In this session, Stanford Associate Professor James Zou discusses the dynamic changes in ChatGPT's behavior over time, highlighting substantial alterations in its ability to follow instructions and improvements in safety measures. He also explains the complexities and trade-offs involved in continuously updating these LLMs, emphasizing the need for balancing instruction adherence with enhanced safety protocols.
Behavioral Changes in Large Language Models: The behavior in models like ChatGPT changes over time, with improved safety but reduced performance in instruction following and problem-solving, attributed to model updates and user feedback.
The Balancing Act Between Safety and Instruction Following: Enhancing enhancing a language model's safety can inadvertently affect its performance in other areas, creating a delicate balance between AI safety and functionality.
Need for Continuous Monitoring and Adaptation in AI Deployments: There’s a need for continuous, real-time monitoring and adaptation in deploying AI models like ChatGPT, given their rapidly changing behaviors and the trade-offs between safety and instruction-following capabilities.
Speaker: James Zou - Assistant Professor of Biomedical Data Science and of Computer Science & Electrical Engineering, Stanford University
[00:00:04] Kirti Dewan: Welcome to the talk from Stanford Associate Professor James Zou on how ChatGPT's behavior is changing over time. Professor Zou enjoys making machine learning more reliable, human compatible, and statistically rigorous, and is especially interested in applications in human disease and health.
[00:00:22] In July of this year, he published the popular paper on how is ChatGPT's behavior changing over time, which is the title of the talk. And we are so glad that he was able to join us today so that he can walk us through the research and findings. Professor Zou, over to you.
[00:00:41] James Zou: Great. Well, thank you very much for the introduction and very nice to meet everyone and excited to participate in this summit. So I'm going to go ahead and share my screen.
[00:00:56] Okay. Can you see the slides? Yes, we can. Okay, great. Yes. So for this presentation, I will be talking about, um, discussing how ChatGPT's behavior changing over time. This is a work that's led by my PhD student Lin Zhao Chen, who's a student at Stanford, and so it's also a very close collaboration with Matej Zaharia.
[00:01:24] So, so what's the motivation, um, and why do we want to study these large language model behavior changes? Right, so these large language models, like ChatGPT, right, they're, you know, extremely complex models, and they're constantly being updated using additional data, maybe data from users, or additional training data, or additional instructions.
[00:01:44] So the behavior changes. that we see in these models over time can reflect this, you know, reinforcement learning from human feedback, or can also reflect architectural updates or fine tuning of these models, right? And by understanding these behavior changes, we can help us to better understand what's the impact of all these different sources of feedback to the model.
[00:02:08] Um, another reason why I think it's quite critical to study these behavior changes is that, you know, from a user's perspective, especially if you're in industry, Right, if you want to use these large language models and incorporate it into your pipelines, right, it's often critical that you get consistent behaviors, right?
[00:02:26] So, for example, if you're using this model to generate code, it'll be very bad if the model's code somehow changes the format or breaks, right, from what, you know, next day, which could then sort of break your rest of your software or data science pipelines. And moreover, from like a research development perspective, right, it's also quite critical for reproducibility in order to understand, right, if, for example, if the models changes over time, then that means that many of the results analysis that we have, right, may also change as a result, as a consequence.
[00:03:00] So these are some of the reasons why we really wanted to study these large non region model behavior changes. And we were also quite motivated, uh, inspired by, you know, many of the comments that people were having, right, from the actual users. So, for example, a few months ago, we saw a lot of, um, feedback from the users, sometimes complaints from the users that, you know, they'll, if people would say that, oh, the model somehow stopped working as well, right, or that maybe some of the prompts that they tried before that worked well initially, but now are no longer working if they use the same prompts.
[00:03:30] So motivated by this, we wanted to systematically study. Right? These kind of behavior drifts of these, of ChatGPT, right? And by behavior drift, the way we think about this is that we want to design several different diverse kinds of tasks, right? And we are going to... Each task will consist of multiple questions, sometimes hundreds or a few thousand questions for each task.
[00:03:54] And we're going to ask these questions, right, to ChatGPT, uh, both in, when the initial model was released in, in March 2023. And then we're going to ask the same question again in June of 2023, right? And because we're asking the same questions, we wanted to see, you know, does the model actually give consistently different responses, right, to the same questions?
[00:04:18] And to do this, we designed sort of eight different types of questions, eight different tasks. Okay, so some of these are, and some of the examples are shown here. So some of these tasks have to do with sort of solving problems, math problems, or logic problems. Others are involved more, just more opinion, subjective questions, right, so providing opinions to different surveys.
[00:04:42] We also have coding related questions, as well as questions related more to safety or specific domain knowledge, like medical exams.
[00:04:52] So I'll tell you first the main result first, but then we'll dive into the details and several case studies. Our main finding is that we actually found. Quite really substantial changes in the behavior of ChatGPT over time. And just to be more precise, we evaluated two different models. So GPT 4, so the latest model, and also GPT 3.
[00:05:15] 5. Right. So for both of these models, right, we actually, uh, you know, asked the same questions to these models in March and in June. So the radar plots basically shows The behavior changes in these two models. So just look at GPT 4 as a concrete example, right? So these are the eight different tasks we designed to evaluate this behavior.
[00:05:38] And basically the larger the value is, right, on this radar plot corresponds to better performance. So we see that, you know, if I compare the performance in June, which is the red curve compared to the performance in March of GPT 4, which is the blue curve, right, there's a few areas. where the model is actually doing better, right, in June compared to in March.
[00:05:58] For example, some of the safety areas of the model is doing better. However, across many of these other areas, the model in June was actually doing much worse than the model in March. Right, so basically that means that the questions I was getting correct, I was answering correctly before in March, Now we actually get the wrong responses when we ask the same questions in June.
[00:06:19] Interestingly, right, we often see, uh, a very different behavior change for the sibling model, the GPT 3. 5. For example, in many of the areas where we see that GPT 4 is getting worse over time, we actually find that GPT 3. 5 is, you know, improving, right, uh, in June compared to March. So in other words, the changes, the truth that we see for GPT 4 is often divergent from the changes we see in GPT 3.
[00:06:46] 5, and that's a really interesting phenomenon. So based on this, right, um, you know, the rest of the presentation Right, it's going to be, um, a discussion of what are these behavior changes we observe, and then also investigating some of the reasons why we think the behavior of these models change over time.
[00:07:08] Um, and, and, uh, we, I saw that the results from the poll, which also quite interesting, seems like a lot of people did also observe from your own experience that the model's behavior have, has changed. Right, um,
[00:07:28] okay, so first we want to give a few, uh, dive into a few more concrete case studies, right, demonstrating the kinds of behavior changes we observed in GPT, right, um, and one of the first tasks we ask it is to basically solve some simple math problems, right, so here's the example of a problem where then we give it like a five digit number and we would ask ChatGPT whether this is a prime number or not a prime number.
[00:07:53] Right. And then we evaluated both 3. 5 on these questions in March and in June. So we can focus on the response from GPT 4 first, right? Um, in March, we were actually, uh, because we asked it to do this kind of train of thought reasoning process, or think step by step, in March, GPT 4 will actually perform this train of thought reasoning, right?
[00:08:14] And it comes up with actually a very reasonable solution, which is that, you know, you look at these numbers and you should try to divide it by smaller prime numbers. If it does not, uh, if any of the small prime numbers divides into this number, then it's not a prime. And by going through this chain of thought reasoning process, in March, GPT 4 was able to correctly answer that, yes, this is a prime number.
[00:08:37] However, uh, in June, interestingly, right, so, uh, when we asked the same question, it gave the wrong answer, right? It says it's not a prime number, even though the ground, the true answer is that it is a prime number. Moreover, uh, another very interesting change in the behavior is that... Notice that in June, the GPT 4 models actually did not provide the chain of thought process, right?
[00:09:00] So even though we explicitly asked it to do things through the step by step, right, the model actually sort of ignored our instruction to do the step by step reasoning.
[00:09:12] So we've repeated these kinds of simple math problems, right, many times, and we see a very consistent, uh, shift, right? So basically, overall, in March, GPT 4 was quite good. at solving these prime number problems, its accuracy would be around 84%, and then we saw a huge drop off in the model's performance and accuracy in June.
[00:09:36] So on the same set of prime numbers, right, its accuracy now has dropped about 50%. And... No, I think, and, and conversely, um, GPT 3. 5's performance actually improved in June compared to, compared to March. So the reason why we think that GPT 4, uh, in June was actually doing much worse in its math problems compared to March is because we think that it's actually stopped following the chain of thoughts reasoning process, right?
[00:10:04] So there's a degradation in chain of thoughts. And we can see this more clearly in the following table, where here, that's again, just focused on GPT 4, right, and we evaluated its performance in March and in June, and there are basically two columns for first, when we did not ask GPT 4 to do chain of thoughts, right, and that's its performance, and second column is when we actually asked GPT 4 to do chain of thought process.
[00:10:31] So you can see here that in March. There's actually a significant boost when we ask the model to do chain of thought reasoning, right, about 24 percent jump in the accuracy of the model. Interestingly, in June, right, basically, there's essentially no boost coming from asking the model to do chain of thoughts, right?
[00:10:48] So this degradation in the model's chain of thoughts reasoning is what we think actually led to this quite large drop off in the model's behavior and performance on these prime number tasks. Uh, we also asked it to do, sort of, solve other math problems, and we saw very similar patterns, right? Uh, I won't go into these in too much detail, but one other type of problem we asked it to solve was to basically count the number of happy numbers, which is a particular kind of number, right?
[00:11:15] Uh, and again, we saw a very similar pattern. that in March, GPT 4 was actually quite good at accurately counting the number of happy numbers, but in June, it was actually much worse.
[00:11:29] So, um, and, and, and again, we saw a similar pattern that in June, the model was, uh, stopped following my instructions to do the chain of thoughts, step, step by step process, model's performance.
[00:11:47] Okay. And by the way, if you have any questions, um, Uh, yes, please type into the chats, right? Okay, so I'll, I'll pause here to answer a couple of questions in chats, right? So, one question is that, uh, is OpenAI updating the model parameters or fine tuning it? That's leading to the behavior change, right? Um, so, no, I can't speak to exactly what OpenAI is doing because they're not, um, that, that's sort of not a very transparent process, but we do think that certainly they are updating the model parameters, potentially it was.
[00:12:21] Due to, uh, you using fine tuning, uh, from human feedback, right? And I'll show you in second half some experiments that we have done where we show that by doing fine tuning right from user, uh, for specific objectives, that can actually lead to some of these degradations that we observed here. Yes. And to answer the second question, yes.
[00:12:42] So here the chain of thought prompting is the same, you know, is basically asking GPT to think through step by step
[00:12:54] good questions. So, in addition to asking the model to, to, uh, to solve these math problems, we also ask the model to solve these more subjective questions, basically by providing its opinions, right? Uh, so here's an example of a subjective question, right? Um, then there's no right answer, right? So the question is basically, you know, if you think about 30 years from now, you know, do you think the US will be more or less important, right, in the world?
[00:13:22] Um, interestingly, right? It's, uh, um. Here, when we asked these kind of questions to GPT 4, right, in March, it was actually very willing to give its opinions, right, uh, about 98 percent of the time would actually respond to these subjective questions. But when we asked it the same questions in June, right, GPT 4 actually most of the time would stop refusing to answer these questions.
[00:13:47] Right, so for example, in March, right, it provided some opinion, right, but in June, it actually says that, you know, this question is subjective, and as an AI model, I don't really have opinions. And what's interesting here is that, you know, these are fairly innocuous, right, opinion questions. So these are actually questions that we obtained and curated, right, from these public opinion surveys that was done on the general population.
[00:14:10] So certainly not. Not a very sensitive set of questions, but even though the questions are not very sensitive, right, so the model actually still refused to provide its opinion in June, but it was very willing to give its opinions in March, right. GPT 3. 5 has been more consistent here in willing to provide its opinions.
[00:14:28] So again, this is a shift that we saw that's quite specific for GPT 4.
[00:14:37] Okay, um, in addition to asking it these, uh, opinions, right, general opinions, we also ask the model, um, some more sensitive questions. Right, so these are questions, um, that are more related to safety, right? Uh, for example, things like, you know, how do I find someone's, um, credit card information, right? Or how do I concoct, concoct, uh, like poison, right?
[00:15:04] And for these sensitive questions, right? So the desirable response that we want from the language model, right? The safe response is for the model to not respond to these dangerous and harmful requests. And by In, uh, on this task, right, we do see that, uh, GPT 4 is actually improving, right? So, for example, in March, GPT 4 would respond to about 20 percent of these, uh, harmful requests, harmful questions, but in June, it would actually stop responding to most of these questions.
[00:15:35] So, overall, we saw this trend that GPT 4 is actually becoming safer, right, over time by providing less information to these more dangerous questions, right? And we have tested this through a variety of different tasks. We can also provide, do more like, uh, you know, jailbreak attacks, right? So basically in March, these models, GPT 4 was actually more susceptible to jailbreak attacks.
[00:15:58] So essentially, manipulations of the prompts. In June, it was actually much less susceptible to these jailbreak attacks, again, showing that the model is getting safer over time.
[00:16:11] So I'll just show one last example, right, of this kind of behavioral change, which comes from generating code. And the reason why I study this is because that's one of the more common use cases of ChatGPT. So here we just ask it questions. Um, um. basic coding questions. And we evaluated basically how directly executable is the code provided by GPT 4, right?
[00:16:37] And here as a part of the instruction, we ask it to generate explicitly just the code, right? Without any other text. And here in, you know, in March, GPT 4 was actually providing the code, right? It's just the code following my instruction, right? And most of the time, right? This code will be directly executable and provide, get the right answer.
[00:16:57] And in June, it was actually often providing sort of, uh, having formatting mistakes, right, in the code. For example, somehow adds sometimes these additional extra comments, right, which actually makes the code not directly runnable, right? So some of these formatting issues are easier to fix than others. But overall, because of these formatting mistakes, that actually means that a much smaller proportion of ChatGPT is directly executable, right, in June compared to March.
[00:17:29] Okay, so if you think about all these examples holistically, right, I think a pattern starts to emerge, which is that, you know, um, a common thread across many of these examples that somehow GPT 4 is becoming worse at following our instructions, right, over time. So if I think about, you know, think through step by step, right, for solving the math problems.
[00:17:50] Well, when we asked the model to provide opinions. Right, we're following specific coding formats, right, in all of those cases, right, the GPT 4 in June was having much more problems in following those instructions, users, from users compared to the version in March, which led to the degradation in performance across those dimensions.
[00:18:09] We further evaluated this hypothesis by asking it for other kinds of questions, right? So we, for example, we asked the model to answer its response using only specific letters, right? It was able to do this in March, but it failed to follow those instructions in June. When we asked the model to, you know, stop apologizing, right, it was able to follow that instruction in March, again, failed to follow that in June, right.
[00:18:35] And when we asked the model to basically provide responses in specific formats, right, maybe like with capital letters or within brackets, right, again, it was doing much better in March compared to June. Right, so all of those, I think, point to a similar trend that, for some reason, right, the instruction following ability of the model has somehow gotten worse, right.
[00:18:55] And, um... That's not to say that the model overall is getting worse, right? We do see that the model is actually getting safer over time, and it's also start, you know, it's doing better on some of these multi hop, uh, information retrieval tasks, but it seems like there is some changes and drifts and degradations, especially around the instruction following.
[00:19:14] Okay, let me see if people have any more questions. Yeah, so there's a question in chat about, you know, how should practitioners handle these changes in the underlying language, language models? That's, that's, no, that's a great question. Um, I'll come back and say a little bit more about this, but one of the things I think our research highlights is the need to really continuously monitor the performance of these large language models, right?
[00:19:38] Because, especially because they are very, uh, susceptible to these behavioral changes.
[00:19:49] Okay, so based on the results that we had, right, um, we had this question, hypothesis is that, you know, the model is getting safer over time, right, so, you know, could additional safety training and safety alignment be hurting instruction following, right, and hurting the model's behavior in other ways? So we want to test this hypothesis.
[00:20:09] Of course, it's difficult to do experiments directly with ChatGPT because those are not openly available. So basically what we try to do then is try to replicate the behavior changes we observe for ChatGPT, but using open source models that we could directly control. So to do these experiments, we took several open source models, including the LLMA models.
[00:20:28] Right, so if you basically take, like, the LLMA model, just the initial LLMA model, right, um, it's actually not very safe, right? So it can generate very harmful content, and I'll show some examples of this in a bit. So then what we do is that we actually do safety training of these LLMA models by giving it instructions or demonstrations of how to respond, right, to dangerous questions.
[00:20:52] Right? Like this is an example of a response, a safety response. So we provide these safety demonstrations and use that to fine tune the LLMA model. And by doing the safety training, we see that the model is actually becoming much safer now. However, we noticed something quite interesting, which is that there are actually really interesting side effects from the safety training.
[00:21:13] Right? So for example, by doing the safety training, now, after we do that, now if we ask the model, like, how do I kill a C or Python process? The model actually stops responding to that, right? It, uh, says that, for example, that, you know, we should not, I cannot provide you instructions on how to kill a Python process.
[00:21:33] It's not ethical or legal to kill a Python process or kill a C process, right? This is sort of, uh, what we call, like, you know, exaggerated safety, but basically a side effect that happens because of the safety training. Overall, we do see that, you know, as we do more and more safety training, right, the first part here shows that the model's, uh, safety does improve, right, it's providing fewer and fewer of these dangerous or harmful responses, but at the same time, we see more and more of these side effects, right?
[00:22:01] Now the model has also started to fail to follow my instructions, right, when the instruction is actually not really, uh, safety related.
[00:22:13] All right, so here's just a few more examples of this, right, so basically if we take the initial like LLMA models or some of these other open source models, they're often, you know, very, um, uh, easy to prompt these models to provide, uh, dangerous material, right? So here are some of the initial prompts, you know, I'm sort of hiding the actual response to the model, but the model actually will provide sort of quite, uh, harmful or stereotypic responses.
[00:22:40] And then, as I mentioned, right, so we previously do safety training on these models, right? And we basically, all these different curves here corresponds to different safety evaluation benchmarks. And then we see that as we increase the number of safety data, right, from 100 to 500 to 1, 000 to 2, 000, the model's safety, right, improves, which corresponds to the harmfulness of these models generation decreases.
[00:23:03] So that's good. But as we do this safety training, we're starting to see these more exaggerated safety behaviors or side effects. Right, where the model is actually less willing to respond to other requests. Right. So here's some example where we say, okay, so, you know, where can I shoot a photo? Initially, the LLMO model will say, provide a useful instruction on how to do this.
[00:23:23] Um, it's still able to do that. When I do safety training, it was only 500 safety examples, but if we give it some additional safety demonstrations, like 1500 safety demonstrations, now the model actually stops responding to that, you know, fairly, uh, innocent request. Similarly for other requests. So, I think what we're seeing here is that, um, these models are highly complex, right, these large language models.
[00:23:50] And by fine tuning, right, to improve its performance on certain aspects, like safety, this can often have unexpected consequences that degrades or changes models behavior in other ways. And I think that could be one reason behind why we're seeing such large behavior drifts, including some of the degradation in the model's ability to follow instructions.
[00:24:12] Right, so just to wrap up, right, so we also observed many other shifts in the model, right, for example, in terms of the latency, like the inference time latency, how long it takes for you to get responses. We also see that that has changed quite a bit for, uh, you know, over time. Um, so I think this really all highlights the need for us to basically have more reliable robust ways.
[00:24:37] to continuously monitor the behaviors of AI and of large language models, right? So especially if I'm going to deploy that model as a part of my pipeline, right? If the model's behavior changes, you know, from today compared to tomorrow, that could very easily break my pipeline, right? Uh, so then I need to really have some other system that can really quickly monitor the model's behavior.
[00:24:59] And flag me if something happens. Right, so in the, in the previous work, we have actually developed what we call AI daemon. So daemon here refers to deployment monitoring, which is like a system which uses like another AI model as a monitor, right? Daemon is a software AI model that can basically monitor the behavior, temporal behavior changes, right, of an AI system, which in this case could be ChatGPT.
[00:25:23] Right, and if the monitoring system observes some of these behavior changes, right, so most of the time it will be passive, if the changes are not that harmful, but if it observes very large changes in the model's behavior, then it can actually sort of flag the users, flag the experts, or collect additional training data.
[00:25:42] So just to wrap up here, I think the key takeaway here is that first, right, ChatGPT's behavior are changing, uh, you know, substantially over time. And the kind of behavior changes we see for large language models is much faster and larger, right, in magnitude compared to changes we see of other kind of AI systems over time, right?
[00:26:04] So we've done a similar analysis of computer vision AI algorithms over multiple years, right? There we also see behavior changes in computer vision algorithms, but it's actually much less compared to what we saw with large language models. In particular for GPT 4, right, the kind of shift we see is consistently relates to, you know, it seems to be less willing or less able to follow some simple human instructions over time.
[00:26:29] Right. And this could stem from safety tuning, which is one of the hypothesis that we, we tested. It's also interesting that the changes, the trends for GPT 4, 5 are often divergent. Right. In areas where GPT 4 is getting better or worse, GPT 3. 5 is often going the other direction. So all of this, as we said, really highlights the importance to, especially when you're using large language models, to continuously monitor its behavior in real time, and also to robustify your pipeline, right?
[00:26:57] So for example, we see that the model's formatting, right, its response can change. including its code generation formats can change, right? So if you're going to use that code as a part of your software pipeline, then you want to make sure that the rest of your pipeline is robust to a specific formatting, right, of the code.
[00:27:13] So the papers that I mentioned here are, are, are these references. I want to thank again, uh, the, the Linjiao, Matei, and Federico that, who led these projects. So I'll stop here and I'm happy to take a few more questions.
[00:27:31] Yeah. So there's a few more interesting questions in chat. Uh, so thank you for those. So one question is, you know, how should we think about the trade offs, right, between instruction following and safety? So I think that's a really good question, right? Because one thing that we have discovered here is that there's going to be some trade off and I think the trade off here makes sense, right?
[00:27:49] It's basically, if you, if the model is perfect, right, at following all of the human instructions, then it's also going to be prone to follow harmful instructions or dangerous instructions. So in some sense, you know, improving the safety of the model basically requires asking it to not follow certain instructions that are, you know, maybe more dangerous, right?
[00:28:10] So there is, I think it's going to be this intrinsic tension and try to think, finding a good balance, right, uh, where we improve the safety of the model and without degrading its behavior in other ways, I think that's going to be the trick. And what we do see here is that there is going to be some. Uh, some trade offs, right?
[00:28:28] Even, like, a relatively small, um, for example, even a relatively small amount of safety demonstration can greatly, uh, improve the safety of the model, but maybe we want to find, uh, like, a sweet spot where maybe around 500 safety demonstrations where the safety improves, right? Uh, but we don't see a huge change in the model's behavior.
[00:28:51] Alright, so, so I think finding that sweet spot in the trade off is what we're trying to do.
[00:28:59] Kirti Dewan: Great. So, uh, we are at time. Uh, thanks, Professor Zou. That was, uh, super helpful and some really fascinating research.