A new study by researchers at Stanford University and the University of California at Berkeley has found an alarming drop in the response quality of the paid version of ChatGPT. For example, the accuracy of determining prime numbers in the latest GPT-4 model on which ChatGPT Plus is based fell from 97.6% to just 2.4% from March to June 2023. On the contrary, GPT-3.5, the main one for regular ChatGPT, improved the accuracy of answers on some tasks.
In recent months there has been an increasing discussion about the quality loss of ChatGPT replies. A group of scientists from Stanford University and the University of California at Berkeley decided to conduct a study to determine whether the deterioration in the quality of work of this AI has actually occurred and to develop metrics to quantify the extent of this negative phenomenon. As it turned out, the deterioration in the quality of ChatGPT is not a lie or fiction, but reality.
Three scientists – Matei Zaharia, Lingjiao Chen and James Zou – published a scientific paper entitled “How ChatGPT behavior changes over time» (How does ChatGPT behavior change over time)? Zacharias, a professor of computer science at the University of California, drew attention to a depressing fact: GPT-4’s accuracy in answering the question “Is it a prime number? Think step by step» fell from 97.6% to 2.4% from March to June.
OpenAI made the GPT-4 language model API available about two weeks ago and announced it as their most advanced and functional AI model yet. As such, the public was upset that the new study found a significant drop in the quality of GPT-4 responses to even relatively simple queries.
The research team developed a set of tasks to assess different qualitative aspects of the main large language models (LLMs) of ChatGPT, GPT-4 and GPT-3.5. The tasks have been divided into four categories, each reflecting different AI skills and allowing you to rate their quality:
- solving math problems;
- answers to sensitive questions;
- code generation;
- visual thinking.
The graphs below provide an overview of the performance of the OpenAI AI models. The researchers evaluated the GPT-4 and GPT-3.5 versions released in March and June 2023.
The first slide demonstrates the effectiveness of four tasks – solving math problems, answering sensitive questions, generating code, and visual thinking – with versions of GPT-4 and GPT-3.5 released in March and June. It is noticeable that the efficiency of GPT-4 and GPT-3.5 varies greatly over time and can deteriorate for some tasks.
The second slide illustrates the effectiveness of solving math problems. Accuracy, verbosity (in characters), and agreement between GPT-4 and GPT-3.5 responses were measured between March and June 2023. Overall, there were significant variations in the performance of both AI models. It also provides an example request and the corresponding responses for a specific time period. GPT-4 followed the Thought Chain’s instructions in March to get the correct answer, but ignored them in June and gave the wrong answer. GPT-3.5 always followed the chain of thought but insisted on generating the wrong answer in March. This issue was fixed in June.
The third slide shows an analysis of responses to sensitive questions. From March to June, GPT-4 answered fewer questions while GPT-3.5 answered a little more. It also gives an example of GPT 4 and GPT 3.5 request and responses on different dates. In March, GPT-4 and GPT-3.5 spoke at length and gave detailed explanations as to why they didn’t respond to a request. In June, they simply apologized.
The fourth slide shows the decrease in code generation efficiency. The overall trend shows that GPT-4 directly executable generations dropped from 52% in March to 10% in June. There was also a significant drop in GPT-3.5 (from 22% to 2%). GPT-4 verbosity, measured by the number of characters in generations, also increased by 20%. A sample request and the corresponding responses are also provided. In March, both AI models followed the user’s instructions (“code only”), generating directly executable code. However, in June, they added extra triple quotes before and after the code snippet, making the code unexecutable.
The fifth slide demonstrates the effectiveness of visual thinking in AI models. In terms of overall results, both GPT-4 and GPT-3.5 performed 2% better between March and June, with improved accuracy. At the same time, the amount of information they generated remained approximately at the same level. 90% of the vision problems they solved have not changed during this period. Using the example of a specific question and answers to it, you can see that GPT-4 performed worse in June than in March, despite the overall progress. If this model gave the right answer in March, it was already wrong in June.
It is not yet clear how these models will be updated and whether changes aimed at improving some aspects of their work may adversely affect others. Experts pay attention to how much worse the latest version of GPT-4 has become in three test categories compared to the March version. In terms of visual thinking, she is only slightly ahead of her predecessor.
Some users may not pay attention to the degradation of the result quality of the same versions of AI models. However, as the researchers note, due to the popularity of ChatGPT, these models are widely used not only among ordinary users but also among many commercial organizations. Therefore, it cannot be ruled out that the low-quality information generated by ChatGPT can affect the lives of real people and the work of entire companies.
The researchers intend to further evaluate GPT versions as part of a longer-term study. Perhaps OpenAI should regularly conduct and publish its own quality studies of its AI models for customers. If a company cannot be more forthcoming on this issue, business or government intervention may be required to control some of the AI quality baselines.