Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPT-4 as served in the API has been getting 85% on HumanEval (compared to 69.5% claimed here)

https://twitter.com/amanrsanger/status/1635751764577361921 https://github.com/getcursor/eval



Right, but there's no contamination studies there. I suspect that RLHF data leaked HumanEval into GPT-4.

It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation.

This is why we use the official numbers.


I have a several arguments why contamination is probably not the main reason of performance difference.

When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers

Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)

There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.

BTW why don't you use HumanEval+?

[1] https://arxiv.org/abs/2305.01210


The "intelligence" of large language models needs to be evaluated like the abilities of self-proclaimed psychics. You send your binary to an independent third party and who evaluates it on new problems. It's only a "Human eval" once.


>> given all of the examples and anecdotes about degradation.

How many examples and anecdotes about degradation are actually scientific side-by-side studies? I see absurd articles online about ChatGPT usage going down the drain by kids, completely failing to consider even the most basic fact of seasonality and how school is out for the summer!


It takes like 2-3 experiences of receiving a confidently wrong answer to downgrade your usage. If you use a refactoring tool to rename and it misses one, you won’t use it again.


While that would likely be my experience with a refactoring tool (unless I didn't have a better alternative), that's not my experience with ChatGPT 4. And that's considering I have very little tolerance for buggy software.

There was a period of a few weeks or months in which it seemed like ChatGPT had really degraded to the point of being unusable (although it could have been my biases). However, it seems to be better now (again, my subjective experience).

Sometimes I still catch it making really basic mistakes, but most times I can convince it to correct the mistake (especially if I point them out).

But what's most amazing to me is how ChatGPT is absolutely brilliant at some things, and not just technical or even obscure topics.

Recently, it gave me the most amazing idea for navigating a complex and nuanced social situation I was having difficulty with. And given the constraints of the situation, there was no way I could have gotten that idea otherwise, especially in the allotted time.

So despite its flaws and mistakes, I still find it to be a tremendously useful tool, even if only to point me in the right direction.


Given the fact that OpenAI has constant resources (for any given small span of time) and varying demand (users and query type), it's not crazy to think they dynamically adjust to consume all available resources on their side.

Obviously the base model would be the same, but aren't there are +/- flavors they could overlay with extra compute? E.g. multi-pass, additional experts, etc.

The benefits to giving someone an occasional "magic" answer are too great not to.

Have there been any wide studies on same-prompt-different-times?


> So despite its flaws and mistakes, I still find it to be a tremendously useful tool, even if only to point me in the right direction.

Much of this resonates. That said, I get tremendous value simply by writing things down (or dictating them) and replying to my own question. I would expect that a sizable fraction of people have forgotten about these strategies and/or don't use them when they are most useful. For many, there is tremendous muscle memory to run a Hooli search almost on mental autopilot. Who has time to slow down and write a well-conceived question? Or perhaps we should turn it around ... On a longer time horizon, who would want to waste time with poorly-conceived questions?

It is the question that starts the process. So we should ask good questions. Do we? I'd be curious about the usage data OpenAI collects. I do my best to lower expectations about people in general, but I'm confident I'd still be unprepared for the level of thought put into questions.

> But what's most amazing to me is how ChatGPT is absolutely brilliant at some things, and not just technical or even obscure topics.

I'm not amazed in the way you are. I expect a variation in quality across topics and domains and question styles.


> I'm not amazed in the way you are. I expect a variation in quality across topics and domains and question styles.

Yes, I can see that. But over time, you also learn and adapt the prompts to ChatGPT's peculiarities so that it provides more useful output.

Still, I'm sure there are many topics/domains for which it's not useful.

As another anecdote, I'm not a mathematician but at one point I was playing around with proving theorems on a theorem prover.

What I found is that ChatGPT is this paradoxical entity which makes the most elementary math errors all the time (I'm talking third-grade level math mistakes), and yet, it was by far the most useful tool ever in coming up with lots of useful PhD-level ideas and math theorems that would allow me to complete proofs when I was completely stuck (and not just for proofs which it had seen before).

It came up all the time with brilliant ideas and theorems which simultaneously I didn't even know existed, were not part of any theorem database of any theorem prover I had seen before (and I've seen the vast majority of them), and there was no way I was going to find them by searching on the web or writing things down on a notepad (I know this because I had tried, for days at a time, along with other ideas such as visualizations and simulations).

That's not to say a mathematician wouldn't be aware of them, but I don't have easy access to one, and I was surely not going to pay one given that I was just exploring, mostly for curiosity.

This seems like a paid ad, but I promise you, I have no affiliation whatsoever...


Anecdotes sometimes take a beating, but I happen to like the personal ones. Thanks for sharing.

A quick thought about your success: ChatGPT's imprecision and stochasticity can work in its favor for many creative efforts. Unexpected token connections can have a lot of value in a space where vast numbers of novel directions are worthwhile.

For me, having spent thousands of hours thinking about statistics, ML, logic, and reasoning, ChatGPT is not paradoxical. To me, the human aspect is more interesting; namely, the ways in which people are surprised reveals a tremendous diversity in people's expectations about intelligence, algorithms, and pattern-matching.

For many people, most of the time, basic reasoning is a basic requirement for intelligence. By themselves, sequence to sequence models are not computationally capable of deductive reasoning with an arbitrary number of steps, since that would require recursion (or iteration).


I don't think I've spent nearly as much time as you thinking about these things and I'm not entirely sure I understood your perspective, but I have a couple of reflections for you which perhaps you can comment on:

> By themselves, sequence to sequence models are not computationally capable of deductive reasoning with an arbitrary number of steps, since that would require recursion (or iteration).

Isn't the fact that LLMs perform their inference step by step, where in each step they output only one token, an instance of deductive reasoning with a (potentially) arbitrary number of steps?

I say this because on each inference step, the tokens that were previously generated do become part of the input.

At a higher level of abstraction, I'm also thinking about chain-of-thought prompting, in which the LLMs first output the easier-to-deduct steps, then build on these steps to perform more deductive steps up until they finally produce the desired answer [1].

Of course, they have a limited context, but the context can be (and has been) increased. And humans have a limited context as well (except if we consider long-term memory or taking notes, perhaps).

The main difference I see is that in LLM chain-of-thought reasoning, they are currently outputting their intermediate "thoughts" before actually giving the final answer, whereas we humans are capable of silencing ourselves before actually having figured out the answer, which we then "output" as speech [2].

So I think there is still a form of recursion or iteration happening in LLMs, it's just that it's in a somewhat limited form in that we are observing it as it happens, i.e. as they output tokens one-by-one.

That said, something that I think could really make LLMs take a big step forward would be to have something akin to long-term memory. And the other big step would probably be being able to learn continuously, rather than only during their training. These two potential steps might even be the same thing.

So I don't know. I'm obviously not an expert but these are my thoughts with regards to what you've just said.

[1] https://ai.googleblog.com/2022/05/language-models-perform-re...

[2] Interestingly, there have been studies that show that humans produce micro-speech patterns when we are thinking, i.e. as if we are really speaking, although imperceptibly. That said, I have no idea how trustworthy these studies are.

Edit: added a clarification at the beginning.


First, I hope that my estimate of hours input into my brain didn't come across as boastful. I'm still working on the balancing act of stating my experience so people get my point of view without sounding arrogant. In this case, I should have also said that sometimes thinking about anything long enough can sometimes cause some of the wonder to fade. Luckily, though, for me, the curiosity remains, just focused in different directions.

Second, your comment above covers the ground I was referring to regarding deduction. It seems like we're on the same page. The main difference may be where one draws the lines. When I said "by themselves sequence to sequence models..." I was excluding algorithms that chain language models together in various ways.

Not too long ago, when people said "AI" that tended to refer to algorithms like forward chaining over a set of facts.


> That said, something that I think could really make LLMs take a big step forward would be to have something akin to long-term memory.

Yes. There is significant work in this direction.


When I was doing a lot of C++ gamedev, we were definitely doing a lot of stuff that would trip up static analysis, e.g. X-macros.

We would still use refactoring tools even though they would often miss stuff. You just rely on a combination of refactoring tool / search and replace / the compiler.

We would also debug our code in release mode with symbols. You get used to a debugging environment where you don't trust anything you're seeing in variables, etc. too.


Depends on what you expect it capable of given the limitations of these systems.


I'm aware of at least one study by Stanford. PDF paper linked in this article:

https://www.techopedia.com/is-gpt-4-a-flop

Of course, I'd like to see more than one study. But this one is by a well known university, and it's pretty conclusive. GPT-4 is getting worse (especially for code, maths, and analytical reasoning) and more censored.


It's important to frame this correctly. The article is a bit misguided (it doesn't matter which university publishes an article) because there are so many ways in which a model can be altered, even excluding retraining weights. Also, even if the performance has dropped practically due to removing some resources for more shortcuts to be taken (for example changing beam search and typical sampling parameters), making implications about the outlook for the future is not really appropriate, since retraining weights, changing architecture, etc can improve capabilities immensely.

It's important not to suggest that GPT systems in general are on the way outside.ply due to some small alterations in parameters that make a system slightly less performant (which seems to be a popular perspective).


Isn't this the study that asked a bunch of questions with the same answer ("yes") and basically the old model always answered "yes" and the new model always answered "no"? That's not a degradation in performance. It was never answering the questions in the first place, just guessing. The only thing that changed was the default guess.


No it's not. Here's a link to the study, you can check the questions they asked.

https://arxiv.org/pdf/2307.09009.pdf


You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.

In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068

They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.


At this point you've gotten like 3 or 4 replies explaining how at best you're drawing a flawed conclusion from the paper, and at worst it's a flawed paper in itself.

Funnily enough, just skimming through it again I found yet another glaring mistake they made: they left the system prompt empty for both checkpoints, yet the headline feature of the new checkpoint was improved steerability via the system prompt: https://openai.com/blog/function-calling-and-other-api-updat...

Every time I look at this paper my inclination drifts further away from harmless incompetence. Matei Zaharia is the CTO of Databricks, it feels like too perfect of a coincidence that someone who built a career on ML and study would suddenly drop the ball right as their company is trying to pivot to on-premise MLOps, who's prime competition is ChatGPT...


On most of their tests gpt-4 is not actually worse [1]. In particular coding results are affected by changed due to different output format rather than worse abilities [2]. But that's ok because the message of the paper is that there is strong drift between versions and developers should be aware of it, not that gpt becomes worse [3].

[1] https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

[2] https://twitter.com/Si_Boehm/status/1681801371656536068

[3] https://twitter.com/matei_zaharia/status/1681805357516210177


Also remember bad research can come out of good universities. Remember the gzip compressor beats BERT paper that showed gzip beat bert at many KNN based tasks? Or just Google for Wansink Cornell.

So best to treat every paper like a i.i.d sample and judge them.om their on their own merits.


Quality has degraded but token generation speed has increased. The GPT4 of today isn’t the same as it used to be. To get the old slow model you need to use GPT4-0314, that gives higher quality answers like it used to.


It's a lot like working with a human, you accept imperfect work all the time from people.


But the model in OP is fine-tuned by "a proprietary dataset of ~80k high-quality programming problems and solutions". How do we know it's not contaminated by HumanEval too?


From the OP:

> Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.


(Chat)GPT-4s practical coding abilities are now 100x because it can code, run the code, and reason about its performance mid-response. They must be using fine tunes for this so the overall model could well be better too


You can do that as well, under your complete control. That's a framework they put around the model.


"model" is end-to-end, input-to-output, inclusive of the entire framework and it's guardrails and everything else

if they are able to detect hallucinations, filter them out and automatically re-run, that's a huge improvement in result, even though core model didn't get new training


That's the product. The model is the kernel of the product.


That's what I said.


Only python though, right?


Is it possible to learn this power?


There weren't any serious examples of degradation.

Does only GPT-4 have to suffer a penalty for HumanEval leaking into training data/RLHF data?

Ignoring those concerns, it fails a reaonable-ness smell test:

We'd have to pretend its the original GPT-4 release from March 2023 until GPT-5 comes out, and only then can OpenAI's work be compared to LLAMA-2 to LLAMA-N.


There's a couple of things here:

1. I'm not saying we have to wait until GPT-5, we just need an apples-to-apples comparison where contamination is taken into account

2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

3. I've personally noticed degradation anecdotally in the GPT-4 June update vs. the original March release


> 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).

see:

* https://github.com/lchen001/LLMDrift/blob/main/generation/

* https://twitter.com/Si_Boehm/status/1681801371656536068


1. TL;DR: OpenAI must verify HumanEval data wasn't used in training in order to compare it?

2. Link in the post you replied to.

3. Subjectivity is fine by me! There's a motte & bailey flavor to it if we combine your comment and this one, c.f. "This is why we use the official numbers."


I think you're assuming that OpenAI is incentivized to benchmark honestly. Like every other company for which a benchmark is a goal, they are not.


Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.

Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.


Well, no, we have the HumanEval results for the June release.


Which is both (1) a subjective selection to measure the effectiveness of various chatbots and (2) now subject to gaming from companies using opaque/closed/inaccessible/unverifiable systems, like OpenAI.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: