Please test their models before you take it at face value.
Eleuther has a history of claiming to replicate projects when they haven't. For example, they shipped a DALL-E repo a few days after OpenAI announced it (https://twitter.com/theshawwn/status/1348017515897659392) which was broken, and they've walked back their GPT-3 replication claims to replicating 1.5B due to the fact that their architecture doesn't scale.
As far as I can tell, they're generating a large amount of hype with grandiose claims that they can't deliver on.
All I care about is whether you like their models and actually use them in practice. If you do, please let me know and I'll pipe down. But so far, I haven't heard of anyone who uses anything they've produced, and that worries me. Has anyone?
"DALL-E is quite straight forward and already coded. We just need data to train it."
No, DALL-E is neither straightforward nor was it successfully coded, especially back on January 7th.
Anyway, carry on. I really don't like speaking badly of AI projects, and I hope that they succeed. The model release today is a good step forward, assuming it works. But it might be better to have the expectation of "the models don't work" until proven otherwise.
I'd also like to point out that there are some capable people doing work at Eleuther. Sid in particular is one of the best TPU hackers in the scene. I just wish they would scale down their claims, release more models, and not claim that they've done X until actually doing X. For example, the readme says they have "the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library," which they don't.
I don't think any single thing you've claimed is factually wrong, and I don't speak for Eleuther nor am I attempting to justify their claims.
But.
As I understand it (mostly from lurking on their discord and reading publicly available materials) this is a group of volunteer academic types trying to replicate something great and awesome, with the only goal of giving it to the world. You could cut them some slack.
I can't speak for you, but as a "for free, weekend project" what they've done certainly makes me feel I need to up my game.
I am sorry that I was misinformed about the state of our DALL-E replication when I made that tweet. It was not malicious - I was reporting what I had been told by someone else.
Yes I was wrong. That said, I had hoped that maybe after two an a half months Shawn would stop holding it over my head.
> Claiming something that is not true is in itself wrong.
I 100% agree with this.
I also think that one catches more flies with honey than vinegar, and the criticism in the parent comment, while possibly valid, could be phrased more encouragingly and less combatively. It's easy to criticize, it's hard to create, and it's even harder to release.
> Claiming something that is not true is in itself wrong.
yup, in any project, and especially the one done for the community where the only you get is satisfaction and fame, the success is super tied to communication. Good, honest communication is what builds trust.
Not just that. To even get access to their API, you need to apply. That is the future of AI I'm afraid without projects like this. That is Elites controlling AI and deciding who is "worthy" to use it.
I'm sure they have the best of intentions but "worthiness" is subjective.
Depends on who you mean by "they". If you mean the researchers, then sure, they probably actually believe whatever's written in their ethics statement.
Now, the actual owners? I don't believe it for a second
Considering last decade social consequences due to easy access to APIs and data, I am quite happy that these initiatives are cautious around opening up software which can have huge impact on society.
Unfortunately, the cat is out of the bag. Their methods are documented and the results exciting, so to a bad actor (especially state-sponsored) it's completely justified to spend millions attempting to replicate their results from what is publicly available.
Good perspective but I would like to hear the response from the developers before concluding too much.
This is not meant as a goad to you, but more just as more info for everyone, my understanding is it is an open source community of like minded people type project (as opposed to a bigco) and actively solicits contributions (by which I mean code and data) so anyone seeing room for improvement is welcome to step in from what I can tell.
I did find your comment helpful and informative; just adding another angle here.
I think it would help your PR efforts to let people know that more often. People hear "we are <org_name>" and assume you're, you know, an organization. That comes with some amount of expected bug fixes, documentation, verifying results _before_ you release, etc.
I'm not really sure how much you gain by attracting tons of people to the discord if you finally release and everyone has unreasonable expectations due to the way you advertise the group as a whole.
This'll probably sound silly, but we weren't really expecting much of a response. We've been sitting on these models (they were trained for the Pile paper) for months for no particular reason besides being focused on other things. We figured we'd put it out there in case anyone was interested and then Aran's tweet blew up hard.
Fair enough! I can personally attest to the results being quite impressive now. Fantastic work and apologies for the criticism. It must be a strange situation to get dragged in to all this due to a viral tweet.
People are always claiming to release replicated models by replicating the architecture (or main parts of it) but not testing whether it produces the same level of results. It's maddening, especially when the level of results is so directly measurable (just measure what the paper did, not that it's easy, just concrete).
Is there anything on its few shot learning performance? I took few shot learning as the main point of GPT-3. Sorry if I just overlooked it, but I don’t see anything on few shot learning in the readme.
I would guess that an average FAANG ML engineer could code up and successfully execute a forward/backward pass on a GPT-1 or GPT-2 model with a day of effort or less. (GPT-3 a little harder, but not significantly). But is that model actually going to perform well? Most likely no. Model performance varies significantly due to subtle details in data processing implementations, seemingly insignificant details in code, and even from different numerical methods of calculating the same semantics.
If you don't believe me, consider that many ML researchers track their commits (or exact code versions) extremely carefully, because oftentimes they will make some change (or changes) they think are inconsequential and later find that actually, their model broke. If they made too many changes, whoops, guess you have to binary search over the diff to see what happened since your last "good run".
If the people who spent months (if not years) tuning a model can't tell whether it will work from the code, how could anyone else? Most ML researchers will not bother with most code that doesn't give proof of results (in terms of a model that can actually be evaluated) because it is just so unlikely that it will actually work well. Now, it might "work" in the sense that it converges and does something when you prompt it with examples. But will this GPT-3 reimplementation actually outperform say, the 10x smaller T5 checkpoint that was released by Google, or the other smaller language models others have released? If it doesn't, it's hard to argue that its very useful at all.
I think that's the spirit of why the original commenter said what they did, but I still do applaud the efforts of this team (and hope that their implementation is, in fact, highly performant!)
GPT-3 is the name for the architecture, but there are a few different versions/sizes. The OpenAI version that impressed us all was ~170B parameters, this is far smaller.
To go from 2.7B to 170B parameters will need more than just a few config tweaks. There's a whole bunch of hacks and tricks needed to coax a model to train at that scale, the Eleuther version is almost guaranteed to fail out-of-the-box.
It's worth noting that the GPT-3 paper did train models with more sane sizes (e.g. 1.5B) as a point of comparison. I am surprised/annoyed they never released them though.
Ada is 2.7B, Babbage is 6.7B, Curie is 13.0B, and DaVinci is 175B. The new one they announced last month is in the 20-50B range I think, not totally sure though.
What I find interesting about their marketing(?) is that they identified a market niche that they want to position themselves in.
Enterprise customers that have no idea about the technical details will just hear about OpenAI's success in this fancy new model and assume that Eleuther can deliver.
I mean, most use cases for "big data" projects that are tiny in comparison with Alphabet's datasets will just work with GPT2 fine, probably.
And Enterprise customers that hear those claims and see some code, maybe some demo, is enough for them to start the consultancy process.
In my opinion that's a policy problem that OpenAI introduced by not requiring the absolute reproducability of both the code and model, and both training and dataset of their models upon release.
Stakes are pretty high in the AI industry, and OpenAI actively influences it. My dream was in the beginning that they are a source of verification, audits and "proof" that models are legit...yet I have the feeling lately that they just buzz around like everyone else.
To this date I haven't seen anyone replicate any of the DNC results, for example.
To date, EleutherAI as an "organization" (read: basically a Discord server) has not really attempted any kind of marketing. It has no PR dept, just individuals tweeting about the work that Eleuther does.
This is a nice release, but the title is a bit misleading as the released sizes (1.3B and 2.7B parameters) do not yet compare to the size of GPT-3 (175B), but rather GPT-2 (1.5B) instead (although future releases may have significantly more!).
With training improvements such as DeepSpeed, the GPU costs will likely be substantially lower than what was available at the time OpenAI trained GPT-3. Still not free, though.
The hard part with GPT-3 is it's big enough to make it difficult to actually deploy.
GPT-3 isn't a single model. It's a model architecture that is very closely followed by GPT-Neo. The 2.7B model is the exact same size as something OpenAI sells under the label "GPT-3"
My line of thinking was that for the average HN reader, who has probably read 'GPT-3' perhaps 500 times by now (every instance of which was referencing OpenAI's infamous 175B model), it may be confusing for them to see this with the same label, when the release is not comparable as far as parameters/performance (yet). But as yourself and another commenter noted, it is still the GPT-3 architecture (or hopefully isomorphic to it), so I appreciate your correction as well.
That's fair. I also later learned that the title didn't explicitly mention model size at first, and I would have probably raised similar complaints had I seen that.
Not hugely, but yes. I tend to think of GPT as a style of architecture with consistent themes and major features, but varying minor features and implementation details. Off the top of my head, I believe the most important difference is that GPT-3 alternates global and local attention while GPT-2 is all global attention.
The two published GPT-Neo models follow GPT-3's lead but the repo lets the user pick whether to use global or local attention layers.
Many people (correctly, in my view) criticized OpenAI for the name, saying that openness should be evaluated on a case by case basis. Glad they listened to critics instead of trying to maintain consistency for its own sake.
Is there anything a non-AI researcher can do to help support this project? Is there a way to donate money? Or could a software engineer help with testing, tooling, or other kinds of infrastructure?
I was really excited about OpenAI's original plan and still believe that an open source solution is the best way to prevent the potential negative impacts AI might have on society. I can sort of appreciate why OpenAI went the route of going private and trying to monetize their work instead, it might prevent people from using their work nefariously and will probably provide them with way more capital to continue their efforts. But, I trust humanity as a collective more than any particular group of people in the long run. I'm sure there are many others like me who would be eager to help out if they could.
Edit: EleutherAI has a whole page on their site about how others can contribute: https://www.eleuther.ai/get-involved/. I didn't see anything about accepting donations though, if anyone involved with the project was interested in setting up a crowdfunding account somewhere I'd be eager to donate.
You may indirectly support the project by supporting the host, that hosts their data, https://the-eye.eu
Right on the front they write:
> Hey there fans! We are currently looking for help funding large storage upgrades,
> if you want to help us serve more data see our donation options (crypto, etc)
> Thanks for reading, happy downloading!
The Eye has been a phenomenal partner and enables a lot of what we do. In addition to providing terabytes of storage for free, they also help us out with CPU from time to time.
Indirectly they say you can donate money, in the form of computation that can be rented:
“As an independent organization, we are dependent upon donations for our computing costs. We are always interested in speaking with people who can donate compute times.”
"Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters"
README:
"1T or bust my dudes [...] An implementation of model & data parallel GPT2 & GPT3 -like models, with the ability to scale up to full GPT3 sizes (and possibly more!)"
It seems the largest model they released is 2.7 billion parameters or ~0.01 the size of GPT-3. The most interesting part about GPT-3 was its size and it seems this is only "GPT-3-like" in architecture.
I also have a translation library with ~100 million (0.001 GPT-3) parameters:
GPT-3 is a model architecture, not a model. While the largest GPT-3 model is 175B, that very paper has a table that includes "GPT-3 XL" (1.3B) and "GPT-3 2.7B" as models in the GPT-3 architecture. The 2.7B model is the same size as Ada, a model that OpenAI currently sells API access to under the moniker "GPT-3"
None of the other models are even close to the big one, and the paper also suggests calling the big one "GPT-3". And people do that very often in practice. So it's often ambiguous but saying the term only means the architecture isn't right either.
what does he mean when he says 1T or bust? Is he referring to 1 trillion parameters? Are you saying that GTP-3 has 2.7 trillion parameters? Does it mean that to get to GPT-3 level it needs 100x more amount of dataset?
GPT-3 has 175 billion parameters. So they need to scale by 64x. They already have a comparable amount of data than what was used by OpenAI, so it's about scaling the numbers of GPUs.
Accuracy and numbers of parameters don't scale linearly together. It varies widely depending on exactly what you are measuring accuracy on as well etc.
But a very approximate rule of thumb would be to say that accuracy scales with the log of the parameter count (for the same architecture).
Hi! Thanks for trying it out. There was a bug that should now be fixed. When I run the example unicorn prompt I get the follow. Don't hesitate to open an issue if you're still having trouble.
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
Bebek Uranzoglu, another member of the research team from the University of British Columbia, was working on a project the Latino-Canadian rodeo competition equipos to document a rare and remarkable ecosystem in the Andes Mountains.
His curiosity was piqued when he spotted an adolescent herd of about 10 unicorns foraging in a forest near the valley of the Jumbo Flu Group. The unicorns — whose numbers once swelled to 46,000 — were perched on the forest floor and watched the researchers work.
Urizoglu grew excited when he spotted another group that seemed to be thriving in an area below the herd. The team hoped the apparent population growth would indicate a human presence.
But when a team of researchers set up a camera trap, they were surprised to find the unicorns in the first place, and in a forest near a lake — in fact the forest was almost entirely made up of the animals. Despite their own magical presence, the team could not see the herd was populated by humans.
“The whole place almost smelled like animals,” says Bebek. “We were never able to find human footprints at any of the points we stood at. The trees were so large, you wouldn’t have been able to walk 40 meters through them. We assumed that the truth of the matter was, ‘Well the deer didn’t like this forest at all.’”
Same here. I managed to make it "work" in the sense that it wouldn't crash during inference, but then it generated gibberish. Has anyone managed to make it work reliably?
Whilst obviously BERT is not the same as GPT-3 in architecture, Amazons recent paper discussing architecture optimizations for BERT seems pretty relevant here (https://arxiv.org/pdf/2010.10499.pdf) given the chance to improve upon GPT-3s architecture (because it surely isn't the best we can get). Have the Eleuther.ai team been exploring this?
Could the title of this post be change to emphasize that the model sizes released were 1.3B and 2.7B? Something like "EleutherAI releases 1.3B and 2.7B parameter GPT-like language models". The current title implies that a full sized GPT-3 model is currently available, which is not the case.
edit: the title has been changed, seems good enough
My experience is that replicating papers is actually nontrivial. For example someone announced they had replicated gpt2 some time back but when evals were run it turned about to be the equivalent of a much smaller model.
I think we need more funding outside of large tech companies and OpenAI for these kinds of things. I wonder if there is a way to crowdsource donations to rent the hardware to train big versions of these things in an open manner.
If I wanted to build an support Q&A system using texts from support logs, training docs, transcribed videos etc etc (basically as much text about my product as I can get) would this model be a good start ?
also, for some quick and simple Q&A system. Haystack https://github.com/deepset-ai/haystack (essentially dense vector similarity on Elastic Search) looks pretty promising and supports whole pipeline.
x =
# x = self.convolution(x)
# x = self.normalization(x)
# x = self.activation(x)
x = self.convolution(x)
x = self.normalization(x)
x = self.activation(x)
return x
Is there something like chattingtransformer ( https://pypi.org/project/chattingtransformer/ ) for gpt-neo? Ie. a trivial way to get text completion on a sample with sane defaults from the commandline.
edit: Oh, I see the "generating text" section. Any way to run it on CPU, even if it takes an hour?
Stella mentioned elsewhere in this thread that HuggingFace is adding support for the Eleuther model, so text generation should become trivial once this work is complete.
tl;dr we are doing pretty much exactly as well as we expected on LAMBADA and WikiText. Results on more sophisticated tasks will take some time, but HuggingFace is currently working on implementing our model in the transformers library and when they do so we can easily run a lot of analyses very quickly.
We actually built an evaluation suite that integrates with HF, but interfacing with the MTF code that GPT-Neo was written in was too much of a pain in the ass because Mesh TensorFlow is the worst. https://github.com/EleutherAI/lm-evaluation-harness
tl;dr we are doing pretty much exactly as well as we expected on LAMBADA and WikiText. Results on more sophisticated tasks will take some time, but HuggingFace is currently working on implementing our model in the transformers library and when they do so we can easily run a lot of analyses very quickly.
We actually built an evaluation suite that integrates with HF, but interfacing with the MTF code that GPT-Neo was written in was too much of a pain in the ass because Mesh TensorFlow is the worst. https://github.com/EleutherAI/lm-evaluation-harness
Does anyone know if there's a hosted version of this kind of GPT model somewhere? All I want to do is just call a GPT-2 API and get a response back, I'm not interested in setting up the entire infrastructure by myself.
I think this is an important problem. With logistic regression or deep learning, at least one can compare (out of sample) calibration curves or discrimination measures. With a language model, what can we do?
This is a good start, but given the breadth of applications this would hardly give us enough to compare, as the goal of these models isn't to simply recite Wikipedia articles. What about language translation? Content summarization? Code generation? Turing test performance?
Both models were trained on Wikipedia, so that's a particularly bad choice. But yes, in practice this is what people tend to do. Take results with a very large grain of salt though, as the domain of the prompts you feed it make a huge difference.
So I would want to include a big corpus like GPT-3 or this newfangled "Neo" thing but still have it trained to respond to our own customers based on 200k email passages.
200k emails is not enough to train a model from scratch. If you check out the google colab file in the GPT-Neo repository, it talks about how to fine-tune the model on data which is what you want to do
I wouldn't trust any model to generate text for customers yet. Not even the largest GPT3. There are no guarantees on what they will output and could be damaging to your business.
You're better off either:
1- Defining common "intents" that a lot of customer queries are categorized into, and having a model map the incoming message to the appropriate canned response. Look at Rasa, for an example of this.
2- if you insist on generating the text, have it be a recommendation to a human agent that either chooses to send it or writes their own response.
Larger models aren't really more complicated than smaller ones though. GPT-2 is already supported, I believe only difference with GPT-3 is sparse attention.
Can somebody explain to this beginner how to use this? Where can I load this code and start running it? How can I train it on a dataset and what do I need to prepare?
Lots of language here I don't understand like what is he referring to when he says 1.5B or 1T weights?
What resources/videos can I watch in order to start tinkering with this?
Colab is free to use -- you can click Runtime → Run All to run the cells in the notebook free-of-charge. (You may need to be logged in to a Google Account to run it.)
very cool! side question but is there a complete guide to learn PyTorch with Colab?
I tried to learn ML a few years ago but gave up because I couldn't install CUDA on my machine for some reason. The landscape seems to change dramatically.
I am interested in transformers in particular completing incomplete images like what https://openai.com/blog/image-gpt/ does, is there a project that implements that and can let me start training?
I'm excited but I just get overwhelmed as to where I need to focus my attention on.
My goal is to utilize something like image-gpt but for a more narrow domain (ex. only dealing with cats), how can I build my knowledge and skills towards that goal?
Much thanks for your answers I'm really looking to learn these stuffs
Eleuther has a history of claiming to replicate projects when they haven't. For example, they shipped a DALL-E repo a few days after OpenAI announced it (https://twitter.com/theshawwn/status/1348017515897659392) which was broken, and they've walked back their GPT-3 replication claims to replicating 1.5B due to the fact that their architecture doesn't scale.
As far as I can tell, they're generating a large amount of hype with grandiose claims that they can't deliver on.
All I care about is whether you like their models and actually use them in practice. If you do, please let me know and I'll pipe down. But so far, I haven't heard of anyone who uses anything they've produced, and that worries me. Has anyone?
One specific claim they made: https://twitter.com/BlancheMinerva/status/134727697554780980...
"DALL-E is quite straight forward and already coded. We just need data to train it."
No, DALL-E is neither straightforward nor was it successfully coded, especially back on January 7th.
Anyway, carry on. I really don't like speaking badly of AI projects, and I hope that they succeed. The model release today is a good step forward, assuming it works. But it might be better to have the expectation of "the models don't work" until proven otherwise.
I'd also like to point out that there are some capable people doing work at Eleuther. Sid in particular is one of the best TPU hackers in the scene. I just wish they would scale down their claims, release more models, and not claim that they've done X until actually doing X. For example, the readme says they have "the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library," which they don't.