Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT Neo: open-source GPT model, with pretrained 1.3B & 2.7B weight models (github.com/eleutherai)
561 points by pizza on March 21, 2021 | hide | past | favorite | 127 comments


Please test their models before you take it at face value.

Eleuther has a history of claiming to replicate projects when they haven't. For example, they shipped a DALL-E repo a few days after OpenAI announced it (https://twitter.com/theshawwn/status/1348017515897659392) which was broken, and they've walked back their GPT-3 replication claims to replicating 1.5B due to the fact that their architecture doesn't scale.

As far as I can tell, they're generating a large amount of hype with grandiose claims that they can't deliver on.

All I care about is whether you like their models and actually use them in practice. If you do, please let me know and I'll pipe down. But so far, I haven't heard of anyone who uses anything they've produced, and that worries me. Has anyone?

One specific claim they made: https://twitter.com/BlancheMinerva/status/134727697554780980...

"DALL-E is quite straight forward and already coded. We just need data to train it."

No, DALL-E is neither straightforward nor was it successfully coded, especially back on January 7th.

Anyway, carry on. I really don't like speaking badly of AI projects, and I hope that they succeed. The model release today is a good step forward, assuming it works. But it might be better to have the expectation of "the models don't work" until proven otherwise.

I'd also like to point out that there are some capable people doing work at Eleuther. Sid in particular is one of the best TPU hackers in the scene. I just wish they would scale down their claims, release more models, and not claim that they've done X until actually doing X. For example, the readme says they have "the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library," which they don't.


Geez, that's really harsh.

I don't think any single thing you've claimed is factually wrong, and I don't speak for Eleuther nor am I attempting to justify their claims.

But.

As I understand it (mostly from lurking on their discord and reading publicly available materials) this is a group of volunteer academic types trying to replicate something great and awesome, with the only goal of giving it to the world. You could cut them some slack.

I can't speak for you, but as a "for free, weekend project" what they've done certainly makes me feel I need to up my game.


This has nothing to do with the good work, awesome intentions, nor the fact that they have no financial incentives behind this.

Claiming something that is not true is in itself wrong.


I am sorry that I was misinformed about the state of our DALL-E replication when I made that tweet. It was not malicious - I was reporting what I had been told by someone else.

Yes I was wrong. That said, I had hoped that maybe after two an a half months Shawn would stop holding it over my head.


> Claiming something that is not true is in itself wrong.

I 100% agree with this.

I also think that one catches more flies with honey than vinegar, and the criticism in the parent comment, while possibly valid, could be phrased more encouragingly and less combatively. It's easy to criticize, it's hard to create, and it's even harder to release.


> Claiming something that is not true is in itself wrong.

yup, in any project, and especially the one done for the community where the only you get is satisfaction and fame, the success is super tied to communication. Good, honest communication is what builds trust.


"Claiming something that is not true is in itself wrong." Does this apply to butterflies disguising themselves as dead leaves?


Wouldn't it be nice if OpenAI were like .. actually open? :P


Not just that. To even get access to their API, you need to apply. That is the future of AI I'm afraid without projects like this. That is Elites controlling AI and deciding who is "worthy" to use it.

I'm sure they have the best of intentions but "worthiness" is subjective.


Depends on who you mean by "they". If you mean the researchers, then sure, they probably actually believe whatever's written in their ethics statement.

Now, the actual owners? I don't believe it for a second


Considering last decade social consequences due to easy access to APIs and data, I am quite happy that these initiatives are cautious around opening up software which can have huge impact on society.


Unfortunately, the cat is out of the bag. Their methods are documented and the results exciting, so to a bad actor (especially state-sponsored) it's completely justified to spend millions attempting to replicate their results from what is publicly available.


Good perspective but I would like to hear the response from the developers before concluding too much.

This is not meant as a goad to you, but more just as more info for everyone, my understanding is it is an open source community of like minded people type project (as opposed to a bigco) and actively solicits contributions (by which I mean code and data) so anyone seeing room for improvement is welcome to step in from what I can tell.

I did find your comment helpful and informative; just adding another angle here.


It's literally a couple people hanging out in a discord channel and doing this as a way to procrastinate their jobs.


Peak of laziness - build ai to do your job so you have more time building ai.


I think it would help your PR efforts to let people know that more often. People hear "we are <org_name>" and assume you're, you know, an organization. That comes with some amount of expected bug fixes, documentation, verifying results _before_ you release, etc.

I'm not really sure how much you gain by attracting tons of people to the discord if you finally release and everyone has unreasonable expectations due to the way you advertise the group as a whole.


This'll probably sound silly, but we weren't really expecting much of a response. We've been sitting on these models (they were trained for the Pile paper) for months for no particular reason besides being focused on other things. We figured we'd put it out there in case anyone was interested and then Aran's tweet blew up hard.


Fair enough! I can personally attest to the results being quite impressive now. Fantastic work and apologies for the criticism. It must be a strange situation to get dragged in to all this due to a viral tweet.


People are always claiming to release replicated models by replicating the architecture (or main parts of it) but not testing whether it produces the same level of results. It's maddening, especially when the level of results is so directly measurable (just measure what the paper did, not that it's easy, just concrete).


Our README has comparisons with GPT-2 and GPT-3


Is there anything on its few shot learning performance? I took few shot learning as the main point of GPT-3. Sorry if I just overlooked it, but I don’t see anything on few shot learning in the readme.


Disclaimer: I know absolutely nothing about machine learning.

Isn't GPT-3 the architecture? Are they doing something different or why would it not scale?


I would guess that an average FAANG ML engineer could code up and successfully execute a forward/backward pass on a GPT-1 or GPT-2 model with a day of effort or less. (GPT-3 a little harder, but not significantly). But is that model actually going to perform well? Most likely no. Model performance varies significantly due to subtle details in data processing implementations, seemingly insignificant details in code, and even from different numerical methods of calculating the same semantics.

If you don't believe me, consider that many ML researchers track their commits (or exact code versions) extremely carefully, because oftentimes they will make some change (or changes) they think are inconsequential and later find that actually, their model broke. If they made too many changes, whoops, guess you have to binary search over the diff to see what happened since your last "good run".

If the people who spent months (if not years) tuning a model can't tell whether it will work from the code, how could anyone else? Most ML researchers will not bother with most code that doesn't give proof of results (in terms of a model that can actually be evaluated) because it is just so unlikely that it will actually work well. Now, it might "work" in the sense that it converges and does something when you prompt it with examples. But will this GPT-3 reimplementation actually outperform say, the 10x smaller T5 checkpoint that was released by Google, or the other smaller language models others have released? If it doesn't, it's hard to argue that its very useful at all.

I think that's the spirit of why the original commenter said what they did, but I still do applaud the efforts of this team (and hope that their implementation is, in fact, highly performant!)


GPT-3 is the name for the architecture, but there are a few different versions/sizes. The OpenAI version that impressed us all was ~170B parameters, this is far smaller.

To go from 2.7B to 170B parameters will need more than just a few config tweaks. There's a whole bunch of hacks and tricks needed to coax a model to train at that scale, the Eleuther version is almost guaranteed to fail out-of-the-box.


It's worth noting that the GPT-3 paper did train models with more sane sizes (e.g. 1.5B) as a point of comparison. I am surprised/annoyed they never released them though.


It's because OpenAI sells them for profit. The "Ada" model is the same size as the larger of these two EleutherAI models.


Huh, I was wondering what the size of the non-davinci models were; guess that make sense.

It's still telling that a "small" GPT-3 model can risk cannibalizing a larger model.


Ada is 2.7B, Babbage is 6.7B, Curie is 13.0B, and DaVinci is 175B. The new one they announced last month is in the 20-50B range I think, not totally sure though.


It's the model, not the architecture, but you could say the model contains the architecture.


Almost all the challenges with GPT-sized models are engineering and training challenges, not architectural.

How do you train a model too big to fit in a single GPU? It's doable, but not simple. How do you update weights across your cluster? etc etc


What I find interesting about their marketing(?) is that they identified a market niche that they want to position themselves in.

Enterprise customers that have no idea about the technical details will just hear about OpenAI's success in this fancy new model and assume that Eleuther can deliver.

I mean, most use cases for "big data" projects that are tiny in comparison with Alphabet's datasets will just work with GPT2 fine, probably.

And Enterprise customers that hear those claims and see some code, maybe some demo, is enough for them to start the consultancy process.

In my opinion that's a policy problem that OpenAI introduced by not requiring the absolute reproducability of both the code and model, and both training and dataset of their models upon release.

Stakes are pretty high in the AI industry, and OpenAI actively influences it. My dream was in the beginning that they are a source of verification, audits and "proof" that models are legit...yet I have the feeling lately that they just buzz around like everyone else.

To this date I haven't seen anyone replicate any of the DNC results, for example.

Anyways, just my two cents on this one.


To date, EleutherAI as an "organization" (read: basically a Discord server) has not really attempted any kind of marketing. It has no PR dept, just individuals tweeting about the work that Eleuther does.


Also I'm pretty sure there is no consultancy process.


This is a nice release, but the title is a bit misleading as the released sizes (1.3B and 2.7B parameters) do not yet compare to the size of GPT-3 (175B), but rather GPT-2 (1.5B) instead (although future releases may have significantly more!).

Edit: title improved, thank you!


Yeah. They say they are doing a 10B release soon[1].

I suspect they have run into training issues since they are moving to a new repo[2]

[1] https://twitter.com/arankomatsuzaki/status/13737326468119674...

[2] https://github.com/EleutherAI/gpt-neox/


It's more about hardware - these models were trained on TPUs, while GPT-NeoX is being trained on GPUs graciously provided by Coreweave.


Any idea what the required GPU time would cost (if not donated)? Is GPT-3 just a commodity soon?


Our current estimate is that it requires between 2000 and 4000 V100 months.


With training improvements such as DeepSpeed, the GPU costs will likely be substantially lower than what was available at the time OpenAI trained GPT-3. Still not free, though.

The hard part with GPT-3 is it's big enough to make it difficult to actually deploy.


The number thrown around for gpt-3 is $4.6 million, but I am not sure where that figure originates.


It was a number tossed around by a GPU hosting provider, based on their own costs: https://lambdalabs.com/blog/demystifying-gpt-3/

The reality is that GPT-3 was likely "free" to train on Azure, as Microsoft has provided a lot of resources to OpenAI.


If this is true, I wonder what sort of social capital transactional exchange is going on instead.


~4M$ per full training give or take.


Fixed title to reflect that, thanks


I would perhaps change 'GPT-3' to just say 'GPT' instead, as a more salient fix.


GPT-3 isn't a single model. It's a model architecture that is very closely followed by GPT-Neo. The 2.7B model is the exact same size as something OpenAI sells under the label "GPT-3"


My line of thinking was that for the average HN reader, who has probably read 'GPT-3' perhaps 500 times by now (every instance of which was referencing OpenAI's infamous 175B model), it may be confusing for them to see this with the same label, when the release is not comparable as far as parameters/performance (yet). But as yourself and another commenter noted, it is still the GPT-3 architecture (or hopefully isomorphic to it), so I appreciate your correction as well.


That's fair. I also later learned that the title didn't explicitly mention model size at first, and I would have probably raised similar complaints had I seen that.


Is GPT-2's architecture any different?


Not hugely, but yes. I tend to think of GPT as a style of architecture with consistent themes and major features, but varying minor features and implementation details. Off the top of my head, I believe the most important difference is that GPT-3 alternates global and local attention while GPT-2 is all global attention.

The two published GPT-Neo models follow GPT-3's lead but the repo lets the user pick whether to use global or local attention layers.


This is incorrect. It's the GPT-3 model architecture and optimisations, and uses training techniques similar to GPT-3.


Thank you, I've rephrased a few things to improve the wording with respect to this.


A great start for a truly open approach. It's ironic that OpenAI isn't particularly open about its tech.


It was disappointing to see just how quickly ClosedAI changed its tune once they produced something of value.


Many people (correctly, in my view) criticized OpenAI for the name, saying that openness should be evaluated on a case by case basis. Glad they listened to critics instead of trying to maintain consistency for its own sake.


Is there anything a non-AI researcher can do to help support this project? Is there a way to donate money? Or could a software engineer help with testing, tooling, or other kinds of infrastructure?

I was really excited about OpenAI's original plan and still believe that an open source solution is the best way to prevent the potential negative impacts AI might have on society. I can sort of appreciate why OpenAI went the route of going private and trying to monetize their work instead, it might prevent people from using their work nefariously and will probably provide them with way more capital to continue their efforts. But, I trust humanity as a collective more than any particular group of people in the long run. I'm sure there are many others like me who would be eager to help out if they could.

Edit: EleutherAI has a whole page on their site about how others can contribute: https://www.eleuther.ai/get-involved/. I didn't see anything about accepting donations though, if anyone involved with the project was interested in setting up a crowdfunding account somewhere I'd be eager to donate.


You may indirectly support the project by supporting the host, that hosts their data, https://the-eye.eu

Right on the front they write:

    > Hey there fans! We are currently looking for help funding large storage upgrades, 
    > if you want to help us serve more data see our donation options (crypto, etc) 
    > Thanks for reading, happy downloading!


The Eye has been a phenomenal partner and enables a lot of what we do. In addition to providing terabytes of storage for free, they also help us out with CPU from time to time.


The Eye stores amazing and important archives. Drives the data hoarding community.


Indirectly they say you can donate money, in the form of computation that can be rented: “As an independent organization, we are dependent upon donations for our computing costs. We are always interested in speaking with people who can donate compute times.”


GPT Paper:

"Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters"

README:

"1T or bust my dudes [...] An implementation of model & data parallel GPT2 & GPT3 -like models, with the ability to scale up to full GPT3 sizes (and possibly more!)"

It seems the largest model they released is 2.7 billion parameters or ~0.01 the size of GPT-3. The most interesting part about GPT-3 was its size and it seems this is only "GPT-3-like" in architecture.

I also have a translation library with ~100 million (0.001 GPT-3) parameters:

https://github.com/argosopentech/argos-translate


GPT-3 is a model architecture, not a model. While the largest GPT-3 model is 175B, that very paper has a table that includes "GPT-3 XL" (1.3B) and "GPT-3 2.7B" as models in the GPT-3 architecture. The 2.7B model is the same size as Ada, a model that OpenAI currently sells API access to under the moniker "GPT-3"


None of the other models are even close to the big one, and the paper also suggests calling the big one "GPT-3". And people do that very often in practice. So it's often ambiguous but saying the term only means the architecture isn't right either.


what does he mean when he says 1T or bust? Is he referring to 1 trillion parameters? Are you saying that GTP-3 has 2.7 trillion parameters? Does it mean that to get to GPT-3 level it needs 100x more amount of dataset?


The saying comes from a slide by Noam Shazeer (see: https://www.youtube.com/watch?v=HgGyWS40g-g&ab_channel=Tenso...). It just means the current goal should be to have models with 1 trillion parameters.


I interpreted it as aspiring to a trillion paramters but I'm not sure.


GPT-3 has 175 billion parameters. So they need to scale by 64x. They already have a comparable amount of data than what was used by OpenAI, so it's about scaling the numbers of GPUs.


I see so that means this GPT Neo is 64 less powerful?


Accuracy and numbers of parameters don't scale linearly together. It varies widely depending on exactly what you are measuring accuracy on as well etc. But a very approximate rule of thumb would be to say that accuracy scales with the log of the parameter count (for the same architecture).


Did anyone manage to successfully run inference in the provided Google Colab (https://colab.research.google.com/github/EleutherAI/GPTNeo/b...)? I can run training, but can't manage to make the inference (even from a pre-trained model) work.


Hi! Thanks for trying it out. There was a bug that should now be fixed. When I run the example unicorn prompt I get the follow. Don't hesitate to open an issue if you're still having trouble.

"In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

Bebek Uranzoglu, another member of the research team from the University of British Columbia, was working on a project the Latino-Canadian rodeo competition equipos to document a rare and remarkable ecosystem in the Andes Mountains.

His curiosity was piqued when he spotted an adolescent herd of about 10 unicorns foraging in a forest near the valley of the Jumbo Flu Group. The unicorns — whose numbers once swelled to 46,000 — were perched on the forest floor and watched the researchers work.

Urizoglu grew excited when he spotted another group that seemed to be thriving in an area below the herd. The team hoped the apparent population growth would indicate a human presence.

But when a team of researchers set up a camera trap, they were surprised to find the unicorns in the first place, and in a forest near a lake — in fact the forest was almost entirely made up of the animals. Despite their own magical presence, the team could not see the herd was populated by humans.

“The whole place almost smelled like animals,” says Bebek. “We were never able to find human footprints at any of the points we stood at. The trees were so large, you wouldn’t have been able to walk 40 meters through them. We assumed that the truth of the matter was, ‘Well the deer didn’t like this forest at all.’”


Same here. I managed to make it "work" in the sense that it wouldn't crash during inference, but then it generated gibberish. Has anyone managed to make it work reliably?


The problem in my case was "train_steps" in the model json file. Default is 0. The notebook sets it to 401000 which works.


Whilst obviously BERT is not the same as GPT-3 in architecture, Amazons recent paper discussing architecture optimizations for BERT seems pretty relevant here (https://arxiv.org/pdf/2010.10499.pdf) given the chance to improve upon GPT-3s architecture (because it surely isn't the best we can get). Have the Eleuther.ai team been exploring this?


Could the title of this post be change to emphasize that the model sizes released were 1.3B and 2.7B? Something like "EleutherAI releases 1.3B and 2.7B parameter GPT-like language models". The current title implies that a full sized GPT-3 model is currently available, which is not the case.

edit: the title has been changed, seems good enough


Curious to see what parameter size of gpt3 this will end up being equivalent to. Obviously we won't know until they evaluate their models.


It's trained using the same architecture, and with a very similar dataset, so it should be very close.


My experience is that replicating papers is actually nontrivial. For example someone announced they had replicated gpt2 some time back but when evals were run it turned about to be the equivalent of a much smaller model.


I think we need more funding outside of large tech companies and OpenAI for these kinds of things. I wonder if there is a way to crowdsource donations to rent the hardware to train big versions of these things in an open manner.


If I wanted to build an support Q&A system using texts from support logs, training docs, transcribed videos etc etc (basically as much text about my product as I can get) would this model be a good start ?



That would be one incredibly expensive Q&A system. As rightly suggested by others, some search + a finetuned Q&A system is what you need.

I've been building a library that makes solving tasks like conversational Q&A easy.

https://github.com/backprop-ai/backprop

If you need some pointers, we have example notebooks and I'm happy to chat about it as well. Would love to know if it helps!


Depending on how much content you've got, this blog post from HuggingFace might be interesting: https://yjernite.github.io/lfqa.html


also, for some quick and simple Q&A system. Haystack https://github.com/deepset-ai/haystack (essentially dense vector similarity on Elastic Search) looks pretty promising and supports whole pipeline.


Thanks that looks decent for quick start.


Are there any output samples? Would like to get a sense of how close this is to GPT-2 vs. GPT-3 levels of coherency.


I've been having pretty good luck generating all sorts of stuff. There's a large chunk of text generated with the default unicorns prompt here: https://cdn.discordapp.com/attachments/823393531305525288/82...

It seems really good at generating code too

prompt:

class ConvBlock(nn.Module):

    def __init__(self, channels, stride=1):

        super(ConvBlock, self).__init__()
        self.convolution = nn.Conv2d(channels//stride, channels, 3, stride=stride, padding=1)
        self.normalization = nn.BatchNorm2d(channels)
        #self.activation = nn.LeakyReLU()

    def forward(self, x):



output:

        x =
        #  x = self.convolution(x)
        #  x = self.normalization(x)
        #  x = self.activation(x)

        x = self.convolution(x)
        x = self.normalization(x)
        x = self.activation(x)

        return x

class Resblock(nn.Module):

    def __init__(self, stride=1):

        super(Resblock, self).__init__()
        self.res = nn.Sequential(
            nn.Conv2d(9, 9, 3, stride=stride, padding=1),
            nn.BatchNorm2d(9),
            nn.LeakyReLU(0.2)
        )
        self.output = nn.MaxPool2d(2)

    def forward(self, x):

        return self.res(x)



Also, some "tongue twisters":

Big ugly stick.

Bowl full of chicken peas.

New York is universally inadvisable.

Long-eared seagull.

This morning I have a tumor in me.

A dog bark.

Invisible dog hides.

A crow blind.

A cat can’t see if it had a billion eyes.

Bumblebee.

Sheep herder, sheep herder.

A fawns abducts.

Two black birds are trapped.

Bottle on her finger.

Elephant sees another elephant.

Bull.

Mice in a box in the library.

A church swelter.

The door of a hotel opens.

Bosnian honey melons.

Grapes in excess.

Cat is on the loose.

Soil is shoveled into a glass jar


I'd love to know the minimum hardware requirements to run something like this locally.


Is there something like chattingtransformer ( https://pypi.org/project/chattingtransformer/ ) for gpt-neo? Ie. a trivial way to get text completion on a sample with sane defaults from the commandline.

edit: Oh, I see the "generating text" section. Any way to run it on CPU, even if it takes an hour?


Stella mentioned elsewhere in this thread that HuggingFace is adding support for the Eleuther model, so text generation should become trivial once this work is complete.


Is something with billions of parameters actually a "model"? I guess the answer is yes if the data set is even larger than that?


We’ve added a table with some evaluation scores to the GitHub repo, and you can see a comparison between our scores, GPT-2, and GPT-3 here: https://twitter.com/BlancheMinerva/status/137399189661642752...

tl;dr we are doing pretty much exactly as well as we expected on LAMBADA and WikiText. Results on more sophisticated tasks will take some time, but HuggingFace is currently working on implementing our model in the transformers library and when they do so we can easily run a lot of analyses very quickly.

We actually built an evaluation suite that integrates with HF, but interfacing with the MTF code that GPT-Neo was written in was too much of a pain in the ass because Mesh TensorFlow is the worst. https://github.com/EleutherAI/lm-evaluation-harness


Per Twitter, there will be more info about model performance tomorrow: https://twitter.com/arankomatsuzaki/status/13737326454445793...


We’ve added a table with some evaluation scores to the GitHub repo, and you can see a comparison between our scores, GPT-2, and GPT-3 here: https://twitter.com/BlancheMinerva/status/137399189661642752...

tl;dr we are doing pretty much exactly as well as we expected on LAMBADA and WikiText. Results on more sophisticated tasks will take some time, but HuggingFace is currently working on implementing our model in the transformers library and when they do so we can easily run a lot of analyses very quickly.

We actually built an evaluation suite that integrates with HF, but interfacing with the MTF code that GPT-Neo was written in was too much of a pain in the ass because Mesh TensorFlow is the worst. https://github.com/EleutherAI/lm-evaluation-harness


Does anyone know if there's a hosted version of this kind of GPT model somewhere? All I want to do is just call a GPT-2 API and get a response back, I'm not interested in setting up the entire infrastructure by myself.


We have exposed the large GPT 2 model as an API.

https://backprop.co

Anyone can access it and our pricing is usage based. If you use less than 1000 seconds of inference a month then it's completely free.


Great will try this!


Hugging Face has that service

https://huggingface.co/pricing


Are there some truly objective benchmarks to compare this to GPT 2/3?


I think this is an important problem. With logistic regression or deep learning, at least one can compare (out of sample) calibration curves or discrimination measures. With a language model, what can we do?


perplexity score against a corpus such as wikipedia? Basically how well the model predicts the next word.


This is a good start, but given the breadth of applications this would hardly give us enough to compare, as the goal of these models isn't to simply recite Wikipedia articles. What about language translation? Content summarization? Code generation? Turing test performance?


Both models were trained on Wikipedia, so that's a particularly bad choice. But yes, in practice this is what people tend to do. Take results with a very large grain of salt though, as the domain of the prompts you feed it make a huge difference.


yes, see GLUE or superGLUE benchmarks. It assumes the answers have not been scraped and included in the dataset though.


So I would want to include a big corpus like GPT-3 or this newfangled "Neo" thing but still have it trained to respond to our own customers based on 200k email passages.

How to create a hybrid?


200k emails is not enough to train a model from scratch. If you check out the google colab file in the GPT-Neo repository, it talks about how to fine-tune the model on data which is what you want to do


You might get some really promising results with finetuning.

If anything, you could build writing assistance that almost automates responses.

I've been co-authoring a library that lets you finetune such models in a single line of code.

https://github.com/backprop-ai/backprop

In specific the text generation finetuning example should be what you are looking for: https://github.com/backprop-ai/backprop/blob/main/examples/F...

Hope this helps, happy to chat more about it. Pretty curious about the results.


I wouldn't trust any model to generate text for customers yet. Not even the largest GPT3. There are no guarantees on what they will output and could be damaging to your business.

You're better off either: 1- Defining common "intents" that a lot of customer queries are categorized into, and having a model map the incoming message to the appropriate canned response. Look at Rasa, for an example of this.

2- if you insist on generating the text, have it be a recommendation to a human agent that either chooses to send it or writes their own response.


Thanks for the advice.


You fine-tune an existing pretrained model on your proprietary dataset.


Does it leverage deepspeed/zero 3?


GPT-NeoX, which is a model from the same group but using GPUs instead of TPUs, uses techniques from DeepSpeed:

https://github.com/EleutherAI/gpt-neox/


That’s PyTorch only; the current models are TensorFlow.


Oh that's unfortunate, can't the models be exported to pytorch through e.g onnx?


There's a PyTorch + DeepSpeed repository here: https://github.com/EleutherAI/gpt-neox


Even small models can be a headache to export if they have even they use anything custom. I can't even imagine something the size of GPT-3.


Larger models aren't really more complicated than smaller ones though. GPT-2 is already supported, I believe only difference with GPT-3 is sparse attention.


The GPT-2 models included with Transformers can export to ONNX fine w/ a helper included with the Python onnxruntime.


To run in PyTorch the model architecture must be ported.

ONNX is slightly different; you could export the model to run in the onnxruntime but that has tradeoffs.


Can somebody explain to this beginner how to use this? Where can I load this code and start running it? How can I train it on a dataset and what do I need to prepare?

Lots of language here I don't understand like what is he referring to when he says 1.5B or 1T weights?

What resources/videos can I watch in order to start tinkering with this?


The repository readme actually includes a link to a notebook[1] that helps getting started on Google colab. It's as good a place to start as any:

[1]: https://colab.research.google.com/github/EleutherAI/GPTNeo/b...


thanks, I never used this before. Do I have to add a credit card? How much will it cost to run this?


Colab is free to use -- you can click Runtime → Run All to run the cells in the notebook free-of-charge. (You may need to be logged in to a Google Account to run it.)


very cool! side question but is there a complete guide to learn PyTorch with Colab?

I tried to learn ML a few years ago but gave up because I couldn't install CUDA on my machine for some reason. The landscape seems to change dramatically.

I am interested in transformers in particular completing incomplete images like what https://openai.com/blog/image-gpt/ does, is there a project that implements that and can let me start training?

I'm excited but I just get overwhelmed as to where I need to focus my attention on.

My goal is to utilize something like image-gpt but for a more narrow domain (ex. only dealing with cats), how can I build my knowledge and skills towards that goal?

Much thanks for your answers I'm really looking to learn these stuffs


Your questions are easily googleable, but if you insist start at pytorch.org


I'm sure they are


Following




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: