The only reason for them to implement opt-out is by legal order and subsequent confirmation by the supreme court or country equivalent. There is no other reason to do it.
Definitely a case for the Futurama "I'm shocked" meme. OpenAI has a lot of talented people that work there but it's clear @sama only cares about chasing the biggest possible payday and nothing else means anything.
I'd love to hear HN's suggestions on keeping IP out of these LLMs where possible. For context I'm a writer. Nothing I write gets published online and I don't need to submit to, say, Amazon/Kindle to make money. However, much of my work is passed around by executives via email, for example. Ideas?
OpenAI considers everything "publicly available" to be fair game, so if GPTBot happens to stumble across pirated ePubs of your books then OpenAI will probably train on them even if you never published them on the open web yourself. They don't care about the provenance of the stuff they're scraping.
Just because something is publicly accessible on the web that does not mean it is in the public domain and free of copyright or other restrictions on use. Big Media - and many smaller players - have been fighting this battle for decades but generally winning because the law is relatively clear in this area in most places.
There is absolutely nothing in the law in my country - or probably most countries other than possibly the US - that says you can grab whatever you like if you can find it online and do whatever you want with it. And in the US the potential loophole is fair use and that has been controversial for a long time since it's clearly in violation of the global copyright treaties to which the US is also a signatory so something as big as AI might be enough to get other countries to push back significantly where usually they turn a blind eye.
So if OpenAI is doing that then I don't see how they are not in breach of copyright in much of the world. I would experience considerable Schadenfreude if that resulted in epic scale lawsuits because I don't think the use of "training AI models" as a means of laundering copyright infringement is a positive step. Like the search engines that started including significant parts of the original content directly on their results pages it's a distortion where the people who actually do the creative work are not the people being rewarded for it.
That's totally fine as long as you train a diffusion model on the OpenAI logo and then prompt the model to generate an image that just coincidentally happens to look exactly like the OpenAI logo. If an AI model made it then it's automatically not plagiarism.
And if the entity doing it is a big enough company.
It's interesting to imagine the legal landscape if/when this technique is applied to MPAA/RIAA content, and everybody is sharing the foundational model plus the "prompts" that will have it make "your" movie.
Yeah, I kinda want to see two kinds of big kinda-evil forces fight, in the hopes we get some fair and consistent rules instead of "it's vague enough that you can do/prevent anything as long as you have enough lobbyists and lawyers."
I'd say lobby for laws. Purely tech measures won't help... Next versions of their email clients will have ToS to let them dump everything to Goog/Msft for training
> It depends on the scenario. For example, if you always have class on Thursday. It would mean to have it done by class on Thursday. Whereas if you are taking an internet class that doesn't have set time frames, it would mean to have it done by 11:59PM Thursday night.
Probably because if this is created, most IP owners would mass opt out. To prevent that, you have to make the process difficult to do, hidden, or framed in an extremely careful way.