OpenAI says it’s “impossible” to create useful AI models without copyrighted material

sculd@beehaw.org · 6 months ago

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

AutoTL;DR@lemmings.world · 6 months ago

🤖 I’m a bot that provides automatic summaries for articles:

Click here to see the summary

Further, OpenAI writes that limiting training data to public domain books and drawings “created more than a century ago” would not provide AI systems that “meet the needs of today’s citizens.”

OpenAI responded to the lawsuit on its website on Monday, claiming that the suit lacks merit and affirming its support for journalism and partnerships with news organizations.

OpenAI’s defense largely rests on the legal principle of fair use, which permits limited use of copyrighted content without the owner’s permission under specific circumstances.

“Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents,” OpenAI wrote in its Monday blog post.

In August, we reported on a similar situation in which OpenAI defended its use of publicly available materials as fair use in response to a copyright lawsuit involving comedian Sarah Silverman.

OpenAI claimed that the authors in that lawsuit “misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”

Saved 58% of original text.

flatbield@beehaw.org · 6 months ago

Of course it is. About 50 years ago we went to a regime where everything is copywrited rather then just things that were marked and registered. Not sure where.I stand on that. One could argue we are in a crazy over copyright era now anyway.

DavidGarcia@feddit.nl · 6 months ago

ip protections are a spook anyway

SloppySol@lemm.ee · 6 months ago

I would just like to say, with open curiosity, that I think a nice solution would be for OpenAI to become a nonprofit with clear guidelines to follow.

What does that make me? Other than an idiot.

Of that at least, I’m self aware.

I feel like we’re disregarding the significance of artificial intelligence’s existence in our future, because the only thing anybody that cares is trying to do is get back control to DO something about it. But news is becoming our feeding tube for the masses. They’ve masked that with the hate of all of us.

Anyways, sorry, diatribe, happy new year

MagicShel@programming.dev · 6 months ago

I think OpenAI (or some part of it) is a non-profit. But corporate fuckery means it can largely be funded by for profit companies which then turn around and profit from that relationship. Corporate law is so weak and laxly enforced that’s it’s a bit of a joke unfortunately.

I agree that AI has an important role to play in the future, but it’s a lot more limited in the current form than a lot of people want to believe. I’m writing a tool that leverages AI as a sort of auto-DM for roleplaying, but AI hasn’t written a line of code in it because the output is garbage. And frankly I find the fun and value of the tool comes from the other humans you play with, not the AI itself. The output just isn’t that good.

BoastfulDaedra@lemmynsfw.com · 6 months ago

Yes, well, a pirate ship can’t stay in business without raiding trade convoys, either.

Stalinwolf@lemmy.ca · 6 months ago

Something about this sentence really makes me want to be a pirate. Like a lot.

BoastfulDaedra@lemmynsfw.com · 6 months ago

It’s not nearly as much fun as it sounds anymore. It’s all VPNs, Usenet, torrents, and signal hacking.

The only traditional conglomerations left are in southeast Asia and maybe the coast of Africa, and I gotta say, they do not look like they’re having any fun.

explodicle@local106.com · 6 months ago

Having read through these comments, I wonder if we’ve reached the logical conclusion of copyright itself.

raccoona_nongrata@beehaw.org · edit-2 6 months ago

Yup, onward to unfettered exploitation of individual labor.

explodicle@local106.com · 6 months ago

Apparently they’re going to just make only the little guy’s copyrights effectively meaningless, so yeah.

frog 🐸@beehaw.org · 6 months ago

Perhaps a fair compromise would be doing away with copyright in its entirety, from the tiny artists trying to protect their artwork all the way up to Disney, no exceptions. Basically, either every creator has to be protected, or none of them should be.

zaphod@lemmy.ca · edit-2 6 months ago

IMO the right compromise is to return copyright to its original 14 year term. OpenAI can freely train on anything up to 2009 which is still a gigantic amount of material while artists continue to be protected and incentivized.

frog 🐸@beehaw.org · 6 months ago

I’m increasingly convinced of that myself, yeah (although I’d favour 15 or 20 years personally, just because they’re neater numbers than 14). The original purpose of copyright was to promote innovation by ensuring a creator gets a good length of time in which to benefit from their creation, which a 14-20 year term achieves. Both extremes - a complete lack of copyright and the exceedingly long terms we have now - suppress innovation.

sanzky@beehaw.org · 6 months ago

copyright has become a tool of oppression. Individual author’s copyright is constantly being violated with little resources for them to fight while big tech abuses others work and big media uses theirs to the point of it being censorship.

vexikron@lemmy.zip · edit-2 6 months ago

Or, or, or, hear me out:

Maybe their particular approach to making an AI is flawed.

Its like people do not know that there are many different kinds of ways that attempt to do AI.

Many of them do not rely on basically a training set that is the cumulative sum of all human generated content of every imaginable kind.

zagaberoo@beehaw.org · 6 months ago

What ways do you mean? More than just expert-systems, I’d imagine.

vexikron@lemmy.zip · 6 months ago

Well, off the top of my head:

Whole Brain Emulation, attempting to model a human brain as physically accurately as possible inside a computer.

Genetic Iteration (not the correct term for it but it escapes me at the moment), where you set up a simulated environment for digital actors, then simulate quasi-neurons, quasi-body parts dictated by quasi-dna, in a way that mimics actual biological natural selection and evolution, and then you run the simulation millions of times until your digital creature develops a stable survival strategy.

Similar approaches to this have been used to do things like teach an AI humanoid how to develop its own winning martial arts style via many many iterations, starting from not even being able to stand up, much less do anything to an opponent.

Both of these approaches obviously have drawbacks and strengths, and could possibly be successful at far more than what they have achieved to date, or maybe not, due to known or existing problems, but neither of them rely on a training set of essentially the entirety of all content on the internet.

bedrooms@kbin.social · 6 months ago

Alas, AI critics jumped on the conclusion this one time. Read this:

Further, OpenAI writes that limiting training data to public domain books and drawings “created more than a century ago” would not provide AI systems that “meet the needs of today’s citizens.”

It’s a plain fact. It does not say we have to train AI without paying.

To give you a context, virtually everything on the web is copyrighted, from reddit comments to blog articles to open source software. Even open data usually come with copyright notice. Open research articles also.

If misled politicians write a law banning the use of copyrighted materials, that’ll kill all AI developments in the democratic countries. What will happen is that AI development will be led by dictatorships, and that’s absolutely a disaster even for the critics. Think about it. Do we really want Xi, Putin, Netanyahu and Bin Salman to control all the next-gen AIs powering their cyber warfare while the West has to fight them with Siri and Alexa?

So, I agree that, at the end of the day, we’d have to ask how much rule-abiding AI companies should pay for copyrighted materials, and that’d be less than the copyright holders would want. (And I think it’s sad.)

However, you can’t equate these particular statements in this article to a declaration of fuck-copyright. Tbh Ars Technica disappointed me this time.

krellor@beehaw.org · edit-2 6 months ago

The issue is that fair use is more nuanced than people think, but that the barrier to claiming fair use is higher when you are engaged in commercial activities. I’d more readily accept the fair use arguments from research institutions, companies that train and release their model weights (llama), or some other activity with a clear tie to the public benefit.

OpenAI isn’t doing this work for the public benefit, regardless of the language of altruism they wrap it in. They, and Microsoft, and hoovering up others data to build a for profit product and make money. That’s really what it boils down to for me. And I’m fine with them making money. But pay the people whose data you’re using.

Now, in the US there is no case law on this yet and it will take years to settle. But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.

bedrooms@kbin.social · 6 months ago

Well, regarding text online, most is there fir the visitors to read fir free. So, if we end up treating these AI training like human reading text one could argue they don’t have to pay.

Reddit doesn’t pay their users, anyway.

But personally, philosophically, I don’t see how Microsoft taking NYT articles and turning them into a paid product is any different than Microsoft taking an open source projects that doesn’t allow commercial use and sneaking it into a project.

Agreed. That said, NYT actually intentionally allows Google and Bing servers to parse their news articles in order to put their articles top in the search results. In that regard they might like certain form of processing by LLMs.

krellor@beehaw.org · 6 months ago

I thought about the indexing situation in contrast to the user paywall. Without thinking too much about any legal argument, it would seem that NYT having a paywall for visitors is them enforcing their right to the content signaling that it isn’t free for all use, while them allowing search indexers access is allowing the content to visible but not free on the market.

It reminds me of the Canadian claim that Google should pay Canadian publishers for the right to index, which I tend to disagree with. I don’t think Google or Bing should owe NYT money for indexing, but I don’t think allowing indexing confers the right for commercial use beyond indexing. I highly suspect OpenAI spoofed search indexers while crawling content specifically to bypass paywall and the like.

I think part of what the courts will have to weigh for the fair use arguments is the extent to which NYT it’s harmed by the use, the extent to which the content is transformed, and the public interest between the two.

I find it interesting that OpenAI or Microsoft already pay AP for use of their content because it is used to ensure accurate answers are given to users. I struggle to see how the situation is different with NYT in OpenAI opinion, other than perhaps on price.

It will be interesting to see what shakes out in the courts. I’m also interested in the proposed EU rules which recognize fair use for research and education, but less so for commercial use.

Thanks for the reply! Have a great day!

P03 Locke@lemmy.dbzer0.com · 6 months ago

It’s bizarre. People suddenly start voicing pro-copyright arguments just to kill an useful technology, when we should be trying to burn copyright to the fucking ground. Copyright is a tool for the rich and it will remain so until it is dismantled.

AVincentInSpace@pawb.social · edit-2 6 months ago

Life plus 70 years is bullshit.

20 years from release date is not.

No one except corporate bigwigs will say they should be allowed to do so in perpetuity, but artists still need legal protections to make money off of what they create, and Midjourney (making OpenAI boatloads of money off of making automated collages from artwork they obtained not only without compensation but without attribution) is a prime example of why.

🇳 🇦 🇨 🇰 🇹 🇲 🇺 🇱 🇱@lemm.ee · 6 months ago

The problem is not the use of copyrighted material. The problem is doing so without permission and without paying for it.

randomaside@lemmy.dbzer0.com · 6 months ago

OpenAI now needs to go to court and argue fair use forever. That’s the burden of our system. Private ownership is valued higher than anything else so … Good luck we’re all counting on you (unfortunately).

ky56@aussie.zone · 6 months ago

All the AI race has done is surface the long standing issue of how broken copyright is for the online internet era. Artists should be compensated but trying to do that using the traditional model which was originally designed with physical, non infinitely copyable goods in mind is just asinine.

One such model could be to make the copyright owner automatically assigned by first upload on any platform that supports the API. An API provided and enforced by the US copyright office. A percentage of the end use case can be paid back as royalties. I haven’t really thought out this model much further than this.

Machine learning is here to say and is a useful tool that can be used for good and evil things alike.

Kichae@lemmy.ca · 6 months ago

Nah. Copyright is broken, but it’s broken because it lasts too long, and it can be held by constructs. People should still reserve the right to not have the things they’ve made incorporated into projects or products they don’t want to be associated with.

The right to refusal is important. Consent is important. The default permission should not be shifted to “yes” in anybody’s mind.

The fact that a not insignificant number of people seem to think the only issue here is money points to some pretty fucking entitled views among the would-be-billionaires.

ky56@aussie.zone · 6 months ago

My major issue with copyright is how published works can have major cultural significance. How it can shift ideas and shape minds. But your not allowed to have some fun with on a personal level. How can it be the norm that the most important scientific knowledge and other culturally significant material is locked behind such restrictive measures. Essentially ensuring that middle class and especially poor people are locked out.

If you publish something, even if it’s paid, you don’t deserve such restrictive rights. You deserve to be compensated for your work but you don’t deserve to make it into a extortion racket.

My view on your second point is if you have posted it publicly with no paywall, maybe you should still get some percentage revenue but you don’t have a say in what it can be used. To place restrictions on what it can be used for when posting it publicly is academic as it’s basically unenforceable.

We live in a society which revolves around the discovery and sharing of ideas. We are all entitled to a certain amount of the sharing of that information. That’s the whole point. To have some business man who was in the right place at the right time create an extortion racket out of something culturally significant they almost certainly didn’t create is wrong.

Sorry if this is all over the place. I’m writing this while tired.

casmael@startrek.website · 6 months ago

Well in that case maybe chat gpt should just fuck off it doesn’t seem to be doing anything particularly useful, and now it’s creator has admitted it doesn’t work without stealing things to feed it. Un fucking believable. Hacks gonna hack I guess.

intensely_human@lemm.ee · 6 months ago

ChatGPT has been enormously useful to me over the last six months. No idea where you’re getting this notion it isn’t useful.

Bilb!@lem.monster · 6 months ago

People pretending it’s not useful and/or not improving all the time are living in their own worlds. I think you can argue the legality and the ethics, but any anti-ai position based on low quality output (“it can’t even do hands!”) has a short shelf-life.

java@beehaw.org · 6 months ago

Orcs versus progress.

sculd@beehaw.org · 6 months ago

So creators are orc now? Wow!

java@beehaw.org · 6 months ago

Not all creators are orcs, of course. But people who don’t understand deliberately exaggerated comparisons might be. I believe that you understood my point. Don’t start arguing over nothing.

sculd@beehaw.org · 6 months ago

Some relevant comments from Ars:

leighno5

The absolute hubris required for OpenAI here to come right out and say, ‘Yeah, we have no choice but to build our product off the exploitation of the work others have already performed’ is stunning. It’s about as perfect a representation of the tech bro mindset that there can ever be. They didn’t even try to approach content creators in order to do this, they just took what they needed because they wanted to. I really don’t think it’s hyperbolic to compare this to modern day colonization, or worker exploitation. ‘You’ve been working pretty hard for a very long time to create and host content, pay for the development of that content, and build your business off of that, but we need it to make money for this thing we’re building, so we’re just going to fucking take it and do what we need to do.’

The entitlement is just…it’s incredible.

4qu4rius

20 years ago, high school kids were sued for millions & years in jail for downloading a single Metalica album (if I remember correctly minimum damage in the US was something like 500k$ per song).

All of a sudden, just because they are the dominant ones doing the infringment, they should be allowed to scrap the entire (digital) human knowledge ? Funny (or not) how the law always benefits the rich.

🇰 🔵 🇱 🇦 🇳 🇦 🇰 ℹ️@yiffit.net · edit-2 6 months ago

Then pay for the material like everyone else who can’t do things without someone else’s copyrighted materials.