Except the first thing openai does is read robots.txt. However, robots.txt doesn...

flutas · on April 11, 2024

> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.

niutech · on April 12, 2024

This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

darkwater · on April 11, 2024

And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.

AgentME · on April 11, 2024

All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.

Pannoniae · on April 11, 2024

It wasn't commented out a few hours ago when I checked it. I think that's a recent change.

cwillu · on April 11, 2024

Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt

andybak · on April 11, 2024

This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.

kubanczyk · on April 12, 2024

You've been able to convince me to accept his second pint. Friday it is.

fsckboy · on April 11, 2024

humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.

gunapologist99 · on April 11, 2024

This must be why self-driving cars always ignore the speed limit. ;)

microtherion · on April 11, 2024

More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.

roughly · on April 11, 2024

Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.

yreg · on April 11, 2024

AI DRIVR claims that beta V12 is much better precisely because it takes rules less literally and drives more naturally.

queuebert · on April 11, 2024

Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?

everforward · on April 11, 2024

No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.

Retric · on April 11, 2024

That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.

everforward · on April 12, 2024

You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.

What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.

That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).

Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.

Retric · on April 12, 2024

It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/...

The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

everforward · on April 12, 2024

It’s trivial to treat it differently, but doing so runs the risk of being accused of cloaking and getting banned from Google’s index: https://developers.google.com/search/docs/essentials/spam-po...

> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.

> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.

Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.

At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.

Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.

Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.

You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.

Retric · on April 12, 2024

Google doesn’t care what you do to other crawlers that ignore your TOS. This isn’t a theoretical situation it’s already going on. Crawling is easy enough to “block” there’s court cases on this stuff because this is very much the case where the defense wins once they devote fairly trivial resources to the effort.

And again blocking should never be the goal poisoning the well is. Training AI on poisoned data is both harder to detect and vastly more harmful. A price compared tool is only as good as the actual prices it can compare etc.

a_c · on April 11, 2024

What about making it slow? One byte at a time for example while keeping the connection open

bityard · on April 11, 2024

That would make it a tarpit, a very old technique to combat scrapers/scanners

happymellon · on April 11, 2024

A slow stream that never ends?

SteveNuts · on April 11, 2024

This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.

I'm sure the big players like Google would deal with it gracefully.

gtirloni · on April 11, 2024

Here you go (1 req/min, 10 bytes/sec), please report results :)

  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }

beau_g · on April 11, 2024

Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible. "OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."

iosguyryan · on April 11, 2024

Can I interest you in a scrape-share with Claude?

reasonabl_human · on April 12, 2024

Solid use case for Saul Goodman LLM alignment

throw_a_grenade · on April 11, 2024

You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.

starttoaster · on April 11, 2024

Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.

Calzifer · on April 11, 2024

Those are usually connection and no-data timeouts. A total time limit is in my experience less common.

Phelinofist · on April 11, 2024

Sounds like endlessh

Takennickname · on April 12, 2024

> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it