Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · edit-2 8 days ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

ulterno@programming.dev · 8 days ago

If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

Probably not a good idea to download the whole site, but then that depends upon the site.

If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
For a dynamically created site, there’s nothing I can say
Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
The best would be if you could ask the website dev whether they have an API available.
- Even better, ask them to provide an RSS feed.

Programmer Belch@lemmy.dbzer0.com · 8 days ago

As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.

limerod@reddthat.com · 8 days ago

You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

I would reduce the frequency to once every two days to further reduce the impact.

ulterno@programming.dev · 8 days ago

That might/might not be much.
Depends upon the site, I’d say.

e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.