Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · edit-2 7 days ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Electricd@lemmybefree.net · 7 days ago

Do they really hit that much? I might not have a popular opinion there, but if they don’t have a performance impact then I probably wouldn’t care

Probius@sopuli.xyz · 8 days ago

This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.

FauxLiving@lemmy.world · 8 days ago

If it’s disrupting their site, it is a crime already. The problem is finding the people behind it. This won’t be some guy on his dorm PC and they’ll likely be in places interpol can’t reach.

porous_grey_matter@lemmy.ml · 8 days ago

they’ll likely be in places interpol can’t reach

Like some Microsoft data center

finitebanjo@lemmy.world · 8 days ago

Huawei

isolatedscotch@discuss.tchncs.de · 8 days ago

good luck with that! not only is a company doing it, which means no individual person will go to prison, but it’s from a chinese company with no regard for any laws that might get passed

humanspiral@lemmy.ca · 8 days ago

The people determining US legislation have said, “how can we achieve skynet if our tech trillionaire company sponsors can’t evade copyright or content licensing?” But they also say if “we don’t spend every penny you have on achieving US controlled Skynet, then China wins.”

Speculating on “Huawei network can solve this”, doesn’t mean that all the bots are Chinese, but does confirm that China has a lot of AI research, and Huawei GPUs/NPUs are getting used, and successfully solving this particular “I am not a robot challenge”.

It’s really hard to call “amateur coding challenge” competition web site a national security threat, but if you hype Huawei enough, then surely the US will give up on AI like it gave up on solar, and maybe EVs. “If we don’t adopt Luddite politics and all become Amish, then China wins” is a “promising” new loser perspective on media manipulation.

eah@programming.dev · 8 days ago

Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they’re also people, aren’t they?

caseyweederman@lemmy.ca · 7 days ago

Put the entire datacenter buildings into prison

Blueteamsecguy@infosec.pub · 7 days ago

I think they call that a “job” already

IndescribablySad@threads.net@sh.itjust.works · 8 days ago

I really feel like scrapers should have been outlawed or actioned at some point.

floofloof@lemmy.ca · 8 days ago

But they bring profits to tech billionaires. No action will be taken.

BodilessGaze@sh.itjust.works · 8 days ago

No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that’s dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There’s nothing we can do legally about Chinese scrapers.

Flax@feddit.uk · 8 days ago

Can you not just block China?

BodilessGaze@sh.itjust.works · edit-2 7 days ago

We do, somewhat. We haven’t gone as far as a blanket ban of Chinese CIDR ranges because there’s a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.

mormund@feddit.org · 8 days ago

I thought Anthropic was also very abusive with their scraping?

BodilessGaze@sh.itjust.works · 7 days ago

Maybe to others, but not to us. Or if they are, they’re very good at masking their traffic.

Programmer Belch@lemmy.dbzer0.com · 8 days ago

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

S7rauss@discuss.tchncs.de · 8 days ago

Does your tool respect the site’s robots.txt?

who@feddit.org · edit-2 8 days ago

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

redjard@lemmy.dbzer0.com · 8 days ago

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …

who@feddit.org · edit-2 7 days ago

Crawl-delay

It’s a nonstandard extension without consistent semantics or wide support, but I suppose it’s good to know about anyway. Thanks for mentioning it.

S7rauss@discuss.tchncs.de · 8 days ago

I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

Programmer Belch@lemmy.dbzer0.com · 8 days ago

Yes, it just downloads the HTML of one page and formats the data into the RSS format with only the information I need.

ulterno@programming.dev · 8 days ago

If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

Probably not a good idea to download the whole site, but then that depends upon the site.

If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
For a dynamically created site, there’s nothing I can say
Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
The best would be if you could ask the website dev whether they have an API available.
- Even better, ask them to provide an RSS feed.

Programmer Belch@lemmy.dbzer0.com · 8 days ago

As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.

limerod@reddthat.com · 8 days ago

You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

I would reduce the frequency to once every two days to further reduce the impact.

ulterno@programming.dev · 8 days ago

That might/might not be much.
Depends upon the site, I’d say.

e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.

IndescribablySad@threads.net@sh.itjust.works · edit-2 8 days ago

Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet

JPAKx4@lemmy.blahaj.zone · 8 days ago

I mean if it’s cms site there may not be an api, this would be the only solution in that case

Flax@feddit.uk · 8 days ago

The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn’t even dent a raspberry pi, nevermind a beefier server

I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.

deur@feddit.nl · edit-2 5 days ago

deleted by creator

grue@lemmy.world · 8 days ago

But html is machine-readable and that absolutely is the point!

Never forget what they stole from us.

FizzyOrange@programming.dev · 8 days ago

So search engines shouldn’t exist? This is absurdly simplistic.

gressen@lemmy.zip · 8 days ago

Write TOS that state that crawlers automatically accept a service fee and then send invoices to every crawler owner.

BodilessGaze@sh.itjust.works · 8 days ago

Huawei is Chinese. There’s literally zero chance a European company like Codeberg is going to successfully collect from a company in China over a TOS violation.

wischi@programming.dev · 8 days ago

It’s not even a company. It’s a non-profit “eingetragener Verein”. They have very limited resources, especially money because they purely live on membership fees and donations.

Lumisal@lemmy.world · 8 days ago

True, but it can help limit the European AI scrapers too

BodilessGaze@sh.itjust.works · edit-2 8 days ago

I really doubt it. Lawsuits are expensive, and proving responsibility is difficult, since plausible deniability is easy. All scrapers need to do is use shared IPs (e.g. cloud providers), preferably owned by a company in a different legal jurisdiction. That could be the case here: a European company could be using Huawei Cloud to mask the source of their traffic.

Venia Silente@lemmy.dbzer0.com · 8 days ago

All scrapers need to do is use shared IPs (e.g. cloud providers),

Simple: just charge the cloud provider.

Once that gets strong enough they’ll start placing terms against scraping in their TOS.

wischi@programming.dev · edit-2 8 days ago

And then they just throw it in the bin because there was never a contract between you and them. What to do then? Sue Microsoft, Amazon and Google

I’m sure Codeberg, a German non-profit Verein, has time and money to do that 🤣.

Venia Silente@lemmy.dbzer0.com · 6 days ago

Sure but that’s a whole different part of the system. Society as a whole has to change (some guillotines would help) and no matter how cool Codeberg is, they can’t do all that on their own.

In the meantime, what the elites visibly respond to and that is more readily accessible is monetary costs. Make it costly (operationally or legally) to scrape sites, and they’ll stop if at least to whine.

wischi@programming.dev · 8 days ago

They typically don’t include a billing address in the User Agent when crawling 🤣

gressen@lemmy.zip · 8 days ago

That’s a technicality. The billing address can be discovered for a nominal fee as well.

wischi@programming.dev · edit-2 8 days ago

I’m sure it can’t, especially for foreign IP addresses, VPNs, and a ton of other situations. Even if directly connect to the internet just via your ISP, many countries in Europe (don’t know about US) have laws that would require you to have very good reasons and a court order to get the info you need from the ISP - for a single(!) case.

If it would be possible to simply get the address of all digital visitors, we wouldn’t have to develop all this anti scrape tech and just sue them.

Kissaki@feddit.org · 7 days ago

Cloudflare had a similar idea: Introducing pay per crawl: Enabling content owners to charge AI crawlers for access

cecilkorik@lemmy.ca · 8 days ago

Begun, the information wars have.

steal_your_face@lemmy.ml · 8 days ago

The wars have been fought and lost a while ago tbh

folken@lemmy.world · edit-2 8 days ago

When you realize that you live in a cyberpunk novel. The AI is cracking the ICE. https://cyberpunk.fandom.com/wiki/Black_ICE

Regrettable_incident@lemmy.world · 8 days ago

I love seeing how much influence William Gibson had on cyberpunk.

ThePyroPython@lemmy.world · 8 days ago

It’s not intentional but the chap ended up writing works that defined both the Cyberpunk (Neuromancer) and Steampunk (The Difference Engine) genres.

Can’t deny that influence.

MeThisGuy@feddit.nl · 7 days ago

most the ICE I’ve read about are white.

haven’t tried it, it’s in the closed apples store… but it’s a start…

https://apps.apple.com/us/app/iceblock/id6741939020

cadekat@pawb.social · 8 days ago

Huh, why does Anubis use SHA256? It’s been optimized to all hell and back.

Ah, they’re looking into it: https://github.com/TecharoHQ/anubis/issues/94

0_o7@lemmy.dbzer0.com · 8 days ago

I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they’re now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

Soon everything on the internet will be behind a wall.

ILikeTraaaains@lemmy.world · 7 days ago

Not necessarily compromised, I saw a VPN provider (don’t remember the name) that offered a free tier where the client accepts being used for this.

And I suspect that in the future some VPN companies will be exposed doing the same but with their paid customers.

irelephant [he/him]@programming.dev · 8 days ago

This isn’t sustainable for the ai companies, when the bubble pops it will stop.

aev_software@programming.dev · 8 days ago

In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible… which is what the scalpers are causing.

Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they’d be a bit more careful about the damage they cause.

But they aren’t, because capitalism.

Natanael@infosec.pub · 7 days ago

If they had the slightest bit of survival instinct they’d share a archive.org / Google-ish scraper and web cache infrastructure, and pull from those caches, and everything would just be scraped once, repeated only occasionally.

Instead they’re building maximally dumb (as in literally counterproductive and self harming) scrapers who don’t know what they’re interacting with.

At what point will people start to track down and sabotage AI datacenters IRL?

ExLisper@lemmy.curiana.net · 8 days ago

There are many commercial VPNs offering residential IPs. I doubt they use malware.

chicken@lemmy.dbzer0.com · 8 days ago

Seems like such a massive waste of bandwidth since it’s the same work being repeated by many different actors to piece together the same dataset bit by bit.

chuckleslord@lemmy.world · 8 days ago

Ah Capitalism! Truly the king of efficiency /s

sp3ctr4l@lemmy.dbzer0.com · 7 days ago

Do we all want the fucking Blackwall from Cyberpunk 2077?

Fucking NetWatch?

Because this is how we end up with them.

…excuse me, I need to go buy a digital pack of cigarettes for the angry voice in my head.

somerandomperson@lemmy.dbzer0.com · 7 days ago

Consider nicotine+

sp3ctr4l@lemmy.dbzer0.com · edit-2 7 days ago

What was that?

I was sucking on my nicotine nipple, err, I mean my vape.

(Hey, its a more affordable stimulant addiction than coffee now!)

somerandomperson@lemmy.dbzer0.com · edit-2 7 days ago

No, not the drug; the app.

sp3ctr4l@lemmy.dbzer0.com · edit-2 7 days ago

Oh, well shit, I had not heard of this lol.

I am partial to I2P as … potentially, an entirely new, full internet paradigm, not just filesharing, but I will look into this too!

somerandomperson@lemmy.dbzer0.com · 7 days ago

It’s a soulseek client, basically. You can share files, chat, put your interests in your profile, etc. It’s basically like social media, minus the posts. The only algorithm that exists is the one that shows people with similar interests. You can also view the most common interests. You can also add disinterests, which are the exact opposite.

sp3ctr4l@lemmy.dbzer0.com · 7 days ago

That does sound very intetesting!

Blackmist@feddit.uk · 8 days ago

Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

filcuk@lemmy.zip · 6 days ago

Just reading the Ender’s game series, very on-point.

Tap for spoiler

We’ll wake up one day only to realise an entity living ‘in the wires’ is the only thing keeping the internet alive.

Blackmist@feddit.uk · 6 days ago

As long as NetWatch keeps them behind the Blackwall, we’re all good.

Kissaki@feddit.org · 7 days ago

Reminds me of the “store data inside slow network requests for the in-transit duration”. It was a fun article to read.

harambe69@lemmy.dbzer0.com · 7 days ago

Link, please?

tuna@discuss.tchncs.de · 7 days ago

http://tom7.org/harder/ (has links to the paper and also to the video)

harambe69@lemmy.dbzer0.com · 7 days ago

Thanks!

nik9000@programming.dev · 7 days ago

I believe they are talking about Harder Drive: https://youtu.be/JcJSW7Rprio

harambe69@lemmy.dbzer0.com · 7 days ago

Thanks!

Natanael@infosec.pub · 7 days ago

Like a public service CAPTCHA / BOINC hybrid

excral@feddit.org · 7 days ago

I like the idea but couldn’t you just go the more direct route and mine crypto?

LiveLM@lemmy.zip · 8 days ago

Uuughhh I knew it’d always be a mouse and cat game, sincerely hope the Anubis devs figure out how to fuck up the AI crawlers again

ryanvade@lemmy.world · 8 days ago

It’s being investigated at least, hopefully a solution can be found. This will probably end up in a constantly escalating battle with the AI companies. https://github.com/TecharoHQ/anubis/issues/978

tal@lemmy.today · 8 days ago

If someone just wants to download code from Codeberg for training, it seems like it’d be way more efficient to just clone the git repositories or even just download tarballs of the most-recent releases for software hosted on Codeberg than to even touch the Web UI at all.

I mean, maybe you need the Web UI to get a list of git repos, but I’d think that that’d be about it.

witten@lemmy.world · 8 days ago

Then they’d have to bother understanding the content and downloading it as appropriate. And you’d think if anyone could understand and parse websites in realtime to make download decisions, it be giant AI companies. But ironically they’re only interested in hoovering up everything as plain web pages to feed into their raw training data.

Natanael@infosec.pub · 7 days ago

The same morons scrape Wikipedia instead of downloading the archive files which trivially can be rendered as web pages locally

🇰 🌀 🇱 🇦 🇳 🇦 🇰 🇮 @pawb.social · edit-2 8 days ago

I dont understand how challenging an AI by asking it to do some heavy computational stuff even makes sense… A computer is literally made to do computations, and AI is just a computer. 🤨

Wouldn’t it make more sense to challenge the AI with a Voight-Kampff test? Ask it about baseball.

purplemonkeymad@programming.dev · 8 days ago

The scrapers are not actually an ai, they are just dumb scrapers there to get as much textual information as possible.

If they have to do Anubis tests, that is going to take more time to get the data they scrape. I suspect that they are probably paid per page they provide, so more time per page is less money for them.

BodilessGaze@sh.itjust.works · 8 days ago

The point is to make scraping expensive enough it isn’t worth the trouble. The only reason AI scrapers are trying to get this data is because it’s cheaper than the alternatives (e.g. generating synthetic data). Once it stops being cheaper, the smart scrapers will stop. The dumb scrapers don’t matter because they don’t have the talent to devise these kind of workarounds.