- cross-posted to:
- fediverse@lemmy.zip
- cross-posted to:
- fediverse@lemmy.zip
It IS really simple, just add this XML to your site https://crust.piefed.social/rsl.xml and a line to robots.txt.
It might not stop them, but it is easy to do so why not.
Note that this is non-enforceable. it relies on the goodwill of the crawlers (of which they have none). You can’t impose licenses on other people like this. A license typically relies on having a copyright, patent or trademark restriction on something and then you licensing some exemption to that restriction to others. There’s no copyright protection on what people can do with your content once they read it other than not replicating it exactly. (nor should there be)
Enforceability is def a question, but copyright exists in most countries the moment you create the content.
yes, copyrights only do one thing though, preventing people from replicating the exact content. It doesn’t prevent things like training GenAI models on that content, or from reading from people you don’t behave as you want them to.
Nope. Copyright also covers derivative works. Whether or not AI training is included, I can’t say.
It only covers derivatives so long as they started from the original as a source.
Yes.
I’m hoping this is laying the groundwork for reddit, et al to eventually sue the heck out of OpenAI, Google, etc but I’m not a lawyer.
Thanks for the efforts against the AI crawlers! Contrary to half the internet, I refuse to send my traffic over Cloudflare, but it’s really getting complicated. A few weeks ago, Alibaba hit my instance, and they did dozens of requests each second from many hosts across large IP ranges. The expensive database queries overloaded my VPS so much, I was barely able to login via SSH and block them. Now there’s some scattered crawling left from Tencent and all over the world but seems they have some manners and wait a few seconds between requests. But I guess it’s bound to happen again. I’ve now activated the User Agent blocking snippet which is floating around somewhere. And next time my server is hit, I’ll learn about more active countermeasures like Anubis or a selfhosted web application firewall or I’ll pull blocklists from somewhere. Anyway, this is becoming more of an issue lately.
You could point fail2ban at the access logs and automatically block any ips that are sending a crazy number of requests. Or that are sending bad requests or really however you want to configure it.
It’s a little trickier for public servers, but I run some private web server stuff and use fail2ban to automatically ban anyone that attempts to access the server through the raw ip or non-recognized hostname. I get like 15-25 hits per day doing that.
Thanks. But I’m not sure if that’s going to help me. What I see in my logs are many different IPs from several /18 networks. It’d take a while to let fail2ban fight such a crawler on an individual address level. Or I go for some nuclear approach, but I’d really like to avoid restricting the open internet even more than it already is. And it’d be hard to come up with a number of allowed requests so my services still work for humans. Me scrolling through PieFed definitely does more requests for a while than one individual crawler IP from Tencent does. Maybe if I find a good replacement for fail2ban which makes tasks like that a bit easier. And it’d better be efficient because fail2ban already consumes hours of CPU time sifting through my logs.
Calling my server with the IP is handled. I think that just returns a 301 forward to my domain name. I get a lot of exploit scanners via that route, looking for some vulnerable wordpress plugins, phpMyAdmin etc. But they end up on my static website and that’s it.
oh for fuck’s sake
It’s a legally still unsettled question in most jurisdictions whether AI crawling and training requires permission from the copyright holders of the source works at all. If the answer to that question is “no”, then the entire idea of wanting to use copyright law (and copyright licenses) to block AI crawling and training doesn’t work at all.
I do not think we should want the answer to that question to be “yes”. Why would anyone want copyright law to impose more restrictions than it does anyway? The trend should be toward fewer copyright restrictions.
I am not at all a fan of the current AI hype, but I am even less of a fan of wanting copyright law to be more restrictive or to be construed that way.