• db0@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    9
    ·
    edit-2
    7 days ago

    Note that this is non-enforceable. it relies on the goodwill of the crawlers (of which they have none). You can’t impose licenses on other people like this. A license typically relies on having a copyright, patent or trademark restriction on something and then you licensing some exemption to that restriction to others. There’s no copyright protection on what people can do with your content once they read it other than not replicating it exactly. (nor should there be)

      • db0@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        5
        ·
        7 days ago

        yes, copyrights only do one thing though, preventing people from replicating the exact content. It doesn’t prevent things like training GenAI models on that content, or from reading from people you don’t behave as you want them to.

    • Rimu@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 days ago

      Yes.

      I’m hoping this is laying the groundwork for reddit, et al to eventually sue the heck out of OpenAI, Google, etc but I’m not a lawyer.

      • BrikoX@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 days ago

        Reddit literally signed deals with Google and OpenAI to allow them rights to train their LLM models on user content. So the only avanue to sue them would be breach of contract.

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    7 days ago

    Thanks for the efforts against the AI crawlers! Contrary to half the internet, I refuse to send my traffic over Cloudflare, but it’s really getting complicated. A few weeks ago, Alibaba hit my instance, and they did dozens of requests each second from many hosts across large IP ranges. The expensive database queries overloaded my VPS so much, I was barely able to login via SSH and block them. Now there’s some scattered crawling left from Tencent and all over the world but seems they have some manners and wait a few seconds between requests. But I guess it’s bound to happen again. I’ve now activated the User Agent blocking snippet which is floating around somewhere. And next time my server is hit, I’ll learn about more active countermeasures like Anubis or a selfhosted web application firewall or I’ll pull blocklists from somewhere. Anyway, this is becoming more of an issue lately.

    • slate@sh.itjust.works
      link
      fedilink
      arrow-up
      2
      ·
      20 hours ago

      You could point fail2ban at the access logs and automatically block any ips that are sending a crazy number of requests. Or that are sending bad requests or really however you want to configure it.

      It’s a little trickier for public servers, but I run some private web server stuff and use fail2ban to automatically ban anyone that attempts to access the server through the raw ip or non-recognized hostname. I get like 15-25 hits per day doing that.

      • hendrik@palaver.p3x.de
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        17 hours ago

        Thanks. But I’m not sure if that’s going to help me. What I see in my logs are many different IPs from several /18 networks. It’d take a while to let fail2ban fight such a crawler on an individual address level. Or I go for some nuclear approach, but I’d really like to avoid restricting the open internet even more than it already is. And it’d be hard to come up with a number of allowed requests so my services still work for humans. Me scrolling through PieFed definitely does more requests for a while than one individual crawler IP from Tencent does. Maybe if I find a good replacement for fail2ban which makes tasks like that a bit easier. And it’d better be efficient because fail2ban already consumes hours of CPU time sifting through my logs.

        Calling my server with the IP is handled. I think that just returns a 301 forward to my domain name. I get a lot of exploit scanners via that route, looking for some vulnerable wordpress plugins, phpMyAdmin etc. But they end up on my static website and that’s it.

  • schnurrito@discuss.tchncs.de
    link
    fedilink
    arrow-up
    5
    arrow-down
    1
    ·
    7 days ago

    oh for fuck’s sake

    It’s a legally still unsettled question in most jurisdictions whether AI crawling and training requires permission from the copyright holders of the source works at all. If the answer to that question is “no”, then the entire idea of wanting to use copyright law (and copyright licenses) to block AI crawling and training doesn’t work at all.

    I do not think we should want the answer to that question to be “yes”. Why would anyone want copyright law to impose more restrictions than it does anyway? The trend should be toward fewer copyright restrictions.

    I am not at all a fan of the current AI hype, but I am even less of a fan of wanting copyright law to be more restrictive or to be construed that way.