• Programmer Belch@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    41
    ·
    9 days ago

    I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

    The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

      • who@feddit.org
        link
        fedilink
        English
        arrow-up
        18
        ·
        edit-2
        9 days ago

        Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

        • redjard@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          9
          ·
          8 days ago

          Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …

          • who@feddit.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            edit-2
            8 days ago

            Crawl-delay

            It’s a nonstandard extension without consistent semantics or wide support, but I suppose it’s good to know about anyway. Thanks for mentioning it.

        • S7rauss@discuss.tchncs.de
          link
          fedilink
          English
          arrow-up
          4
          ·
          9 days ago

          I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

    • ulterno@programming.dev
      link
      fedilink
      English
      arrow-up
      8
      ·
      9 days ago

      If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

      Probably not a good idea to download the whole site, but then that depends upon the site.

      • If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
      • For a dynamically created site, there’s nothing I can say
      • Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
      • Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
      • The best would be if you could ask the website dev whether they have an API available.
        • Even better, ask them to provide an RSS feed.
      • Programmer Belch@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        3
        ·
        9 days ago

        As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.

        • limerod@reddthat.com
          link
          fedilink
          English
          arrow-up
          2
          ·
          8 days ago

          You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

          I would reduce the frequency to once every two days to further reduce the impact.

        • ulterno@programming.dev
          link
          fedilink
          English
          arrow-up
          1
          ·
          8 days ago

          That might/might not be much.
          Depends upon the site, I’d say.

          e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
          But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.

    • Flax@feddit.uk
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      9 days ago

      The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn’t even dent a raspberry pi, nevermind a beefier server

      I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.