• ulterno@programming.dev
    link
    fedilink
    English
    arrow-up
    8
    ·
    8 days ago

    If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

    Probably not a good idea to download the whole site, but then that depends upon the site.

    • If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
    • For a dynamically created site, there’s nothing I can say
    • Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
    • Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
    • The best would be if you could ask the website dev whether they have an API available.
      • Even better, ask them to provide an RSS feed.
    • Programmer Belch@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      3
      ·
      8 days ago

      As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.

      • limerod@reddthat.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        8 days ago

        You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

        I would reduce the frequency to once every two days to further reduce the impact.

      • ulterno@programming.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        8 days ago

        That might/might not be much.
        Depends upon the site, I’d say.

        e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
        But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.