Local LLMs are incredibly powerful tools, but it can be hard to put smaller models to good use in certain contexts. With fewer parameters, they often know less, though you can improve their capabilities with a search engine that’s accessible over MCP. As it turns out, though, you can host a 120B parameter model on a GPU with just 24GB of VRAM, paired with 64GB of regular system RAM, and it’s fast enough to be usable for voice assistants, smart home automation, and more. For reference, on 24GB of VRAM, the most practical dense model you’ll typically be able to fit will be a quantized 27 billion parameter model, accounting for the memory needed to hold the context window, too.

Specifically, the model we can use is gpt-oss-120b, which is the largest open weight model from OpenAI. It’s a Mixture of Experts model with 117B parameters with 5.1B active at a time. Paired with Whisper for quick voice to text transcription, we can transcribe text, ship the transcription to our local LLM, and then get a response back. With gpt-oss-120b, I manage to get about 20 tokens per second of output, which is more than good enough for a voice assistant. I’m running all of this on the 45HomeLab HL15 Beast, but any similarly specced machine will be able to do the same.

    • robsteranium@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      21 days ago

      Yeah I’d really like the idea of talking to a smart home but can’t justify the power consumption of leaving such a big box running 24/7.

      I also don’t think we need models this large for that purpose. It just needs to understand the semantics of home automation. Being able to compose limericks is a parlour trick that’s not worth the overhead.