GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a “thinking mode” for advanced reasoning and tool use, and a “non-thinking mode” for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs

Blog post: https://z.ai/blog/glm-4.5

Hugging Face:

https://huggingface.co/zai-org/GLM-4.5

https://huggingface.co/zai-org/GLM-4.5-Air

  • Alex@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    11 days ago

    I’ve moved to using Rama Lama mainly because it promises to do the probing to get the best acceleration possible for whatever model you launch.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      11 days ago

      It looks like it just chooses a llama.cpp backend to compile, so technically you are leaving a good bit of performance/model size on the table if you know your GPU, and the backend to choose.

      All this stuff is horribly documented though.