GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a “thinking mode” for advanced reasoning and tool use, and a “non-thinking mode” for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs
Blog post: https://z.ai/blog/glm-4.5
Hugging Face:
I’ve moved to using Rama Lama mainly because it promises to do the probing to get the best acceleration possible for whatever model you launch.
It looks like it just chooses a llama.cpp backend to compile, so technically you are leaving a good bit of performance/model size on the table if you know your GPU, and the backend to choose.
All this stuff is horribly documented though.