From autoregressive blank-infilling to agentic engineering — the GLM family has evolved from a novel pretraining architecture into one of the world's top open-weight AI systems.
"Towards Long-Horizon Tasks" — GLM-5.1 is a next-generation flagship for agentic engineering. It achieves SOTA on SWE-Bench Pro and sustains productive optimization over hundreds of iterations and thousands of tool calls.
Reduces KV cache vectors for significant GPU memory savings, enabling longer contexts and more efficient inference.
Reduces attention computation by 1.5–2× for long sequences without quality loss.
Predicts next 2 tokens simultaneously during inference with acceptance length of 2.76 — faster decoding.
Unlike previous models that plateau early, GLM-5.1 sustains optimization over 600+ iterations and 6,000+ tool calls. In a vector search optimization task, it reached 21.5k QPS — 6× the best single-session result. It built a complete Linux desktop environment in a single 8-hour run.
From lightweight inference to flagship performance, the GLM family covers the full spectrum.
754B MoE (40B active). #1 open model on LMArena. 200K context. Competitive with Claude Opus 4.5, Gemini 3 Pro, GPT-5.2.
GPT-4 competitive. 128K context, strong bilingual performance, native tool use and function calling.
Vision-capable variant that understands images alongside text. Built on CogVLM research.
Lighter and faster variant optimized for cost-sensitive production use without sacrificing quality.
9B model that outperforms Llama-3-8B. 1M context variant available. Permissive commercial license.
Specialized code model based on GLM-4-9B. Strong HumanEval scores with a VS Code extension.
GLM-5 autonomously solves multi-step engineering tasks across 10,000+ SWE environments in 9 languages. Supports 1,000+ concurrent agentic rollouts via the async "slime" RL framework.
Consistently the top-performing model family for Chinese NLP tasks while maintaining strong English performance. Dominates C-Eval and CMMLU benchmarks.
GLM-4V and CogVLM support image understanding, visual question answering, and text generation from images.
GLM-5's search agent features hierarchical context management and a web knowledge graph with 2M+ pages. GLM-4 All-Tools autonomously invokes browsing, code interpreters, and drawing tools.
200K tokens in GLM-5 (up from 128K). The GLM-4-9B-Chat-1M variant pushes to 1 million tokens. 3-stage training: 32K → 128K → 200K.
CodeGeeX4 delivers strong HumanEval results with a VS Code extension. GLM-5 scores 77.8 on SWE-bench Verified and 56.2 on Terminal-Bench 2.0.
GLM-5 supports interleaved, preserved, and turn-level thinking control — giving developers fine-grained control over reasoning depth and cost.
GLM-5 generates presentation slides via multi-level reward training for layout, rendering, and visual quality. CogView and CogVideoX handle image and video generation.
GLM originated with autoregressive blank infilling — combining the best of GPT and BERT. GLM-5 builds on that foundation with a sparse MoE architecture and cutting-edge attention mechanisms.
Causal / autoregressive LM. Predicts the next token. Great for generation, unidirectional.
Masked LM. Predicts [MASK] tokens with bidirectional context. Great for understanding, not generation.
Autoregressive blank infilling. Removes spans of text and autoregressively predicts them with both left and right context.
First open-weights model to score 50 on the Artificial Analysis Intelligence Index v4.0. Competitive with Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2.
| Benchmark | What It Tests | GLM-5.1 | GLM-5 |
|---|---|---|---|
| AIME 2026 | Advanced math competition | 95.3 | 95.4 |
| GPQA-Diamond | Graduate-level reasoning | 86.2 | 86.0 |
| HLE | Humanity's Last Exam | 31.0 (52.3 w/ tools) | 30.5 (50.4) |
| SWE-Bench Pro | Complex software engineering | 58.4 | 55.1 |
| NL2Repo | Repo generation | 42.7 | 35.9 |
| Terminal-Bench 2.0 | Real-world terminal tasks | 69.0 | 56.2 |
| CyberGym | Cybersecurity tasks | 68.7 | 48.3 |
| BrowseComp | Web browsing accuracy | 68.0 (79.3 w/ ctx mgmt) | 62.0 (75.9) |
| τ³-Bench | Agent task completion | 70.6 | 69.2 |
| Benchmark | What It Tests | GLM-5.1 | Opus 4.6 | GPT-5.4 |
|---|---|---|---|---|
| SWE-Bench Pro | Complex software engineering | 58.4 | 57.3 | 57.7 |
| NL2Repo | Repo generation | 42.7 | 49.8 | 41.3 |
| Terminal-Bench 2.0 | Real-world terminal tasks | 63.5 | 65.4 | 75.1 |
| AIME 2026 | Advanced math | 95.3 | 95.6 | 98.7 |
| GPQA-Diamond | Graduate-level reasoning | 86.2 | 91.3 | 92 |
GLM-5.1 benchmarks from z.ai/blog/glm-5.1 (Apr 2026). Competitive landscape scores from the same report.
130B bilingual model. One of the first large open bilingual models.
First open chatbot model. Hugely popular in China.
32K context, FlashAttention. Better inference speed.
Added function calling and code interpreter.
Flagship. 128K context, multimodal, tool use, GPT-4 competitive.
Open-source 9B. Outperforms Llama-3-8B. 1M context variant.
754B MoE, 200K context, #1 open model on LMArena. Agentic engineering, 3 thinking modes, search agent.
Next-gen flagship for long-horizon agentic tasks. SOTA on SWE-Bench Pro (58.4), CyberGym (68.7). MIT License. 21.5k QPS on vector search over 600+ iterations.
THUDM releases models, training code, and research openly. The entire GLM ecosystem is available on GitHub and HuggingFace.
754B open-weights on HuggingFace. MIT License. Full-precision and FP8. Compatible with vLLM and SGLang.
Open weights on HuggingFace & ModelScope. Permissive commercial license.
Code-specialized model with VS Code extension. Strong HumanEval performance.
GLM-5 is fully adapted to Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, MetaX, and Enflame — ensuring broad hardware accessibility.