Popular LLMs

This page now includes both model families and a concrete model catalog with many named models teams actively evaluate in 2026, including newer additions from recent model waves.

Latest refresh (May 31, 2026): Claude Opus 4.8 is the new frontier model (1M context, adaptive thinking) — Opus 4.7 moves to legacy; Claude Sonnet 4.6 upgraded to 1M context window; GPT-5.5-Pro added as highest-intelligence option; GPT-5.4 mini (400K context) launches for coding and computer use; Gemini 3.5 Flash reaches stable; Gemini 3.1 Pro in preview; Gemma 4 expands to E2B/E4B (mobile), 26B, and 31B tiers; DeepSeek V4-Pro (862B) and V4-Flash (158B) are the new flagships; Qwen3.5 series confirmed across 2B/9B/27B/35B-A3B sizes; Mistral Medium 3.5 leads frontier multimodal; Ministral 3 available in 3B/8B/14B. Context windows reaching 1M tokens at frontier tier.

OpenAI GPT Family

Why it is good: GPT-5.5 and GPT-5.5-Pro deliver frontier reasoning with 1M context; GPT-5.4 mini (400K context) excels at coding, computer use, and subagent workflows at lower cost. Broad ecosystem and stable operations.

Why it can be bad: still premium in ultra-high-volume deployments; less control than self-hosted alternatives.

Best for: research, real-time analysis, production assistants, complex coding tasks, agentic workflows, and knowledge work requiring current information.

Anthropic Claude Family

Why it is good: Claude Opus 4.8 is the new frontier — 1M context, adaptive thinking, and the most capable model for complex reasoning and agentic coding. Sonnet 4.6 upgraded to 1M context with fast output. Haiku 4.5 remains cost-efficient. Full family supports extended/adaptive thinking modes.

Why it can be bad: Premium pricing for Opus tier ($5/$25 per MTok); Opus 4.7 now legacy — migration needed.

Best for: Agentic workflows, complex multi-step reasoning, autonomous coding, legal/compliance review, long-form analysis, and enterprise use cases requiring careful outputs.

Google Gemini Family

Why it is good: Gemini 3.5 Flash now stable for production; Gemini 3.1 Pro in preview with advanced agentic and coding capabilities; strong multimodal support across text, image, audio, and video. Good fit for Google Cloud users.

Why it can be bad: 3.1 Pro still in preview; consistency can vary across prompt styles and niche coding tasks.

Best for: multimodal apps, agentic coding, search-enriched workflows, Google-native stacks, and enterprise customers.

Google Gemma Family (Open Source)

Why it is good: fully open-weight models under the permissive Gemma License (Apache-2.0 compatible); allows commercial use, modification, and self-hosting without vendor lock-in. Gemma 4 (April 2026) now purpose-built for advanced reasoning and agentic workflows with E2B/E4B tiers for mobile/IoT and 26B/31B for advanced reasoning on personal hardware.

Why it can be bad: requires in-house serving, infrastructure, and compliance review for some regulated industries; smaller fine-tuning community compared to Llama.

Best for: privacy-critical workflows, cost-optimized private inference, mobile/edge AI (E2B/E4B), compliance-sensitive enterprises with IP concerns, and hybrid cloud-on-prem stacks. Ideal for teams wanting full control and no API dependencies.

Meta Llama Family

Why it is good: open-weight flexibility, strong community support, easier self-hosting options, and now includes Llama 4 Maverick for visual reasoning tasks.

Why it can be bad: requires more in-house ML/platform work for top-tier quality and reliability; Scout variants trade some quality for speed.

Best for: cost-aware products, private deployments, custom fine-tuning paths.

Mistral Models

Why it is good: Mistral Medium 3.5 is now frontier-class for agentic and coding tasks; Mistral Large 3 available with open weights; Ministral 3 spans 3B/8B/14B for efficient deployments. Strong European adoption and compliance.

Why it can be bad: ecosystem and tooling footprint can be narrower than hyperscaler platforms.

Best for: latency-sensitive assistants, agentic coding, compact model deployments, regional compliance cases, and European data sovereignty.

Qwen and DeepSeek Families

Why they are good: DeepSeek V4-Pro (862B) and V4-Flash (158B) deliver strong general reasoning and coding; Qwen3.5 series spans 2B to 35B-A3B with hybrid thinking mode, 262K context, multimodal support, and efficient MoE architecture. Both families offer competitive performance at lower cost.

Why they can be bad: deployment/compliance review is required in regulated enterprise environments; thinking mode increases latency for simple tasks.

Best for: agentic coding, private hosting, value-focused inference, and teams needing controllable reasoning depth without paying frontier API prices.

Concrete Model Catalog (2026)

This list is intentionally broad, so you can shortlist by exact model name before running your own benchmarks.

Direct Model Links

Quick access to the most requested model pages: GPT models, Claude models, Gemini models, Gemma models, Llama 4 Maverick, Llama 4 Scout, Mistral models, Qwen model hub, DeepSeek models

Model Provider Category Good For Watch Out Download / Access
GPT-5.5 OpenAI Closed frontier Live web data, 1M context, premium coding and reasoning Highest cost tier; latency for web calls API access
GPT-5.5-Pro OpenAI Closed highest-intelligence Hardest reasoning problems, research-grade analysis Higher latency and cost; overkill for simple tasks API access (new)
GPT-5.4 OpenAI Closed balanced 1M context, strong coding at lower cost than 5.5 Not ideal for the hardest reasoning chains API access
GPT-5.4 mini OpenAI Closed efficient 400K context; coding, computer use, subagents at low cost Smaller context than full 5.4; less depth on hardest tasks API access (new)
Claude Opus 4.8 Anthropic Closed frontier Most capable model for complex reasoning and agentic coding; 1M context, adaptive thinking, 128K output Premium cost ($5/$25 per MTok); moderate latency API access (new)
Claude Opus 4.7 Anthropic Closed (legacy) Strong agentic tasks and reasoning; 1M context with adaptive thinking Now legacy — consider migrating to Opus 4.8 API access
Claude Opus 4.6 Anthropic Closed (legacy) Extended thinking, 1M context, strong reasoning Legacy; same pricing as 4.8 but less capable API access
Claude Sonnet 4.6 Anthropic Closed balanced Best speed/intelligence combo; 1M context, extended thinking, $3/$15 per MTok Not as capable as Opus on hardest tasks API access
Claude Haiku 4.5 Anthropic Closed small Fastest model with near-frontier intelligence; 200K context, $1/$5 per MTok Less robust on deepest tasks; smaller context API access
Gemini 3.5 Flash Google Closed fast (stable) Most intelligent for agentic and coding tasks at speed Lower quality than Pro tier on complex reasoning API access (stable)
Gemini 3.1 Pro Google Closed enterprise (preview) Advanced intelligence, complex problem-solving, agentic coding Still in preview; higher latency Preview access
Gemini 3 Flash Google Closed (preview) Frontier-class performance at fraction of cost Preview status; still maturing Preview access
Gemini 2.5 Pro Google Closed Reasoning and multimodal enterprise apps Being superseded by 3.x series API access
Gemma 4 E2B/E4B Google Open weight mobile/edge Optimized for mobile and IoT; compute-efficient inference Limited ceiling on complex reasoning; requires edge hardware Model docs · Weights
Gemma 4 26B Google Open weight balanced Advanced reasoning on personal hardware; agentic workflows Needs good VRAM; careful quantization for consumer GPUs Model docs · Weights
Gemma 4 31B Google Open weight high quality Most capable Gemma for private reasoning and coding Hardware intensive; best with 24GB+ VRAM or multi-GPU Model docs · Weights
Llama 4 Maverick Meta Open MoE multimodal 17B params, 128 experts (402B total); flagship open-weight reasoning and vision Full MoE serving requires strong infrastructure Download · Scout
Llama 4 Scout Meta Open multimodal efficient 17B params, 16 experts (109B total); edge inference with vision-text support Lower ceiling than Maverick; best for volume-optimized deployments Download
Llama 3.1 405B Instruct Meta Open weight Top-end open deployment quality Heavy infrastructure requirements Download · 70B · 8B
Llama 3.1 70B Instruct Meta Open weight Strong self-hosted quality/cost balance Needs good inference stack Download · 405B · 8B
Llama 3.1 8B Instruct Meta Open weight small Edge and low-cost deployments Lower performance on complex tasks Download · 70B · 405B
Llama 3.2 11B Vision Meta Open multimodal Private vision-text pipelines Requires evals for OCR-heavy cases Download · 90B
Llama 3.2 90B Vision Meta Open multimodal High-capacity multimodal inference Infrastructure complexity Download · 11B
Llama 3.3 70B Instruct Meta Open weight Efficient self-hosted quality, matches 3.1 405B at much lower cost Needs good inference stack for throughput Download
Mistral Large 3 Mistral AI Open weight multimodal Advanced general-purpose with open weights available Smaller ecosystem vs hyperscalers API + weights
Mistral Medium 3.5 Mistral AI Closed frontier Frontier-class multimodal for agentic and coding tasks Higher cost than smaller variants API access
Mistral Small 4 Mistral AI Closed small Unified instruction-following, reasoning, and coding Limited depth on advanced reasoning API access
Ministral 3 14B Mistral AI Open small Best-in-class text and vision at compact size Can trail latest closed models API access
Ministral 3 8B Mistral AI Open compact Efficient text and vision on consumer hardware Lower ceiling than 14B on complex tasks API access
Ministral 3 3B Mistral AI Open tiny Ultra-compact with multimodal support for edge/mobile Very limited on complex reasoning API access
Devstral 2 Mistral AI Code-specialized open Software engineering and code review Narrower general language strength API access
Qwen3.5-35B-A3B Alibaba Open weight MoE multimodal MoE (35B total, 3B active); hybrid thinking mode; 262K native context; multimodal (text, image, video); agentic coding. SWE-bench: 73.4 Regional compliance review required; thinking mode adds latency for simple tasks Download · FP8
Qwen3.5-27B Alibaba Open weight balanced Strong reasoning at moderate size; 80K context Regional compliance review; needs good VRAM Model hub
Qwen3.5-9B Alibaba Open weight compact 64K context; efficient private inference Lower ceiling than larger Qwen3.5 variants Model hub
Qwen3 32B Instruct Alibaba Open weight Strong open-weight multilingual assistant quality Regional compliance and policy review required Model hub
DeepSeek V4-Pro DeepSeek Open/available flagship 862B parameters; strong general reasoning and coding Massive model requires substantial infrastructure Download (new)
DeepSeek V4-Flash DeepSeek Open/available efficient 158B parameters; fast general reasoning at lower cost Governance review in enterprise; less capable than V4-Pro Download
DeepSeek V3.2 DeepSeek Open/available 685B parameters; mature and well-tested Superseded by V4 series; large infrastructure needed Download

Note: closed models usually provide API access rather than direct weight downloads.

Longlist: Best Candidates by Workload

Use this longer shortlist before final benchmarks. Pick 3-5 from each category, run your evals, and track quality, latency, and cost per successful task.

Reasoning and Decision Support

  • Claude Opus 4.8
  • GPT-5.5-Pro
  • GPT-5.5
  • Claude Sonnet 4.6
  • Gemini 3.1 Pro
  • Gemini 2.5 Pro
  • Gemma 4 31B
  • Llama 4 Maverick
  • DeepSeek V4-Pro
  • Qwen3.5-35B-A3B

Coding and Developer Assistant Workloads

  • Claude Opus 4.8
  • Claude Sonnet 4.6
  • GPT-5.5
  • GPT-5.4 mini
  • DeepSeek V4-Pro
  • Mistral Medium 3.5
  • Devstral 2
  • Llama 4 Maverick
  • Qwen3.5-35B-A3B
  • Qwen3 32B Instruct
  • Gemini 3.5 Flash

High-Volume, Cost-Efficient Automation

  • GPT-5.4
  • GPT-5.4 mini
  • Claude Haiku 4.5
  • Gemini 3.5 Flash
  • Gemma 4 E2B/E4B
  • Gemma 4 26B
  • Llama 4 Scout
  • Mistral Small 4
  • Ministral 3 8B

Private or Self-Hosted Enterprise Paths

  • Llama 4 Maverick
  • Llama 4 Scout
  • Llama 3.3 70B Instruct
  • Gemma 4 31B
  • Gemma 4 26B
  • Mistral Large 3
  • Qwen3.5-35B-A3B
  • Qwen3 32B Instruct
  • DeepSeek V4-Flash
  • DeepSeek V4-Pro

New Model Radar for May 2026

How to Pick from This List

Do not pick only by benchmark rank. Validate with your own workload: prompt complexity, response latency, failure tolerance, and monthly token budget.

Continue with the comparison matrix and then read clear recommendations.