Small Systems Playbook

For founders, indie hackers, and teams under 20 engineers.

Updated May 23, 2026 shortlist: Start with Claude Opus 4.7 for agentic tasks or Claude Sonnet 4.6 for fast quality; Route high-volume to Gemini 2.5 Flash or Llama 4 Scout (25% faster). Consider Qwen3.5-35B-A3B (efficient MoE), Ministral 3 8B, or Llama 4 Maverick for self-hosted fallback layers when token budget permits.

Architecture That Actually Ships

Use one primary model (GPT-5.5, Claude Haiku 4.5, or Grok-3) and one budget model (Gemini Flash or Llama 4 Scout).
Keep prompts in versioned files with A/B test metadata.
Add retry + timeout + fallback at API boundaries; consider circuit breakers.
Use token-counting middleware to prevent surprise bills (especially with extended context models).
Cache deterministic prompts; query results aggressively to reduce context passing.

Where Small Teams Fail (2026 Edition)

Neglecting to recalibrate evals after pricing cuts and speedups.
Over-engineering model routers before product-market fit.
Ignoring eval datasets and trusting quick demo prompts.
No token budget guardrails; bleeding money on extended-context models unnecessarily.
Shipping without refusal/failure UX; users hit model limits and think your product broke.
Skipping A/B tests; new faster models tempt you to switch globally instead of validating.

Best Early Recommendation

Start with a premium model for customer-facing outputs and only optimize cost after usage patterns stabilize for 2-4 weeks.