Small Systems Playbook

For founders, indie hackers, and teams under 20 engineers.

Updated May 23, 2026 shortlist: Start with Claude Opus 4.7 for agentic tasks or Claude Sonnet 4.6 for fast quality; Route high-volume to Gemini 2.5 Flash or Llama 4 Scout (25% faster). Consider Qwen3.5-35B-A3B (efficient MoE), Ministral 3 8B, or Llama 4 Maverick for self-hosted fallback layers when token budget permits.

Architecture That Actually Ships

  • Use one primary model (GPT-5.5, Claude Haiku 4.5, or Grok-3) and one budget model (Gemini Flash or Llama 4 Scout).
  • Keep prompts in versioned files with A/B test metadata.
  • Add retry + timeout + fallback at API boundaries; consider circuit breakers.
  • Use token-counting middleware to prevent surprise bills (especially with extended context models).
  • Cache deterministic prompts; query results aggressively to reduce context passing.

Where Small Teams Fail (2026 Edition)

  • Neglecting to recalibrate evals after pricing cuts and speedups.
  • Over-engineering model routers before product-market fit.
  • Ignoring eval datasets and trusting quick demo prompts.
  • No token budget guardrails; bleeding money on extended-context models unnecessarily.
  • Shipping without refusal/failure UX; users hit model limits and think your product broke.
  • Skipping A/B tests; new faster models tempt you to switch globally instead of validating.

Best Early Recommendation

Start with a premium model for customer-facing outputs and only optimize cost after usage patterns stabilize for 2-4 weeks.