Small Systems Playbook
For founders, indie hackers, and teams under 20 engineers.
Updated May 23, 2026 shortlist: Start with Claude Opus 4.7 for agentic
tasks or Claude Sonnet 4.6 for fast quality; Route high-volume to Gemini
2.5 Flash or Llama 4 Scout (25% faster). Consider Qwen3.5-35B-A3B
(efficient MoE), Ministral 3 8B, or Llama 4 Maverick for self-hosted
fallback layers when token budget permits.
Architecture That Actually Ships
-
Use one primary model (GPT-5.5, Claude Haiku 4.5, or Grok-3)
and one budget model (Gemini Flash or Llama 4 Scout).
- Keep prompts in versioned files with A/B test metadata.
-
Add retry + timeout + fallback at API boundaries; consider circuit
breakers.
-
Use token-counting middleware to prevent surprise bills
(especially with extended context models).
-
Cache deterministic prompts; query results aggressively to reduce
context passing.
Where Small Teams Fail (2026 Edition)
-
Neglecting to recalibrate evals after pricing cuts and speedups.
- Over-engineering model routers before product-market fit.
- Ignoring eval datasets and trusting quick demo prompts.
-
No token budget guardrails; bleeding money on extended-context
models unnecessarily.
-
Shipping without refusal/failure UX; users hit model limits and
think your product broke.
-
Skipping A/B tests; new faster models tempt you to switch globally
instead of validating.
Best Early Recommendation
Start with a premium model for customer-facing outputs and only
optimize cost after usage patterns stabilize for 2-4 weeks.