---
title: "AI operations for small teams: a starter playbook"
description: "Most AIOps writing assumes a 200-person engineering org. Here's what actually works at five, fifteen, and forty people, with specific moves at each tier."
slug: "ai-operations-for-small-teams"
date: "2026-04-02"
updated: "2026-04-02"
author: "Nick Barnard"
authorAvatar: "https://images.unsplash.com/photo-1535713875002-d1d0cf377fde?q=80&w=200&auto=format&fit=crop"
image: "https://images.unsplash.com/photo-1551434678-e076c223a692?q=80&w=2000&auto=format&fit=crop"
category: "Operations"
tags:
  - AI operations
  - AIOps
  - Small teams
  - Startup
primaryKeyword: "AI operations for small teams"
secondaryKeywords:
  - "AIOps small business"
  - "AI for startups"
  - "AI ops playbook"
canonicalUrl: "https://tolly.ai/blog/ai-operations-for-small-teams"
markdownUrl: "https://tolly.ai/blog/ai-operations-for-small-teams.md"
featured: false
published: true
---

If your team is under 50 people, ignore most "AI operations" advice on the internet. It's written for orgs with dedicated platform teams. The version that works for you is shorter, cheaper, and more honest about what one or two engineers can actually maintain. Three tiers, three different playbooks.

## Why "AIOps" needs translation for small teams

The big-company AIOps stack — dedicated MLOps platforms, model registries, multi-region inference clusters, observability across thousands of model calls per second — is designed for problems you don't have. Adopting it would consume the same engineers who are also doing your product work. The right starter playbook for small teams optimizes for: *one engineer can fix it on a Tuesday*.

Three real tiers, with concrete moves for each.

## Tier 1: under 10 people

You don't have an "AI strategy." You have a couple of pain points where AI could realistically help, and a budget that won't survive a tooling sprawl.

**What to do:**

- Pick **one workflow** — usually email triage, lead qualification, or document extraction. Not three.
- Use a hosted model API (OpenAI, Anthropic) with simple prompt engineering. Skip fine-tuning, RAG, agents, and anything else that adds infrastructure.
- Log every prompt and response to a Postgres table. That's your eval dataset, your audit trail, and your debugging tool, all in one row.
- Put a human in the loop on every output for the first month. After 30 days, decide which outputs are reliable enough to auto-action.

**What to skip:**

- Vector databases. You almost certainly don't need them yet.
- Custom model hosting. You almost certainly don't need it ever.
- A separate "AI tool" budget line. Keep AI spend bundled with the workflow it powers.

## Tier 2: 10-25 people

You probably have your first dedicated operations or RevOps person now. AI starts to be worth a small layer of infrastructure — not a platform, just enough scaffolding that you can iterate.

**What to do:**

- Add structured outputs (Anthropic's tool use, OpenAI's structured outputs). Free-text responses are the source of half your debugging time at this stage.
- Track basic LLM observability: token counts, latency, model version, cost per workflow. PostHog or Langfuse is enough — you don't need an enterprise tool.
- Define an **eval set** of 20-50 real examples per workflow. Run them whenever you change the prompt. This single practice prevents 80% of the regressions teams experience at this stage.
- Move secrets out of `.env` files into a real secret store (Doppler, 1Password, AWS Secrets Manager). Rotation matters now.

**What to skip:**

- A formal model registry. You have two or three workflows, each with a clear primary model. Nothing to register.
- Multi-region failover for AI calls. Your customers won't notice.
- A dedicated ML engineer. The skills you need are *system design* and *operations literacy*, not modeling.

## Tier 3: 25-50 people

Now AI is doing real work, and the cost of an outage or a regression is starting to be visible. You need slightly more rigor — but still not the enterprise stack.

**What to do:**

- Promote your eval sets from "scripts" to "CI." Every prompt change runs evals before merge. Fail the build on regressions.
- Introduce per-workflow rate limiting and circuit breakers. When the model API has a bad day, your downstream systems shouldn't.
- Track **model drift** — same prompts may behave differently after a model version bump. Pin model versions explicitly, and treat unpinned upgrades like dependency upgrades (planned, with a rollback path).
- Have an explicit on-call escalation for AI workflows. Who gets paged when the qualifier starts mis-routing leads at 2am?
- Document the fallback for every AI workflow. If the model is down, what does the system do? "Queue the work for human review" is a perfectly fine answer — but it has to be designed in, not improvised.

**What to skip:**

- Building your own foundation model. Still no.
- A dedicated platform team. You probably want one or two engineers with a strong operations bent, not a separate org.

## Common failure modes at every tier

| Failure mode | What it looks like | The fix |
| --- | --- | --- |
| Prompt drift | Outputs slowly degrade as the prompt is edited piecemeal | Version-control prompts, run evals before deploys |
| Silent model upgrades | Model version changes and outputs shift overnight | Pin model versions explicitly |
| Cost surprise | Monthly bill triples after a code change | Per-workflow cost dashboards, alerts on weekly delta |
| Hallucination in production | Customer sees wrong data confidently asserted | Human-in-the-loop on customer-facing outputs |
| Lost audit trail | Compliance asks "what did the model say?" and you can't answer | Log every input/output, even when you think you won't need to |

## Frequently asked questions

### When should we hire our first ML engineer?

Usually around 50 engineers, not 50 employees. Before that, your AI workflows benefit more from generalist engineers with operational discipline than from specialists.

### What's the minimum viable AI ops stack?

Postgres for logging, a hosted model API, an eval script you can run in CI, and version-controlled prompts. That's it. Anything more elaborate at <25 people is usually overhead, not capability.

### Should we use an AI gateway?

Probably not until you're running three or more model providers in production. A gateway adds latency, debugging surface, and an extra vendor relationship. Tier 3 is a reasonable time to evaluate it.

## The takeaway

The most common AI operations mistake at small companies is adopting the toolchain of a large one. Buy what you can buy. Build what you can maintain. Log everything. Have a human in the loop until you've earned the right to remove them. Most of what makes AI workflows reliable is boring infrastructure done well, not novel ML practice.
