← Blog
AI · 8 Dec 2025 · 10 min read

AI for SMEs — starting small, staying useful

We've added LLM features to nine projects this year. Here's what worked, what wasted budget, and how to scope your first AI feature.

In the last year we've shipped AI features into nine separate client projects. Some have been substantial — a full AI legal triage agent, a RAG chatbot on twenty years of journalism, an embeddable widget for law firms. Some have been small — a "summarise this thread" button, a smarter search box, an autocomplete that knows the user's project context.

The substantial ones get the press. The small ones are where most SMEs should start.

This is what we've learned about adding AI to existing line-of-business systems without setting fire to the budget.

The default mistake: building a chatbot first

When a business decides it wants to add AI, the first instinct is usually to build a chatbot. There's a reason for this — chatbots are the most visible, most demoable, most "AI-looking" thing. They photograph well in board presentations.

For most SMEs, a chatbot is the wrong place to start.

A chatbot is the most exposed surface AI can have. It is also the most expensive to do well, the most expensive to evaluate, and the most expensive to keep working when prompts and models drift. The chatbots we've built for clients — the LawyerClientConnect AI legal agent, the Gasworld archive chatbot — are substantial pieces of engineering, with months of investment behind them. They earn that investment because they're core to the product. A chatbot as a "let's see if AI helps" experiment will burn budget and ship something the team doesn't trust.

What we suggest to clients instead: start with one low-stakes AI feature inside the workflow they already have. Not a new surface. A small enhancement to an existing one.

What "low-stakes" looks like

Examples of first AI features that have actually shipped and stuck:

  • Auto-summarising long email threads when a team member opens a case. The summary is advisory; the original is one click away.
  • Suggesting tags when an admin uploads a document. The user can accept, edit, or ignore.
  • Drafting a first-pass response to a customer enquiry. The agent never sends; the human reviews.
  • Smarter search that understands intent, not just keywords, across an internal knowledge base.
  • Extracting structured fields from messy free-text input (e.g. addresses, dates, references).

What these have in common:

  1. The user is always in the loop. AI suggests; human decides.
  2. The cost of being wrong is bounded — at worst, a few seconds wasted.
  3. The value is incremental but real, every time it works.
  4. No new UI metaphor. It's a button or a field in software people already use.

These are not impressive AI demos. They are useful AI features. There is a difference.

Cost: the bit nobody plans for

LLM costs are easy to underestimate by an order of magnitude.

The thing that catches people out: the same feature costs vastly different amounts depending on how you implement it. A "summarise this conversation" feature might cost £0.001 per call or £0.10 per call depending on whether you're sending the whole conversation, only the last 50 messages, or a pre-summarised running summary.

The rules of thumb we use:

  • Estimate before you build. Tokens × price × expected calls per month. Do this on the back of an envelope before any code.
  • Cache aggressively. If the same input gives the same output, cache the output. We use Azure OpenAI's prompt caching where available and our own application-level cache otherwise.
  • Use the smallest model that works. GPT-4-class models are excellent and not always necessary. Many of our enhancement features run perfectly well on smaller/cheaper models. Don't reach for the most expensive model first.
  • Separate "experimental" from "production" deployments. Experiments will burn tokens you didn't plan for. Cap them.

For one of our clients we estimated AI costs at around [YOU: ballpark — I'd guess £200-£500/month for a substantial feature, vs £20-£50/month for a small enhancement. Confirm or refine.] for the production feature. The first prototype, before we tuned model selection and caching, was running at roughly 4× that. Tuning is real work and earns its keep.

RAG: usually right, sometimes overkill

The default architecture for "AI on top of our data" is Retrieval-Augmented Generation — index your documents, retrieve the relevant ones at query time, pass them to the model with the question.

This is the right pattern most of the time. It's what powers the Gasworld archive chatbot, where two decades of articles are continuously indexed and queryable. It scales reasonably, costs predictably, and keeps your data out of the model's training set.

But sometimes RAG is overkill.

For a small set of relatively static documents — fewer than a hundred pages, total — you can just put the whole thing in the prompt context. No vector database, no embedding pipeline, no chunking strategy. With large context windows now standard, "stuff it all in" is a legitimate architectural choice for small corpora.

Our rule of thumb: under ~50 pages of source material, put it in context. Over that, build a RAG pipeline. The break-even is approximate and the right answer depends on usage frequency, but the principle holds: don't build infrastructure you don't need.

Evaluations: the unsexy thing that makes AI features actually work

Here is the single biggest difference between AI features that work in production and AI features that demo well and then disappoint: evaluations.

When you ship a normal feature, you write tests. When you ship an AI feature, you also need to write tests — but the tests look different. They look like:

  • A set of representative inputs that the feature has to handle.
  • Expected behaviour for each, in language a human (not a unit-test framework) can judge.
  • A way to run the inputs through the feature and see whether the outputs are acceptable.

For a small AI feature, this might be twenty hand-curated examples. For a substantial one, it's hundreds. The evaluations are the asset you build alongside the feature, and they're how you change models or prompts without breaking things in production.

Most SME projects skip this and live to regret it. The agentic features we've shipped for clients all have evaluation suites attached. The small enhancement features sometimes don't, and that's the technical debt we pay for over time.

If you're scoping your first AI feature, budget time for evaluation harness, not just for the feature itself. It's roughly 30% of the engineering effort and 80% of the long-term value.

Picking the model: don't get attached

Models change. The model that was best in April is not the model that's best in November. Prices change. New providers appear. The Azure OpenAI service offers different models at different latency and price points; the OpenAI direct service offers others; Anthropic offers others.

The pattern we now follow in every project:

  • Wrap the model behind an interface. Your application code calls IAiService, not OpenAIClient directly.
  • Configuration, not code, picks the model. Switching models is a setting change.
  • Run the new model through your evaluations before switching.

This is engineering discipline more than AI expertise. It costs an hour to set up at the start of a project. It saves weeks of work two years later when the landscape has changed.

[YOU: we have an interesting story here from one of the projects — switching from one model to another mid-project — worth adding if accurate. Or generalise.]

Where AI is genuinely worth substantial investment

When does it make sense to spend real budget on AI features? When the feature is:

  • Central to the product, not a nice-to-have around the edges.
  • Reducing genuine human effort at scale — minutes saved per use, thousands of uses per month.
  • Doing something the business couldn't do before, not just doing the same thing slightly better.

The LawyerClientConnect AI legal agent fits all three: it's the front door of the product, it triages enquiries that would otherwise need a human, and it allows a marketplace model that couldn't exist without it.

The Gasworld archive chatbot fits all three: it's a subscriber-facing feature, it replaces research requests that used to need a human editor, and it unlocks self-service access to twenty years of content.

Most SME AI features don't fit all three. Most should be small, useful, evaluated, and incremental. The substantial bets are for products that are AI-shaped at their core.

The takeaway

If you're an SME thinking about adding AI: don't start with a chatbot. Start with one small feature inside an existing workflow, with the human in the loop, costed honestly, with an evaluation suite. Ship it, learn what your users actually do with it, and then decide whether to invest more.

The temptation to ship something visibly impressive is real. Resist. The right first AI feature is the one your users barely notice when it works and tell you about when it doesn't. That's where the value lives.


Red Owl IT is a Microsoft software consultancy in Bath. We've added AI features to nine line-of-business and consumer projects in the last year, from small in-workflow enhancements to substantial agentic systems. If you're scoping your first AI feature and want a second opinion on where to start, we'd happily talk it through.

aillmazure-openairagsmes