Engineering LLM Workflow Claude March 2026 · Andy

LLM Coding Workflow: Practical Patterns That Actually Work

Most guides on using LLMs for coding are either too vague ("just ask it to write code!") or too tool-specific. This is the workflow I've actually developed after using Claude to build production systems — the mental models, prompt patterns, context management techniques, and honest limits that make the difference between frustrating sessions and genuinely productive ones.

In this article

The mental model shift
The three modes
CLAUDE.md: your project memory
The iterative loop
Prompt patterns that work
What LLMs are bad at
Tool setup
The eval mindset
Context management
Red flags

The mental model shift

The biggest productivity unlock isn't a better prompt or a new tool — it's getting the mental model right. Most people start by treating LLMs like a smarter autocomplete or a fancier Stack Overflow. That framing causes friction, because autocomplete should be instant and SO answers are usually exact.

The mental model that actually works: the LLM is a junior developer who has read every book, every paper, and every GitHub repo that existed before their training cutoff. They know a lot — more than you do about many specific APIs — but they haven't touched your codebase, don't know your production quirks, and need explicit direction to produce something useful rather than something plausible.

This reframing changes how you interact. You stop fighting the tool when it generates something that needs editing. You expect to direct, not receive. You take responsibility for the spec — a bad output usually traces back to a bad prompt, not a bad model. And you start treating the session like a code review conversation rather than an oracle consultation.

The other shift: you are the architect, it is the drafter. LLMs are genuinely excellent at translating a clear spec into code. They are poor at inventing the spec from nothing. The sharper your mental picture of what you want, the better the output. Blurry requirements produce blurry code.

The three modes

I use LLMs in three distinct modes, and the prompting style for each is quite different. Conflating them is a common mistake.

Exploration

Understand unfamiliar code, concepts, or APIs. Ask broad questions. Invite tangents.

Implementation

Write new code to a spec. Be precise. Constrain scope. Specify exactly what you want.

Review

Audit existing code. Find bugs, suggest improvements, check for edge cases.

Exploration mode

When you're in unfamiliar territory — a new library, a language feature you haven't used, someone else's codebase — exploration mode is about building a mental map. Here, broad prompts are fine: "Explain how Python's asyncio event loop works and when I'd want to use it over threading." You want the LLM to teach you, so invite it to go deeper than you asked.

Useful openers for exploration: "Walk me through how this code works, section by section." "What are the main concepts I need to understand to use X?" "What would a senior engineer think about this design?"

Implementation mode

This is where most of the useful work happens, and also where most prompts fail. Vague implementation prompts produce plausible-looking code that doesn't quite do the right thing. The fix is to front-load your constraints:

# Bad
"Write a function to process user uploads"

# Better
"Write a Python function that accepts a FastAPI UploadFile,
validates it's a JPEG or PNG under 5MB, saves it to ./uploads/
with a UUID filename, and returns the filename.
Use pathlib. Don't use any external libraries beyond FastAPI itself.
Raise HTTPException with appropriate status codes on validation failures."

The more specific the spec, the less editing you do. I've found that the time spent writing a tight prompt is always recovered in reduced back-and-forth.

Review mode

Review mode is underused. Paste a function and ask: "What are the edge cases this doesn't handle?" or "What would break under concurrent access?" or "Is there a simpler way to write this?" LLMs are good at this — they've seen thousands of bug patterns and can match your code against them without emotional attachment to the implementation.

I use review mode before committing anything I'm uncertain about. It's cheaper than a human review for catching obvious issues, and it often surfaces concerns I hadn't considered.

CLAUDE.md: your project memory

The single highest-leverage thing I've done for LLM-assisted coding is maintaining a CLAUDE.md file at the root of each project. Claude Code (and other tools) read this file automatically at the start of every session. Everything in it becomes ambient context that you don't have to re-explain.

Here's roughly what I put in a CLAUDE.md for a typical web project:

# Project: My API Service

## Architecture
- FastAPI backend, PostgreSQL (asyncpg), Redis for caching
- Deployed on VPS via systemd, nginx reverse proxy on port 8080
- No Docker in production — plain virtualenv at /opt/myapp/venv

## Conventions
- All DB queries go in db/ folder, never inline in routes
- Use snake_case for Python, kebab-case for URL paths
- Error responses: {"error": "message", "code": "ERROR_CODE"}
- Log with structlog, not print() or logging directly

## Key gotchas
- asyncpg connections are NOT thread-safe — always use pool.acquire()
- Redis keys expire after 1hr by default (see cache.py:DEFAULT_TTL)
- The /health endpoint must NOT require auth (nginx healthcheck uses it)

## What not to touch
- Don't modify migrations/ directly — use alembic revision --autogenerate
- Don't change the nginx config — it's managed separately

This file pays dividends every session. Without it, you spend the first 5 minutes re-establishing context: "This is a FastAPI project, we use asyncpg, here are our conventions..." With it, Claude already knows. It generates code in your style, avoids your known pitfalls, and doesn't suggest solutions that conflict with your architecture.

I use the CLAUDE.md Writer tool to generate a solid starting point for new projects. You describe the project and it produces a structured CLAUDE.md you can then customize. Takes 30 seconds versus writing one from scratch.

What to put in CLAUDE.md

Architecture decisions and why (not just what), naming conventions, known gotchas and workarounds, things NOT to do, file organization rules, how to run tests locally, environment quirks. The more it captures decisions you've already made, the more the LLM acts as a teammate rather than a stranger.

The iterative loop

The worst workflow pattern I see is the "big task" approach: dump five files into context, ask for a complex refactor, try to merge the output. This almost always fails — the output conflicts with something the LLM didn't account for, the changes are too large to review properly, and you end up spending more time untangling than you saved.

The loop that works is: small task → generate → review → commit → repeat.

Specifically:

One thing at a time. "Add input validation to this one function" beats "refactor the entire validation layer."
Small commits. Commit generated code you're happy with before asking for the next thing. This gives you clean rollback points and keeps you honest about what's actually tested.
Test before continuing. Run the tests (or manually verify) after each generated chunk. Don't stack three unverified changes on top of each other.
Fresh context for new problems. When you start a different problem, start a fresh session. Old context from an unrelated task doesn't help and sometimes confuses.

The reason "never give an LLM a 5-file task" is a rule: the model doesn't have a reliable way to know what it doesn't know about your codebase. A single-file task exposes exactly the context needed. A 5-file task requires the model to correctly understand all the interactions between those files — and it will silently make plausible-but-wrong assumptions about anything it's uncertain about.

The test-then-ask rule

Write the test first, then ask the LLM to make it pass. This is the most reliable way to get correct code. The test is the spec. The LLM can't misinterpret what "green test" means, and you can't argue with the result.

Prompt patterns that work

Some prompt patterns I use constantly:

"Before writing any code, tell me your plan"

This is the single most useful pattern I've found. When the task is nontrivial, asking for a plan first catches misunderstandings before they're written in code. The LLM explains its approach, you spot the wrong assumption, you correct it in one message rather than after 50 lines of wrong code.

Before writing any code, explain your plan for implementing this.
List the functions you'll create, the data structures you'll use,
and any assumptions you're making about the existing codebase.

Rubber duck debugging

When I'm stuck, I explain the bug to the LLM in detail — not to get an answer, but because explaining it forces me to articulate every assumption. Half the time I find the bug while writing the explanation. The other half, the LLM catches the thing I glossed over.

I'm debugging a race condition. Here's what I know:
- The symptom is [X] but only under concurrent load
- I've checked [A] and [B], they look fine
- I think it might be [C] but I can't see why...
[paste relevant code]

Walk me through where I might be wrong in my assumptions.

"What are 3 different ways to solve this?"

When I'm not sure of the right approach, asking for alternatives forces the LLM to surface tradeoffs rather than just commit to whatever approach came to mind first. I often learn something — a pattern or library I hadn't considered — and the contrast makes the right choice clearer.

Ask it to critique its own output

[After receiving generated code]

Now critique this. What are the weaknesses, edge cases it doesn't handle,
or ways it could fail in production? Be specific and honest.

This works surprisingly well. The LLM is often harsher on its own output than I would be, and it catches things the initial generation missed. It's most useful for security-sensitive or concurrency-sensitive code.

"Explain this to me like I'll need to maintain it in 6 months"

When reviewing unfamiliar code, this framing produces useful explanations. It shifts from "here's what it does" to "here's what you need to know to not break it."

What LLMs are bad at

Honest accounting matters here. The faster you accept what LLMs can't do, the less time you waste fighting those limits.

File system state. The LLM has no idea what's changed in your project since the conversation started. If you create a file mid-session, rename something, or add a dependency, you have to tell it. It will confidently reference the old state unless you update it explicitly.

Your production quirks. "It works on my machine" situations, environment-specific behavior, infrastructure idiosyncrasies — the LLM doesn't know these exist. Code that looks correct will still fail on your specific stack for reasons the LLM couldn't have known. This is normal; it's not a model failure. You have to bridge that gap.

Large codebase navigation. Asking "find all the places we handle auth" on a 100k-line codebase doesn't work well via conversation. The model doesn't have reliable access to files it hasn't seen, and even with a lot of context it can miss things or confuse files. Use real tools for this — grep, an LSP, or MCP servers that give the LLM actual filesystem access. (More on that in tool setup.)

Long session consistency. After a long session, LLMs start to drift — contradicting decisions made earlier, forgetting constraints you mentioned 50 messages ago. This isn't a flaw to fix; it's a property to work around. Fresh sessions with good CLAUDE.md context beat long sessions that rely on the model remembering everything.

Knowing what it doesn't know. This is the dangerous one. LLMs are confidently wrong at roughly the same confidence level as when they're right. They will not usually say "I'm not sure about this API — check the docs." They'll generate something plausible. Always verify generated code against actual documentation for anything you're unfamiliar with.

Tool setup

The tools matter. Here's what I actually use and why:

Claude Code

Claude Code (the official Anthropic CLI) is my primary tool for in-project coding work. It reads CLAUDE.md automatically, has direct file access so it can actually read and edit your code rather than relying on you to paste it, and supports skills — reusable instructions you can deploy for repeated workflows like "run tests and report failures" or "generate a migration file."

I've built a SKILL.md Generator for creating custom skill files, which can meaningfully accelerate repeated workflows. A skill is basically a prompt template plus behavioral instructions that loads into any session.

MCP servers for your stack

Model Context Protocol (MCP) servers are the fix for the "large codebase navigation" problem. Instead of pasting files into the conversation, you give the LLM tools it can call: read this file, search for this symbol, run this test. The LLM then navigates your project like a developer would rather than like someone reading a printout.

The practical effect: tasks that require understanding multiple files become tractable. The model can read the interface, find the implementation, check the tests, and come back with a grounded answer — not a guess based on what you pasted.

My guide on how to build an MCP server covers building custom ones for your specific stack. The official MCP servers cover filesystem, git, and databases — good starting points for most projects.

OpenRouter for model flexibility

When I'm running automated pipelines or need to experiment with different models, I use OpenRouter rather than committing to a single provider's API. You get a single API key that routes to Claude, GPT-4, Llama, Mistral, and others. You can swap models without changing code, which is useful when evaluating or when rate limits hit.

See my OpenRouter pipeline guide for details on model routing decisions and fallback chains.

The eval mindset

Here's how you know if your LLM workflow is actually improving: you measure it.

This sounds more formal than it needs to be. At its simplest, "evals" just means having a consistent set of problems you can use to compare prompt strategies. For coding workflows, this might be:

A set of 10 tasks you commonly do (write a REST endpoint, add validation, refactor a function)
A rubric for each: does the output compile? Does it pass the tests? Does it match your conventions?
Occasional benchmarks: try a new prompt pattern against the same task and see if the output is better

The reason this matters: without evals, you can't tell if a change to your CLAUDE.md improved things or not. You can't tell if a new prompt pattern actually helps or just feels better. Gut feel is unreliable here — the LLM's output varies enough that a single example proves nothing.

I run more formal evals for my AI quality scoring work (see the chatbot eval system in my pipeline guide), but even informal evals — "I tried asking for a plan first on my last 5 tasks, here's what I noticed" — are better than nothing.

My eval practice

I maintain a small set of 10 "benchmark tasks" — real problems from my projects. When I change my CLAUDE.md significantly or try a new prompt strategy, I run 3-5 of them and compare. It catches regressions and surfaces what actually helps versus what just sounds good in theory.

Context management

Context is a finite resource. How you manage it determines whether the end of a session is as useful as the beginning.

Structure long sessions deliberately

If a session will span multiple subtasks, front-load the context you'll need throughout: the relevant file contents, the constraints, the end goal. Don't add context mid-session if you can avoid it — earlier additions are more reliably recalled than later ones.

When to use /clear

Use /clear (or equivalent context reset) when:

You're switching to a genuinely different problem
The session has accumulated a lot of failed attempts or corrected misconceptions
You notice the model contradicting itself or ignoring earlier constraints
The task is complete — don't carry stale context into the next task

Don't use /clear mid-task if the LLM is doing well. The context it has about what you've already done is valuable. The goal is to reset when the accumulated context is hurting more than helping — not just when the context window gets large.

When to continue versus restart fresh

Continue the session when: you're building on work the LLM already knows about, referencing code it wrote, or iterating on a design it understands. Restart fresh when: the problem domain is different, you got a bad output and want a clean slate, or you're starting the next day's work (stale sessions aren't worth loading).

CLAUDE.md makes fresh sessions low-cost. If the project context is in the file, a fresh session isn't starting from zero — it's starting from a good baseline.

Red flags

Signs that you're using LLMs in ways that will cost you:

Red flag

Copy-pasting without reading. If you're moving generated code into your project without reading it, you're accepting unknown behavior. Generated code isn't automatically correct. It's a draft that needs review. Treat it like a code review, not a free pass.

Red flag

No tests on generated code. If the LLM generates a function and you ship it without tests, you now own an untested function. The fact that an LLM wrote it doesn't change the quality bar. Write the test, make sure it passes, then commit.

Red flag

Ignoring hallucination in library calls. LLMs frequently invent API methods that don't exist, or use methods with the wrong signature. This is especially common for newer libraries with less training data. Always check generated library calls against the actual docs before running them in any non-trivial context.

Red flag

Using the LLM as the source of truth for security decisions. "Is this SQL query safe from injection?" is a reasonable LLM question for learning. Shipping the code because the LLM said it was safe is not a sufficient security review. LLMs make confident mistakes on security-sensitive code.

Red flag

Asking for too much at once. If your prompt is three paragraphs describing a complex feature, you're going to get a complex output you'll spend an hour reviewing. Break it down. The friction is a feature — it forces you to think through the steps before generating.

None of these make LLMs not worth using. They make them worth using carefully. The developers I've seen get the most out of LLMs are also the ones who review generated code most critically — not because they distrust the tool, but because they've learned where it fails.