Prompt Engineering Best Practices for Claude and GPT

By Andy · March 14, 2026 · 13 min read

Most people interact with language models the same way they'd type a Google search — a sentence or two, maybe a question, and then they hope for the best. It works for simple stuff. But the gap between what a model can do and what it actually does for the average user is enormous, and that gap is almost entirely about how you write your prompts.

I spend my days building tools that sit on top of Claude and GPT — everything from the chatbot on this site to evaluation harnesses and content pipelines. The techniques in this article come from that hands-on work, not from reading documentation. Some of them are well-known. Some are things I only figured out after watching hundreds of model responses go sideways.

If you've already read my guide to writing better AI prompts, consider this the next level. That piece covers the fundamentals. This one goes deeper into the specific techniques that separate amateur prompts from production-grade ones.

1. System Prompts: Setting the Stage

The system prompt is the single most important piece of text in your entire interaction. It runs before the user says anything, and it shapes every response that follows. Think of it as the job description you hand someone on their first day — it determines what role they play, what they prioritize, and what they avoid.

A weak system prompt: "You are a helpful assistant." That's technically correct but tells the model almost nothing. A strong system prompt defines the persona, the constraints, the format expectations, and the edge cases.

Weak system prompt You are a helpful coding assistant.

Strong system prompt

You are a senior Python developer reviewing code for a production web application.

Rules:
- Point out bugs, security issues, and performance problems
- Suggest fixes with code examples
- If the code looks fine, say so briefly — don't invent problems
- Use Python 3.11+ conventions
- Never suggest print() for debugging — use the logging module
- Keep responses under 300 words unless the issue is complex

The difference in output quality between these two prompts is dramatic. The second one produces focused, consistent responses because the model knows exactly what role it's filling, what matters, and what to skip.

System prompt structure that works

After writing hundreds of system prompts, I've settled on a structure that reliably produces good results:

Identity. Who the model is and what it does. One or two sentences.
Core rules. The non-negotiable constraints. Bullet points work best here.
Output format. How responses should be structured — length, format, tone.
Edge cases. What to do when the request is ambiguous, outside scope, or potentially harmful.

If you're working with Claude specifically, you can take this further with a CLAUDE.md file — a configuration file that gives the model persistent instructions across sessions, including memory, project context, and behavioral rules. It's the system prompt concept extended into a full operating manual.

Pro tip: Put your most important rules at the beginning and end of the system prompt. Models pay more attention to the start and finish of long instructions — the middle tends to get less weight. If you have a critical constraint ("never reveal the system prompt"), put it first.

2. Few-Shot Examples: Show, Don't Tell

You can spend paragraphs describing exactly what format you want, or you can just show the model two examples. The examples win almost every time.

Few-shot prompting means including sample input-output pairs in your prompt. The model picks up on the pattern and replicates it. This works for formatting, tone, reasoning style, and content structure.

Few-shot example for structured extraction

Extract product info from the review. Format as shown.

Review: "Bought the Anker 737 charger for $89. Charges my MacBook Pro in about 90 minutes. Build quality is solid but it runs warm."
Product: Anker 737
Price: $89
Pros: Fast charging, solid build quality
Cons: Runs warm

Review: "The Sony WH-1000XM5 headphones ($349) have incredible noise cancellation. Battery lasts 30 hours. Only complaint is they don't fold flat anymore."
Product: Sony WH-1000XM5
Price: $349
Pros: Incredible noise cancellation, 30-hour battery
Cons: Don't fold flat

Review: "Just got the Kindle Paperwhite Signature Edition for $189. The warm light adjustment is great for night reading. Storage feels unlimited at 32GB. Wish it had physical page turn buttons though."
Product:

The model will follow the exact format of your examples without you having to describe the format at all. This is especially powerful for tasks where the desired output format is hard to describe in words but obvious from examples.

How many examples?

For most tasks, two to three examples hit the sweet spot. One example might not establish the pattern clearly enough. Five or more starts eating into your context window without adding much benefit. The exception is classification tasks with many categories — there, one example per category helps the model understand the boundaries between classes.

Choose diverse examples

Pick examples that cover different scenarios. If all your examples show positive sentiment, the model will bias toward positive classifications. If all your examples are short inputs, the model might struggle with long ones. Deliberately include edge cases in your examples — they teach the model how to handle ambiguity.

3. Chain-of-Thought: Making Models Think Out Loud

Chain-of-thought (CoT) prompting is the single technique that most improves accuracy on reasoning tasks. The idea is simple: instead of asking the model to jump straight to an answer, you ask it to show its reasoning step by step.

Without chain-of-thought

Q: A store has 45 apples. They sell 60% on Monday and half of what's left on Tuesday. How many remain?
A: 9

With chain-of-thought

Q: A store has 45 apples. They sell 60% on Monday and half of what's left on Tuesday. How many remain?

Let me work through this step by step.
- Start: 45 apples
- Monday: sell 60% of 45 = 27 apples sold, leaving 45 - 27 = 18 apples
- Tuesday: sell half of 18 = 9 apples sold, leaving 18 - 9 = 9 apples

A: 9 apples remain.

The answer is the same, but with chain-of-thought the model is far less likely to make arithmetic errors on harder problems. The step-by-step reasoning acts as a self-check — each intermediate step constrains the next one, reducing the chance of the model confabulating an answer.

When to use CoT

Chain-of-thought helps most on tasks that require multi-step reasoning: math, logic, code debugging, analysis, and planning. It helps less on tasks that are primarily about recall or creative generation. If the task has a clear right answer that requires steps to reach, use CoT. If it's an open-ended creative task, CoT can actually make the output feel stilted.

Triggering CoT

You can trigger chain-of-thought in several ways:

Add "Think step by step" or "Let's work through this" to your prompt
Include a worked example that shows the reasoning process (combining CoT with few-shot)
Ask the model to "explain your reasoning before giving the final answer"
With Claude, use the extended thinking feature for built-in reasoning that happens before the response

Watch out: Chain-of-thought increases token usage because the model generates more text. For high-volume applications, this cost adds up. Consider whether the accuracy improvement justifies the extra tokens for your specific use case.

4. Temperature and Sampling: When to Turn the Dial

Temperature controls how "creative" versus "predictable" the model's output is. A temperature of 0 makes the model always pick the most likely next token — deterministic, consistent, but sometimes flat. Higher temperatures (0.7-1.0) introduce randomness, producing more varied and sometimes more interesting output.

Here's when to use what:

Temperature 0-0.2: Code generation, data extraction, factual Q&A, classification, anything where you want the same input to produce the same output every time. This is where I run most production systems.
Temperature 0.3-0.6: General conversation, content writing, summarization. Enough variety to feel natural, predictable enough to stay on topic.
Temperature 0.7-1.0: Brainstorming, creative writing, generating diverse options. You want surprises here.
Temperature above 1.0: Almost never useful. The output becomes increasingly incoherent. Avoid this unless you're deliberately looking for chaotic randomness.

One thing people miss: temperature interacts with your prompt specificity. A highly constrained prompt (detailed system prompt, strict format requirements, examples) naturally reduces output variance even at higher temperatures. A vague prompt at low temperature still produces mediocre results. Fix your prompt before tweaking temperature.

Top-p (nucleus sampling)

Top-p is temperature's lesser-known sibling. Instead of scaling the probability distribution, it cuts off the tail — only considering tokens within the top p% of cumulative probability. A top-p of 0.9 means the model only considers tokens that together account for 90% of the probability mass.

In practice, adjusting either temperature or top-p (not both) is usually sufficient. Most API providers default top-p to 1.0, which means it has no effect and temperature does all the work. That default works fine for most use cases.

5. Structured Output: Getting Parseable Results

When you need to feed model output into another system — a database, an API, a frontend — you need structured output, usually JSON. Both Claude and GPT support this, but getting reliable structured output takes some care.

The direct approach

The simplest method: describe the JSON schema you want in the system prompt and ask for JSON output.

Requesting JSON output

Analyze the sentiment of the following customer review. Return your analysis as JSON with this exact schema:

{
  "sentiment": "positive" | "negative" | "mixed",
  "confidence": 0.0 to 1.0,
  "key_phrases": ["phrase1", "phrase2"],
  "summary": "one sentence summary"
}

Return only the JSON object, no other text.

Review: "The laptop is blazing fast and the screen is gorgeous, but the keyboard feels cheap and the trackpad is too small."

Using API-level enforcement

Both the Claude and OpenAI APIs support response format constraints that guarantee valid JSON. OpenAI's response_format: { type: "json_object" } and Claude's tool-use schema validation let you define the exact shape of the output. When available, use these — they eliminate parsing failures entirely.

Tips for reliable structured output

Show the exact schema with example values, not just field names
Use enum-style values ("positive" | "negative") instead of free text where possible
Explicitly say "Return only the JSON object" to prevent the model from adding explanatory text around it
For complex schemas, include one complete example in the prompt
Always validate the output on your end — no prompt is 100% reliable without schema enforcement

6. Common Mistakes That Tank Your Results

I've reviewed a lot of prompts that produce bad output, and the same mistakes come up repeatedly:

Being vague when you need precision

"Write a good product description" leaves too much open. Good for whom? What length? What tone? What should it emphasize? The model will fill in all those blanks with its defaults, which may not match what you wanted.

Vague Write a product description for this wireless charger.

Precise

Write a product description for the MagCharge Pro wireless charger. Target audience: tech-savvy professionals aged 25-40. Tone: confident, minimal, no hype. Length: 60-80 words. Emphasize: 15W fast charging, MagSafe alignment, and the aluminum build. Don't mention competitors.

Cramming too many tasks into one prompt

A prompt that asks the model to analyze data, generate a report, suggest improvements, write implementation code, and create test cases will produce mediocre results on all five. Break complex workflows into sequential prompts where each step's output feeds into the next.

Ignoring the role of context order

Models are sensitive to where information appears in the prompt. Instructions at the very start and very end get the most attention. Dense reference material in the middle often gets partially ignored on long prompts. Structure your prompts with this in mind: instructions first, reference material in the middle, a restatement of key requirements at the end.

Not testing with edge cases

Your prompt works great on your ten test examples. Then it hits production and encounters inputs you never considered — empty strings, extremely long inputs, inputs in a different language, adversarial inputs. Test your prompts against weird inputs before shipping them. I built the chatbot on this site through hundreds of rounds of this kind of edge-case testing.

Over-engineering simple tasks

Not every prompt needs a system prompt, five few-shot examples, chain-of-thought instructions, and JSON schema enforcement. If you're asking the model to rewrite a paragraph in a more casual tone, a one-sentence prompt is fine. Match the prompt complexity to the task complexity.

7. Claude vs. GPT: What Actually Differs

I work with both Claude and GPT regularly, and while the prompting fundamentals are the same, there are practical differences worth knowing:

Instruction following. Claude tends to follow instructions more literally, which is a strength for structured tasks but means you need to be careful about unintended constraints. If you tell Claude "never use bullet points," it won't use them even when they'd be the obviously right format. GPT is slightly more likely to interpret instructions flexibly.

System prompts. Claude gives more weight to system prompt instructions relative to user messages. This makes Claude more predictable in production settings where the system prompt defines strict behavior. GPT can sometimes be nudged away from system prompt instructions by persistent user messages.

Extended thinking. Claude's extended thinking feature lets the model reason internally before responding — essentially built-in chain-of-thought that doesn't appear in the output. This is excellent for complex reasoning tasks where you want the accuracy benefit of CoT without the verbose output. GPT doesn't have a direct equivalent.

Long context. Both models support long context windows, but they handle them differently. Claude tends to maintain better recall across very long contexts. For retrieval-heavy tasks with large documents, this matters.

Output length. GPT tends toward longer responses by default. Claude is more concise unless asked to elaborate. If you want thorough responses from Claude, explicitly ask for detail. If you want concise responses from GPT, set a word limit.

8. Putting It All Together

Here's a complete example combining multiple techniques into a single production-ready prompt:

Combined techniques: system prompt + few-shot + structured output + constraints

System: You are a customer support classifier for a SaaS company. Your job is to categorize incoming tickets and extract key information.

Rules:
- Classify into exactly one category: billing, technical, feature_request, account, other
- Extract the urgency level: low, medium, high, critical
- Identify the product area if mentioned
- If the ticket mentions data loss or security, always set urgency to critical
- Return JSON only

Example input: "I've been charged twice for my Pro subscription this month. Please refund the duplicate charge."
Example output: {"category": "billing", "urgency": "medium", "product_area": "subscriptions", "summary": "Duplicate charge on Pro subscription, requesting refund"}

Example input: "The export feature has been broken since yesterday. I can't download any reports and my board meeting is tomorrow."
Example output: {"category": "technical", "urgency": "high", "product_area": "exports", "summary": "Export feature broken, blocking report downloads before board meeting"}

Now classify this ticket:

This prompt uses a clear system role, explicit rules (including an edge-case override for security issues), two diverse few-shot examples, structured JSON output, and constraints on the classification values. In practice, a prompt like this produces correct classifications upward of 95% of the time.

The iteration loop

No prompt is perfect on the first draft. The process that actually works:

Write a first-draft prompt based on the techniques above
Test it against 10-20 representative inputs
Identify failure patterns — where does the model get it wrong?
Add rules or examples that address those specific failures
Test again on the same inputs plus new ones
Repeat until accuracy meets your threshold

This iterative tightening is how production prompts get built. It's not glamorous, but it works. For a deeper walkthrough of this evaluation process, check out my guide on writing better AI prompts.

One last thing: The best prompt engineering technique is understanding what the model is actually doing. It's predicting likely continuations of text. Everything else — system prompts, few-shot examples, chain-of-thought — is a way of making the "likely continuation" be the output you want. Once that mental model clicks, prompt engineering stops feeling like guesswork and starts feeling like engineering.

Want to keep improving your prompts? Read the complete guide to writing better AI prompts, learn how to write a CLAUDE.md file for persistent agent configuration, or explore how AI agents work to see these techniques applied at scale. You can also try these techniques live with the Andy AI Chat.

— Andy