4 minute read

Written by - Millan Kaul


Tokenization: why “Hello World” becomes 4 tokens, how it affects your prompts, costs, and why context windows matter for testing.

WHY?

  • Billing and cost: LLM requests are charged by tokens, not words. “Hello World” is ~4 tokens, not 2. Understanding this helps you write prompts that fit the budget and avoid surprises on your AI bill.

  • Context windows: Every model has a max token limit (GPT-4 has 128K, Claude has 200K). Your prompt + response must fit within it, or the model silently chops off the start of your data—a silent killer for test reliability.

Image 01

WHAT?

  • Tokenization is the process of splitting text into small chunks called tokens—which can be letters, words, punctuation, or pieces of words. The model processes your prompt token-by-token, so token count directly affects speed, memory, and cost.
  • Most LLMs use subword tokenization (like Byte Pair Encoding): common words = 1 token, but rare words, long IDs, and code often split into 2+ tokens.
  • A context window is the total number of tokens the model can handle in one request (your prompt + the model’s response). Once you hit the limit, the model can’t add tokens; it either fails or quietly truncates the beginning of your input.

Examples:

Here are some real-world text samples with their approximate token counts:

  • Wayne Gretzky’s quote “You miss 100% of the shots you don’t take” = 11 tokens
  • The OpenAI Charter = 476 tokens

Note: Above examples are refered from OpenAI’s token article on What are tokens and how to count them?

Quick rule of thumb: 1 token ≈ 3-4 characters of English text. So a 1000-word document is roughly 200-300 tokens.

WHEN AND WHERE?

When tokenization matters most

  • Writing AI test suites: When you craft a prompt that includes system instructions + examples + context, the total often exceeds your budget. Example: A guardrail prompt with a long policy list + user query can hit token limits before the model responds.
  • Passing structured data: JSON, error logs, or code snippets tokenize inefficiently. A 50-line stack trace can eat 200+ tokens. Understanding this helps you trim or restructure data before sending it.
  • Comparing models: A 4K-context model vs. a 128K-context model changes what you can do. Knowing your typical prompt size helps you pick the right model upfront.

Practical example scenarios

  • A log-triage tool that suddenly fails: new log format makes each entry longer, pushing prompts past the context limit and truncating traces.
  • A test harness that works with GPT-3.5 but fails with GPT-4 (different tokenizer, same prompt now exceeds limits).
  • An AI assistant where prompts slowly grow (more policies, more context added over time), and you don’t notice token usage doubling until the bill arrives.

HOW?

Practical steps

  1. Estimate your prompt size
    • Measure tokens for the key prompts you’ll reuse (system message, examples, typical input). Use a token counter (OpenAI’s or Hugging Face) or apply the 3-4 characters per token rule.
  2. Choose a model with enough headroom
    • If your typical prompt is 2K tokens and you want room for a long response, pick a model with at least 8K context (leaving buffer). Don’t rely on exactly fitting the limit.
  3. Trim and restructure data
    • Instead of dumping a full error log, send the last 20 lines. Convert JSON pretty-print to compact format. Summarize long policies into bullet points. Small reductions add up.
  4. Monitor tokens in your test suite
    • Log token usage per request (most LLM SDKs report this). Over time, you’ll spot if prompt bloat is happening or if a new prompt format is expensive.

Real examples

  • Test harness bloat: You have a system prompt (500 tokens) + test instructions (300 tokens) + 3 examples (1800 tokens total). Trimming examples to 2 and shortening instructions saves 600 tokens, making the suite faster and cheaper.
  • Structured data explosion: Sending a full error traceback (multi-line) costs more tokens than sending just the exception message + final stack frame. You get nearly the same debugging info for 1/3 the tokens.
  • Silent truncation: Your test prompts work fine with a 32K model, but when you upgrade to a specific 4K model by accident, tests start failing silently. Monitoring token count catches this immediately.

For Leaders

Why tokens matter: Token efficiency directly drives both cost and latency. A team that ignores tokenization can see monthly AI spending double silently as prompts grow (policies added, context expanded, examples accumulated). Early adoption of token budgeting prevents surprises.

Where to watch: Tokenization affects model selection (do we need 128K or is 8K enough?), infrastructure costs (per-request charges scale with tokens), and release timelines (bloated prompts slow down inference). Token metrics belong in your AI KPIs dashboard alongside precision and recall.

Risk management: Silent truncation is a classic failure mode: long prompts get chopped at the model’s limit, and you never see a warning. This can cause wrong answers, missed context in logs, or failed guardrails. Pre-mortems should include “what if our prompts hit the context limit?”

References

  • “How to work with large language models” openai
  • “What are Large Language Models?” nvidia
  • “What is LLM? – Large Language Models Explained” aws
  • “Large language model” en.wikipedia