Prompt Tracking
How to design and run prompt sets that produce reliable, unbiased data. Your measurements are only as good as the prompts behind them, and most teams get this wrong in the same predictable ways.
The short version
- Stop counting prompts. Tracking too many prompts creates noise. You end up measuring your own wording choices, not the model’s actual understanding of your brand.
- Focus on consistent outputs. The signal is which topics, entities, and attributes the AI reliably connects to your brand across many runs.
- Use a small, deliberate core prompt set. A handful of prompts that genuinely represent your offerings. Let the outputs tell you what to look at next.
- Map the gaps. The most useful data is where competitors are consistently clustered but your brand is absent.
Four ways prompt tracking goes wrong
Volume overload
Tracking dozens of prompts creates noise. Each phrasing produces slightly different answers, making it impossible to see clear patterns. More prompts does not mean better data. It usually means harder-to-read data.
Bias in the prompt
The way you phrase a question shapes the answer. If you write prompts that reference your brand or describe your category using your own marketing language, you are measuring your own wording choices rather than the model’s actual understanding.
Non-reproducibility
AI outputs are probabilistic. Even with the same prompt, you will not always get the same answer. Running a prompt once and treating the result as fact is not data. It is a single observation.
Measuring inputs instead of outputs
The goal is not to track which prompts mention your brand. The goal is to understand what the model consistently connects to your brand. Prompt tracking shifts your attention to the wrong thing.
How to do it properly
Choose your core prompts deliberately
Select a small set of prompts that genuinely represent your brand’s core offerings. This step is deliberately hard and should not be automated or delegated. Start with the questions your buyers actually ask before they have a shortlist, not after. Aim for five to ten prompts to start.
Example prompts
“What are the top workflow automation tools?” / “Compare AI workflow platforms for small businesses.” / “Which tools do companies use for [your core use case]?”
Control the channel before you run
Before running any prompt, decide which channel you are testing and lock it in. For training data measurement, disable browsing explicitly. For grounded search measurement, enable it. If you mix the two within the same prompt set without logging which runs used which setting, you are blending two different signals into one frequency count and the data will not be interpretable.
Log the channel alongside every run: model name, browsing on or off, and date. This becomes essential when you compare results month to month.
Why this matters
A prompt run against training data and the same prompt run with live retrieval can produce completely different brand mentions, topics, and attributes. Treating them as equivalent is one of the most common sources of confusing or contradictory data in AI visibility measurement.
Run each prompt multiple times per model
Run each core prompt at least five to ten times per model. Log what appears consistently, not what appeared once. Frequency is the signal. Single-run presence is noise.
Do not aggregate results across models before analysing them separately first. Different models have different tuning, different system prompts, and different training data coverage. ChatGPT may produce a very different competitive set than Gemini for the same prompt. Treat each model as its own data source, then look for patterns that hold across all of them.
Extract entities, topics, and attributes from each response
From each response, record three types of signal: the entities mentioned, the topics they are associated with, and the attributes assigned to each brand. Use a consistent format so the data is comparable across runs and over time.
Map breadth and identify gaps
Build a table of entities and attributes across all competitors. Highlight where your brand is missing. A competitor consistently appearing alongside an attribute your brand does not own is a specific, traceable gap with a specific fix.
Track change over time
Run the same core prompt set monthly. Look for directional change: associations strengthening, new topics emerging, gaps closing. The value is in the trend, not any single snapshot.
The entity gap analysis worksheet
Use this structure to record each prompt run consistently. The format matters because you will be comparing runs across months, not just reading a single set of results.
And for each signal extracted from the responses:
| Type | Signal | Times seen | Notes |
|---|---|---|---|
| Entity | Your brand | 3 of 5 runs | Consistently recognised. Core offering confirmed. |
| Topic | AI optimisation | 5 of 5 runs | Appears in every response. Strong topical cohesion. |
| Attribute | Trusted | 3 of 5 runs | Stable attribute. Reinforced across multiple phrasings. |
| Competitor | BrandX | 4 of 5 runs | Dominates the “fast” attribute. Your brand absent from speed comparisons. Gap to address. |
Waikay manages your core prompt set, runs each prompt continuously across models, and extracts entities, topics, and attributes automatically. Rather than managing spreadsheets, you see the patterns directly: which associations are strengthening, which gaps are closing, and where competitors are pulling ahead.
