Citation Data
Which sources is AI drawing on when it talks about your brand? Where are your competitors getting cited that you are not? Citation data tells you why the model says what it says, and where you need to be to change it.
The short version
- Citations in AI responses are modelled, not retrieved. LLMs approximate what should be cited based on training patterns. Some are accurate, some are hallucinated.
- Run two types of query. Commercial queries show which sources dominate live retrieval. Knowledge-based queries reveal what the model has learned about your brand specifically.
- Build a citation profile. Track which domains, page types, and platforms appear consistently across responses.
- Hallucinated citations are a signal. If a model cites a page on your site that does not exist, it is telling you what content it expects you to have.
Two chapters worth reading first
Citation behaviour differs significantly between training data and live retrieval. Understanding which channel you are measuring changes how you interpret what you find and what you do about it.
Ch. 7 – AI Visibility Channels
Citations work differently depending on whether the AI is drawing on training data or live search. Chapter 7 explains how each channel works and why the distinction matters for citation analysis.
Ch. 8 – Prompt Tracking
Getting consistent citation data requires well-designed prompts. Chapter 8 covers how to structure queries that reliably surface citations rather than producing inconsistent results.
How citations actually work
Modelled, not retrieved
When an AI cites a source, it is not always pulling a live URL from the web. In many cases it is generating what it predicts a citation should look like based on patterns in its training data. This is why AI citations are sometimes wrong, broken, or pointing to pages that do not exist.
Understanding this changes how you approach citation analysis. You are not just auditing what the AI links to. You are auditing what it has learned to associate with authoritative sources in your space.
You can influence citations at two points: the live search surface (traditional SEO) and the training data layer (authoritative content on the right platforms). Most teams focus only on the first. The second is often more important for brand-specific queries.
Where you can actually influence citations
| Pipeline stage | Leverage | What you can do |
|---|---|---|
| Query interpretation | Limited | Model likely intent and design content around the questions buyers actually ask. |
| Live search decision | None | A system-level call made by the model. Cannot be triggered or influenced externally. |
| Live search surface | Yes | Traditional SEO applies: ranking well, earning backlinks, structuring metadata clearly. |
| Training data recall | Yes | Publish high-quality, well-linked content on authoritative platforms that the model has learned to trust. |
| Citation tagging | Yes | Track patterns, fix broken links, use clean URL structures to reduce hallucinated citations. |
Building your citation profile
Two types of query, two types of insight
To build a complete picture of your citation profile you need to run two different kinds of query. Each surfaces different information about how and where you are being cited.
Commercial queries
General category-level questions that do not mention your brand. Reflect how a buyer researches before knowing which brands to consider. Shows which sources dominate live retrieval in your space.
Knowledge-based queries
Entity-specific questions that mention your brand directly. Tap into trained memory rather than live retrieval. Ask the model to turn off live search for a cleaner read of what it has actually learned.
Run queries and collect citations
Run both query types across one or more models. Explicitly ask for citations in your prompt. Record every URL returned, including broken ones and hallucinated ones. Both matter.
Log, aggregate, and compare against competitors
Record each citation: URL, domain, page type, which query it appeared in, and whether the link actually resolves. Then run the same queries for your main competitors. Platforms appearing consistently for them but not for you are the primary targets for investment.
Treat hallucinated citations as content briefs
If the model cites a URL on your domain that does not exist, it is telling you what content it expects your brand to have. That is a content brief. Create the page.
Monitor on a regular cadence
Citation behaviour shifts as models update. Re-run quarterly at minimum and track the direction of change over time rather than treating any single snapshot as definitive.
Making your citations more accurate and more frequent
Clean URL structures
LLMs hallucinate links based on domain patterns. Predictable, descriptive URL structures narrow the gap between what the model approximates and what actually exists on your site.
Citation-friendly content formats
Guides, glossaries, and structured explainers are cited more often than marketing pages. Include clear author attribution, structured headings, and internal linking that helps the model understand context.
Authoritative platform presence
Publish on platforms with strong representation in training data: industry publications, structured knowledge bases, well-linked community sites. The model cites what it has been trained to treat as authoritative.
Fix broken links promptly
A broken page that keeps being cited is a missed attribution every time. Redirect it or restore it. Every broken link is a reference that leads nowhere.
Waikay collects citations across every tracked prompt, identifies which domains and page types appear most consistently, and surfaces the gap between where you are cited and where your competitors are. It also flags hallucinated citations so you can turn them into real content rather than letting them remain as dead references.
