🔀
Chapter 7 Layer 4: Methodology

AI Visibility Channels

Training data and grounded search are two completely different systems. The AI tools your buyers use do not all work the same way. Understanding which channel each one uses, which one fires when, and which one you are measuring at any given time, is the foundation everything else in this guide is built on.

TL;DR

The short version

Key points
  • Different AI tools use different channels by default. Perplexity and Google AI Overviews are almost entirely grounded search. ChatGPT without browsing is training data. Knowing which you are measuring matters enormously.
  • For ChatGPT specifically, most brand and category queries hit training data. Grounded lookups cost compute and are triggered selectively, not on every query.
  • Ranking well does not mean the AI represents you accurately. Grounded search retrieves your page and then rewrites it. You can rank first and still be misrepresented.
  • Training data is not fully frozen. Major models extend knowledge cutoffs and incorporate updates without always doing a full retrain. But changes are still slow compared to live retrieval.
  • Publishing content today affects grounded search in days. Training data takes months to over a year. These are fundamentally different timelines requiring different strategies.
💡
Start here

Try this before anything else

The two-channel test

Open ChatGPT with browsing turned off. Ask: “What do you know about [your brand] in the context of [your category]?”

Now turn browsing on and ask the exact same question. Then run the same question in Perplexity. Compare all three responses.

Questions worth asking

Which response feels most accurate? Which feels most dangerous? Does the live version surface the right pages or a competitor’s? Does Perplexity produce something different again? Are there claims in any version you would not want a buyer to read?

The differences between those three responses are what this chapter is about.

🗂️
Critical context

Not all AI tools work the same way

The single biggest mistake in AI visibility measurement is treating all AI tools as equivalent. They are not. Each major platform has a different default channel, and that changes everything about how you interpret your data and what you do to improve your visibility.

Grounded by default

Perplexity

Almost entirely grounded search. Every query triggers live web retrieval. There is no training data layer you can influence separately. Your SEO and your citation profile are what matter here.

Grounded by default

Google AI Overviews

Defaults to grounded retrieval from Google’s index. Built on top of traditional search infrastructure. Optimising for AI Overviews is substantially an SEO problem, not a training data problem.

Mixed — setting dependent

ChatGPT

Defaults to training data. Browsing can be enabled manually or is triggered automatically when the query signals recency need. This is the platform where the training vs grounded distinction matters most for brand visibility work.

Grounded by default

Microsoft Copilot / Bing

Grounded by default via Bing’s index. Similar to Perplexity in that live retrieval is the primary mechanism. Training data plays a supporting role in synthesis but retrieval dominates.

Mixed — context dependent

Gemini

Uses a hybrid approach. Has access to Google Search and will trigger retrieval for queries that benefit from it. Training data knowledge cutoffs are extended more frequently than some competitors, making the static vs live distinction less clean.

Training data primary

Claude

Defaults to training data with a fixed knowledge cutoff. No live retrieval unless connected to a search tool. For brand visibility purposes this is a pure training data channel.

What this means for your measurement

If you are only measuring ChatGPT with browsing off, you are measuring training data for one platform. That matters, but it is a partial picture. Your buyers may be using Perplexity or Google AI Overviews far more, and those are entirely different channels that need different strategies and different measurement approaches.

📡
The channels

How each channel actually works

Channel 01

Training Data

Knowledge baked into the model during pretraining on large web snapshots. It is not fully frozen: major models extend knowledge cutoffs and incorporate updates through techniques like continued pretraining and knowledge distillation without always doing a full retrain. But changes are slow and unpredictable compared to live retrieval.

Key characteristic

Predominantly static. The model’s baseline understanding of your brand, shaped by what existed on the web during its training window.

Channel 02

Grounded Search

Real-time retrieval using RAG (Retrieval Augmented Generation). The model fetches current web results and synthesises them into a response. You are not just ranking. You are being interpreted. The user sees the model’s synthesis, not your page.

Key characteristic

Dynamic and current. But triggering retrieval does not guarantee accurate or favourable representation of what is on your page.

ChatGPT specifically

When does grounded search fire in ChatGPT?

For platforms that are grounded by default (Perplexity, Google AI Overviews, Copilot), every query triggers retrieval. For ChatGPT, the model decides. These signals are probabilistic, not rules, but they are consistent enough to plan around.

Query type Likely channel Why
“What is the latest news about [brand]?” Grounded “Latest” signals time-sensitive intent. The model infers its training data may be stale.
“What are the best tools for [category]?” Training data No recency signal. The model draws on trained market knowledge. This is where your SOV and Topical Presence data comes from.
“Does [brand] integrate with Salesforce?” Training data Usually training data for established brands. May trigger retrieval for newer brands with thin training coverage where model confidence is low.
“What is [brand]’s current pricing?” Grounded Pricing changes frequently. The model recognises this category of information as retrieval-appropriate.
“Compare [brand] vs [competitor]” Mixed Increasingly triggers retrieval as models have been trained to provide current feature and pricing comparisons. Was predominantly training data in 2023 but this has shifted in 2024 and 2025.
“Has [brand] had any recent layoffs?” Grounded Clearly event-driven. Model will retrieve unless the event predates its training cutoff.
These are signals, not rules

ChatGPT’s retrieval decisions are probabilistic. The same prompt can produce a training data response on one run and a grounded response on another. This is one of the reasons running prompts multiple times and averaging results is essential for reliable measurement.

⚠️
Critical nuance

Ranking well does not mean the AI represents you accurately

Grounded search retrieves and then rewrites

When a model retrieves your page it does not show it to the user. It reads it, extracts what it considers relevant, and writes a new response. You have no control over what it extracts, how it weights competing sources, or how it frames your brand in the synthesis.

Ranking first gets you into the source pool. It does not guarantee accurate, complete, or favourable representation in what the user actually reads.

Real scenario: the integration question

A buyer asks ChatGPT (browsing on): “Does [your brand] integrate with Salesforce?”

The model retrieves three pages: your integrations page, a competitor comparison article written by a third party, and a community forum thread from two years ago. It synthesises all three and produces this:

“[Your brand] has limited native integrations. Some users report using Zapier as a workaround. Competitor X offers a direct Salesforce connector.”

Your integrations page ranked first. The model read it. The buyer now believes you do not have a Salesforce integration.

The problem was not your ranking. It was how the model weighted an outdated community thread and a competitor’s comparison page over your own documentation. Grounded search is not just an SEO problem. It is a content ecosystem problem.

What to do about it

Structure your key product pages so the most important facts appear in the opening paragraph, not buried in a table or secondary tab. The model extracts what it encounters first and what appears most consistently across multiple sources. Make your clearest, most accurate claims impossible to miss. And audit the third-party content that could be entering the retrieval pool alongside yours.

🕐
Time lag

Publishing today does not update the model today

Two very different timelines

One of the most common misconceptions in AI visibility is that publishing new content will change what a model says about your brand quickly. For grounded search platforms that is broadly true. For training data it is not, though the picture is more nuanced than a simple cutoff date suggests.

Grounded search

Days to weeks

Once a page is indexed it can appear in grounded retrieval quickly. For established domains this typically happens within days to a few weeks. New domains take longer to gain the trust signals needed to enter the retrieval pool reliably.

Best use

Time-sensitive corrections. New product announcements. Pricing updates. Anything the model should surface right now.

Training data

Months to over a year

Models incorporate training updates on irregular cycles. Major base model retrains happen every several months to over a year depending on the provider. Some models extend knowledge cutoffs more frequently through lighter update processes, but this is inconsistent across providers and not publicly documented.

Best use

Long-term association building. Topical authority. The foundational coverage that shapes how the model understands your brand and category over time.

The practical implication

If a model is saying something wrong about your brand right now and you need to fix it quickly, grounded search optimisation is your fastest route. Fix the content that enters the retrieval pool. Training data corrections matter and are worth investing in, but do not expect to see them reflected in responses for months. Plan both timelines simultaneously rather than choosing one over the other.

🧠
Training data

What you can influence in training data

Training Data: Where You Can Act
Before training

Increase inclusion odds

  • Use consistent, canonical phrasing for your brand name across all pages and platforms
  • Publish on high-authority, crawlable domains that training datasets are known to draw from
  • Remove crawl blockers: robots.txt exclusions, login walls, heavy JavaScript rendering
  • Use structured formats: FAQs, semantic headings, clear entity references
  • Build third-party coverage on platforms with strong web presence: Wikipedia, industry publications, review sites
After training

Audit and adapt

  • Run monthly brand prompts with browsing disabled to test trained knowledge
  • Log hallucinations, missing associations, and inaccurate descriptions
  • Publish corrective content addressing specific misrepresentations at scale
  • Treat hallucinated citations as content briefs: the model expected a page that does not exist

The monthly training data audit

Once a month, run this prompt across multiple models with browsing disabled:

“What do you know about [YourBrand.com] in the context of [your niche]?”

Issue found What to do
Brand not mentioned Publish crawlable, entity-rich content on authoritative domains. Build third-party coverage on review and comparison sites.
Inaccurate descriptions Create canonical brand messaging repeated consistently across all assets. Make your About page direct and factual. See Chapter 4 (Factual Accuracy Rate).
Weak topic associations Build topical clusters around your core use cases with clear internal linking and consistent entity references. See Chapter 3 (AI Topical Presence).
Hallucinated offerings Publish corrective content that clearly states what you do and do not offer. Repeat it across multiple authoritative sources so the model encounters the correction consistently.
🌐
Grounded search

What you can influence in grounded search

Grounded Search: Where You Can Act
Your own content

Optimise for retrieval and synthesis

  • Put your most important facts in the opening paragraph of every key page. Models extract early content most reliably.
  • Use clear, direct headings. Models scan structure before reading body text.
  • Target featured snippet formats: concise definitions, bullet-pointed answers, clear comparisons
  • Ensure crawlability: fast load times, mobile-friendly, fully indexable, schema markup
Content ecosystem

Shape what enters the retrieval pool

  • Identify third-party pages about your brand that rank well and may be retrieved alongside yours
  • Correct factual errors on review sites, comparison pages, and community forums
  • Build trust signals across platforms: author bios, third-party reviews, consistent NAP data, schema markup
  • Track which competitor pages are being retrieved for your brand queries and understand what they say
A note on trust signals for grounded platforms

If you are absent from grounded results on Perplexity or Google AI Overviews for queries where you should appear, it is often a trust signal problem rather than a content problem. These platforms are cautious about surfacing thin, promotional, or low-authority content. Before optimising copy, check the fundamentals: clear About and Contact pages, author attribution, third-party validation, and structured data.

📊
Measurement

Why measurement requires both channels

Running prompts across both channels, and across multiple platforms, gives you a picture that neither alone can provide. Here is what each measurement tells you.

Off

Training data (browsing disabled)

What the model has learned about your brand from its training window. Stable, slow to change, and the primary channel for most ChatGPT and Claude brand queries. This is the baseline you are working to improve over months. Also relevant for how the model synthesises grounded results, since trained knowledge shapes how retrieved content is interpreted.

On

Grounded search (browsing enabled / Perplexity)

What is currently in the retrieval pool for your brand queries: your pages, competitor pages, third-party coverage. This reflects what a buyer sees on grounded platforms in near real time. Faster to influence but also faster to be affected by third-party content you do not control.

Gap

The gap between them

A large gap means training data is thin and live retrieval is compensating. Focus on long-term content and coverage to build the trained layer. A small gap where training data is wrong tells you the model has learned something incorrectly and live retrieval may be masking it. Fix what the model has already learned, not just what ranks.

How Waikay measures both channels

Waikay’s Brand Tracker topic reports test both channels for every tracked prompt. Each report shows you what the model says about your brand with browsing off and with browsing on, side by side. That comparison is the starting point for knowing which channel needs work, which strategy to prioritise, and whether what you are doing is actually moving the needle.