AI answer engines select sources through a multi-stage process: query → retrieval → re-ranking → excerpt extraction → citation. Each stage filters candidate pages. Your business can only be cited if it passes all stages. The signals that determine passage differ meaningfully between ChatGPT, Perplexity, and Google AIO.

The general architecture: retrieval-augmented generation

Modern AI answer engines use Retrieval-Augmented Generation (RAG). When a user submits a query: (1) a search component retrieves candidate documents; (2) a re-ranking model scores the candidates for relevance and quality; (3) the language model reads the top candidates and generates a synthesized answer; (4) source URLs are attached to claims in the answer. The critical insight is that selection happens at step 2 — a page that is retrieved but poorly structured will lose to a page that is both retrieved and densely informative.

ChatGPT (with web search)

ChatGPT's web browsing uses a Bing-powered search API to retrieve candidates, then applies a re-ranking model to select the best passages. Observed selection factors:

  • GPTBot access — pages blocked to GPTBot in robots.txt are excluded entirely. This is the most common and easily fixed barrier.
  • Answer-first content structure — ChatGPT excerpts the first complete answer it finds. Pages that open with the direct answer are cited with the correct excerpt; pages that bury answers produce low-quality excerpts.
  • Bing index presence — ChatGPT's web search draws from Bing's index. Pages not indexed by Bing are invisible. Submit via Bing Webmaster Tools if your Bing indexation is low.
  • JSON-LD schema — FAQPage and Article schema improve passage identification quality.
  • Domain freshness signals — recently updated pages (reflected in dateModified in Article schema) are preferred for time-sensitive queries.

Perplexity

Perplexity operates its own web crawler (PerplexityBot) and builds a proprietary index. Its re-ranking is notably citation-aggressive — it frequently cites more sources per answer than ChatGPT and tends to pull from longer-form, structured content. Key signals:

  • PerplexityBot access — like GPTBot, must be allowed in robots.txt. Perplexity also respects llms.txt and agents.json for context.
  • Subheading density — Perplexity's extraction model pulls from H2/H3 sections specifically. Pages with clear subheadings matching query intent are cited at the section level, not just the page level.
  • Numerical specificity — Perplexity strongly prefers pages with specific data (percentages, dates, named entities). Vague qualitative claims are replaced by pages with numbers.
  • Source diversity signals — Perplexity appears to avoid over-citing a single domain and deliberately includes diverse source types (news, official docs, expert blogs).

Google AI Overviews (AIO)

Google AIO is the most complex system because it integrates directly with Google's existing search infrastructure. Source selection combines traditional PageRank signals with AI-specific quality factors:

  • E-E-A-T signals — Experience, Expertise, Authoritativeness, Trustworthiness. AIO heavily weights pages from demonstrable experts (author credentials, organization affiliation, consistent topical coverage).
  • Featured snippet eligibility — Pages that already appear in featured snippets for a query are disproportionately selected for AIO. Featured snippet optimization (direct answers, table formatting, numbered lists) directly feeds AIO selection.
  • Freshness — For time-sensitive queries, AIO prefers pages updated within the last 90 days.
  • Google-Extended bot access — Google has a separate crawler (Google-Extended) for AIO training. Some sites block it; doing so reduces AIO citation probability.

What all three systems agree on

Despite their differences, all three AI answer engines consistently select pages that: (1) provide the direct answer to the query in the first paragraph; (2) use clear H2/H3 semantic structure; (3) are accessible to their respective crawlers; (4) have valid structured data; (5) are from domains with external references (backlinks, directory listings, citations in other content). These five signals represent the non-negotiable baseline for AI citation eligibility across all major systems.

What gets pages excluded

Pages are actively filtered when they: block the AI crawler in robots.txt; require authentication or JavaScript to render (many SPAs are invisible to AI crawlers); contain thin content under ~300 words; have no external links pointing to them; or are on domains with no AI-accessible crawl history. Paywalled content is also excluded — AI systems will not cite behind-login pages.