Back to lorenzespinosa.github.io
GEO · AI Search

Schema markup won't get you cited by AI. Here's what does.

Half the advice circulating about AI search optimization comes down to this: add schema markup, sprinkle in some JSON-LD, and ChatGPT will start citing you. It won't. Schema is worth doing, and I've done it on this site. But it's a hygiene step, not the strategy itself. The things that actually get you cited by ChatGPT, Perplexity, Google AI Overviews, and Claude are different, and most people writing about GEO haven't separated the two.

Here's what I know from building and monitoring this site as a GEO experiment in real time, using my own infrastructure instead of third-party SaaS, and tracking what actually moves. The fundamentals are less mysterious than the consultants want you to think.

Why schema markup alone doesn't make AI models cite you

Schema markup tells Google's structured data renderer how to parse your page. It's useful for rich results, Knowledge Panel eligibility, and Bing's validator. What it doesn't do is make an AI language model trust you as a source.

The reason is basic: large language models don't parse schema at retrieval time. When Perplexity or ChatGPT surfaces a source in a response, it's pulling from an index of page content, not from JSON-LD blocks embedded in your HTML. The schema may have helped the page get crawled and indexed correctly, which matters. But the citation decision is driven by content signals, not schema signals.

Direct answer
Does schema markup help you get cited by ChatGPT or Perplexity?

Indirectly, yes. Schema helps your pages get correctly indexed by crawlers that feed AI retrieval systems. But schema alone does not make an AI model prefer your content as a citation. The citation signals are content-side: entity consistency, direct-answer structure, and factual authority. Schema is a prerequisite, not a lever.

I run Person JSON-LD with a stable @id anchor, ProfessionalService, FAQPage, and WebSite schema on this site. All of it is worth doing. None of it is the reason a model would quote my "$800K+ in client savings" claim instead of someone else's.

What actually drives AI citations in 2026

After reading the research and running this site as a live experiment, the citation drivers come down to three things. They're not complicated. They're just not as marketable as a new schema type.

1. Entity consistency across surfaces

AI models and their retrieval layers build a picture of who you are from multiple signals. Your LinkedIn headline, your GitHub profile, your site's Person schema, and the bio on every post all need to say the same thing with the same numbers.

Not approximately the same. The same. When a model pulls from multiple indexed sources and sees inconsistent claims, the confidence in any single claim drops. When every surface agrees, the model has a consistent, citable entity to reference.

On this site, I use a knowledge/FACTS.md file as the single source of truth. Every number that appears anywhere on the web in connection with my name traces back to that file. The Person JSON-LD sameAs links point to GitHub and LinkedIn. The proof numbers on the homepage, in /llms.txt, in /llms-full.txt, and in blog posts are identical. That's the point.

Drift is the enemy. A LinkedIn headline that says "$800K saved" and a site that says "$800K+" aren't the same claim to a model trying to verify consistency. Pick the canonical form and enforce it everywhere.

2. Sourced direct-answer blocks

Perplexity, Google AI Overviews, and similar systems are retrieval-augmented. They're not summarizing prose. They're looking for the shortest reliable path from a query to a defensible answer.

That means content shaped like a direct answer, not content that buries the answer in paragraph six after three setup sentences. The question-shaped headings on this page are deliberate. The answer blocks aren't a design flourish. They're the format that retrieval systems can extract and surface cleanly.

The FAQ section on my main site has FAQPage schema, but more importantly, each answer is a tight, complete thought. It can stand alone. That's what gets cited: standalone, factually specific sentences that survive context stripping.

Here's the way I think about it. If you took one paragraph out of your page and dropped it raw into a chat response, would it make sense and be credible? If not, it probably won't get surfaced. Write for that extraction condition.

3. Being a real, citable, consistent authority

This is the uncomfortable one because it can't be hacked with a plugin. AI models are increasingly trained on or fine-tuned against credibility signals. What makes you credible in this context is similar to what made you credible in traditional SEO: consistent publication, accurate claims, corroboration from external sources, and a stable identity over time.

A page that has been live for six months, with consistent facts, linked from real profiles, with no contradictory claims anywhere, will outperform a freshly published page with perfect schema, every time.

I built this site's GEO foundation in a specific order. First, the owned content layer: Person schema with a stable @id at https://lorenzespinosa.github.io/#lorenz, the same across every page. Then entity links: the sameAs array in the Person block pointing to GitHub and LinkedIn. Then the AI-readable layer: /llms.txt and /llms-full.txt committed to the repo root, with the same FACTS-consistent numbers. Then explicit crawler stanzas in robots.txt for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended.

The explicit AI crawler stanzas didn't change indexing behavior in practice. The wildcard User-agent: * allow already covers all of them. But the intent is legible, and that matters to the extent AI companies read explicit permission signals when deciding what to include in training or retrieval datasets. The cost is two lines per bot. I'll take it.

What about llms.txt?

The honest assessment of /llms.txt as of mid-2026: it's a 30-minute hygiene step with unclear direct citation impact, and I'd still recommend doing it.

Server-log analysis shows the major AI crawlers (GPTBot, ClaudeBot, PerplexityBot) mostly skip /llms.txt and crawl HTML pages directly. No major AI provider has formally committed to using it. SERanking analysis of 300K domains found near-zero correlation between having /llms.txt and being cited by AI models.

But "near-zero correlation" doesn't mean useless. What it means is: /llms.txt isn't the lever. It's the signal that you've thought about this. And the compressed, Markdown-formatted context it provides is genuinely useful if any model is trained or fine-tuned on crawl data that happens to include it.

My /llms.txt is a 40-line document: who I am, the one-line positioning, proof numbers, services, and links. My /llms-full.txt goes deeper: full case studies, the token-tax thesis, and the method. Both are committed to the repo root. If a crawler reads them, the model gets a clean, context-window-safe digest that matches what every other surface says. That's the point.

Direct answer
Should you add llms.txt to your site in 2026?

Yes. Not because it's proven to drive citations directly, but because the cost is minimal (~30 minutes), it demonstrates GEO awareness, and it provides a clean context-window-safe summary for any model that does read it. Do it after you've handled entity consistency and direct-answer content structure, not instead of them.

The worked example: this site

There's a reason I'm writing this post on my own site instead of publishing it somewhere else. GEO-optimized content about GEO, on a site that practices what it describes, is itself a credibility signal. The meta-credibility isn't accidental.

Here's the full foundation as it exists on this site right now, in the order I built it:

  • Person JSON-LD with stable @id. The anchor is https://lorenzespinosa.github.io/#lorenz, the same on every page. sameAs links to GitHub and LinkedIn. This is the entity anchor that everything else hangs from.
  • ProfessionalService, FAQPage, WebSite schema. All on the main site. The FAQ schema wraps the exact text of the FAQ section, so the same answers that appear on-page are also machine-readable in structured form.
  • Sitemap and IndexNow. sitemap.xml committed to repo root, submitted to Google Search Console and Bing Webmaster Tools. When new content is added, an n8n workflow posts to the IndexNow endpoint to push instant notification to Bing and every other engine that supports it. Google doesn't support IndexNow yet, but the others do.
  • Explicit AI crawler stanzas in robots.txt. GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended. Each with Allow: /. The wildcard covers them already. The explicit stanzas are the signal.
  • llms.txt and llms-full.txt. Both in repo root. Consistent with FACTS.md numbers. Same entity, same claims, same links.
  • BlogPosting schema on each post. Including this one. Author resolves to the same @id as the Person block on the homepage.
  • OG and Twitter Card tags on every page. Required for social sharing previews and increasingly used by LLM systems to confirm a page's claimed metadata.

This is a static site. Vanilla HTML, CSS, JS. No build pipeline, no CMS, no plugin ecosystem. Everything I just described is either committed as a file, written into a <script type="application/ld+json"> block, or triggered by an n8n webhook on git push. The no-build constraint is a feature, not a limitation. It means the foundation is transparent, maintainable by one person, and not at the mercy of plugin updates or framework rewrites.

The receipt that makes this credible

GEO without verifiable claims is just content. The reason any of this infrastructure is worth building is that I have numbers to put in it.

50+
operational processes automated in production
$800K+
in hard-dollar client savings
+30%
lead conversion on a capped AI budget
-70%
manual data entry on a 3-system intake pipeline

These numbers are in /llms.txt, in /llms-full.txt, in the Person schema, in the FAQ schema answers, in the hero copy, and in every case study. They are the same numbers everywhere because they trace back to a single source file I update first before propagating to any surface.

That's the thing most people skip. They optimize the schema and forget the claims. A model that encounters "$800K+" across six independently indexed surfaces, all pointing back to the same entity anchor, treats that claim with more confidence than a claim that appears once on a freshly published page. Consistency at scale is the actual GEO lever. Schema is the delivery mechanism.

What to actually do, in order

If you're building a personal brand or a consulting presence and you want AI search to work for you, here's the order that makes sense:

  1. Lock your canonical facts first. One source of truth for every claim. Numbers, titles, service descriptions, links. Everything else is downstream of this.
  2. Build entity consistency across surfaces. Website, LinkedIn, GitHub, and any other indexed profile need to agree. Same facts, same numbers, same positioning.
  3. Structure content for extraction. Question-shaped headings, direct-answer first paragraphs, standalone factual blocks. Write for the paragraph that survives context stripping.
  4. Add schema as the machine-readable layer. Person with stable @id, sameAs to corroborating profiles, FAQPage schema on FAQ sections, BlogPosting schema on posts. This is how structured data pays off: as a supplement to good content, not a substitute for it.
  5. Add llms.txt and llms-full.txt. One file for the brief summary, one for the extended context. Consistent with everything else.
  6. Explicit AI crawler stanzas in robots.txt. Two lines per bot. Do it after the content is worth crawling.
  7. Measure with something real. Google Search Console query performance filtered to name queries, manual prompt tests in ChatGPT and Perplexity logged monthly, GA4 referral source tracking for perplexity.ai and chatgpt.com traffic. Otterly.ai if you want an automated dashboard.

The order matters. Step four without step one is expensive decoration. Step six without step three is inviting crawlers into content that won't get cited anyway.

The honest state of GEO in mid-2026

This field is 18 months old. The tactics that work are the ones that have always worked in search: real authority, real claims, real consistency, published over time. The GEO-specific layer adds structure to help AI retrieval systems surface that authority more cleanly.

Nobody has fully decoded what makes a model cite one source over another. The research consensus points toward entity authority and direct-answer structure. That's what I'm optimizing for. The schema, the llms.txt, the robots stanzas, the canonical JSON-LD, all of it is in service of the same goal: making it as easy as possible for any retrieval system to say "this person says X, consistently, from a credible, corroborated identity."

The schema is not the strategy. The strategy is building something citable.

Work with me

Need help building AI automation that's actually production-ready?

I build deterministic, self-hosted automation for ops-heavy teams, and add AI only where it earns its keep. 50+ processes automated, $800K+ in client savings. Free 15-min call, no obligation.

Free 15-minute call · no obligation · reply within 1 business day