LLMs.txt URL Curation: How to Choose the 30 Links That Define Your Entity to AI

AI agents data workflow abstract representing intelligent data processing

About Will

I run a multi-site content operation on Claude and Notion with autonomous agents — and I write about what we do, including what breaks.

Connect on LinkedIn →

Last week we covered the four-element spec and the robots.txt pairing. This week is the harder problem: assuming you already know how to ship the file, what goes inside it? Curation is where almost every llms.txt implementation falls apart, and it is the only decision in the file that actually affects how AI systems represent you.

This is the URL-selection playbook. No spec recap. No “why llms.txt matters” framing. If you already have a file in production and you suspect it is doing nothing for you, the problem is almost certainly the link list — and this guide is the diagnostic.

The Failure Mode Almost Everyone Hits

The default impulse when building an llms.txt file is to dump the sitemap, or to mirror your top nav, or to copy the breadcrumb hierarchy. All three produce a file that is technically valid and functionally useless. Independent audits documented in the State of llms.txt 2026 report and the Codersera 2026 analysis both flag the same root cause: AI systems weight density, not breadth. A file with 200 URLs of mixed quality signals nothing distinctive; a file with 30 URLs that each defines a piece of your entity signals exactly what you are the authority on.

The principle from the official spec is curated context, not full coverage. Treat the file as a one-page editorial brief on what your site is for. Anything that does not contribute to that brief is noise.

The Five Buckets

A working llms.txt link list breaks into five buckets. Aim for 25 to 40 total entries across all five.

Bucket 1: Entity-defining pages (5–8 URLs). The pages where your business defines what it is. Service pages for what you sell. Methodology pages explaining your approach. The “what we do” hub. These are the highest-priority entries and should appear in your first ## Core Resources section.

Bucket 2: Answer-dense reference content (8–12 URLs). Long-form guides that answer a specific question end-to-end. Glossaries. Comparison pages. Technical documentation. The content AI systems are most likely to cite when answering a query.

Bucket 3: Proof and case studies (4–8 URLs). Documented outcomes. Customer stories with specifics. Before-and-after evidence. AI systems weight verifiable claims more heavily; give them something to verify.

Bucket 4: Active editorial (4–8 URLs). Recent articles representing current expertise. Rotate these quarterly. Stale editorial drags entity coherence.

Bucket 5: Optional supporting context (3–5 URLs). About, contact, terms, accessibility. Goes in the final ## Optional section, which the spec explicitly marks as lower priority.

If you cannot place a URL in one of those five buckets, it does not belong in the file.

The Curation Worksheet

Here is the decision sheet that turns five buckets into 30 URLs. Run it once, then version-control the output.

Step Action Output
1 Pull your 50 highest-traffic pages from GA4. Raw candidate list.
2 Cross-reference with your sitemap to surface evergreen pages not in the top 50. Expanded candidate pool.
3 Score each URL: does it define a piece of the entity? (Y/N) Bucket 1 candidates.
4 Score each URL: does it answer a discrete question end-to-end? (Y/N) Bucket 2 candidates.
5 Tag every page with the topical cluster it serves. Cluster map.
6 Within each cluster, keep the single strongest representative. Deduplicated list.
7 Write a one-sentence description for each URL that describes what it contains, not what it is optimized for. Final list.

The single most common error in step 7 is reverting to meta-description voice — keyword-stuffed promises instead of literal descriptions. AI systems parse these literally. “This explains our pricing tiers and what each includes” is read as a factual claim about what the page contains. “Affordable enterprise SaaS pricing solutions” is read as marketing copy and discounted.

A Worked Example Across Buckets

Here is a real-shape llms.txt for a hypothetical content-marketing agency, showing how the bucket structure looks in production:

# Anchor Studio

> Anchor Studio is a content strategy agency for B2B SaaS companies between
> $5M and $50M in ARR. We build topical authority programs combining
> traditional SEO, GEO, and answer engine optimization across the full
> funnel.

## Core Resources

- [Our Methodology](https://anchor.studio/methodology): The full eight-stage
  process from topic discovery through measurement.
- [Topical Authority Framework](https://anchor.studio/topical-authority): How
  we map content clusters to entity definitions.
- [Service Tiers](https://anchor.studio/services): What we sell at each
  engagement level and what is included.

## Reference Guides

- [B2B SaaS Content Audit Checklist](https://anchor.studio/audit): The
  72-point audit we run before every engagement.
- [GEO Implementation Guide](https://anchor.studio/geo): How to optimize
  content for AI citation across ChatGPT, Claude, and Perplexity.
- [AEO Featured Snippet Playbook](https://anchor.studio/aeo): Structural
  patterns that win the answer box.

## Case Studies

- [SaaS Company A: Citation Lift Case Study](https://anchor.studio/case-a):
  Documented 90-day citation tracking across four AI platforms.
- [SaaS Company B: Editorial Rebuild](https://anchor.studio/case-b): Full
  content architecture rebuild and the traffic outcome.

## Recent Editorial

- [The 2026 GEO Landscape](https://anchor.studio/2026-landscape): Current
  state of AI search optimization and what is changing.
- [Why Most Content Audits Fail](https://anchor.studio/audit-failures):
  The three structural mistakes that invalidate audit findings.

## Optional

- [About Anchor Studio](https://anchor.studio/about): Team, mission, contact.
- [Privacy and Terms](https://anchor.studio/legal): Site policies.

Note what is missing: there is no “Blog” link dumping the full archive. No category landing pages. No tag pages. Every entry is a destination, not a directory.

The Quarterly Audit

llms.txt is not a deploy-and-forget asset. Set a quarterly review on the calendar with three checks:

  1. Editorial freshness. Replace Bucket 4 entries older than six months with current articles. Stale editorial signals an inactive site.
  2. URL validity. A 404 or 301 in your llms.txt is a credibility hit. Audit links against a crawler quarterly.
  3. Strategic alignment. Has your business changed? New service line, new vertical, new positioning? The H1 and blockquote should still describe what you actually do today.

The AI Rank Lab 2026 best-practices brief puts the quarterly cadence at the center of effective implementation, and matches what mature publishers like the developer-tools cohort are doing in practice.

What This Earns You

To be honest about expected outcomes: major AI providers do not all fetch /llms.txt on every request today, and the file is not a ranking signal in the Google sense. What it does is give you a deterministic answer to the question “what would I want a language model to know about my site if it asked one question?” That answer becomes useful in three forward-leaning scenarios — when AI providers begin weighting it explicitly, when your own AI agents and IDE tools consume it (this is happening now in developer tooling), and when third-party AI-citation tracking services begin scoring it as an authority signal.

The cost is half a day of curation and a quarterly review. The optionality is significant. Ship the file with a real link list, not a dumped sitemap, and move on.


Sources:
The /llms.txt file specification (llmstxt.org)
State of llms.txt 2026: Adoption, Standards, and Practice (Presenc AI)
llms.txt Explained May 2026 (Codersera)
LLMs.txt Best Practices for AI Crawlers 2026 (AI Rank Lab)

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *