Secure Cookie Configuration in PHP: HttpOnly, Secure, SameSite, and What Each Flag Does

13 min read 2,481 words
Secure Cookie Configuration in PHP: HttpOnly, Secure, SameSite, and What Each Flag Does featured image

What robots.txt and sitemap.xml Actually Do for Your Website

If you have a website, you probably want search engines to find it, index it properly, and show it to people searching for what you offer. Two files in your site root play a quiet but important role in making that happen: robots.txt and sitemap.xml. These are not optional extras or things only large sites need. They are fundamental parts of how search engines understand and navigate your website.

Understanding what these files do, how they work together, and what mistakes to avoid will help you maintain a healthier online presence whether you manage your own site or work with a web developer.

How Search Engine Crawlers Work

Before looking at the files themselves, it helps to understand how search engines discover and index pages. Search engines use automated programs called crawlers or spiders. These bots follow links from page to page, collecting information about each URL they encounter.

A crawler starts with a known URL and then follows links on that page to find more pages. It stores what it finds and passes that information to the search engine's indexing system. The crawler does not see your site the way a human visitor does. It reads code, follows rules, and builds a map of your website structure based on what it can access.

Sometimes crawlers hit pages that should not be indexed, such as admin areas, staging environments, or duplicate content. Without guidance, they may waste crawl budget on unimportant pages while missing the ones that matter most.

That is where robots.txt and sitemap.xml come in. They give you a way to communicate directly with crawlers and tell them how to approach your site.

What robots.txt Is and How It Works

The robots.txt file sits in the root directory of your website. Its sole purpose is to give instructions to search engine crawlers. When a bot visits your domain, it checks for robots.txt before doing anything else. If it finds the file, it reads the instructions and follows them.

The file uses a simple text-based syntax. You specify a user agent, which identifies which crawler the rules apply to, and then you list directives that tell that crawler what to do.

Key Directives in robots.txt

The most common directives are Disallow and Allow. Disallow tells crawlers they cannot access a specific URL or path. Allow explicitly permits access, which is useful when you want to block an entire directory but open one page within it.

User-agent: *
Disallow: /admin/
Allow: /admin/public-page/

In this example, all crawlers are blocked from the /admin/ directory but can access a specific public page inside it.

The Crawl-delay directive tells crawlers to wait a certain number of seconds between requests. Some crawlers respect this, and it can be useful if your server struggles with rapid crawl requests.

User-agent: Googlebot
Crawl-delay: 2

The Sitemap directive points crawlers to your XML sitemap location.

Sitemap: https://example.com/sitemap.xml

This is a clean way to include your sitemap location without needing a separate plugin or tool.

Common robots.txt Mistakes

A misconfigured robots.txt file can quietly damage your search visibility. Blocking the entire site by accident is more common than people realise, especially during development.

User-agent: *
Disallow: /

This directive tells every crawler to stay away from everything. If this reaches a live site, your pages will disappear from search results until someone notices and fixes it.

Another common mistake is using robots.txt to hide sensitive information. The file is publicly accessible. Anyone can view it to see exactly which paths you have tried to hide. Never use it to secure private pages. Use proper authentication and access controls instead.

Case sensitivity also matters. Most crawlers treat paths as case-sensitive. Blocking /Admin/ does not block /admin/. Make sure your directives match the actual URL structure of your site.

What sitemap.xml Is and Why It Matters

A sitemap is an XML file that lists the URLs on your website you want indexed. Unlike robots.txt, which tells crawlers what to avoid, a sitemap tells them what exists. It is essentially a directory of your important pages.

For small sites with a clear internal linking structure, search engines can usually find all relevant pages without a sitemap. However, even then, a sitemap can speed up discovery and ensure nothing gets missed. For larger sites, sites with lots of dynamic content, or sites with complex navigation, a sitemap becomes much more important.

The Structure of an XML Sitemap

A basic sitemap lists URLs with their location and optional metadata. Here is a simple example:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-01-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/services/</loc>
    <lastmod>2024-01-10</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

The loc element contains the full URL. The lastmod element shows the last modification date, which helps crawlers know when to revisit a page. The changefreq and priority elements give hints about how often the page changes and how important it is relative to other pages on the site.

It is worth noting that crawlers use priority values as suggestions, not commands. A high priority does not guarantee a page will rank better. It simply tells the crawler how you would like crawl budget allocated across your site.

Image and Video Sitemaps

If your site relies on visual content, you can create separate sitemaps for images and videos. These help search engines understand and index your media properly. An image sitemap includes the image URL, caption, and license information. A video sitemap includes video duration, description, and thumbnail details.

Search engines use this media metadata to display rich results, which can improve click-through rates from search results. If you run a portfolio site, a business site with product images, or a platform with video content, these extended sitemaps are worth setting up.

How These Two Files Work Together

Robots.txt and sitemap.xml serve different purposes but they complement each other. Robots.txt handles exclusion and permission. Sitemap.xml handles inclusion and guidance.

Imagine a scenario where you have a staging version of your site at staging.example.com linked from your live site. You do not want search engines indexing the staging site. You would use robots.txt to block crawlers from staging.example.com. Meanwhile, your live site at example.com has a sitemap listing all public pages for crawlers to discover.

The Sitemap directive inside robots.txt connects both files. Adding your sitemap location to robots.txt gives crawlers a hint about where to find your sitemap without needing to submit it manually through search engine tools.

User-agent: *
Disallow: /staging/
Sitemap: https://example.com/sitemap.xml

Where to Place These Files

Both files must live in the root directory of your domain. For a site at example.com, the files should be accessible at example.com/robots.txt and example.com/sitemap.xml.

Placing them in subdirectories will not work. Crawlers look specifically in the root. If your site runs on a subdomain, the files belong on that subdomain. A robots.txt file at example.com does not apply to blog.example.com. Each subdomain needs its own configuration.

If you use a managed hosting platform, a CMS like WordPress, or a website builder, these files may be generated automatically. However, the auto-generated versions are not always optimised. Reviewing and customising them to match your actual site structure is usually worthwhile.

Telling Search Engines About Your Sitemap

Having a sitemap is only part of the process. Search engines need to know it exists. You can submit your sitemap directly through tools like Google Search Console or Bing Webmaster Tools. These platforms also show you crawl errors, index coverage, and any issues detected with your sitemap.

For most websites, using Google Search Console is the most practical step. It is free, straightforward to set up, and gives you direct feedback about how Google sees your site. If your site has significant UK-based traffic, Bing is worth considering as well, though Google remains the dominant search engine in that market.

Regularly checking these tools helps you catch problems early. A sitemap that suddenly reports fewer URLs than expected might indicate pages that have been blocked, moved without redirect, or removed from your site structure.

What Happens When These Files Are Missing

If robots.txt does not exist, crawlers will attempt to crawl everything they can find on your site. That is not necessarily a problem for most sites, but it means you have no way to guide crawlers away from areas you do not want indexed.

If sitemap.xml is missing, crawlers rely entirely on following links to discover your pages. For sites with deep navigation structures, complex hierarchies, or isolated pages not linked from anywhere obvious, some content may never get discovered and indexed.

For a simple static site with straightforward navigation, missing both files may not cause obvious problems. But as your site grows or your needs become more specific, these files become increasingly important.

Security Considerations When Using These Files

While robots.txt and sitemap.xml are public files, what you include in them can reveal information about your site structure. Listing every URL in your sitemap tells the world about the full extent of your website. In most cases that is fine, but for sites with admin areas or semi-private sections that rely on other security measures, it is worth thinking carefully about what you expose.

Never assume that blocking a page in robots.txt makes it private. A determined person can still access it directly or use other methods to find it. Proper access control, authentication, and server-side security are the only reliable ways to protect sensitive areas. Think of robots.txt as a polite request to crawlers, not a security mechanism.

Regular Maintenance and Review

These files are not set-and-forget. When you add new sections to your site, launch new products, or restructure your content, your sitemap needs to reflect those changes. When you set up new admin areas, developer tools, or staging environments, your robots.txt needs to keep pace.

Reviewing both files quarterly, or whenever you make significant changes to your site, helps catch misconfigurations before they cause problems. If you work with a developer or IT support provider, include these files in your regular site audits.

A sitemap that is kept up to date helps search engines crawl your site more efficiently. If you have ever published new content and been frustrated that it did not appear in search results for weeks, an outdated or missing sitemap was often the cause.

How This Fits Into Your Overall Technical Setup

Robots.txt and sitemap.xml are part of a broader technical foundation that supports how your website performs in search results. They work alongside clean URL structures, proper server configuration, fast page load times, and mobile-friendly design. Ignoring the fundamentals while focusing on content alone often leads to disappointing results.

If you are building or maintaining a website and want to understand the technical side better, looking at how these files are configured on your site is a practical starting point. Seeing what is blocked and what is included gives you a direct view into how search engines experience your site.

Related practical reading

These related guides can help you connect this topic with the wider website, server, security, and support decisions around it.

Frequently Asked Questions

Can I use robots.txt to remove a page from Google?
Blocking a page with robots.txt will stop Google from crawling it, but it may still appear in search results if other pages link to it and Google has already indexed it previously. To remove a page from search results entirely, use the URL removal tool in Google Search Console alongside the robots.txt block. Simply blocking a page does not guarantee it disappears from search results immediately.
How many URLs should a sitemap contain?
XML sitemaps can technically contain tens of thousands of URLs, but most search engines recommend keeping the total under 50,000 URLs per sitemap file. If your site exceeds that, you can create multiple sitemap files and use a sitemap index file to reference them. Keeping your sitemap lean and focused on important, unique pages is more effective than listing every possible variation of every URL.
Should I block search engines from my development site?
Yes. Development and staging sites should always be blocked from indexing. A robots.txt block prevents accidental crawling, but also ensure your development environment is not publicly accessible or is protected by authentication. If a staging site gets indexed by mistake, it can cause duplicate content issues and confusion in search results.
Do I need both files if my site is small?
For a small site with a simple structure and straightforward navigation, a sitemap is still worth having because it speeds up discovery and gives search engines clear information about your content. A robots.txt file is less critical in that scenario but adding one with a sitemap reference takes a few minutes and is good practice. It costs nothing to include and helps crawlers work more efficiently.
What is the difference between noindex and disallow in robots.txt?
Disallow tells crawlers not to visit a URL. Noindex is a meta tag placed in the HTML of a page that tells crawlers not to index it. Disallow prevents crawling entirely, which means crawlers never see the page or its noindex tag. If you want a page to be crawled but not indexed, use the noindex meta tag instead of disallowing it in robots.txt. If you want to keep a page out of search results and save crawl budget, disallow it.
Can I block all crawlers except one from accessing my site?
Yes. Robots.txt supports multiple user-agent blocks. You can allow a specific crawler and disallow others. Most legitimate crawlers respect robots.txt directives. Be specific with user-agent names to avoid accidentally blocking crawlers you want to allow. The major search engines like Googlebot and Bingbot use well-documented user-agent strings.