Robots.txt: May 2022

Robots.txt is a text file created by webmasters to instruct web robots (usually search engine robots) how to crawl pages on their website.

The robots.txt file is part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and deliver that content to users.

The REP also includes directives such as meta robots, as well as page, subdirectory, or site-wide instructions on how search engines should treat links (such as "follow" or "don't follow").

In practice, robots.txt files indicate whether or not certain user agents (web crawling software) can crawl parts of a website. These trace instructions specify "disallowing" or "allowing" the behavior of certain (or all) user agents.

Visit us for more update : Digital Marketing Agency in India

Basic format:

User agent: [user agent name] Disallow: [URL string not to be crawled]

Together, these two lines are considered a complete robots.txt file, although a robots file can contain multiple lines of user agents and directives (ie deny, allow, crawl delay, etc.).

Inside a robots.txt file, each set of user agent directives appears as a discrete set, separated by a line break:

In a robots.txt file with multiple user agent directives, each allow or deny rule only applies to the user agents specified in that particular newline-separated set. If the file contains a rule that applies to more than one user agent, a crawler will only pay attention (and follow directives) to the most specific set of instructions.

Here is an example:

Msnbot, discobot, and Slurp are specifically mentioned, so those user agents will only pay attention to the directives in their sections of the robots.txt file. All other user agents will follow the directives of the user agent: * group.

Example robots.txt:

Here are some examples of robots.txt in action for a www.example.com site:

Robots.txt file URL: www.example.com/robots.txt

Blocking all web trackers of all content

User agent: * Don't allow: /

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the home page.

Allow all web crawlers to access all content

User Agent: * Disallow:

Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the home page.

Blocking a specific web crawler from a specific folder

User Agent: Googlebot Disallow: /example-subfolder/

This syntax tells only Google's crawler (user agent name Googlebot) not to crawl any page that contains the URL string www.example.com/example-subfolder/.

Blocking a specific web tracker from a specific web page

User Agent: Bingbot Disallow: /example-subfolder/blocked-page.html

This syntax tells only the Bing crawler (Bing user agent name) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html

How does robots.txt work?

Search engines have two main jobs:

Crawl the web to discover content;
Index that content so that it can be served to search engines looking for information.

To crawl sites, search engines follow links to get from one site to another; ultimately they track many billions of links and websites. This crawling behavior is sometimes referred to as "spider".

After reaching a website but before crawling it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing with the page.

Because the robots.txt file contains information about how the search engine should crawl, the information found there will indicate further action by the crawler on this particular site.

If the robots.txt file does not contain directives that prevent user agent activity (or if the site does not have a robots.txt file), it will proceed to crawl other information on the site.

Other quick insights on robots.txt:

(discussed in more detail below)

In order to be found, a robots.txt file must be placed in the top level directory of a website.

Robots.txt is case sensitive: the file must have the name "robots.txt" (not Robots.txt, robots.TXT or others).

Some user agents (robots) may choose to ignore your robots.txt file. This is especially common with more nefarious trackers like malware bots or email address scrapers.

The /robots.txt file is publicly available: simply add /robots.txt to the end of any root domain to see the directives for that website (if that site has a robots.txt file!). This means that anyone can see which pages you want or don't want to crawl, so don't use them to hide private user information.

Each subdomain in a root domain uses separate robots.txt files. This means that both blog.example.com and example.com must have their own robots.txt files (at blog.example.com/robots.txt and example.com/robots.txt).

Generally, it's good practice to indicate the location of any sitemaps associated with this domain at the bottom of the robots.txt file. Here is an example:

Technical syntax of robots.txt

The Robots.txt syntax can be thought of as the "language" of robots.txt files. There are five common terms that you are likely to come across in a robots archive. They include:

User Agent – The specific web crawler you are giving crawl instructions to (usually a search engine). You can find a list of most user agents here.

Don't Allow – The command used to tell a user agent not to crawl a particular URL. Only one "Disallow:" line is allowed for each URL.

Allow (only applicable to Googlebot) – The command to tell Googlebot that it can access a page or subfolder even if its parent page or subfolder is not allowed.

Crawl-delay – How many seconds a crawler should wait before loading and crawling the page content. Note that Googlebot does not recognize this command, but the crawl speed can be configured in the Google Search Console.

Sitemap – Used to indicate the location of any XML sitemap associated with this URL. Please note that this command is only supported by Google, Ask, Bing and Yahoo.

Performance Marketing in 2022: What is it and how will it work?

Robots.txt

Thursday, 12 May 2022

What is a robots.txt file?