Robots.txt – a special file that is used to regulate the site indexing process by search engines. Its location is the root directory. Various sections of this file contain directives that allow or deny indexing bots access to sections and pages of the site. At the same time, search robots of different systems use separate algorithms for processing this file, which may differ from each other. No robots.txt settings affect the processing of links to site pages from other sites.
The main function of this file is to place indications for indexing robots. The main robots.txt directives are Allow (allows indexing a specific file or section) and Disallow (respectively, prohibits indexing), as well as a User-agent (determines which robots allow and deny directives belong to).
Keep in mind that robots.txt instructions are advisory in nature. This means that in various cases they can be ignored by robots.
Let's look at some examples.
A file of the following content prohibits site indexing for all robots:
User-agent: * Disallow: /
To prohibit indexing for the main Google search engine robot only in the /private/ directory, robots.txt of the following content is used:
User-agent: Googlebot Disallow: / private /
How to create and where to place robots.txt
The file must have the txt extension. After creating it, you need to upload it to the root directory of the site using any FTP client and check whether the file is available at site.com/robots.txt. When accessing this address, it should be displayed in full by the browser.
Robots.txt file requirements
The webmaster should always remember that the absence of the robots.txt file in the root directory of the site or its incorrect configuration potentially threatens site traffic and search availability.
By standards, the use of Cyrillic characters is prohibited in the robots.txt file. Therefore, to work with Cyrillic domains you need to use Punycode. In this case, the encoding of the page addresses must correspond to the encoding of the applied site structure.
Other file directives
This directive is used by robots of all search engines. It allows you to specify the site mirror that will be the main one for indexing. This will prevent pages from different mirrors on the same site from being included in the index, and duplicates will appear in the PS output.
Example of use
If the main mirror for a group of sites is https://onesite.com, then:
User-Agent: Googlebot Disallow: / blog Disallow: / custom Host: https://onesite.com
If there are several values of the Host directive in the robots.txt file, then the indexing robot uses only the first of them, the rest will be ignored.
For fast and correct site indexing, a special Sitemap file or group of such files is used. The directive itself is intersectional – it will be taken into account by the robot when placing it anywhere robots.txt. But it is usually accepted to place it at the end.
When processing this directive, the robot will remember and process the data. This information is the basis for forming the next sessions of loading site pages for indexing.
Example of use:
User-agent: * Allow: / catalog sitemap: https://mysite.com/my_sitemaps0.xml sitemap: https://mysite.com/my_sitemaps1.xml
This is an additional directive for Yandex search engine bots. Modern sites have a complex structure of names. Content management systems often generate dynamic parameters in page names. They can transmit additional information about referrers, user sessions, and so on.
The standard syntax for this directive is described as follows:
Clean-param: s0[&s1&s2&..&sn] [path]
In the first field, we see parameters that need to be ignored. They are separated by the & symbol. And the second field contains the prefix of the path of pages that fall under this rule.
Suppose, on a certain forum, the site engine generates long links like http://forum.com/index.php?id=128955&topic=55 when the user accesses the pages, and the content of the pages is the same, and the id parameter for each visitor is different. To prevent the entire set of pages with different id from getting into the index, we use the following robots.txt file:
User-agent: * Disallow: Clean-param: id /forum.com/index.php
This directive is intended for cases when indexing robots create too high a load on the site server. It specifies the minimum time between the end of loading the site page and the robot accessing the next one. The time period is set in seconds. The Yandex search engine robot also successfully reads fractional values, such as 0.3 seconds.
Example of use:
Examples of using: User-agent: * Disallow: / cgi Crawl-delay: 4.1 # timeout 4.1 seconds for robots
This directive is currently not taken into account by Google’s search engine robots.
$ and other special characters
Keep in mind that when making any directives, the special character * is assigned at the end by default. As a result, this directive applies to all sections or pages of the site that begin with a specific combination of characters.
To mark the default action, the special character $ is used.
Example of use:
User-agent: Googlebot Disallow: / pictures $ # prohibits '/ pictures', # but does not prohibit '/pictures.html'
The robots.txt file standard recommends that an empty line break be inserted after each group of User-agent directives. In this case, the special symbol # is used to place in the comment file. Robots will not take into account the content in the line that is placed behind the # symbol before the empty translation sign.
How to prohibit indexing of a site or its sections
You can prevent indexing of certain pages, sections, or the entire site using the Disallow directive as follows.
User-agent: * Disallow: / # blocks access to the entire site User-agent: Googlebot Disallow: / bin # blocks access to pages, # that start with '/bin'
How to check robots.txt for the correctness
Validation of the robots.txt file is a mandatory operation after any changes are made to it. After all, an accidental error in the placement of the symbol can lead to serious problems. At a minimum, you need to check robots.txt in webmaster tools. A similar check must be done in the Google search engine. For successful verification, you need to register to work in the webmaster panel and enter your site data into it.