What is a Robots.txt file?
A robots.txt file is a simple text file that should be available at the root level of the application, like the one on the Excellium website. This file is here to allow or avoid the search engine robots to crawl some parts of the website.
For that example, the robots.txt file provides the website’s sitemap to help search engines browse all links more easily than browsing each page one by one and discovering links recursively. That also allows indexing the pages that don’t have external references to them.
In the following example, the robots are allowed to browse all the pages of the website, there is no restriction:
—– CODE —-
—– CODE —-
Are Robots.txt files only serving robots?
In the previous example, the file is here only to help with the indexation of the website. However, that file can also disallow the robot to browse some pages or some parts of the website.
For example, the following content is an extract of the Google.com robots.txt file.
—– CODE ——-
User-agent: *Disallow: /searchAllow: /search/aboutAllow: /search/staticAllow: /search/howsearchworksDisallow: /sdchDisallow: /groupsDisallow: /index.html?Disallow: /?Allow: /?hl=Disallow: /?hl=*&Allow: /?hl=*&gws_rd=ssl$Disallow: /?hl=*&*&gws_rd=sslAllow: /?gws_rd=ssl$Allow: /?pt1=true$Disallow: /imgresDisallow: /u/Disallow: /preferencesDisallow: /setprefsDisallow: /defaultDisallow: /m?Disallow: /m/
——- CODE ——
On that file, the keyword Disallow is present. That keyword informs the robot that it doesn’t have the right to crawl that part. It could be matched by the beginning of the URL, a folder name or by a full URL corresponding to a particular page for example.
That rule is here for some reason, but one of them is that the robot could be falling into an infinite loop. For example, if the website returns the same content for the URL /articles/ID where the ID is not existing, the robot can test all IDs from zero to infinite and the indexation will be incorrect. That will also generate non-wanted traffic on the server and cause some overhead.
In addition, some folders are not interesting to be parsed, like resources for example.
Robots.txt & cybersecurity
Ok, but why do we address Robot.txt in a security-oriented newsletter?
The robots.txt file allows the robots to browse non already referenced pages or disallow browsing non-important pages.
However, some people are using it to disallow the robots to browse internal and private data that should not be indexed on the search engine. To do so, they add some disallow rules on the file.
Such a method is working as these files will not be indexed by search engines. However, that represents a huge disclosure for attackers. In fact, during an intrusion test, the attacker conducted reconnaissance steps allowing them to gather more information about the technologies used, the different features of the application and its content. Therefore, that robots.txt file can give details to the attacker in case of disallowing rules for example. These could be the path for the admin panel, maintenance pages, internal and/or confidential files, etc.
After that, the attacker just needs to browse these paths and perform some directory listing in addition to finding interesting pages or information.
For pentest purposes, a custom tool names robots-disallowed-dict-builder was made by one of our teammates (Dominique Righetto) that is getting the robots file of the top website. For each website, a request is made to retrieve the robots.txt files and all the disallow entries will be kept in a file. Then, that file is used during the pentest.
A test was made by making a partial list of the .lu domains and getting the robots.txt file. A total of 35000 domains/hostnames were tested and over 6700 robots.txt files were retrieved. Within these files, 6600 rules disallow were found.
That test gave the following interesting results. Some websites disallow paths that contain “admin”, and 273 disallow a PHP page.
—- CODE: Extract lines containing “admin” —–
—- CODE —–
—- CODE: Extract lines containing “php” —–
—- CODE —–
In addition, some API endpoints could be found. For example, some APIs could not be used by the front-end but only by mobile applications. With that discovery, an attacker can go deeper with his attack.
—- CODE: Extract lines containing “api”—–
—- CODE —–
Depending on the content of the robots.txt, the attacker can get more interesting information than only those referenced by search engines.
How to proceed without the robots.txt file?
As described in the previous section, that file should not be used as a security feature in order to avoid someone to access the data.
Therefore, the application must implement a security check based on the session of the user for example. So, for confidential data or pages, the user must be logged in to the application, and the application needs to check if the user possesses the access rights to access the data.
A previous article was created to explain how to implement and check access rights by using an authentication matrix.
That methodology is not limited to page access, but also to documents. If some documents need to be accessed only by some people, these should be taken into consideration in the access rights matrix.
In some cases, access to the data should be made anonymously as the registering process can be time-consuming because managing accounts add a layer of difficulty for the application developers and the users. That is the case when getting some medical results like radios or other kinds of medical results for example. For these cases, the access to the data could be made by a UUID, which is randomly generated (UUID v4 for example). A disallow rule could be added to the robots.txt file to avoid these being indexed from the base path, but the major security feature will be based on the randomness of the file name.
The files present in that kind of folder should be removed as soon as possible. The user needs to know that the file will be available for only one day, for example. After that, it will be deleted. A light authentication could be implemented like validating the access with a code sent by SMS and valid only one time.
In addition, access to that kind of folder could be monitored to ensure nobody is exercising some brute force on the file name for example.
Should I use a robots.txt file?
Yes, you should use a robots.txt file as that file will avoid the robots to crawl some pages that should not be indexed. However, be sure that the content described on that file is not related to confidential data, obscurity is not security. If the data is confidential, this one should be protected by being only accessible to authenticated users. If authentication is not possible, the file should be referenced with a strong random name and should not be available for a long time.
In conclusion, go check your robots.txt file, and take the appropriate update in order to improve your external security posture.
- Robots exclusion standard – Wikipedia
- Web crawler – Wikipedia
- righettod/robots-disallowed-dict-builder: Script generating a dictionary containing the most common DISALLOW clauses from robots.txt file found on CISCO Top 1 million sites (github.com)
- What is Robots.txt? | Google Search Central | Google Developers
- What is robots.txt? | How a robots.txt file works | Cloudflare
- txt – Moz