Guide to file Robots.txt: what it is and why it is so important
In this article, we will explore the role of robots.txt, a small file that can make the difference between scoring a high ranking and languishing on the lowest deeps of the SERP.
What is robots.txt
Robots.txt’s role is to tell the crawler which pages it can require from your site. Beware, the spider can still see them. It just does not scan them. If you want to hide a page, you should rely on noindex instructions, as specified by Google’s Search Console Guide.
So, why do you need a robots.txt file? Because you can make crawling faster and smoother, saving your server from too many crawler requests. You can exclude duplicate or not essential pages from the scanning that can hurt your ranking.
Where to put robots.txt
You have to put the robots.txt file inside your website’s main directory so that its URL is http://www.mywebsite.com/robots.txt.
Do not put it elsewhere, or it won’t work.
How to create robots.txt file
Create a .txt file in your website’s main directory and call it “robots”. Remember that you can have only one robots.txt file per site.
Create a group
Create your first group. A robots.txt file can have one or more groups.
Each group has one or more instructions (also called rules). Remember to use only one instruction for each line.
Robots.txt istructions
Instructions can be of three types:
- user-agent: the crawler to which the rule applies.
- allow: all the files or directories to which the user-agent can access.
- disallow: all the files or directories to which the user-agent cannot access.
A rule must include one or more (or all!) user agents, and at least an allow or disallow instruction (or both).
Robots.txt examples
For example, to prevent Googlebot to scan your entire website, you must write in your robots.txt file something like:
#Prevent GoogleBot from scanning. (this is a comment. You can write what you want)
User-agent: googlebot
Disallow: /
If you want instead to exclude more than a directory for all crawlers:
User-agent: *
Disallow: /directory 1
Disallow:/ directory 2
(the asterisk means “all”)
Or maybe exclude all directories but one to a specific crawler:
User-agent: specific-crawler
Allow: /directory1 User-agent: * Allow: /
In this way, you’re stating that every other crawler can access to the entire website.
Finally, we can prevent the scanning of a specific file format, for example, jpg images.
User-agent: *
Disallow: /*.jpg$
The $ character establish a rule valid for all strings that end with .jpg.
To see more examples, visit the Google’s Search Console guide.
Learn more about Technical SEO
Technical SEO is not easy. But it’s fundamental to make SEO the right way.
Learn it reading our Guide for Beginners to Technical SEO.