When it comes to SEO, most people have a decent understanding of the basics. They know about keywords, and how they should show up in different places throughout their content. They have heard of on-page SEO, and have maybe even given WordPress SEO plugins a whirl.
If you dive down into the nitty gritty of search engine optimization, however, there are a few rather obscure pieces of the puzzle that not everyone knows about—one of them being robots.txt files.
What are robots.txt files and what are they used for?
A robots.txt file is a text file that resides on your server. It contains rules for indexing your website and is a tool to directly communicate with search engines.
Basically, the file says which parts of your site Google’s allowed to index and which parts they should leave alone.
However, why would you ever tell Google not to crawl something on your site? Isn’t that harmful from an SEO perspective? There are actually many reasons why you would tell Google not to crawl something on your site.
One of the most common uses of robots.txt is excluding a website from the search results that is still in its development stage.
The same goes for a staging version of your site, where you try out changes before committing them to the live version.
Or, maybe you have some files on your server that you don’t want to pop up on the Internet because they are only for your users.
Is it absolutely necessary to have a robots.txt?
Do you absolutely have to have a robots.txt in place? No, your WordPress website will be indexed by search engines even without that file present.
In fact, WordPress in itself already contains a virtual robots.txt. That said, I would still recommend creating a physical copy on your server. It will make things much easier.
One thing that you should be aware of, however, is that obedience to robots.txt cannot be enforced. The file will be recognized and respected by major search engines, but malicious crawlers and low-quality search crawlers might disregard it completely.
How do I create one and where do I put it?
Making your own robots.txt is as easy as creating a text file with your editor of choice and calling it robots.txt. Simply save and you are done. Seriously, it’s that easy.
Ok, there is a second step involved: uploading it via FTP. The file is usually placed in your root folder, even if you have moved WordPress to its own directory. A good rule of thumb is to put it in the same place as your index.php, wait for the upload to finish, and you’re done.
Be aware that you will need a separate robots.txt file for each subdomain of your site and for different protocols like HTTPS.
How to set rules inside robots.txt
Now let’s spend some time talking about the content.
The robots.txt has its own syntax to define rules. These rules are also called “directives.” In the following, we will go over how you can use them to let crawlers know what they can and cannot do on your site.
Basic robots.txt syntax
If you rolled your eyes at the word “syntax,” don’t worry, you don’t have to learn a new programming language. Available commands for directives are few. In fact, knowing just two of them is enough for most purposes:
- User-Agent – Defines the search engine crawler
- Disallow – Tells the crawler to stay away from defined files, pages, or directories
If you are not going to set different rules for different crawlers or search engines, an asterisk (*) can be used to define universal directives for all of them. For example, to block everyone from your entire website, you would configure robots.txt in the following way:
User-agent: * Disallow: /
This basically says that all directories are off limits for all search engines.
What’s important to note is that the file uses relative (and not absolute) paths. Since robots.txt resides in your root directory, the forward slash denotes a disallow for this location and everything it contains. To define single directories, such as your media folder, as off limits, you would have to write something like
/wp-content/uploads/. Also keep in mind that paths are case sensitive.
If it makes sense for you, you can also allow and disallow parts of your site for certain bots. For example, the following code inside your robots.txt would give only Google full access to your website while keeping everyone else out:
User-agent: Googlebot Disallow: User-agent: * Disallow: /
Be aware that rules for specific crawlers have to be defined at the beginning of the robots.txt file. Afterwards you can include a
User-agent:* wild card as a catch-all directive for all spiders that don’t have explicit rules.
Noteworthy names of user-agents include:
- Googlebot – Google
- Googlebot-Image – Google Image
- Googlebot-News – Google News
- Bingbot – Bing
- Yahoo! Slurp – Yahoo (great choice in name, Yahoo!)
More can be found here:
Again, let me remind you that Google, Yahoo, Bing, and such will generally honor the directives in your file, however, not every crawler out there will.
User-agent are not the only rules available. Here are a few more:
- Allow – Explicitly allows crawling an entity on your server
- Sitemap – Tell crawlers where your sitemap resides
- Host – Defines your preferred domain for a site that has multiple mirrors
- Crawl-delay – Sets the time interval search engines should wait between requests to your server
Let’s talk about
allow first. A common misconception is that this rule is used to tell search engines to check out your site and is therefore important for SEO reasons. Because of this you will find the following in some robots.txt files:
User-agent: * Allow: /
This directive is redundant. Why? Because search engines consider everything that is not specifically disallowed on your site as fair game. Telling them that you allow your entire site to be crawled won’t change much about this.
allow directive is used to counteract
disallow. This is useful in case you want to block an entire directory but give search engines access to one or more specific files inside of it like so:
User-agent: * Allow: /my-directory/my-file.php Disallow: /my-directory/
The search engines would stay away from
my-directory in general, but still access
my-file.php. However, it’s important to note that you need to place the
allow directive first in order for this to work.
Some crawlers support the
Sitemap directive. You can use it to tell them where to find the sitemap of your website and it would look like this:
Sitemap: http://mysite.com/sitemap_index.xml Sitemap: http://mysite.com/post-sitemap.xml Sitemap: http://mysite.com/page-sitemap.xml Sitemap: http://mysite.com/category-sitemap.xml Sitemap: http://mysite.com/post_tag-sitemap.xml
The directive can be anywhere within the robots.txt file. Generally website owners choose to place it either at the beginning or the end. Its usefulness, however, is debatable. For example, Yoast has the following thoughts on it:
“I’ve always felt that linking to your XML sitemap from your robots.txt is a bit nonsense. You should be adding them manually to your Google and Bing Webmaster Tools, and make sure you look at their feedback about your XML sitemap.” – Joost de Valk
Therefore, it’s up to you whether to add it to your file or not.
Crawl-delay are two directives I have personally never used. The former tells search engines which domain is your favorite in case you have several mirrors of your site. The latter sets the number of seconds that crawlers should wait between sweeps.
Since both are not that common, I am not going to go too deep into them but I wanted to include them for completion’s sake.
Still with me? Well done. Now it gets a little trickier.
We already know that we can set wild cards via an asterisk for
User-agent. However, the same is also true for other directives.
For example, if you wanted to block all folders from access that begin with wp-, you could so like that:
User-agent: * Disallow: /wp-*/
Makes sense, doesn’t it? The same also works with files. For example, if I my goal was to exclude all PDF files in my media folder from showing up in SERPs, I would use this code:
User-agent: * Disallow: /wp-content/uploads/*/*/*.pdf
Note that I replaced the month and day directories that WordPress automatically sets up with wildcards as well to make sure all files with this ending are caught no matter when they were uploaded.
While this technique does a good job in most cases, sometimes it is necessary to define a string via its end rather than its beginning. This is where the dollar sign wildcard comes in handy:
User-agent: * Disallow: /page.php$
The aforementioned rule ensures that only
page.php gets blocked and not
page.php?id=12 as well. The dollar sign tells search engines that
page.php is the very end of the string. Neat, huh?
Fine, but what should I put into my robots.txt file now?!
I can see you are getting impatient. Where is the code? Aren’t there some optimized directives I can post here that you can just copy and paste, and be done with this topic?
As much as I would like that, the answer is unfortunately no.
Why? Well, one of the reasons is that the content of your robots.txt really depends on your site. You might have a couple of things you would rather keep away from search engines that others don’t care about.
Secondly, and more importantly, there is no agreed upon standard for best practices and optimal ways to set up your robots.txt in terms of SEO. The whole topic is a bit of a debate.
What the experts are doing
For example, the folks over at Yoast only have the following in their robots.txt:
User-Agent: * Disallow: /out/
As you can see, the only thing they are disallowing is their “out” directory, which houses their affiliate links. Everything else is fair game. The reason is this:
By now Google looks at your site as a whole. If you block the styling components, it will think your site looks like crap and penalize you for it with devastating effects.
To check how Google sees your site, use “Fetch as Google” and then “Fetch and render” in the Crawl section of Google Webmaster Tools. If your robots.txt is too restrictive, your site will probably not look the way you want it and you will need to make some adjustments.
Yoast also strongly advises not to use robots.txt directives to hide low-quality content such as category, date, and other archives, but work with
noindex, follow meta tags instead. Also note that there is no reference to the sitemap in their file for the reason mentioned above.
WordPress founder Matt Mullenweg takes a similar minimalistic approach:
User-agent: * Disallow: User-agent: Mediapartners-Google* Disallow: User-agent: * Disallow: /dropbox Disallow: /contact Disallow: /blog/wp-login.php Disallow: /blog/wp-admin
You can see he blocks only his dropbox and contact folder plus important admin and login files and folders for WordPress. While some people do the latter for security reasons, hiding the wp-admin folder is something that Yoast actually advises against.
Our next example comes from WPBeginner:
User-Agent: * Allow: /?display=wide Allow: /wp-content/uploads/ Disallow: /wp-content/plugins/ Disallow: /readme.html Disallow: /refer/ Sitemap: http://www.wpbeginner.com/post-sitemap.xml Sitemap: http://www.wpbeginner.com/page-sitemap.xml Sitemap: http://www.wpbeginner.com/deals-sitemap.xml Sitemap: http://www.wpbeginner.com/hosting-sitemap.xml
You can see that they block their affiliate links (see the “refer” folder) as well as plugins and the readme.html file. As explained in this article, the latter happens to avoid malicious queries aimed at certain versions of WordPress. By disallowing the file, you might be able to protect yourself from mass attacks.
Blocking the plugins folder is also aimed at keeping hackers from going through vulnerable plugins. Here they take a different approach than Yoast, who changed this not too long ago so that styling within plugin folders doesn’t get lost.
One thing that WPBeginner does differently than the other two examples is set wp-content/uploads explicitly to “allow,” even though it is not blocked by any other directive. They state that this is to make all search engines include this folder in their search.
However, I don’t really see the point in this as the default approach of search engines is to index everything they can get their hands on. Therefore I don’t think that allowing them to crawl something specifically is going to help much.
The final verdict
I am with Yoast when it comes to configuring robots.txt.
From an SEO perspective, it makes sense to give Google as much as you can so they are able to understand your site. However, if there are parts that you would like to keep for yourself (such as affiliate links), disallow those as desired.
This also goes hand in hand with the relevant section in the WordPress Codex:
“Adding entries to robots.txt to help SEO is popular misconception. Google says you are welcome to use robots.txt to block parts of your site but these days prefers you don’t. Use page-level noindex tags instead, to tackle low-quality parts of your site. Since 2009, Google has been evermore vocal in its advice to avoid blocking JS & CSS files, and Google’s Search Quality Team has been evermore active in promoting a policy of transparency by webmasters, to help Google verify we’re not “cloaking” or linking to unsightly spam on blocked pages. Therefore the ideal robots file disallows nothing whatsoever, and may link to an XML Sitemap if an accurate one has been constructed (which itself is rare though!).
WordPress by default only blocks a couple of JS files but is nearly compliant with Google’s guidance here.”
Pretty clear, isn’t it? Keep in mind that if you decide to link to a sitemap, you should definitely submit it to the search engines directly via their Webmaster suites as well.
Whatever you decide to do, don’t forget to test your sitemap! This can be done in the following ways:
- Go to yoursite.com/robots.txt to see if it shows up
- Run it through a tester tool to find syntax errors (for example this one)
- Fetch and render to check if Google sees what you would like them to see
- Look out for possible error messages from Google Webmaster Tools
Robots.txt for WordPress in a nutshell
Setting up a robots.txt file for your website is an important and often disregarded step in search engine optimization. Telling web crawlers which parts of your site to index and which parts to leave alone helps you keep unnecessary content out of search engine results.
On the other hand, as we have seen, blocking Google from too much of your site can seriously harm its performance in the SERPs.
While in the past it was appropriate to disallow a bunch of folders and files from being accessed, today the trend goes more toward having a minimally set up robots.txt.
When configuring your file, make sure to test it thoroughly so it is not hurting you more than it is helping.
How did you set up robots.txt? Any important points you would like to add?