Understanding and optimizing the WordPress robots.txt

When it comes to SEO, most people have a decent understanding of the basics. They know about keywords, and how they should show up in different places throughout their content. They have heard of on-page SEO, and have maybe even given WordPress SEO plugins a whirl.

If you dive down into the nitty gritty of search engine optimization, however, there are a few rather obscure pieces of the puzzle that not everyone knows about—one of them being robots.txt files.

What are robots.txt files and what are they used for?

A robots.txt file is a text file that resides on your server. It contains rules for indexing your website and is a tool to directly communicate with search engines.

Basically, the file says which parts of your site Google’s allowed to index and which parts they should leave alone.

However, why would you ever tell Google not to crawl something on your site? Isn’t that harmful from an SEO perspective? There are actually many reasons why you would tell Google not to crawl something on your site.

One of the most common uses of robots.txt is excluding a website from the search results that is still in its development stage.

The same goes for a staging version of your site, where you try out changes before committing them to the live version.

Or, maybe you have some files on your server that you don’t want to pop up on the Internet because they are only for your users.

Is it absolutely necessary to have a robots.txt?

Do you absolutely have to have a robots.txt in place? No, your WordPress website will be indexed by search engines even without that file present.

In fact, WordPress in itself already contains a virtual robots.txt. That said, I would still recommend creating a physical copy on your server. It will make things much easier.

One thing that you should be aware of, however, is that obedience to robots.txt cannot be enforced. The file will be recognized and respected by major search engines, but malicious crawlers and low-quality search crawlers might disregard it completely.

How do I create one and where do I put it?

Making your own robots.txt is as easy as creating a text file with your editor of choice and calling it robots.txt. Simply save and you are done. Seriously, it’s that easy.

Ok, there is a second step involved: uploading it via FTP. The file is usually placed in your root folder, even if you have moved WordPress to its own directory. A good rule of thumb is to put it in the same place as your index.php, wait for the upload to finish, and you’re done.

Be aware that you will need a separate robots.txt file for each subdomain of your site and for different protocols like HTTPS.

How to set rules inside robots.txt

Now let’s spend some time talking about the content.

The robots.txt has its own syntax to define rules. These rules are also called “directives.” In the following, we will go over how you can use them to let crawlers know what they can and cannot do on your site.

Basic robots.txt syntax

If you rolled your eyes at the word “syntax,” don’t worry, you don’t have to learn a new programming language. Available commands for directives are few. In fact, knowing just two of them is enough for most purposes:

User-Agent – Defines the search engine crawler
Disallow – Tells the crawler to stay away from defined files, pages, or directories

If you are not going to set different rules for different crawlers or search engines, an asterisk (*) can be used to define universal directives for all of them. For example, to block everyone from your entire website, you would configure robots.txt in the following way:

User-agent: *
Disallow: /

This basically says that all directories are off limits for all search engines.

What’s important to note is that the file uses relative (and not absolute) paths. Since robots.txt resides in your root directory, the forward slash denotes a disallow for this location and everything it contains. To define single directories, such as your media folder, as off limits, you would have to write something like /wp-content/uploads/. Also keep in mind that paths are case sensitive.

If it makes sense for you, you can also allow and disallow parts of your site for certain bots. For example, the following code inside your robots.txt would give only Google full access to your website while keeping everyone else out:

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

Be aware that rules for specific crawlers have to be defined at the beginning of the robots.txt file. Afterwards you can include a User-agent:* wild card as a catch-all directive for all spiders that don’t have explicit rules.

Noteworthy names of user-agents include:

Googlebot – Google
Googlebot-Image – Google Image
Googlebot-News – Google News
Bingbot – Bing
Yahoo! Slurp – Yahoo (great choice in name, Yahoo!)

More can be found here:

Again, let me remind you that Google, Yahoo, Bing, and such will generally honor the directives in your file, however, not every crawler out there will.

Additional syntax

Disallow and User-agent are not the only rules available. Here are a few more:

Allow – Explicitly allows crawling an entity on your server
Sitemap – Tell crawlers where your sitemap resides
Host – Defines your preferred domain for a site that has multiple mirrors
Crawl-delay – Sets the time interval search engines should wait between requests to your server

Let’s talk about allow first. A common misconception is that this rule is used to tell search engines to check out your site and is therefore important for SEO reasons. Because of this you will find the following in some robots.txt files:

User-agent: *
Allow: /

This directive is redundant. Why? Because search engines consider everything that is not specifically disallowed on your site as fair game. Telling them that you allow your entire site to be crawled won’t change much about this.

Instead the allow directive is used to counteract disallow. This is useful in case you want to block an entire directory but give search engines access to one or more specific files inside of it like so:

User-agent: *
Allow: /my-directory/my-file.php
Disallow: /my-directory/

The search engines would stay away from my-directory in general, but still access my-file.php. However, it’s important to note that you need to place the allow directive first in order for this to work.

Some crawlers support the Sitemap directive. You can use it to tell them where to find the sitemap of your website and it would look like this:

Sitemap: http://mysite.com/sitemap_index.xml
Sitemap: http://mysite.com/post-sitemap.xml
Sitemap: http://mysite.com/page-sitemap.xml
Sitemap: http://mysite.com/category-sitemap.xml
Sitemap: http://mysite.com/post_tag-sitemap.xml

The directive can be anywhere within the robots.txt file. Generally website owners choose to place it either at the beginning or the end. Its usefulness, however, is debatable. For example, Yoast has the following thoughts on it:

“I’ve always felt that linking to your XML sitemap from your robots.txt is a bit nonsense. You should be adding them manually to your Google and Bing Webmaster Tools, and make sure you look at their feedback about your XML sitemap.” – Joost de Valk

Therefore, it’s up to you whether to add it to your file or not.

Host and Crawl-delay are two directives I have personally never used. The former tells search engines which domain is your favorite in case you have several mirrors of your site. The latter sets the number of seconds that crawlers should wait between sweeps.

Since both are not that common, I am not going to go too deep into them but I wanted to include them for completion’s sake.

Advanced stuff

Still with me? Well done. Now it gets a little trickier.

We already know that we can set wild cards via an asterisk for User-agent. However, the same is also true for other directives.

For example, if you wanted to block all folders from access that begin with wp-, you could so like that:

User-agent: *
Disallow: /wp-*/

Makes sense, doesn’t it? The same also works with files. For example, if I my goal was to exclude all PDF files in my media folder from showing up in SERPs, I would use this code:

User-agent: *
Disallow: /wp-content/uploads/*/*/*.pdf

Note that I replaced the month and day directories that WordPress automatically sets up with wildcards as well to make sure all files with this ending are caught no matter when they were uploaded.

While this technique does a good job in most cases, sometimes it is necessary to define a string via its end rather than its beginning. This is where the dollar sign wildcard comes in handy:

User-agent: *
Disallow: /page.php$

The aforementioned rule ensures that only page.php gets blocked and not page.php?id=12 as well. The dollar sign tells search engines that page.php is the very end of the string. Neat, huh?

Fine, but what should I put into my robots.txt file now?!

I can see you are getting impatient. Where is the code? Aren’t there some optimized directives I can post here that you can just copy and paste, and be done with this topic?

As much as I would like that, the answer is unfortunately no.

Why? Well, one of the reasons is that the content of your robots.txt really depends on your site. You might have a couple of things you would rather keep away from search engines that others don’t care about.

Secondly, and more importantly, there is no agreed upon standard for best practices and optimal ways to set up your robots.txt in terms of SEO. The whole topic is a bit of a debate.

What the experts are doing

For example, the folks over at Yoast only have the following in their robots.txt:

User-Agent: *
Disallow: /out/

As you can see, the only thing they are disallowing is their “out” directory, which houses their affiliate links. Everything else is fair game. The reason is this:

“No longer is Google the dumb little kid that just fetches your sites HTML and ignores your styling and JavaScript. It fetches everything and renders your pages completely. This means that when you deny Google access to your CSS or JavaScript files, it doesn’t like that at all.” – Yoast

By now Google looks at your site as a whole. If you block the styling components, it will think your site looks like crap and penalize you for it with devastating effects.

To check how Google sees your site, use “Fetch as Google” and then “Fetch and render” in the Crawl section of Google Webmaster Tools. If your robots.txt is too restrictive, your site will probably not look the way you want it and you will need to make some adjustments.

Yoast also strongly advises not to use robots.txt directives to hide low-quality content such as category, date, and other archives, but work with noindex, follow meta tags instead. Also note that there is no reference to the sitemap in their file for the reason mentioned above.

WordPress founder Matt Mullenweg takes a similar minimalistic approach:

User-agent: *
Disallow:

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /dropbox
Disallow: /contact
Disallow: /blog/wp-login.php
Disallow: /blog/wp-admin

You can see he blocks only his dropbox and contact folder plus important admin and login files and folders for WordPress. While some people do the latter for security reasons, hiding the wp-admin folder is something that Yoast actually advises against.

Our next example comes from WPBeginner:

User-Agent: *
Allow: /?display=wide
Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /readme.html
Disallow: /refer/

Sitemap: http://www.wpbeginner.com/post-sitemap.xml
Sitemap: http://www.wpbeginner.com/page-sitemap.xml
Sitemap: http://www.wpbeginner.com/deals-sitemap.xml
Sitemap: http://www.wpbeginner.com/hosting-sitemap.xml

You can see that they block their affiliate links (see the “refer” folder) as well as plugins and the readme.html file. As explained in this article, the latter happens to avoid malicious queries aimed at certain versions of WordPress. By disallowing the file, you might be able to protect yourself from mass attacks.

Blocking the plugins folder is also aimed at keeping hackers from going through vulnerable plugins. Here they take a different approach than Yoast, who changed this not too long ago so that styling within plugin folders doesn’t get lost.

One thing that WPBeginner does differently than the other two examples is set wp-content/uploads explicitly to “allow,” even though it is not blocked by any other directive. They state that this is to make all search engines include this folder in their search.

However, I don’t really see the point in this as the default approach of search engines is to index everything they can get their hands on. Therefore I don’t think that allowing them to crawl something specifically is going to help much.

The final verdict

I am with Yoast when it comes to configuring robots.txt.

From an SEO perspective, it makes sense to give Google as much as you can so they are able to understand your site. However, if there are parts that you would like to keep for yourself (such as affiliate links), disallow those as desired.

This also goes hand in hand with the relevant section in the WordPress Codex:

“Adding entries to robots.txt to help SEO is popular misconception. Google says you are welcome to use robots.txt to block parts of your site but these days prefers you don’t. Use page-level noindex tags instead, to tackle low-quality parts of your site. Since 2009, Google has been evermore vocal in its advice to avoid blocking JS & CSS files, and Google’s Search Quality Team has been evermore active in promoting a policy of transparency by webmasters, to help Google verify we’re not “cloaking” or linking to unsightly spam on blocked pages. Therefore the ideal robots file disallows nothing whatsoever, and may link to an XML Sitemap if an accurate one has been constructed (which itself is rare though!).

WordPress by default only blocks a couple of JS files but is nearly compliant with Google’s guidance here.”

Pretty clear, isn’t it? Keep in mind that if you decide to link to a sitemap, you should definitely submit it to the search engines directly via their Webmaster suites as well.

Whatever you decide to do, don’t forget to test your sitemap! This can be done in the following ways:

Go to yoursite.com/robots.txt to see if it shows up
Run it through a tester tool to find syntax errors (for example this one)
Fetch and render to check if Google sees what you would like them to see
Look out for possible error messages from Google Webmaster Tools

Robots.txt for WordPress in a nutshell

Setting up a robots.txt file for your website is an important and often disregarded step in search engine optimization. Telling web crawlers which parts of your site to index and which parts to leave alone helps you keep unnecessary content out of search engine results.

On the other hand, as we have seen, blocking Google from too much of your site can seriously harm its performance in the SERPs.

While in the past it was appropriate to disallow a bunch of folders and files from being accessed, today the trend goes more toward having a minimally set up robots.txt.

When configuring your file, make sure to test it thoroughly so it is not hurting you more than it is helping.

How did you set up robots.txt? Any important points you would like to add?

There are 24 comments

April 14, 2015 #

Piet

Thanks for the informative article! Still torn between nothing at all or a minimal file. Like if you want to do it the way Joost de Valk does it, but you don’t have a folder for affiliate links, then you pretty much don’t need the robots.txt file, right?
- April 27, 2015 #
  
  Nick
  
  Hey Piet, thanks for the comment. Sorry it took me a while to answer, I was on honeymoon and not online during that time.
  From what I have read, if you want to take control of the content of your robots.txt (even if it’s as good as empty), you should go ahead and create an actual file. If you don’t, the WordPress virtual robots.txt will be used and it might not have the same content as you would like it to. On pretty much all of my sites I have a near empty file residing on the server.
  Cheers!
  - April 30, 2015 #
    
    Piet
    
    Congrats on the marriage! Hope you had a great honeymoon!
    - April 30, 2015 #
      
      Nick
      
      Thanks man!
May 26, 2015 #

Sacha Benda

Hi Nick, thanks for the tips and some clarification.
I know there are quite different points of view, but what do you think about indexing – for a blog – archive pages (categories, tags, date, author, etc)?
Thanks
June 29, 2015 #

Till Vennemann

Hey Nick, mega cool. Da suche ich Infos und finde deinen Beitrag. 🙂
July 3, 2015 #

SandyT

This is the best article on the topic I have read so far! Thanks a lot for comparing different opinions. I have read through a couple posts about robots.txt before and was very confused, this really cleared things up!
September 30, 2015 #

johnjoepeach

Hi Nick, thank you for imparting your in-depth knowledge about robots.txt. Clearly, this is one of the most important topics and SEO activity that’s usually overlooked. Good thing, its importance is being discussed and being realized by SEO practitioners. Looking forward to reading more of your posts, Nick.
January 4, 2016 #

ScottM

Nice article. I’m new to WP. I have used robots.txt before and this article gave me the understanding of how it can help or hurt WP depending on my goals.
March 26, 2016 #

catalogo unique

Thanks for this article, was well explained, follow these tips, thanks.
- March 28, 2016 #
  
  Nick Schäferhoff
  
  You are welcome. Thanks for the comment!
April 12, 2016 #

lucky

help full article thank you for providing.
- April 13, 2016 #
  
  Nick Schäferhoff
  
  Sure thing, happy you found it helpful!
August 1, 2016 #

Obat Mata Ikan Di Apotik

This articel is helpful, thanks for sharing.
- August 2, 2016 #
  
  Nick Schäferhoff
  
  Thanks for the comment!
October 13, 2016 #

Gawai

hi, nice articel.. but how about dissalow ?replytocom in robot txt?
can u give me a result?
thanks nick.
January 27, 2017 #

/

Awesome issues here. I’m very satisfied to look your post.

Thanks so much and I am having a look forward to contact you.

Will you please drop me a e-mail?
August 5, 2017 #

Tecmobs

thanks bro, this would help me alot
- August 5, 2017 #
  
  Nick Schäferhoff
  
  No problem, bro. Glad to be of help.
December 1, 2017 #

Amol Sarise

Thank you, Nick, You help me to learn something new. I like your post and keep up the nice work.
- December 1, 2017 #
  
  Nick Schäferhoff
  
  Thanks for the kind words, Amol! Glad I could help.
March 21, 2019 #

Wakeel Ahmed

Wow! such a great info..
really learn lots of new thing about Robot.txt
means we should carefully define robot.txt index our pages and disallow privacy links.
May 27, 2020 #

Pupparazzi Simon

Thanks, that answered my Robots.txt questions. Is it me, or has Yoast removed the robots.txt element from their plugin recently?
- June 1, 2020 #
  
  Nick Schäferhoff
  
  Not that I know, but I haven’t checked in a long time. Can you confirm?