In this post, we will cover how we have made use of Cloudflare Workers and Transform Rules to block access to search engines from crawling and indexing our umbhost.dev domain and subdomains.
We use the umbhost.dev domain to hold our temporary and pre-live URLs, so it's massively important that these do not get indexed by search engines.
You could also use this as a guide to block search engines from any subdomain (staging.domain.com for example).
Transform Rule
We have added a response header to every request which goes through the umbhost.dev domain, the header is called X-Robots-Tag
with a value of noindex
This header will instruct crawlers to not index a page, you can read more about the X-Robots-Tag
header on the Google Docs site.
To add this header you will need to login to the Cloudflare Dashboard and then browse to Domain Zone -> Rules -> Transform Rules -> Modify Response Header
On this screen click on the + Create rule button.
To target every URL accessed through the domain configure it as follows:
If you wish to only target a single (or multiple) subdomains configure them as follows:
Cloudflare Worker
Next, we wanted to add a robots.txt
which was handled at the edge and automatically applied to every URL served.
To do this we make use of Cloudflare Workers, which are serverless code deployed directly to the edge.
To create a Cloudflare Worker browse to Workers & Pages -> Overview
On the Overview screen click on the Create Application button and then on the Create Worker button
On the next screen give your worker a name such as project-robots-blocker and then click on the Deploy button (The code can't be edited until it has been deployed).
On the next screen click the Edit Code button, in the editor which opens replace the code shown with the snippet below.
This code will return the robots.txt
in the following format:
Next, click Save and Deploy
Now we need to hook this up to our domain, to do this browse to Domain Zone -> Worker Routes and then click on the Add Route button.
In the window which pops up enter the domain as follows:
*.domain.com/robots.txt
(Make sure to replace domain.com with your domain)
The asterisk will target all subdomains on the domain, if you wish to target a single subdomain you can replace the asterisk with the required subdomain.
Finally, click on Save.
And that's all there is to it, now all requests to the domain will automatically have the X-Robots-Tag
header applied and a robots.txt
file.
No more chance of accidentally having a pre-production site or temporary URL ending up being indexed by search engines.
(Only the good search engines which obey the rules will be affected)
Comments
Recent Posts
Categories