Recap
In the previous article in this series, New Domain Names - Part 4: Redirects and Deindexing, I redirected all the pages on the old site I had already moved to the new site that hadn't already been redirected, and I decided those pages that were not being created here would, where appropriate, return a 410 Gone status code.
Because the old site used a CMS those 410'd pages weren't technically permanently deleted, but in terms of their URLs it will be improbable they will be recreated at the same URL (especially given that I can't even login to the old site now because the login page is 410'd.
The problem I made for myself is that I hadn't recreated all of the content at the new site, because I had been gradually doing it in sections. Because of that, I used the Google Change of Address Tool to say that watfordjc.com had moved to watfordjc.co.uk.
While that meant that watfordjc.com started getting deindexed, watfordjc.co.uk's deindexing (from the 301 and 410 status codes) gradually ground to a halt. At the time of writing there are two indexed links for watfordjc.com (1 listed in Webmaster Tools) and 24 for watfordjc.co.uk (of "about 137 results"—Webmaster Tools lists 46 indexed and 48 blocked by robots.txt).
With 44 days until watfordjc.com expires, and 97 days until watfordjc.co.uk expires, I need to do all I can to make sure that Google's index is correct for my sites. That means no more watfordjc.com or watfordjc.co.uk in the search results.
robots.txt and X-Robots-Tag
I have removed all rules that would apply to Googlebot from robots.txt on all of my personal sites. I have, instead, told nginx to return a 404/410 error for files that shouldn't be indexed (i.e. internal files such as those used for grunt or git).
I have also removed a site-wide X-Robots-Tag that applied to all pages on one domain.
That means that any page that is accessible from the Web should be accessible to Googlebot now.
Redirects, error pages, everything.
With that done, I can now grep my nginx log for particular status codes so I can ensure every redirect that is needed is in place.
Redirect Checking
egrep -h "watfordjc.(co.uk|com)" /var/log/nginx/access.log* | egrep -v " (301|410) " | egrep "(jpg|jpeg|png/i)" | sed 's/^\(.*\) \(.*\) \(.*\) \[\(.*\)\] \(.*\) \"GET \(.*\) HTTP\/1\.1\" \(.*\) \(.*\) \"\(.*\)\" \"\(.*\)\"$/\1#\2#\3#\4#\5#\6#\7#\8#\9#\10/g'
That one-liner will search the log files (non-standard format) for any .jpg/.jpeg/.png file requested for which a 301 or 410 was not the response. It then converts the output lines into octothorpe-separated values.
We want the 6th value for the location, so if we make that one-liner a bit longer...
egrep -h "watfordjc.(co.uk|com)" /var/log/nginx/access.log* | egrep "(jpg|jpeg|png/i)" | sed 's/^\(.*\) \(.*\) \(.*\) \[\(.*\)\] \(.*\) \"GET \(.*\) HTTP\/1\.1\" \(.*\) \(.*\) \"\(.*\)\" \"\(.*\)\"$/\1#\2#\3#\4#\5#\6#\7#\8#\9#\10/g' | awk -F# '{print $7$6}' | sort | uniq
200/downloads/images/no_image.jpg 200/images/shaving-thumb2.jpg
Two images have been requested and received a 200 response code.
And if we just want to get a list of every page so we can then check them:
egrep -h "(watfordjc\.com|watfordjc\.co\.uk)" /var/log/nginx/access.log* | sed 's/^\(.*\) \(.*\) \(.*\) \[\(.*\)\] \(.*\) \"GET \(.*\) HTTP\/1\.1\" \(.*\) \(.*\) \"\(.*\)\" \"\(.*\)\"$/\1#\2#\3#\4#\5#\6#\7#\8#\9#\10/g' | awk -F# '{print $7$5$6}' | sort | uniq | grep -vE "://web\.(watfordjc|johncook)\.uk"
Add | egrep "^200"
to the end of that one-liner and we've got a list of every file that has returned a 200 response on the old sites from all of my nginx logs.
Because my logs aren't technically in NCSA format (I've added the request protocol, domain, and port together before the request field) the above one-liner actually prints out a list of URLs prepended with the status code. Gnome-terminal actually likes that—I can control-click any link and it'll open in my browser.
Here's part of the output from that one-liner:
200https://web.watfordjc.co.uk:443/ 200https://web.watfordjc.co.uk:443/images/favicon.ico 200https://web.watfordjc.co.uk:443/?new_message=1 200https://web.watfordjc.co.uk:443/robots.txt
/images/favicon.ico and /favicon.ico on the old sites can become a permanent redirect to https://web.johncook.uk/img/favicon.ico on this site.
Google Site Move
You can't chain site moves with Google Webmaster Tools. I thought I had made a big error by moving watfordjc.com to watfordjc.co.uk, but that is just because the documentation on the facility is lacking.
Basically, it works like this:
- You plan to move domains.
- You move your content to their new home.
- Once all your content (or at least all that you are keeping) is at its new home, you are now ready to use the tool.
- You make sure Googlebot has access to everything it needs. Any robot restrictions on redirects might make Googlebot ignore them.
- You set up 301 redirects for the things you moved, and 404/410 the things you didn't.
- The site structure can be completely different at the new site, just make sure the redirects from the old site point to the new locations.
- You 301 the old / page to the new / page. The new / page must not redirect elsewhere.
- You tell Google about the Change of Address.
- Google gradually crawls your old site and, whenever it follows a redirect to your new site, it remembers for later that it has to change the old URL to the new one in the Google Index.
- If you subsequently move site, you can withdraw the Change of Address Request (for old site 1). You can then repeat the process telling Google both of your old sites (old site 1 and old site 2) have moved address to your new site. You must still have control over old site 1 to withdraw its change of address.
- Any 301 redirects from old site 1 to old site 2 should be modified so they redirect from old site 1 to new site.
- Google does not simply replace the old domain with the new domain in the index. It follows 301 redirects and "moves" the old URL in its index to the new one it was redirected to.
- I believe the redirects from your old sites must point to locations at the domain you've told Google you've moved to.
- Multiple sites can move to a new site. If you have multiple versions of a domain (e.g. http://watfordjc.co.uk, https://watfordjc.co.uk, http://web.watfordjc.co.uk, https://web.watfordjc.co.uk) double-check that you have added all of them to Webmaster Tools and then double-check that the Site Move has been performed for all of the variants (in my case web.watfordjc.co.uk, watfordjc.co.uk, and watfordjc.com).
Increasing Caching
I have, until now, prevented CloudFlare from caching my static files. The reason for this is because CloudFlare removes my Cache-Control headers including no-transform.
I was originally going to be writing here about how I have now set different headers and started allowing CloudFlare to cache my static content. That is not the case.
Because CloudFlare will not pass on the no-transform header, I have no faith they won't at some point start messing with my JavaScript, CSS, or images. If T-Mobile could do it, so could CloudFlare.
So, instead of switching from headers that only permit private caching to headers that permit public caching and let CloudFlare cache my images and stuff, I have disabled CloudFlare for my web sub-domain and set the DNS TTL to 10 minutes.
My static content now has a max-age of 2,592,000 seconds (30 days) and an s-maxage of 1,296,000 seconds (15 days) with no-transform specified in the headers.
I have also ditched the Expires header due to a flaw in NGINX: the Expires setting adds another Cache-Control header instead of appending the existing one. There are probably still some HTTP/1.0 clients out there, but if they want caching they can start supporting HTTP/1.1 or HTTP/2.0.