Creating Sitemaps

One of the things this site doesn't currently have is a sitemap for robots. Time to fix that.

Current Sitemap

The current Complete Sitemap for this site is purely a multi-tier list.

Although it contains links, and robots/spiders/crawlers are able to read the links in those lists, without intelligent parsing they might not be able to see the site structure from the list.

Unlike the category pages of this site, the current sitemap doesn't use microdata.

Microdata would enable some robots to look at the page and understand the associations between the links.

The problem with the current sitemap page is the complexity of the code for the lists inside a list inside a list.

Adding microdata is something I will eventually do—the content is already in a <section> with a schema.org itemtype of CollectionPage; but it currently looks like it will be a little too complex.

XML Sitemaps

XML sitemaps for the sites (JohnCook.UK is SFW whereas WatfordJC.UK also contains potentially NSFW content) is something I have been putting off creating because it means more work—I already have a sitemap on the site.

In an ideal world I could just take the current sitemap and transform it into an XML sitemap. That, however, cannot be done without scripting an automated way of doing it.

Because of the way I've coded the PHP content pages on the site, the only <a href> instances on the sitemap page are the links in the list—all the other hyperlinks are from included files.

Having spent a few hours working on a single line of code, I now have the following one-liners to pump out all the links in the sitemap.

The resulting output gives me the list of relative URLs contained in the sitemap, with each link on a single line.

WatfordJC.UK Link List

The following line of code pipes the source code for the complete sitemap to lynx, which generates a list of links, which is then grepped for lines containing file, and then the start of each line is stripped removing the link number and file://.

cat complete-sitemap.php | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,'

JohnCook.UK Link List

The following line of code takes the source code for the complete sitemap, removes all the newline characters, removes all of the NSFW links, pipes it to lynx which generates a list of links which is then grepped for lines containing file, and then the start of each line is stripped removing the link number and file://.

tr -d "\n\r" < complete-sitemap.php | perl -pe 's/<\?php if \(\$site_name != "John Cook UK"\) { \?\>[\s\S]*?} \?>//g' | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,'

XML Sitemap Standard

The XML sitemap standard is rather simple, and I plan on using the following elements inside each <url> tag:

  • <loc>—required as it is for the URL.
  • <lastmod>—optional, last modified time. I already have every last modified time in a redis database.
  • <priority>—optional, a priority (importance) ranging from 1.0 to 0.0 with a default value of 0.5.

I already have the URLs from the one-liners above, so that leaves priority and modification time.

Site structure makes priority rather simple. With the exception of /archives/* everything can be prioritied based on path.

For example, / will have a priority of 1.0, and all other links with only one / 0.8. Everything that matches /archives/johncook-[0-9]{4} will have a priority of 0.9, and everything else that matches /archives/ (as well as /gallery/) a priority of 0.5. Everything else with two slashes will have a priority of 0.7, and everything with 3 slashes 0.5.

In terms of testing logic it will be similar to the following:

  • Contants 3 slashes? 0.5
  • Matches /archives/johncook-[0-9]{4}? 0.9
  • Matches /archives/ or /gallery/? 0.5
  • Contains 2 slashes? 0.7
  • Equals /? 1.0
  • Contains 1 slash? 0.8

Script Pre-Planning

Getting the last modified time will be slightly complex, but I have already done it in PHP:

switch ($script_name) {
	case "/articles/personal/biography":
		break;
	case "/gallery":
	case "/music":
	case "/links":
	case "/about":
	case "/status":
		cache_headers(true,300,true,1800,false,"includes");
		break;
	case "/":
	case "/archives/johncook-2010":
	case "/archives/johncook-2011":
	case "/archives/johncook-2013":
	case "/archives/johncook-2014":
		cache_headers(true,120,true,180,false,"home");
		break;
	default:
		if (substr($script_name, 0, 9) == "/articles") {
			switch (substr_count($script_name,"/")) {
				case 3:
					cache_headers(true,60,true,120,false,"includes");
					break;
				case 2:
				case 1:
					cache_headers(true,120,true,1200,false,"articles");
					break;
			}
		}
		if (substr($script_name, 0, 6) == "/blogs") {
			switch (substr_count($script_name,"/")) {
				case 3:
					cache_headers(true,60,true,120,false,"includes");
					break;
				case 2:
				case 1:
					cache_headers(true,120,true,1200,false,"blogs");
					break;
			}
		}
		if (substr($script_name, 0, 9) == "/gallery") {
			switch (substr_count($script_name,"/")) {
				case 2:
					cache_headers(true,60,true,120,false,"includes");
					break;
			}
		}
		break;

}

That section of code basically determines which category of file modification times in the last modified database should be used as a "category" modification time.

The "includes" category, for example, means the file modification time will later be compared to the modification time of the most recently modified include file, and whichever is more recent is the real modification time.

The "articles" category on the other hand includes the modification time of all the include files as well as all the article files.

What I'll need to do is emulate the above PHP code to get the category modification time, convert ^/$ to /index.php (and add .php to all the other links) to lookup the file modification time, and then take the highest of those two integers and convert it to an ISO 8601 formatted string.

This will be a lot of work, but the end result will be an XML sitemap that is based entirely on data already used on the site including the existing HTML sitemap.

Reducing Workload

The Redis database is updated every minute, and it would be pointless running the full script to generate the sitemap if no files have been modified since last run.

What I need to do is determine the best way of caching the last modified time of all files.

Coding the Script

I think it will be easiest if I set sitemap.xml.php to have the same timestamp as the last modified file:

#!/bin/sh

LATEST=`redis-cli zrevrange johncook.uk:files 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`
SITEMAPLASTMOD=`redis-cli hget johncook.uk:file:./links/sitemap.xml.php`

if [ "$LATESTMODTIME" -eq "$SITEMAPLASTMOD" ]; then
	exit 0
elif [ "$LATESTMODTIME" -lt "$SITEMAPLASTMOD" ]; then
	exit 1
fi

Next, as I'm going to be using the same file for multiple domains, I am going to need to wrap the sitemap for each site in a block:

cat << EOF > sitemap.xml.php
<?php if ($_SERVER['HTTP_HOST'] == "web.johncook.uk") { ?>
EOF

Now I need to grab the list of links:

tr -d "\n\r" < /home/www/var/www/johncook_co_uk/links/complete-sitemap.php | perl -pe 's/<\?php if \(\$site_name != "John Cook UK"\) { \?\>[\s\S]*?} \?>//g' | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,' > sitemap_links_web_johncook_uk.tmp

I finally need to set the modification time and move the file:

touch -d "$LATESTMODTIME" sitemap.xml.php
mv -f sitemap.xml.php /home/www/var/www/johncook_co_uk/links/

That is a rough idea of what I needed to do. After several hours of coding and debugging (and Googling for the correct syntax for things like non-simple case statements) I ended up with the final script that is now in production.

The Final Script

This is the final script, subject to refinements (such as removing duplication):

#!/bin/sh

LATEST=`redis-cli zrevrange johncook.uk:files 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`
LATESTMODTIMEUNIX=`date -u -d "$LATESTMODTIME" +%s`
SITEMAPLASTMOD=`redis-cli hget johncook.uk:file:./links/sitemap.xml.php modified`

if [ `date -u -d "$LATESTMODTIME" +%s` -eq `date -u -d "$SITEMAPLASTMOD" +%s` ]; then
	exit 0
elif [ `date -u -d "$LATESTMODTIME" +%s` -lt `date -u -d "$SITEMAPLASTMOD" +%s` ]; then
	exit 1
fi

INCLUDESLASTMODFILE=`redis-cli zrevrange johncook.uk:includes 0 0`
INCLUDESLASTMOD=`redis-cli hget johncook.uk:file:"$INCLUDESLASTMODFILE" modified`
HOMELASTMODFILE=`redis-cli zrevrange johncook.uk:home 0 0`
HOMELASTMOD=`redis-cli hget johncook.uk:file:"$HOMELASTMODFILE" modified`
ARTICLESLASTMODFILE=`redis-cli zrevrange johncook.uk:articles 0 0`
ARTICLESLASTMOD=`redis-cli hget johncook.uk:file:"$ARTICLESLASTMODFILE" modified`
BLOGSLASTMODFILE=`redis-cli zrevrange johncook.uk:blogs 0 0`
BLOGSLASTMOD=`redis-cli hget johncook.uk:file:"$BLOGSLASTMODFILE" modified`

cat << EOF > sitemap.xml.php
<?php
\$if_modified = isset(\$_SERVER['HTTP_IF_MODIFIED_SINCE']) ? \$_SERVER['HTTP_IF_MODIFIED_SINCE'] : "nothing";
\$last_modified = gmdate("D, d M Y H:i:s","$LATESTMODTIMEUNIX")." GMT";
header("Cache-Control: Public, max-age=900, must-revalidate, s-maxage=600, proxy-revalidate");
header("X-Robots-Tag: noindex, nosnippet, noarchive, noodp");
if (\$if_modified == \$last_modified) {
	header("HTTP/1.1 304 Not Modified");
	header("Status: 304 Not Modified");
	exit();
}
header("Content-Type: application/xml; charset=utf-8");
header("Last-Modified: ".\$last_modified);
?>
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<?php if (\$_SERVER['HTTP_HOST'] == "web.johncook.uk") { ?>
EOF

tr -d "\n\r" < /home/www/var/www/johncook_co_uk/links/complete-sitemap.php | perl -pe 's/<\?php if \(\$site_name != "John Cook UK"\) { \?\>[\s\S]*?} \?>//g' | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,' > sitemap_links_web_johncook_uk.tmp

# 's,^/$,/index,; s,$,.php,; s,^,.,'

while read line; do
	FILE="$line"
	FILELOCAL=`echo "$line" | sed 's,^/$,/index,; s,$,.php,; s,^,.,' -`
	FILEMOD=`redis-cli hget johncook.uk:file:"$FILELOCAL" modified`
	PRIORITY="0.0"

	case "$FILE" in
/)
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="1.0"
;;
/archives/johncook-[0-9][0-9][0-9][0-9])
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="0.9"
;;
/gallery|/music|/links|/about|/status|/downloads)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/articles/?*/?*|/blogs/?*/?*|/gallery/?*|/archives/?*|/links/?*|/?*/?*/?*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.5"
;;
/articles/?*)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.7"
;;
/articles)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/blogs/?*)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.7"
;;
/blogs)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.8"
;;
*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.1"
;;
	esac

	FILEMODW3C=`date -u --iso-8601=seconds -d "$FILEMOD"`
	cat << EOF >> sitemap.xml.php
	<url>
		<loc>https://web.johncook.uk$FILE</loc>
		<lastmod>$FILEMODW3C</lastmod>
		<priority>$PRIORITY</priority>
	</url>
EOF
done < sitemap_links_web_johncook_uk.tmp

cat << EOF >> sitemap.xml.php
<?php } ?>
EOF

rm sitemap_links_web_johncook_uk.tmp

#=========================#

cat << EOF >> sitemap.xml.php
<?php if (\$_SERVER['HTTP_HOST'] == "web.watfordjc.uk") { ?>
EOF

cat /home/www/var/www/johncook_co_uk/links/complete-sitemap.php | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,' > sitemap_links_web_watfordjc_uk.tmp

while read line; do
	FILE="$line"
	FILELOCAL=`echo "$line" | sed 's,^/$,/index,; s,$,.php,; s,^,.,' -`
	FILEMOD=`redis-cli hget johncook.uk:file:"$FILELOCAL" modified`
	PRIORITY="0.0"

	case "$FILE" in
/)
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="1.0"
;;
/archives/johncook-[0-9][0-9][0-9][0-9])
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="0.9"
;;
/gallery|/music|/links|/about|/status|/downloads)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/articles/?*/?*|/blogs/?*/?*|/gallery/?*|/archives/?*|/links/?*|/?*/?*/?*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.5"
;;
/articles/?*)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.7"
;;
/articles)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/blogs/?*)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.7"
;;
/blogs)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.8"
;;
*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.1"
;;
	esac

	FILEMODW3C=`date -u --iso-8601=seconds -d "$FILEMOD"`
	cat << EOF >> sitemap.xml.php
	<url>
		<loc>https://web.watfordjc.uk$FILE</loc>
		<lastmod>$FILEMODW3C</lastmod>
		<priority>$PRIORITY</priority>
	</url>
EOF
done < sitemap_links_web_watfordjc_uk.tmp

cat << EOF >> sitemap.xml.php
<?php } ?>
EOF

rm sitemap_links_web_watfordjc_uk.tmp

#=======================#

echo -n "</urlset>" >> sitemap.xml.php

touch -d "$LATESTMODTIME" sitemap.xml.php
mv -f sitemap.xml.php /home/www/var/www/johncook_co_uk/links/

Bug Fixing

As with most code, there are bound to be bugs.

With the above code there was an obvious bug: the sitemap for web.watfordjc.uk included URLs for which web.johncook.uk is canonical.

I'll come back to that in a moment.

Default Case

The second bug is already mitigated in the above code, and that is related to new site content being added to areas where the previous site structure didn't think content would be.

As an example, I have started recreating content from an old MySQL database dump from my old blog site. There is now level 3 content at /music/?*/?* (already covered by the case /?*/?*/?*) and level 2 categories at /music/?* (not covered in the above script).

Rather than generating the sitemap in a way where content must be where it is expected to be, I added a default fallthrough that adds the URL but makes its priority 0.1.

When checking the sitemap after /music/recordings was created I noticed that the posts inside that category were correctly set as priority 0.5, but the category itself was set to priority 0.1. As a temporary fix I duplicated the default case and replaced * in the first copy with /music/?*.

I say it is temporary because /music includes /music/recordings and I don't yet have a /music category in Redis.

To add /music, I need to not only modifiy the redis updating script, but the sitemap generation script and the PHP last modified time code. I'll do that at a later time.

Non-Canonical URLs in Sitemap

With the current design there are two possible canonical URLs for page:

  1. https://web.johncook.uk/…
  2. https://web.watfordjc.uk/…

If a page is Suitable For Work (SFW) then its canonical URL will be under web.johncook.uk, if it is Not Suitable For Work (NSFW) it will be web.watfordjc.uk.

If, however, $NSFW = "NULL" is set in the file, then the pages are possibly not the same so the canonical URL will be both web.johncook.uk and web.watfordjc.uk, depending on which domain the client is viewing.

Such pages are those where the content may be different between sites, such as the home and category pages.

The /blogs/rants category, for example, started off as NSFW because it only contained a link to a NSFW post (strong language). It later became NULL when a link to a SFW post was added. If a page is categorised as NSFW a 404 is returned on web.johncook.uk, making it unlikely for something on johncook.uk having a canonical URL of watfordjc.uk

Anyway, I decided the simplest way to remove non-canonical URLs from the web.watfordjc.uk sitemap was to grep the actual files for the NSFW line.

Because the line number might change between files, and the thing I'm grepping for might also be included in a code block, I used head before egrep, setting a variable to what is returned, and then testing the variable against an empty string.

	IGNOREFILE=`head -n37 "/home/www/var/www/johncook_co_uk/$FILELOCAL" | egrep "\$NSFW = \"(NULL|NSFW)\""`
if [ "$IGNOREFILE" != "" ]; then

	case "$FILE" in

…

EOF

fi

done < sitemap_links_web_watfordjc_uk.tmp

Obviously that code is only for the web.watfordjc.uk part of the sitemap generator script, and it simply skips over the whole case block (and <url></url> echo block)—in effect it "skips" the file and moves on to the next file in the list.

bash/dash/sh

Ever since the whole Shellshock flaw was discovered, I have tended to use #!/bin/sh (/bin/sh is a symlink to /bin/dash in Debian and Ubuntu) wherever possible when scripting.

While I still include 'bash' when Googling, there is always the chance my code won't function as expected because of the differences between bash and dash. I'm most likely to hit an issue when using test, although it is usally possible to rewrite the test so it works with dash.