Improving Cacheability - Part 2:
Vary by User-Agent

In the previous part I ended on the subject of varying by user-agent. Can I use AJAX & JSON?

$_SERVER['HTTP_USER_AGENT']

A grep -R USER_AGENT /home/www/var/www/johncook_co_uk/inc/ revealed that footer.php, htmlheader.php, and breadcrumbs.php contain code that requires I use vary by user-agent.

  1. htmlheader.php includes an if/else that checks if the client is a search engine. When Googlebot visits, for example, the server generates the page as if the client already has the needed fonts.

  2. Some further code is used by varnish to extend cookie expiry by 30 minutes if the fonts cookie is set. It is done by varnish so a client gets a cookie that expires in 30 minutes without cache life having an effect.

  3. A second code block sets the fonts cookie to expire in 30 minutes. This block should be conditional on if varnish is between the server and client.

  4. breadcrumbs.php contains an if block. If the client is not a search engine, a commented out <aside> is included for a twitter share button. I could remove this.

  5. footer.php includes a lot of inline JavaScript code that is, technically, placeholder code for tweaking until "just right".

    One of those is if the fonts cookie was not sent by the client, the PreLoadThem(0) function is called to load the fonts using JavaScript, and SetCookieExpiry() is called. The PreLoadThem() { } function is among several code blocks only sent back to the client if no fonts cookie is received (i.e. the reason I need to vary by cookie).

  6. Alternatively, if a user-agent string is not sent (so is not an outdated browser—varnish is configured to strip UA strings from current browsers), or the UA string is not "Crawler" (simplified catch-all for bots), then JavaScript functions FontsActivated(0) and SetCookieExpiry() are called.

  7. Another block adds JavaScript code if the UA string doesn't equal "Crawler" which dims the text while loading the fonts.

  8. Another block contains the FontsActivated function, which undims the text after loading the fonts, slides the top navigation bar down, and adds my twitter profile icon to any of my tweets quoted on the page.

  9. Another block adds the JavaScript for making the twitter share button work (so is doing nothing), and a final block adds an HTML comment stating the received User-Agent string.

Breaking Things Down

All of the things in this list can be broken down into two categories:

  1. Fonts cookie.
  2. Font loading.

The problem with removing vary by user-agent is that there will be no way to serve bots and spiders different content to browsers that haven't visited the site recently—they are all without the fonts cookie.

I will take each of the items in the list at the beginning of this article and comment on them.

Item One

  1. htmlheader.php includes an if/else that checks if the client is a search engine. When Googlebot visits, for example, the server generates the page as if the client already has the needed fonts.

<?php
date_default_timezone_set('Etc/UTC');
$last_modified = '';
$javascript_styles = "yes";
if (isset($_SERVER['HTTP_USER_AGENT']) && strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT'],"Crawler") !== false) {
	$javascript_styles = no;
} else if (@htmlspecialchars($_COOKIE["fonts"] === "1" || @htmlspecialchars($_SERVER['HTTP_X_HAS_FONTS']) === "1") {
	$javascript_styles = no;
	if (@htmlspecialchars($_SERVER['HTTP_X_HAS_FONTS'])) {
		header("X-Font-Expiry: " . time()+1800);
	}
}
?>

The if statement makes sure that crawlers are served the page as if they already have the fonts (i.e. <head> loading instead of JavaScript loading).

The reason for the X-HAS-FONTS header is because extending expiry of a cookie through PHP could result in a page cached in varnish being sent to a client with a cookie that has already expired.

What about Cookie max-age? To quote my varnish configuration, "# All IE browsers do not support cookie max-age". The X-Font-Expiry header is unused because of this, so I can remove that if block.

<?php
date_default_timezone_set('Etc/UTC');
$last_modified = '';
$javascript_styles = "yes";
if (isset($_SERVER['HTTP_USER_AGENT']) && strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT'],"Crawler") !== false) {
	$javascript_styles = no;
} else if (@htmlspecialchars($_COOKIE["fonts"] === "1" || @htmlspecialchars($_SERVER['HTTP_X_HAS_FONTS']) === "1") {
	$javascript_styles = no;
}
?>

I am left with the inevitable question of treating Googlebot differently or identically to Chrome. If I treat Googlebot identically, Google cached pages will not be consistent if Googlebot accepts and sends cookies.

Assuming the only proper way to remove vary by user-agent on the frontend is to remove tests for user-agent, the above code will become this:

<?php
date_default_timezone_set('Etc/UTC');
$last_modified = '';
$javascript_styles = "yes";
if (@htmlspecialchars($_COOKIE["fonts"] === "1" || @htmlspecialchars($_SERVER['HTTP_X_HAS_FONTS']) === "1") {
	$javascript_styles = no;
}
?>

Item Two

  1. Some further code is used by varnish to extend cookie expiry by 30 minutes if the fonts cookie is set. It is done by varnish so a client gets a cookie that expires in 30 minutes without cache life having an effect.

…
sub vcl_recv {
	…
	if (req.http.Cookie ~ "fonts=1") {
		unset req.http.X-Has-Fonts;
		set req.http.X-Has-Fonts = 1;
		unset req.http.Cookie;
	} else {
		unset req.http.Cookie;
	{
	…
}

sub vcl_hash {
	hash_data(req.url);
	if (req.http.X-Has-Fonts == "1") {
		hash_data("has-fonts");
	}
	if (req.http.Cookie ~ "fonts=1") {
		set req.http.Cookie = "fonts=1";
	}
	…
	return (lookup);
}

sub vcl_backend_response {
	…
	unset beresp.http.Set-Cookie;
	return (deliver);
}

Unlike the first item, this one is done in varnish.

Where is varnish setting the cookie? It isn't. Neither is nginx. Although nginx does set a cookie if the UA isn't Crawler, if varnish is running it strips the cookie from the response.

Why? Because I seem to have, at some point, decided that using JavaScript/jQuery to set the cookie was the best way to go for cross-browser support.

Item Three

  1. A second code block sets the fonts cookie to expire in 30 minutes. This block should be conditional on if varnish is between the server and client.

if (@htmlspecialchars(strpos($_SERVER['HTTP_USER_AGENT']), "Crawler") === false) {
	if (isset($cookie_domain) {
		setcookie("fonts","1",time()+1800,'/',$cookie_domain,false,false);
	}
}

As discussed in item 2, this block is used to set a cookie for all user agent strings not equal to "Crawler". If varnish is in the path, it will remove the cookie.

Given that a UA string is unlikely to be "Crawler" if varnish isn't running, surrounding the code with another if block is unnecessary.

This code should ensure most bots are not sent the fonts cookie.

However, after *

Item Four

  1. breadcrumbs.php contains an if block. If the client is not a search engine, a commented out <aside> is included for a twitter share button. I could remove this.

		// If visitor is not a search engine, add share links:
		if (!isset($_SERVER['HTTP_USER_AGENT']) || strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT']),"Crawler") === false) {
			strpos .= <<<EOF
			<!--<aside class="social-links"><p><a class="twitter-share-button" href="https://twitter.com/share" rel="nofollow" data-dnt="true" data-text="$article_name"><span class="fi-social-twitter">Tweet this page.</span></a></p></aside>-->

EOF;
		}

There are a number of ways I can deal with this, but I do eventually want the possibilty of adding share buttons back in.

One possibilty is to create an empty aside, and later add the content to the aside dependant on user agent. How can that be done?

If I go down the AJAX route, I could create a file such as /api/social-links.php, and then vary that specific URI by User-Agent.

I can then include the content of that specific URL inside the aside, which in the case of search engines will be an empty response.

                // If visitor is not a search engine, add share links:
//              if (!isset($_SERVER['HTTP_USER_AGENT']) || strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT']),"Crawler") === false) {
                $str .= <<<EOF
<aside class="social-links"></aside>
EOF;
//              }
<?php $file_is_included = (isset($script_name));?>
<?php include_once $_SERVER['DOCUMENT_ROOT'].'/inc/htmlheader.php';?>
<?php header("Vary: User-Agent, Accept-Encoding"); ?>
<?php
if (!$file_is_included) {
        cache_headers(true,10,true,30,true,"includes");
}
?>
<?php
$str = "";
// If visitor is not a search engine, add share links:
if (!isset($_SERVER['HTTP_USER_AGENT']) || strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT']),"Crawler") === false) {
        $article_name = "";
        if (isset($_GET['twt_name'])) {
                $article_name = @htmlspecialchars($_GET['twt_name']);
                $str = <<<EOF
<p><a class="twitter-share-button" href="https://twitter.com/share" rel="nofollow" data-dnt="true" data-text="$article_name"><span class="fi-social-twitter">Tweet this page.</span></a></p><aside>

EOF;
        }
}
header("X-Robots-Tag: noindex, nofollow, noarchive", true);
echo $str;
?>

Item Nine

  1. Another block adds the JavaScript for making the twitter share button work (so is doing nothing), and a final block adds an HTML comment stating the received User-Agent string.


<?php
if (!isset($_SERVER['HTTP_USER_AGENT']) || strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT']),"Crawler") === false) {
?>
<script type="text/javascript">
window.twttr=(function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],t=window.twttr||{};if(d.getElementById(id))return;js=d.createElement(s);js.id=id;js.src="https://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);t._e=[];t.ready=function(f){t._e.push(f);};return t;}(document,"script","twitter-wjs"));
</script>
<?php
}
?>
<?php
echo "<!--".$javascript_styles."-->";
if (isset($_SERVER['HTTP_USER_AGENT'])) {
        echo "<!--".htmlspecialchars($_SERVER['HTTP_USER_AGENT'])."-->";
}
?>

Since item 4 dealt with the twitter button, item 9 is the next logical thing to change so the share button works.

<?php
// if (!isset($_SERVER['HTTP_USER_AGENT']) || strpos(@htmlspecialchars($_SERVER['HTTP_USER_AGENT']),"Crawler") === false) {
?>
<script type="text/javascript">
function LoadSocial() {
  if (window.jQuery && window.Foundation) {
    $(".social-links").load("/api/social-links?twt_name=<?=rawurlencode($article_name);?>", function() {
    window.twttr=(function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],t=window.twttr||{};if(d.getElementById(id))return;js=d.createElement(s);js.id=id;js.src="https://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);t._e=[];t.ready=function(f){t._e.push(f);};return t;}(document,"script","twitter-wjs"));
    $(".social-links").fadeIn(500);
});
  } else {
    setTimeout(LoadSocial,1000);
  }
}
</script>
<?php
// }
?>
<?php
echo "<!--".$javascript_styles."-->";
?>

Font Loading

As for the loading of fonts, I have now moved it into JavaScript.

If the fonts cookie is set, the stylesheet is added to the head tag, as before.

If the fonts cookie is not set, the JavaScript, which has now been moved to an external file, creates a Zurb alert box.

User input is now required to load the fonts. The "without fonts cookie" alert box has a button that, when pressed, loads the fonts using the font loader script.

Additional Changes

There are a few more things that will need doing down the line, but for now things are improved—everything that no longer needs to vary by cookie or user-agent (most pages on this site) are now easier to cache.

I have also made some of the URIs in /api/ use CORS without using Access-Control-Allow-Origin: * by using protocol+host whitelisting.

I am considering using Content Security (CSP) site-wide at some point in the future.

I will also be looking at content hashing (subresource integrity), although if I use it for files on my site I will have to modify how I minify/uglify my CSS/JS files so that cached pages, such as Google's cache, will be able to load an older version of the CSS/JS.