HTML5, Microdata, and Rich Snippets

How exactly do I markup my entire Web site for HTML5, microdata and rich snippets?

What are Microdata and Rich Snippets?

Microdata and Rich Snippets are terms used to describe methods of adding additional metadata to the source code of Web pages so that the content is more easily understood by machines such as Googlebot.

Unlike in the past, HTML5 and Microdata now mean that understanding the correct way of doing things is a lot more complicated.

How complicated is it? Take, for instance, the breadcrumbs at the top of this article (technically an HTML5 Article and schema.org BlogPosting). At present, schema.org breadcrumbs aren't the recommended way of adding microdata, with data-vocabulary.org being recommended for breadcrumbs.

The issue with mixing itemscopes is that they cannot be mixed. By using data-vocabulary.org's breadcrumbs microdata, I cannot use schema.org's ArticleSection itemprop with the same tag. Thus, for the time being, I am not using ArticleSection on my pages (if I were, this page would have ArticleSection of Blogs and Website).

How Does HTML5 Complicate Things?

HTML5, as with previous versions of (X)HTML, brings in new elements and functionality. One example would be that of the <article> and <section> elements (elements are also commonly referred to as tags).

The problem with new elements is not knowing how to properly use them. Although not anywhere as difficult to understand as microdata, it did require a lot of Googling to get my head around these two new tags.

Basically, <section> tags mark a section of a document, and <article> tags mark an article. Sounds simple? Well, it is until you get to more complex pages where you don't quite know how you are supposed to use these new tags.

An Example of HTML5, Microdata, and Breadcrumbs

The best illustration of this would be the home page of this site:

...
<body itemscope itemtype="http://schema.org/WebPage">
<section itemscope itemtype="http://schema.org/Article">
...
<header>
...
<h1 itemprop="name headline">John Cook UK</h1>
<div itemprop="description">
<p>Welcome to the new look, and new home, of my websites: JohnCook.co.uk & WatfordJC.co.uk</p>
</div>
</header>
<ul role="menubar" aria-label="breadcrumbs">
<li role="menuitem">
<a href="/">Home</a>
</li>
</ul>
...
<div itemprop="articleBody">
<img ...>
<p>This is the home page...</p>
...
<footer>
<p>Posted by...</p>
</footer>
</div>
...
<article itemscope itemtype="http://schema.org/Article">
...
<header>
<h1 itemprop="name headline">Bringing Telecommunications In-House - Part 5...</h1>
<div itemprop="description">
<p>The fith part in the series...</p>
</div>
</header>
<ul role="menubar" aria-label="breadcrumbs for item">
<li role="menuitem">
<a href="/">Home</a>
</li>
<li role="menuitem">
<a href="/articles">Articles</a>
</li>
<li role="menuitem">
<a href="/articles/computing>Computing</a>
</li>
<li role="menuitem">
<a href="/articles/computing/bringing-telecommunications-in-house-part-5" itemprop="url">Bring Telecommunications In-House (Part 5)</a>
</li>
</ul>
...
...
<footer>
<p>Posted by...</p>
</footer>
</article>
<article itemscope itemtype="http://schema.org/Article">
...
<header>
<h1 itemprop="name headline">Switching From O2/Sky to Virgin Media...</h1>
<div itemprop="description">
<p>This article is going to detail...</p>
</div>
</header>
<ul role="menubar" aria-label="breadcrumbs for item">
<li role="menuitem">
<a href="/">Home</a>
</li>
<li role="menuitem">
<a href="/articles">Articles</a>
</li>
<li role="menuitem">
<a href="/articles/computing>Computing</a>
</li>
<li role="menuitem">
<a href="/articles/computing/switching-from-sky-broadband-to-virgin" itemprop="url">Switching From O2/Sky to Virgin Media</a>
</li>
</ul>
...
...
<footer>
<p>Posted by...</p>
</footer>
</article>
...
</section>

Every Web page on this site has an itemtype of http://schema.org/Webpage in the <body> tag. That much is simple, and means the entire body of the document is within that itemscope.

As the Home Page is not itself an article, but rather a collection of external-to-the-page articles (with an introductory section), the entire content is encompassed in a <section> element.

However, because the introductory section can be considered an article itself (because it canonically belongs on the Home Page) then the encompassing <section> has an itemtype of http://schema.org/Article.

This is where things get rather interesting. The home page of this site contains snippets of content that belong to other pages. Thus, each included article or blogposting has an <article> tag with an itemtype of http://schema.org/Article or http://schema.org/BlogPosting.

Because each included article and blog post is its own thing, by giving it an itemscope I am, in essence, creating a new self-contained entity that is separate from the page as a whole.

And this is where things get so complicated I myself am confused, and the reason for this article. First, let's cover the ARIA stuff and then come back to my confusion.

ARIA

An aria-label is what a screen reader will read as a sort of supplementary text to an element.

I haven't yet tested, but I believe a screen reader will say something along the lines of "Bringing telecommunications in-house (part 5)... The fifth part in the series... breadcrumbs for item menu, menu item 'home', menu item 'articles', menu item 'computing', menu item 'bringing telecommunications in-house (part 5)'".

Accessibility is another one of those things I still have to investigate. The breadcrumbs for the page itself should, technically, be within <nav></nav> tags, and I believe it is OK to use unordered lists for breadcrumbs that are not for the current page (i.e. snippets of included articles and blog posts). Breadcrumbs are supposed to be for the current page, but I am also using them for navigation to linked content (as opposed to using a "Read More" link as I previously did).

Accessibility is still something that needs some work.

Confused About Item Scopes

My confusion about itemscopes is that I just don't know how they work. What I really need is an easy to understand explanation of what they are, what they do, and what they are intended for.

In order to get my breadcrumbs working correctly, I used Mindly (a mind mapping iOS app) to create a logic map. Although I don't consider myself a visual person, I was attempting to get 4 tests to create the correct breadcrumbs for the entire site, but a logic map showed there are 8 different types of breadcrumbs needed.

As soon as I could see the different states I needed to test for, I copied the mind map to a new one and extended it by adding sample code to each branch so I could logically go through the different states and see where I could merge logic tests.

I ended up with a number of branches, and each branch had a choice. The Home Page, for example, could be reached by following these choices: Is Not Included, Is Home Page. This page would be reached by following these choices: Is Not Included, Is Not Home Page, Is Not Section, Is Not Tag, Is BlogPosting.

The reason I bring this up is that I need to comprehend how schemas work to be able to think in a way that allows me to do what I want to do. At the moment, I don't even know what I need to Google for.

Structured Data

Microdata and Rich Snippets are encompassed within the term Structured Data. The following video introduces Structured Data:

Video: Google Webmaster Tools Structured Data Part 1

Play Video Embedded Watch Video on YouTube Google Privacy Policy Google Cookies

At 9:55 it is explained that the scope (itemscope) of this entire page is (itemtype) http://schema.org/WebPage "and within it we can have many different properties".

Part two of that series of videos explains that a Thing is an item, and that an itemtype is your scope. To determine an itemtype, ask "What am I looking at?"

Thus, on the Home Page, the included articles have an itemtype of Article, and the included blog posts have an itemtype of BlogPosting. The second video also explains that available properties are inherited, thus a BlogPosting (being found in schema.org at Thing > CreativeWork > Article > BlogPosting) has available to it all the itemprops found in Article.

Likewise, a WebPage is a Thing > CreativeWork > WebPage. Inherited from "Thing" is itemprop="url", and the item property url is the "URL of the item." As item=Thing, that means itemprop="url" can be described as "the URL at which this Thing can be found".

As an URL is supposed to be unique, it can be deduced that each Thing can only have one URL. For a WebPage, that would be the canonical URL. For a Breadcrumb (again, I'm not currently using http://schema.org/Breadcrumb) it would be the Breadcrumb URL (e.g. /articles). For an article snippet, it would be the URL for that http://schema.org/Article.

At this point, I have a basic understanding of how things will work. If I am including an item (such as a link) then as long as it (or a higher element tag) has an itemscope and there is no url itemprop defined for that item/Thing yet, then I can add itemprop="url" to the hyperlink reference (<a href>) as long as it is properly scoped.

Scoping Example for /articles

As an example, I am going to take the first section of the /articles page.

This page has a number of links, each of which points to an article. The page has a number of <section>s, each of which is a "tag" that I have classed an article as (i.e. a tag in this context being a collection of similarly themed articles).

Now, previously I was thinking "what is the schema.org equivalent to a collection of links?" I now know that the "Thing", which is a link on that page, is in fact an http://schema.org/Article.

Whereas I was previously trying to think top-down, I am now thinking bottom-up. If a Thing is a self-contained item, I need to start at the smallest element and work upwards. As a hyperlink is obviously an itemprop="url", that means it is also an itemscope itemtype="http://schema.org/WebPage" unless (a) a more suitable scope applies, or (b) the hyperlink is not the entirety of the to-be-scoped item.

Thus, on my /articles page, a hyperlink will have an itemprop="url" but the <li> will have the itemscope for that self-contained item. As it has already been established that a list item detailing and linking to an article will have an itemtype of Article and an itemprop url (<a href>) referencing the article, the next question is to look at the <ul>.

What Am I Looking At?

Is the unordered list a self-contained item? In the majority of cases it is not. Let's look at an item: Bringing Telecommunications In-House - Recording Calls - Part 5.

The list item is a self-contained item: an Article.
The unordered list is a self-contained item: a collection of related articles.
The heading, paragraph, and unordered list is a self-contained item: a collection of related articles.
The section is a self-contained item: a collection of articles in the same category (e.g. Computing).
Is there something above the section and below WebPage?

Scoping of <li>

We have determined that the list item is an Article, thus we will have the following code:

<li itemscope itemtype="http://schema.org/Article">
<a href="/articles/computing/bringing-telecommunications-in-house-recording-calls-part-5" itemprop="url">Bringing Telecommunications In-House - Recording Calls - Part 5</a>
</li>

We also know that it is the fifth part of a series of articles. However, because I have not yet created a CollectionPage for Bringing Telecommunications In-House, I do not have an URL I can use for item property isPartOf.

The url item property for this item refers to the href, and we want to allow machines to know the name of the article. Thus we add a span and get the following code:

<li itemscope itemtype="http://schema.org/Article">
<a href="/articles/computing/bringing-telecommunications-in-house-recording-calls-part-5" itemprop="url"><span itemprop="name">Bringing Telecommunications In-House - Recording Calls - Part 5</span></a>
</li>

However, I am not going to do that. Why not? Because the name of the article may change. As /articles and /blogs are hand-written, unlike the home page which includes content from other pages, saying "this is an article, and this is its URL" should be sufficient for machines to understand it. If they want to know the name of the article without inferring from the hyperlink text, they can visit the URL.

Google, for instance, appears to make such an inference. In Google Webmaster Tools, if I drill down to Structured Data I can see that Google is not only referencing the hyperlink reference, but is also making it a clickable hyperlink using the same text used on that page without the need to use a name itemprop.

Scope of <section>

Each section is a site "tag" (e.g. Computing, VPS, et cetera). At present I do not have tag-level pages on the site, so I will not be doing anything with them at the moment.

For the time being I will incorporate just the one change: adding itemscope and itemtype to the list items, and adding itemprop to the anchor hypertext references. For /articles that will be itemtypes of http://schema.org/Article, and for /blogs that will be itemtypes of http://schema.org/BlogPosting.

CollectionPage Schema

As CollectionPage is a type of WebPage, and inherits everything from WebPage, but is not a type of Article, I am wondering how best to handle things when a page is better suited for a CollectionPage rather than a WebPage.

As previously mentioned, all pages are regarded as a WebPage. As ImageGallery is also a type of WebPage, I shall have a look at how I handle things for /gallery.

<section itemscope itemtype="http://schema.org/CollectionPage">

It looks like I have surrounded the content of /gallery in a <section> of type CollectionPage, presumably because it will be a collection of ImageGallery items.

For /gallery/3d-gardening-photos (an ImageGallery) I have done the following:

<article itemscope itemtype="http://schema.org/ImageGallery">

To drill down further on that page, each thumbnail (and hyperlink) is part of a figure, which also has a figcaption, like this:

<figure itemscope itemtype="http://schema.org/ImageObject" class="large-6 medium-6 columns">
<a itemprop="contentUrl" class="th" role="button" aria-label="Thumbnail" href="/gallery/3d-gardening-photos/hni_0020_mpo.jpg"><img itemprop="thumbnailUrl" aria-hidden="true" src="/gallery/3d-gardening-photos/hni_0020_mpo_t2.jpg" alt="Parsnip seedlings starting to sprout."></a>
<figcaption itemprop="caption">Parsnip seedlings in toilet roll tubes starting to sprout.</figcaption>
</figure>

The whole reason I am having so much trouble getting my head around doing things the correct way is probably because I am trying to think of things simply when my use case is more complex than that.

A link is an url. The url is a property of an Article/BlogPosting. That much is known. What is a CollectionPage a collection of? Can a <div> be scoped as a WebPage?

A web page. Every web page is implicitly assumed to be declared to be of type WebPage, so the various properties about that webpage, such as breadcrumb may be used. We recommend explicit declaration if these properties are specified, but if they are found outside of an itemscope, they will be assumed to be about the page.
schema.org, WebPage

I am going to make an assumption that it is perfectly fine to have a WebPage within a WebPage (that is, after all, what some iframes are). It can also be assumed that unless a new scope has been defined for a child element, then the itemprop belongs to whatever parent itemtype has such a property.

For example, an url item property belongs to whatever Thing was defined in the closest parent/grandparent/etc. element because all Things can have an url property. ContentUrl in the above example, on the other hand, will belong to ImageObject (inherited property from MediaObject) and if no such itemtype is defined on the element or a parent element it will result in an error as ImageGallery/WebPage/etc. cannot have a ContentUrl property.

Based on these assumptions, my /articles page shall look like the following:

body itemtype=WebPage
- section itemtype=CollectionPage
  - section itemtype=CollectionPage
    - list item itemtype=Article

The first-level section encompasses the entire page, whereas the second-level sections shall encompass each "tag" (e.g. Computing) section.

If a section were to have a longer amount of text, then it might become an <article> instead. Depending on whether or not the content were meatier than just a collection of links, it may also switch to an itemtype of Article, BlogPosting or similar. Were the list of links to be reformatted as paragraphs rather than a list, the second-level section would have similar modifications made.

The reason for the second-level CollectionPage is that microdata is about Things. As /articles/computing will become a page itself at a later time then it is obviously a self-contained thing that although belonging to /articles can be looked at as an individual item. Considering it has a heading and a description that is now rather obvious.

I think the most important thing to understand from this is What Am I Looking At? is both a simple and complex question, and you need to comprehend how microdata works to be able to say "That's What I'm Looking At!"

isPartOf CollectionPage

It now seems so simple. Each list item is to have an itemprop isPartOf making it part of the parent CollectionPage. Each section is to have an itemprop of isPartOf making it a part of the parent (page encompassing) CollectionPage.

After doing that, we can see that everything no longer looks disparate in Google Rich Snippets but rather looks interconnected. With the exception of the breadcrumbs and the rdfa-node, but that is to be expected because they are not part of the schema.org schemas.

I have added an itemprop of name to the <h2> tags so that the CollectionPages are named, and I think I am satisfied with how things are currently looking.