Explanation for Downtime on 2011-06-25

On 2011-06-25, my VPS experienced several issues. This post details the events.

On 2011-06-25, my VPS experienced several issues. This post details the events.

  1. At 14:59:28 UTC I rebooted the server to ensure the new boot scripts were working.
  2. At 14:59:42 UTC the server started the boot sequence.
  3. During the boot sequence the following failed to start:
    • DomainKeys (signing and verifying)
    • DKIM (signing and verifying)
    • SMTP and ESMTPS (sending and receiving, IPv4 and IPv6)
    • IMAPS (IPv4 and IPv6 connections)
    • HTTP and HTTPS (IPv4 and IPv6)
    • SSH (IPv6)
    • Authoritative DNS (IPv4 TCP and IPv6 TCP/UDP)

    This was due to an issue with an upstart script not doing its job properly, namely adding IPv6 addresses to the interfaces at the correct point during the startup sequence.

    Over the next several hours I worked on these issues, and decided fixing the bootup scripts was urgent to prevent a future recurrence.

  4. At 14:59:46 UTC the IPv6 address-adding script ran, exiting with an error.
  5. At 15:44:50 UTC I partly fixed two scripts, namely the one that adds the IPv6 addresses and the one that restores the ip6tables firewall policies.
  6. At 16:07 UTC I checked a temporary fix. The following started fine:
    • SMTP and ESMTPS (sending and receiving, IPv4 and IPv6)
    • IMAPS (IPv4 and IPv6)

    Neither these nor the partly fixed scripts worked on reboot, however.

  7. I rebooted the server again 14:55:05 UTC to check the startup scripts, and subsequently rebooting again a couple of dozen times over several hours.
  8. At 22:05:36 the DKIM, DomainKeys, and DNS issues were resolved as I'd found they were caused by a completely different issue.
  9. At 22:37 UTC I issued a NOTICE: tweet stating that the server was experiencing a few issues and that services would only be working intermittently "over the next 3.5 hours." I did this because I knew I would need to reboot after each test of the changed scripts, and needed to ensure the following were working:
    • Getting the IPv6 IP-adding script to fire at the right point during startup (fixed at 23:09:35 UTC).
    • Saving and restoring the IPv6 firewall rules on shutdown/reboot (restoring fixed, saving not fixed).
    • Changing other upstart scripts (such as for ssh) so that they started after the IPv6 IPs had been added.
  10. At 00:43 UTC, I was certain everything had been resolved.

The following services were mainly down during this period:

  • DomainKeys (signing and verifying)
  • DKIM (signing and verifying)
  • SMTP and ESMTPS (sending and receiving, IPv4 and IPv6)
  • SSH (IPv6)
  • Authoritative DNS (IPv4 TCP and IPv6 TCP/UDP)
  • HTTP and HTTPS (IPv6)

The following services were out intermittently:

  • IMAPS (IPv4 and IPv6 connections)
  • HTTP and HTTPS (IPv4)
  • IRC Bouncer
  • Torrent Tracker

Incoming mail should not have been bounced during this period, as the SMTP server was either down or responding with a temporary error. Well, unless a sender's mail server only tries to deliver for 5 hours before giving up.

DNS should still have worked during this period, and except for delays, only those looking up queries over TCP (all long queries) and those that are not able to resolve over IPv4 (anyone?) would have experienced issues with the backup DNS server.

Apologies for any inconvenience caused.