Debian OpenSSL Vulnerability Still Pains Two Years On

In April 2006, someone filed a bug report with Debian complaining that OpenSSL (an open source SSL/TLS implementation) read data from an uninitialised buffer. This was causing Valgrind (a brilliant debugging tool) to report memory usage warnings anywhere the OpenSSL random number generator was used. This was actually the behaviour intended by the OpenSSL developers, as the uninitialised buffer was being fed into an entropy pool used by the random number generator.

A fix was proposed that reset the uninitialised buffer to zero before first use. This would have had minimal security implications as OpenSSL used multiple reliable entropy sources in addition to the buffer. However, the initial patch to do this didn't stop all the Valgrind warnings. Somehow the proposed fix mutated into one that caused almost no entropy to be added to the pool except for the process ID, rendering the random number generator almost entirely useless. This in turn led OpenSSL to create extremely vulnerable SSL certificates, SSH keys and a bunch of other things. The broken OpenSSL version made it into Debian in September 2006. It later propagated into Ubuntu.

Incredibly, no one noticed the vulnerability until 2008. By then, a massive number of packages and users had been affected. Cue scrambling to fix the bug in Debian and Ubuntu, creation of blacklists and scanners for vulnerable OpenSSL and OpenSSH keys, arguments about who was to blame and how open source practices could have failed so badly. Also, cartoons.

Yes, I did feel the need to reiterate the entire bug's history. Reading the bug report is like watching a slow-motion car crash. Even the "fix" for the initial bug was applied incorrectly, which meant it appeared in Debian version 0.9.9c-1 of OpenSSL (released 17 September) instead of 0.9.8b-1 (released 4 May). I suppose that was a good thing.

Even now, over two years after the bug was discovered and almost four years after it was originally introduced, the damage it caused is still being discovered. The Electronic Frontier Foundation, launching a project into all publicly used SSL certificates has had to delay because they are disclosing vulnerabilities to websites they found using weak private keys generated by the broken Debian/Ubuntu OpenSSL versions.

As the EFF SSL Observatory page will be updated at some point, it currently reads:

This project is not fully launched yet, because we are currently engaging in vulnerability disclosure for around 28,000 websites that we observed to be using extremely weak private keys generated by a buggy version of OpenSSL.

Further information can be found in the SSL Observatory DEFCON 18 slides. Of the 28K vulnerable certificates seen, the 530 validating ones are the most interesting. The others were either invalid or the 12K issued by private certificate authorities. Only 73 of the 530 valid certificates had been revoked. In particular none of the 140 valid certificates by Equifax had been revoked, and only 4 of the 125 issued by Cybertrust.

In conclusion, this all seems rather depressing. A bug in a patch to an important open source security library went unnoticed for two years, despite reducing of the effective security of the keys it generated to almost nothing (15 bits). Furthermore, a bunch of private "trusted" companies still haven't taken measures to ensure SSL certificates they generated using this library have been marked invalid another two years after the bug was found.

On a positive note, I'm sure Debian has learnt from its mistakes by now, but it would still be nice to find some policy document to show what has changed (I failed to find one). Rather unsettlingly it seems like this was the exact type of bug the Debian Security Audit Project was set up to spot.

Unfortunately, nothing so positive can be said about the state of Certificate Authorities. Until one is sued for not taking proper security precautions, their behavior is unlikely to change. As for security, there's nothing to stop a CA from collaborating with a government or other entity that wants to eavesdrop on communications (as the EFF warns). Also, no amount of money spent on an SSL certificate from even the most trustworthy CA will protect against a rogue certificate created (or successfully forged) from a different CA also trusted by the web browser. In other words, SSL is nowhere near as trustworthy as you think.

Debian Packaging for TrueCrypt

I like to be able to build Debian packages from source. Unfortunately, the released TrueCrypt sources don't contain the Debian/Ubuntu packaging used to build their .deb files. As a result, I've created my own Debian packaging, available here.

As the TrueCrypt license seems to have issues that stop TrueCrypt being included in most distributions, the downloadable files only include the 'debian' folder. You'll still need to download the TrueCrypt sources from here. Advantages over the upstream packaging include an init script to kill TrueCrypt device mappings on shutdown and the generation of a reasonably well formatted man page from the output of 'truecrypt --help'.

DNS resolution delays in Debian (and Ubuntu)

Yesterday, I was pinging a server when I noticed that the output of ping seemed to be rather slow. In fact, I'd noticed it before but never really thought about it until I was pinging anther server at the same time and the saw the drastic difference in output speeds.

Pinging google.co.uk, there was a ping every second:

fpr@callisto:~$ ping google.co.uk | perl -ne 'use Time::Format; print "$time{\"hh:mm:ss.mmm\"} - $_"'
14:44:34.898 - PING google.co.uk (74.125.77.104) 56(84) bytes of data.
14:44:34.915 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=1 ttl=238 time=32.9 ms
14:44:35.883 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=2 ttl=238 time=35.9 ms
14:44:36.882 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=3 ttl=238 time=33.4 ms

On another server, it was closer to five:

fpr@callisto:~$ ping server1.fsckvps.com | perl -ne 'use Time::Format; print "$time{\"hh:mm:ss.mmm\"} - $_"'
14:49:42.389 - PING server1.fsckvps.com (66.71.248.146) 56(84) bytes of data.
14:49:42.408 - 64 bytes from 66.71.248.146: icmp_seq=1 ttl=45 time=122 ms
14:49:47.500 - 64 bytes from 66.71.248.146: icmp_seq=2 ttl=45 time=123 ms
14:49:52.625 - 64 bytes from 66.71.248.146: icmp_seq=3 ttl=45 time=122 ms

This didn't make much sense since the round-trip times were small by comparison and ping sends one request per second by default. A Google search indicated that the problem might lie with my resolv.conf file. Unfortunately, mine seemed to be fine and my local DNS server was completely responsive. However, if I pinged the server by IP address instead of by hostname, the delay was gone.

To see what was blocking, I ran strace on ping.

15:04:26.789 - munmap(0x7f69ed319000, 129482)          = 0
15:04:26.790 - socket(PF_FILE, SOCK_STREAM, 0)         = 4
15:04:26.790 - fcntl(4, F_GETFD)                       = 0
15:04:26.790 - fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
15:04:26.790 - connect(4, {sa_family=AF_FILE, path="/var/run/avahi-daemon/socket"...}, 110) = 0
15:04:26.790 - fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
15:04:26.790 - fstat(4, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
15:04:26.790 - mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f69ed359000
15:04:26.791 - lseek(4, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
15:04:26.791 - write(4, "RESOLVE-ADDRESS 66.71.248.146\n"..., 30) = 30
15:04:31.789 - read(4, "-15 Timeout reached\n"..., 4096) = 20
15:04:31.789 - close(4)                                = 0

It was blocking each ping on a read from /var/run/avahi-daemon/socket, a socket used by Avahi, an implementation of the network auto-configuration and service discovery mechanism Zeroconf that augments DNS. Killing the Avahi daemon solved the problem, but I still wanted to work out why ping was talking to Avahi and why it only occurred on certain hosts, so I ran ltrace:

15:14:33.062 - gettimeofday(0x7fff90ec0550, NULL)               = 0
15:14:33.062 - gettimeofday(0x7fff90ec0520, NULL)               = 0
15:14:33.062 - memcpy(0x00608988, "\311\004\030J", 16)          = 0x00608988
15:14:33.062 - sendmsg(3, 0x6077c0, 2048, 2, 61091)             = 64
15:14:33.184 - recvmsg(3, 0x7fff90ec15f0, 0, 0, 61091)          = 84
15:14:38.193 - gethostbyaddr("BG\370\222T\315(\002", 4, 2)      = NULL
15:14:38.193 - inet_ntoa(0x92f84742)                            = "66.71.248.146"
15:14:38.194 - strcpy(0x006078e0, "66.71.248.146")              = 0x006078e0

I then wrote a little test in C just to check that I could replicate the delay with gethostbyaddr(). I could, and it was then that I finally realised that the delay occurred when pinging hosts that had no PTR record (reverse DNS). Slightly confusingly, ping performs a reverse DNS lookup on the IP address when provided with a hostname, but not when given an IP address.

gethostbyaddr() was calling Avahi because it had plugged itself into the glibc resolver using NSS. When an attempt to resolve an IP address to a hostname failed, glibc would then call Avahi to try to find it. For whatever reason, Avahi cannot answer this request instantly and times out after 5 long seconds. Avahi also resolves host names to IP addresses but the delay in looking up unresolvable host names only occurs if the domain is under the .local pseudo top-level domain.

The Debian package dependencies make it a bit difficult to remove Avahi so the easiest way to fix this is to remove references to mdns4 from /etc/nsswitch.conf. If you want to kill the daemon entirely then you can always disable its init script of course.

Amusingly (or tragically) this bug is listed in Ubuntu and Debian bug reports which are both over two years old. At time time of writing, it still exists in Ubuntu Jaunty and Debian testing. What makes me really angry is that somewhere, someone decided that it would be a great idea to enable this daemon by default on desktop installs and as a result, performance of applications is being degraded. Obviously ping doesn't matter that much, but as mentioned in the bug reports, this hits people using ssh and IMAP as well, causing anything from mild delays to almost unusable systems. The worst part of this is that many people (and there could be a lot) suffering these issues are probably attributing them to a slow network, or packet loss, or ssh key verification or just about anything else other than a dubiously designed daemon running on their own machine. The fact that it only occurs on certain hosts only reinforces this and unless they suddenly realise that this delay is their local machine's fault and put a lot of effort into debugging, they'll probably never know.