Microsoft doesn't understand DNS

A few days ago, Microsoft launched a DOS attack against millions of users of the dynamic DNS service NO-IP. Microsoft's aim was to disrupt DNS entries used by the creators of malware, but had the side effect of rendering the service effectively useless for anyone using DDNS for resolving servers, home automation systems, remote webcams etc.

Using a court order, Microsoft were able to replace the nameserver records for 22 domain names used by NO-IP and pointed them to their own nameservers. Microsoft's intent was presumably to answer legitimate DNS queries by forwarding them to the original NO-IP nameservers and block those from malicious entities, rather than completely crippling the service.

Microsoft have since claimed to have fixed the problem yet for many users, myself included I can no longer find machines I use dynamic DNS to resolve. Many are attributing this to DNS propagation delay, but the actuality is that Microsoft seem to lack to lack the technical competence to implement a such a filtering system.

As of writing (02/07/2014 11:38:25 BST 2014) cached DNS entries for NO-IP domains are completely broken, so let's use dig to directly query the nameservers (and hence avoid any DNS caching issues):

$ dig -tNS +trace no-ip.org

; <<>> DiG 9.9.2-P1 <<>> -tNS +trace no-ip.org
;; global options: +cmd
.                       163525  IN      NS      f.root-servers.net.
.                       163525  IN      NS      a.root-servers.net.
.                       163525  IN      NS      j.root-servers.net.
.                       163525  IN      NS      k.root-servers.net.
.                       163525  IN      NS      h.root-servers.net.
.                       163525  IN      NS      e.root-servers.net.
.                       163525  IN      NS      m.root-servers.net.
.                       163525  IN      NS      b.root-servers.net.
.                       163525  IN      NS      l.root-servers.net.
.                       163525  IN      NS      i.root-servers.net.
.                       163525  IN      NS      d.root-servers.net.
.                       163525  IN      NS      g.root-servers.net.
.                       163525  IN      NS      c.root-servers.net.
.                       515689  IN      RRSIG   NS 8 0 518400 20140709000000 20140701230000 8230 . p5nawXXuH07BoGUsETH3J3VEj7W6H6V1EzwfRIRkr5qepcJQoqyGuced WTeOWW4kZV8GfB0NPS4Rp8HBfNTR6CsxNf4da92kTtbJKh9P+xEtreyd z3RDMqKDDBHGEl2Taml6J5Yhy89gsbigAZKPammqKh2aZM9+Tz46OmPt GHg=
;; Received 913 bytes from 146.169.1.24#53(146.169.1.24) in 16 ms

org.                    172800  IN      NS      a0.org.afilias-nst.info.
org.                    172800  IN      NS      a2.org.afilias-nst.info.
org.                    172800  IN      NS      b0.org.afilias-nst.org.
org.                    172800  IN      NS      b2.org.afilias-nst.org.
org.                    172800  IN      NS      c0.org.afilias-nst.info.
org.                    172800  IN      NS      d0.org.afilias-nst.org.
org.                    86400   IN      DS      21366 7 1 E6C1716CFB6BDC84E84CE1AB5510DAC69173B5B2
org.                    86400   IN      DS      21366 7 2 96EEB2FFD9B00CD4694E78278B5EFDAB0A80446567B69F634DA078F0 D90F01BA
org.                    86400   IN      RRSIG   DS 8 1 86400 20140709000000 20140701230000 8230 . F7mv0vQZqoDHVnIGnms53kiVT4nEwfxPv7ebMixzb20tI/FjH9nUrgvy PvrkzgXYV+HmO0Xzay/bsdLLhE2nMpFY7aXhbmzav9C126UGEAaheUDB 4MuTltNzmpu01biTVxyfqr6ZueE7QMWaKia/l29KdBab/3UgN7M3UL5e wr0=
;; Received 683 bytes from 192.203.230.10#53(192.203.230.10) in 8 ms

no-ip.org.              86400   IN      NS      ns7.microsoftinternetsafety.net.
no-ip.org.              86400   IN      NS      ns8.microsoftinternetsafety.net.
h9p7u7tr2u91d0v0ljs9l1gidnp90u3h.org. 86400 IN NSEC3 1 1 1 D399EAAB H9PARR669T6U8O1GSG9E1LMITK4DEM0T NS SOA RRSIG DNSKEY NSEC3PARAM
h9p7u7tr2u91d0v0ljs9l1gidnp90u3h.org. 86400 IN RRSIG NSEC3 7 2 86400 20140723104131 20140702094131 21185 org. djVZn31Z2Fbpk8Wnj0QQ2HGfkZj/tU9UWhJIEViDbvPaKHfqHRVYnBLc 0n+s04e1uuZJpxmGOjIw6+aTJrxP/t4H525GtS5YLT/TeMQyK5Tq8dKN fUq0lpUqz0fhz2H8QhfHPpZDBCy1Udh29gPmAbXWb84yhEgra7jFObK5 VGA=
vdarb7crtpe7mq8c176tsr178kc0put9.org. 86400 IN NSEC3 1 1 1 D399EAAB VDBA62E7405UOAVCP3IU953TNPO52T45 A RRSIG
vdarb7crtpe7mq8c176tsr178kc0put9.org. 86400 IN RRSIG NSEC3 7 2 86400 20140722155512 20140701145512 21185 org. UAsAW27kIw8XUKxtz1nD4tsjF2uSf8ERmgZS02s4fb0ATIXbDY95Az+u 3Ai/iGDFjBxHJ1oJpEl8xat4IwHzx1/JEgVIheoOd6lklZbXFQRi7RAu E75yYUnnRyVCk1tqGBT3QpgPAM3U6dBkji0UZ/eAiL7CrTQk2Vzz3z8s cfY=
;; Received 594 bytes from 199.19.54.1#53(199.19.54.1) in 156 ms

no-ip.org.              119749  IN      NS      ns8.microsoftinternetsafety.net.
no-ip.org.              119749  IN      NS      ns7.microsoftinternetsafety.net.
;; Received 117 bytes from 157.56.78.73#53(157.56.78.73) in 142 ms

This shows that both the .org parent nameservers and the nameservers Microsoft have introduced both agree that the current nameservers for no-ip.org are located under microsoftinternetsafety.net. Their IPs (157.56.78.93, 157.56.78.73) both belong to Microsoft. This means that Microsoft's “fix” must involve changes to their nameservers and have not returned nameserver control to NO-IP.

Let's lookup a DDNS host under the Microsoft nameserver, querying it directly (replaced by asterisks for paranoia):

$ dig -tA *****.no-ip.org @ns7.microsoftinternetsafety.net

; <<>> DiG 9.9.2-P1 <<>> -tA *****.no-ip.org @ns7.microsoftinternetsafety.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20400
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;*****.no-ip.org.               IN      A

;; ANSWER SECTION:
*****.no-ip.org.        60      IN      A       31.53.88.84

;; Query time: 217 msec
;; SERVER: 157.56.78.73#53(157.56.78.73)
;; WHEN: Wed Jul  2 11:49:05 2014
;; MSG SIZE  rcvd: 60

It works, that's weird. This indicates that the Microsoft nameserver is successfully forwarding requests to the underlying no-ip.org nameservers. Now let's make the same request to Google's public DNS server:

$ dig -tA *****.no-ip.org @8.8.8.8

; <<>> DiG 9.9.2-P1 <<>> -tA *****.no-ip.org @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 60346
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;*****.no-ip.org.               IN      A

;; Query time: 297 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jul  2 11:52:09 2014
;; MSG SIZE  rcvd: 44

This fails with an error suggesting that the nameserver Google contacted failed to answer. Checking the no-ip.org nameserver entries show that the data is up-to-date, which means that Microsoft's nameservers are not replying to Google correctly.

My first thought was that this was some weird geographical issue, or maybe Microsoft had mistakenly identified Google's DNS requests as malicious traffic. Instead the answer is much more simple, and indicative of why Microsoft is simply incompetent.

There are two types of DNS server: recursive and non-recursive. Recursive DNS servers are typically contacted by end-users/clients, and perform the sometimes complicated set of lookups required to resolve a domain name. They almost always perform caching, in order to reduce load on non-recursive nameservers. Non-recursive name-servers serve only local data, and don't perform DNS lookups themselves. Organisations wishing to make DNS entries visible set up non-recursive DNS servers so that they can be contacted by recursive ones.

Within each DNS request, there is a “recursion desired” flag (RD). This tells the DNS server whether or not it should recursively perform the query. A recursive DNS server will perform requests with this bit unset since it's performing the recursion itself and the target DNS server ideally shouldn't support it.

Unfortunately Microsoft's DNS servers fail to deliver a useful reply if this bit is unset. Consequently, djbdns, Google's nameservers and many other recursive DNS servers will never get a useful response from the nameservers Microsoft have inserted.

Let's perform the same lookup directly to Microsoft's nameservers again, but this time disable recursion.

$ dig -tA +norecurse *****.no-ip.org @ns7.microsoftinternetsafety.net

; <<>> DiG 9.9.2-P1 <<>> -tA +norecurse *****.no-ip.org @ns7.microsoftinternetsafety.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9751
;; flags: qr ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 7

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;*****.no-ip.org.               IN      A

;; AUTHORITY SECTION:
org.                    117492  IN      NS      c0.org.afilias-nst.info.
org.                    117492  IN      NS      a0.org.afilias-nst.info.
org.                    117492  IN      NS      a2.org.afilias-nst.info.
org.                    117492  IN      NS      b0.org.afilias-nst.org.
org.                    117492  IN      NS      b2.org.afilias-nst.org.
org.                    117492  IN      NS      d0.org.afilias-nst.org.

;; ADDITIONAL SECTION:
c0.org.afilias-nst.info. 117492 IN      A       199.19.53.1
a0.org.afilias-nst.info. 117492 IN      A       199.19.56.1
a2.org.afilias-nst.info. 117492 IN      A       199.249.112.1
b0.org.afilias-nst.org. 117492  IN      A       199.19.54.1
b2.org.afilias-nst.org. 117492  IN      A       199.249.120.1
d0.org.afilias-nst.org. 117492  IN      A       199.19.57.1

;; Query time: 149 msec
;; SERVER: 157.56.78.73#53(157.56.78.73)
;; WHEN: Wed Jul  2 12:15:05 2014
;; MSG SIZE  rcvd: 278

We do get a response, but it contains no useful information. This demonstrates that Microsoft really has no idea what they're doing.

DNS resolution delays in Debian (and Ubuntu)

Yesterday, I was pinging a server when I noticed that the output of ping seemed to be rather slow. In fact, I'd noticed it before but never really thought about it until I was pinging anther server at the same time and the saw the drastic difference in output speeds.

Pinging google.co.uk, there was a ping every second:

fpr@callisto:~$ ping google.co.uk | perl -ne 'use Time::Format; print "$time{\"hh:mm:ss.mmm\"} - $_"'
14:44:34.898 - PING google.co.uk (74.125.77.104) 56(84) bytes of data.
14:44:34.915 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=1 ttl=238 time=32.9 ms
14:44:35.883 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=2 ttl=238 time=35.9 ms
14:44:36.882 - 64 bytes from ew-in-f104.google.com (74.125.77.104): icmp_seq=3 ttl=238 time=33.4 ms

On another server, it was closer to five:

fpr@callisto:~$ ping server1.fsckvps.com | perl -ne 'use Time::Format; print "$time{\"hh:mm:ss.mmm\"} - $_"'
14:49:42.389 - PING server1.fsckvps.com (66.71.248.146) 56(84) bytes of data.
14:49:42.408 - 64 bytes from 66.71.248.146: icmp_seq=1 ttl=45 time=122 ms
14:49:47.500 - 64 bytes from 66.71.248.146: icmp_seq=2 ttl=45 time=123 ms
14:49:52.625 - 64 bytes from 66.71.248.146: icmp_seq=3 ttl=45 time=122 ms

This didn't make much sense since the round-trip times were small by comparison and ping sends one request per second by default. A Google search indicated that the problem might lie with my resolv.conf file. Unfortunately, mine seemed to be fine and my local DNS server was completely responsive. However, if I pinged the server by IP address instead of by hostname, the delay was gone.

To see what was blocking, I ran strace on ping.

15:04:26.789 - munmap(0x7f69ed319000, 129482)          = 0
15:04:26.790 - socket(PF_FILE, SOCK_STREAM, 0)         = 4
15:04:26.790 - fcntl(4, F_GETFD)                       = 0
15:04:26.790 - fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
15:04:26.790 - connect(4, {sa_family=AF_FILE, path="/var/run/avahi-daemon/socket"...}, 110) = 0
15:04:26.790 - fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
15:04:26.790 - fstat(4, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
15:04:26.790 - mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f69ed359000
15:04:26.791 - lseek(4, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
15:04:26.791 - write(4, "RESOLVE-ADDRESS 66.71.248.146\n"..., 30) = 30
15:04:31.789 - read(4, "-15 Timeout reached\n"..., 4096) = 20
15:04:31.789 - close(4)                                = 0

It was blocking each ping on a read from /var/run/avahi-daemon/socket, a socket used by Avahi, an implementation of the network auto-configuration and service discovery mechanism Zeroconf that augments DNS. Killing the Avahi daemon solved the problem, but I still wanted to work out why ping was talking to Avahi and why it only occurred on certain hosts, so I ran ltrace:

15:14:33.062 - gettimeofday(0x7fff90ec0550, NULL)               = 0
15:14:33.062 - gettimeofday(0x7fff90ec0520, NULL)               = 0
15:14:33.062 - memcpy(0x00608988, "\311\004\030J", 16)          = 0x00608988
15:14:33.062 - sendmsg(3, 0x6077c0, 2048, 2, 61091)             = 64
15:14:33.184 - recvmsg(3, 0x7fff90ec15f0, 0, 0, 61091)          = 84
15:14:38.193 - gethostbyaddr("BG\370\222T\315(\002", 4, 2)      = NULL
15:14:38.193 - inet_ntoa(0x92f84742)                            = "66.71.248.146"
15:14:38.194 - strcpy(0x006078e0, "66.71.248.146")              = 0x006078e0

I then wrote a little test in C just to check that I could replicate the delay with gethostbyaddr(). I could, and it was then that I finally realised that the delay occurred when pinging hosts that had no PTR record (reverse DNS). Slightly confusingly, ping performs a reverse DNS lookup on the IP address when provided with a hostname, but not when given an IP address.

gethostbyaddr() was calling Avahi because it had plugged itself into the glibc resolver using NSS. When an attempt to resolve an IP address to a hostname failed, glibc would then call Avahi to try to find it. For whatever reason, Avahi cannot answer this request instantly and times out after 5 long seconds. Avahi also resolves host names to IP addresses but the delay in looking up unresolvable host names only occurs if the domain is under the .local pseudo top-level domain.

The Debian package dependencies make it a bit difficult to remove Avahi so the easiest way to fix this is to remove references to mdns4 from /etc/nsswitch.conf. If you want to kill the daemon entirely then you can always disable its init script of course.

Amusingly (or tragically) this bug is listed in Ubuntu and Debian bug reports which are both over two years old. At time time of writing, it still exists in Ubuntu Jaunty and Debian testing. What makes me really angry is that somewhere, someone decided that it would be a great idea to enable this daemon by default on desktop installs and as a result, performance of applications is being degraded. Obviously ping doesn't matter that much, but as mentioned in the bug reports, this hits people using ssh and IMAP as well, causing anything from mild delays to almost unusable systems. The worst part of this is that many people (and there could be a lot) suffering these issues are probably attributing them to a slow network, or packet loss, or ssh key verification or just about anything else other than a dubiously designed daemon running on their own machine. The fact that it only occurs on certain hosts only reinforces this and unless they suddenly realise that this delay is their local machine's fault and put a lot of effort into debugging, they'll probably never know.