Recently I’ve switched to Vodafone for all my cellular needs, in a twisted turn of events this uncovered a goroutine leak in miekg/dns.

First some background on the setup I have at home (where this first showed up). I have your run-of-the-mill LAN + Wifi and a Raspberry Pi running CoreDNS for my DNS (proxy) needs. This CoreDNS instance forwards all DNS traffic to https://dns.google.com which uses a non-standard DNS protocol implemented as JSON over HTTPS. I use this so that my DNS traffic is encrypted. Note that CoreDNS makes heavy use of miekg/dns.

So one day I am doing a journalctl -fu coredns on the Pi and seeing NAPTR queries show up (and erroring):

Nov 14 08:10:40 raspberrypi coredns[7925]: 192.168.1.124 - [14/Nov/2017:08:10:40 +0000] "NAPTR IN mplus.ims.vodafone.com. udp 40 false 512" SERVFAIL qr,rd 40 79.988891ms
Nov 14 08:10:40 raspberrypi coredns[7925]: 14/Nov/2017:08:10:40 +0000 [ERROR 0 mplus.ims.vodafone.com. NAPTR] unreachable backend: no upstream host

Just seeing NAPTR queries in the wild was odd (they’re not that common). Seeing the error was weird, but I shrugged it off as something Vodafone had been using but turned down or something. (The no upstream host is misleading here).

More than a week later I started working on getting my Grafana instance back in order and while fidding with it I saw this enourmous pile of goroutines being used on the Pi.

In the above image, number:

  1. (1) starting monitoring and seeing an increase
  2. (2) I believe this is my wife swapping her sim for a Vodafone one
  3. (3) The spike to the right is when started to figure out things

The large drop is when I restarted CoreDNS to enable the pprof plugin, then the small increase when I really found the culprit and used dig a lot.

I filled issue 1233 and issue 1235 not knowing yet it was a miekg/dns bug.

I also have another instance, running as an auth server, which didn’t have this issue (pfew!), which narrowed down what plugins could be responsible for this.

Debug

To start debugging in earnest I restarted CoreDNS with the pprof plugin and use the pprof endpoint: localhost:6053/debug/pprof/goroutine?debug=1 to dump the running goroutines.

At the top it says something in the lexer is lingering. I did not believe immediately that miekg/dns was the culprit here, even though this clearly says so, I opened miekg/dns/issues/568 in any case.

goroutine profile: total 27
7 @ 0x3e6c8 0x3e790 0x14124 0x13ea8 0x224060 0x6fd2c
#	0x22405f	github.com/miekg/dns.zlexer+0x347	/home/miek/g/src/github.com/miekg/dns/scan.go:602

2 @ 0x3e6c8 0x39a5c 0x38edc 0xa420c 0xa4280 0xa4e78 0x135948 0x145e04 0x1e48a0 0x1e4d14 0x1e7cd0 0x302a58 0xf8530 0xf868c 0x30337c 0x6fd2c
#	0x38edb		internal/poll.runtime_pollWait+0x43	/home/miek/upstream/go/src/runtime/netpoll.go:173
#	0xa420b		internal/poll.(*pollDesc).wait+0x8b	/home/miek/upstream/go/src/internal/poll/fd_poll_runtime.go:85

During this debugging I happen to notice those NAPTR requests again. And every time one was done an extra scan.go:602 goroutine showed up. From the logs I grabbed the domain name that was being asked and replayed with dig mplus.ims.vodafone.com. NAPTR. Sure enough the goroutine count increased by 3 (dig will retry 3 times by default). Bypassing CoreDNS and issuing this request against 8.8.8.8 would yield the correct results.

So what is wrong here?

JSON

By using dns.google.com everything needs to be put in a JSON payload. Going to: https://dns.google.com/resolve?name=mplus.ims.vodafone.com.&type=NAPTR, it shows:

{"Status": 0,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name":
"mplus.ims.vodafone.com.","type": 35}],"Answer":[ {"name": "mplus.ims.vodafone.com.","type":
35,"TTL": 1521,"data": "10 100 S SIP+D2T  _sip._tcp.mplus.ims.vodafone.com."},{"name":
"mplus.ims.vodafone.com.","type": 35,"TTL": 1521,"data": "10 100 S SIPS+D2T
_sips._tcp.mplus.ims.vodafone.com."},{"name": "mplus.ims.vodafone.com.","type": 35,"TTL":
1521,"data": "10 100 S SIP+D2U  _sip._udp.mplus.ims.vodafone.com."}]}

Looks legit? The problem here is that because all the RDATA is put in the data element, it strips quotes from the text presentation that are mandatory for NAPTR records, i.e.:

"data": "10 100 S SIP+D2T  _sip._tcp.mplus.ims.vodafone.com."

Should have been encoded as:

"data": "10 100 \"S\" \"SIP+D2T\" \"\"  _sip._tcp.mplus.ims.vodafone.com."

dns.NewRR

In CoreDNS this JSON is parsed in plugin/proxy/google_rr.go which calls dns.NewRR. That code is sits in scan.go. The gist of the code is that to spawn a lexer goroutine that sends tokens down on a channel that are consumed by a “grammar” like parser to make sense of the string.

A successful parse means the lexer see an EOF “error”, drops out of the for-loop and exits the function and ends the goroutine.

A non successful parse that stops midway however is an entirely different story, because of the missing quotes around S in the returned NAPTR record, we do bail out early, but this means

  1. the zlexer function does not see an EOF
  2. it consumes the next token and then sits idling, trying to send it on the channel
  3. the reader of the channel is gone, because we’ve stopped parsing

Result: a leaked goroutine.

Fix

The fix was to make sure the zlexer goroutine would exit even when the higher layer saw an error. In this PR implement with a context that gets canceled which triggers the exit from the zlexer function. Is also adds a test that checks for goroutine leaks.

Morale of the story? Don’t switch providers. Dogfood your own stuff and always make sure you know how to stop a goroutine.

Fun fact: this code was committed on December 14 2011!