This is the final report from DENIC on the outage they experienced on the 5th of May 2026.

I appreciate the openness, but I’ve reviewed (and written :/) many post-mortems and this one could be a lot better. Using one of the readily available post-mortem templates would have helped.

The main issues:

  • No timeline.
  • Lack of technical details.
  • 50% of the action items are weakly defined.

Let go over it. I’ve put my comments after “My remarks”, after each quoted section.

(I’ve converted the blog post into markdown, to be able to list it here).

Final Report: DNS Outage of 5 May 2026

Summary

On 5 May 2026, a DNS outage occurred during a routine DNSSEC key rollover, which significantly restricted access to .de domains for approximately three hours. The cause was an error in the software code of an in-house development, which resulted in the majority of the delivered DNSSEC signatures being invalid. Normal operations were fully restored during the night of 5 May. The findings of the initial analysis dated 8 May 2026 are confirmed.

Remarks

  • No UTC timestamps.
  • Not linking to the previous doc.
  • Majority? How many? Why hard to estimate?

Background: The Signing System

The DNSSEC signing process for DE utilizes standard software (Knot) as well as in-house developments in conjunction with Hardware Security Modules (HSMs).

In April 2026, the third generation of this system (since the introduction of DNSSEC in 2011) was put into operation. The systems were tested in advance and externally audited. The signing system consists of several HSMs distributed across two geographically and network-technically separate data centers.

Remarks

  • One of the action items is another external audit, why did this one not flag it?
  • Tested? How, what was tested?
  • Why is it important to mention the HSM are separated?

Cause of the Failure

Faulty Code in the Rollover Agent

The root cause was an error in the software code of an in-house development that controls a rollover agent. The agent’s task is to generate key material and load it into all connected HSMs.

Remarks

  • A deeper root-cause is why this came about. Sure this was the technical root cause, but how did we get here?

Due to the faulty code:

  • Instead of generating a single key pair and loading it into all HSMs, a separate key pair was generated for each connected HSM and loaded into exactly one of them.
  • All three key pairs generated contained the same identifiers, including the key tag 33834.
  • This was not a classic key tag collision, but rather three different key pairs with identical metadata.

Remarks

  • What, how? Show us code? How did this happen exactly? The key tags was calculated once and reused?
  • How was this not caught in a unit-test?
  • Where there unit tests?
  • Different keys would be definition have different key tags… But not here?

As a result, the subsequent logic wrote one of the three ZSKs with key tag 33834 into the zone. Because only one of the three HSMs contained the key matching the published DNSKEY record, only about one-third of all signatures could be validated.

Remarks

  • Fair enough, some pointers to code would be nice though.

Why was the error not detected before the system went live?

The faulty code was incorporated during improvements without existing test scenarios covering this specific error case. The test environment consists of a single HSM at a single location. Because the faulty behavior only manifests when multiple HSMs are connected, the defect was not detected during test runs or ‘cold’ parallel operation.

Remarks

  • “…this specific error case.” What is this specific error case? Future problems are impossible to guard against, but what things did it guard against then?
  • Testing wouldn’t require multiple HSM, those can be faked with SoftHSM… And even so. You’re a TLD, buy another HSM.

Why was the non-validatable zone published?

The .de zone is updated incrementally due to its size. While three different testing and validation tools detected the anomalies (missing or non-validatable signatures) as intended, the generated notifications were not processed correctly, preventing timely intervention.

Remarks

  • “…updated incrementally due to its size.” Not really related, but is this still really a problem, should this not be done differently now a days?
  • Holy shit, “not processed correctly”. How, why, many more words need to be written on this topic. What was tested? How did notification work?

Ruled-out Causes

The comprehensive analysis explicitly ruled out the following:

  • Compromise: No signs of attacks on the signing system or DENIC infrastructure.
  • Software Failure: No malfunction identified in the Knot name server.
  • Hardware Failure: No malfunction identified in the HSMs used.
  • Collision: No classic key-tag collision occurred.

Remarks

  • OK, good to include this.

Impact

Remarks

  • Again, good section to have in a post-mortem.

Technical Impact

The validity of DNS responses in a TLD zone depends on signed NSEC3 records, especially when verifying the absence of a DS record in an unsigned child zone. Non-validatable signatures on NSEC3 records caused validating resolvers to classify delegation information as “bogus”. Consequently, even second-level domains that do not use DNSSEC could not be resolved. Non-validating resolvers were unaffected.

Remarks

  • Nothing to add.

Impact on Users and Duration

Restrictions lasted for approximately three hours. The impact was partially mitigated by operators of large resolvers who temporarily suspended DNSSEC validation for .de domains.

Remarks

  • TIMELINE!

Measures

DENIC has identified measures affecting both software development and DNS operations. Improvements to the code review process have already been implemented.

Short-term measures include:

  • Enhanced Alerting: Additional alerts based on improved visibility and expanded metrics from validation tools.
  • Accelerated Switchover: A faster procedure to provide a valid zone backup during emergencies.
  • Pre-deployment Validation: Implementing partial validation of the zone prior to deployment.
  • ZSK Rollover Suspension: Halting further ZSK rollovers until security in the software development process and test environment coverage are improved. External Audit: Conducting a final external security and process analysis.

Remarks

Well, lets go over these.

  • “Enhanced alterting”, doesn’t say anything particular, enhanced what? How? From remote vantage points, after the zone is published??
  • “Accelerated Switchover”, the worst possible action, how is this tested? Will this only be used in an emergency or regularly exercised?
  • “Pre-deployment Validation”, this looks like a sane thing to do.
  • “ZSK Rollover Suspension”, OK, understand this one. An actual timeline would be helpful here. For how long?
  • “External audit”, the current code was also audited…. ’nuff said.