Apparently this is a known design issue with bind-dyndb-ldap, the glue between bind/named and LDAP.

https://bugzilla.redhat.com/show_bug.cgi?id=1071356 mentions this behaviour on startup, and the response was:

> This is "expected" behavior for bind-dyndb-ldap version 4.0 and higher:
> See https://git.fedorahosted.org/cgit/bind-dyndb-ldap.git/tree/NEWS for version 4.0 point [5].

> It simply takes some time to load all the data from LDAP to named.

> If you want to see some other behavior please open a bug against bind-dyndb-ldap component.

I don't see any bugs against bind-dyndb-ldap for this behaviour of responding NXDOMAIN during startup while data is loading (instead of e.g. SERVFAIL, or not responding at all if it doesn't know the right response).
https://pagure.io/bind-dyndb-ldap/issue/124 mentions this behaviour too, and indicates that it could be solved with caching, but the ticket hasn't moved for some time. There are no workarounds listed.

On Fri, Oct 27, 2017 at 1:04 PM Nicholas Hinds <hindsn@gmail.com> wrote:
This might not be entirely related to a FreeIPA upgrade. I have managed to reproduce this by sending lots of queries at bind/named while it's restarting (sudo service named-pkcs11 restart). Sometimes these queries during startup will get unlucky and return NXDOMAIN with invalid authority information, like I observed during the FreeIPA upgrade.

It's possible the FreeIPA upgrade just loaded my system up so that bind took longer to finish starting up - my test system is running on some pretty low-specced hardware.

Simpler steps to reproduce this:

$ sudo service named-pkcs11 restart; for i in {1..20}; do dig a.cname.in.my.freeipa @localhost|grep status; done
Redirecting to /bin/systemctl restart named-pkcs11.service
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 11664
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15073
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 31456
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36166
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36299
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 53211
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 30928
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 10465
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 65318
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 33517
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 35773
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2719
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42969
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28725
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16096
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55018
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54067
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47360
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6057
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20778


I turned the debug level of named up, and it seems to be answering queries before it has read all of the DNS entries from LDAP.

The queries which were returning NXDOMAIN all occurred before a log entry "general: debug 7: add a.cname.in.my.freeipa. 60 IN CNAME destination.in.my.freeipa.", and the queries which returned NOERROR all occurred after that log entry. There are also log entries between the NXDOMAIN and NOERROR messages where it loads the parent zones of the entry ("general: debug 1: zone my.freeipa/IN: starting load" / "general: debug 1: zone in.my.freeipa/IN: starting load"), so the NXDOMAIN response might be because it hasn't read in the NS records or does not yet understand that it is supposed to be the authoritative nameserver for that zone.

Is there a way to make bind/named only respond to queries once it's read its configuration fully from LDAP? Or just to wait e.g. 30 seconds after the bind/named service starts before listening on port 53, to lower the chances of responding to queries while it's still booting?

On Thu, Oct 26, 2017 at 11:43 AM Nicholas Hinds <hindsn@gmail.com> wrote:
On Thu, Oct 26, 2017 at 9:17 AM Rob Crittenden <rcritten@redhat.com> wrote:
Nicholas Hinds wrote:
> I tried running `sudo service named-pkcs11 stop` before the yum update,
> but FreeIPA still returned NXDOMAIN responses temporarily.

You want the service named.
That service does not exist in my FreeIPA installation:
$ sudo service named status
Redirecting to /bin/systemctl status named.service
● named.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)

Running `sudo service named stop` gives no output, and running `sudo ipactl status` afterwards shows that "named" is still running:
$ sudo service named stop
Redirecting to /bin/systemctl stop named.service
$ sudo ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
ntpd Service: RUNNING
pki-tomcatd Service: RUNNING
smb Service: RUNNING
winbind Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful


If I stop named-pkcs11, `sudo ipactl status` shows that "named" is stopped:
$ sudo service named-pkcs11 stop
Redirecting to /bin/systemctl stop named-pkcs11.service
$ sudo ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: STOPPED
httpd Service: RUNNING
ipa-custodia Service: RUNNING
ntpd Service: RUNNING
pki-tomcatd Service: RUNNING
smb Service: RUNNING
winbind Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful


So at least on my machine, stopping the OS service "named-pkcs11" stops "named" for FreeIPA.


> It seems like these responses occur about 10 seconds after the last log
> entry in /var/log/ipaupgrade.log ("The ipa-server-upgrade command was
> successful"). Based on the IPA "posttrans" script from the RPM, it seems
> likely the NXDOMAIN responses are being returned while the
> `/bin/systemctl restart ipa.service` command is running, however I
> cannot reproduce the NXDOMAIN responses by running `/bin/systemctl
> restart ipa.service` on its own. Something in the yum upgrade or
> ipa-server-upgrade process seems to trigger this different behaviour.

As I said, by default right now bind remains running while its backend,
389-ds, is unavailable during the package update process. The ipa
service doesn't reproduce this because of the order in which the
services are restarted.

If I stop the "ipa" service then start only bind ("named-pkcs11"), so the backend isn't running, DNS queries return the "SERVFAIL" status rather than "NXDOMAIN", which makes sense to me. They also do not return any authority information. It does not appear that bind returns "NXDOMAIN" with incorrect authority information if its backend is completely unavailable when it starts.

If I start the "ipa" service and attempt to stop all of its components apart from bind/named one by one (ipa-dnskeysyncd, winbind, smb, ntpd, ipa-custodia, httpd, kadmin, krb5kdc, pki-tomcatd@pki-tomcat, dirsrv@MY-DOMAIN), the DNS server continues to correctly respond to DNS queries. This could be because I have a pair of replicated FreeIPA instances, and once bind/named starts it knows how to query from the secondary server? Although stopping FreeIPA on my second server does not stop DNS queries from being answered - perhaps bind has just cached the response for the test query I am using. Either way, stopping all of these services including dirsrv (which I believe is the 389-ds backend process) does not result in "NXDOMAIN" responses with incorrect authority information.


rob

>
> On Tue, Oct 24, 2017 at 1:45 PM Rob Crittenden <rcritten@redhat.com
> <mailto:rcritten@redhat.com>> wrote:
>
>     Nicholas Hinds via FreeIPA-users wrote:
>     > During an upgrade from 4.5.0-21.el7.centos.1.2
>     > to 4.5.0-21.el7.centos.2.2 on a CentOS 7.4 machine, FreeIPA's DNS
>     server
>     > briefly returned NXDOMAIN for records which existed in FreeIPA. These
>     > invalid responses were returned for a very short amount of time, but
>     > caused long-running issues with Java clients which tend to cache DNS
>     > responses. Upgraded packages included: 389-ds-base, 389-ds-base-libs,
>     > 389-ds-base-snmp, ipa-client, ipa-client-common, ipa-python-compat,
>     > ipa-server, ipa-server-common, ipa-server-dns, ipa-server-trust-ad,
>     > python2-ipa-server, and a dozen sss-related packages.
>     >
>     > I reproduced this in a FreeIPA test environment by running `while
>     true;
>     > do dig some.dns.entry.managed.by
>     <http://some.dns.entry.managed.by>.freeipa @ip.address.of.freeipa |
>     tee -a
>     > a-log-file; done` from one server, and running `yum update` on the
>     > FreeIPA machine. The invalid NXDOMAIN responses were returned some
>     time
>     > after the `yum update` logged 'Cleanup' for the RPMs, and seemed to be
>     > during the 'Verifying' phase.
>     >
>     > These NXDOMAIN responses claimed that an upstream nameserver
>     > (a.root-servers.net <http://a.root-servers.net>
>     <http://a.root-servers.net>) was the authority for
>     > my zone:
>     >
>     > a-log-file-; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.7 <<>>
>     > some.dns.entry.managed.by
>     <http://some.dns.entry.managed.by>.freeipa @172.16.0.77
>     <http://172.16.0.77> <http://172.16.0.77>
>     > a-log-file-;; global options: +cmd
>     > a-log-file-;; Got answer:
>     > a-log-file:;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 2889
>     > a-log-file-;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1,
>     > ADDITIONAL: 0
>     > a-log-file-
>     > a-log-file-;; QUESTION SECTION:
>     > a-log-file-;some.dns.entry.managed.by.freeipa. IN A
>     > a-log-file-
>     > a-log-file-;; AUTHORITY SECTION:
>     > a-log-file-.60INSOAa.root-servers.net
>     <http://60INSOAa.root-servers.net> <http://a.root-servers.net>.
>     > nstld.verisign-grs.com <http://nstld.verisign-grs.com>
>     <http://nstld.verisign-grs.com>. 2017102400 1800
>     > 900 604800 86400
>     > a-log-file-
>     > a-log-file-;; Query time: 227 msec
>     > a-log-file-;; SERVER: 172.16.0.77#53(172.16.0.77)
>     > a-log-file-;; WHEN: Tue Oct 24 18:30:28 2017
>     > a-log-file-;; MSG SIZE  rcvd: 130
>     >
>     > Usually when querying an invalid DNS entry, the dig output still
>     claims
>     > that my FreeIPA server is authoritative for the zone:
>     > $ dig doesntexist.zone.managed.by
>     <http://doesntexist.zone.managed.by>.freeipa @172.16.0.77
>     <http://172.16.0.77> <http://172.16.0.77>
>     >
>     > ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.7 <<>>
>     > doesntexist.zone.managed.by
>     <http://doesntexist.zone.managed.by>.freeipa @172.16.0.77
>     <http://172.16.0.77> <http://172.16.0.77>
>     > ;; global options: +cmd
>     > ;; Got answer:
>     > ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 59953
>     > ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1,
>     ADDITIONAL: 0
>     >
>     > ;; QUESTION SECTION:
>     > ;doesntexist.zone.managed.by
>     <http://doesntexist.zone.managed.by>.freeipa. IN A
>     >
>     > ;; AUTHORITY SECTION:
>     > zone.managed.by.freeipa.30 INSOAidm01.freeipa.
>     > hostmaster.zone.managed.by
>     <http://hostmaster.zone.managed.by>.freeipa. 1508869828 30 900
>     1209600 30
>     >
>     > ;; Query time: 0 msec
>     > ;; SERVER: 172.16.0.77#53(172.16.0.77)
>     > ;; WHEN: Tue Oct 24 19:27:12 2017
>     > ;; MSG SIZE  rcvd: 113
>     >
>     >
>     > Is it possible that during a yum update, the FreeIPA DNS server
>     > temporarily forgets what zones it's authoritative for (or forgets all
>     > DNS records) and just delegates to the upstream DNS server for half a
>     > second or so? Or is something else going on here?
>     >
>     > I'm open to suggestions.
>
>     The LDAP server is brought down during upgrades which is likely the
>     issue. bind can't connect to its backend. Why it returns NXDOMAIN I
>     don't know.
>
>     You may be able to manually work around this by manually stopping bind
>     before updating IPA, then starting it again afterwards.
>
>     rob
>