Can't get any more broken than a segfault at startup :-)
Looking at the carnage in /var/log/messages, it broke so bad that even systemd-coredump choked on itself, and failed to do whatever it wanted to do. Impressive.
Booted back to 4.6.7 to get things going again. Don't really have much to add to bug 1374917, besides the sorry state of affairs from /var/log/messages.
chroot might have something to do with it. I'm running named-chroot.service
On 09/10/16 20:15, Sam Varshavchik wrote:
Can't get any more broken than a segfault at startup :-)
Looking at the carnage in /var/log/messages, it broke so bad that even systemd-coredump choked on itself, and failed to do whatever it wanted to do. Impressive.
Booted back to 4.6.7 to get things going again. Don't really have much to add to bug 1374917, besides the sorry state of affairs from /var/log/messages.
chroot might have something to do with it. I'm running named-chroot.service
FWIW, I'm also running named-chroot.service and
[egreshko@meimei ~]$ uname -r 4.7.2-201.fc24.x86_64
But unable to reproduce your issue.
[egreshko@meimei ~]$ systemctl status named-chroot.service ● named-chroot.service - Berkeley Internet Name Domain (DNS) Loaded: loaded (/usr/lib/systemd/system/named-chroot.service; enabled; vendor pre Drop-In: /etc/systemd/system/named-chroot.service.d └─mynfssetup.conf Active: active (running) since Wed 2016-09-07 08:10:11 CST; 3 days ago Main PID: 1338 (named)
Ed Greshko writes:
On 09/10/16 20:15, Sam Varshavchik wrote:
Can't get any more broken than a segfault at startup :-)
Looking at the carnage in /var/log/messages, it broke so bad that even
systemd-coredump
choked on itself, and failed to do whatever it wanted to do. Impressive.
Booted back to 4.6.7 to get things going again. Don't really have much to
add to bug
1374917, besides the sorry state of affairs from /var/log/messages.
chroot might have something to do with it. I'm running named-chroot.service
FWIW, I'm also running named-chroot.service and
[egreshko@meimei ~]$ uname -r 4.7.2-201.fc24.x86_64
But unable to reproduce your issue.
All-righty, this must be something about this particular named-chroot configuration…
On Sat, 10 Sep 2016 10:20:57 -0400 Sam Varshavchik wrote:
All-righty, this must be something about this particular named-chroot configuration…
In the "check the dumb stuff first" category, might want to run memtest and check the SMART info on the disk. Always a chance the code got corrupted somehow and isn't running the instructions intended to run :-).
Tom Horsley writes:
On Sat, 10 Sep 2016 10:20:57 -0400 Sam Varshavchik wrote:
All-righty, this must be something about this particular named-chroot configuration…
In the "check the dumb stuff first" category, might want to run memtest and check the SMART info on the disk. Always a chance the code got corrupted somehow and isn't running the instructions intended to run :-).
I copied the chroot to another server that I can play with. On that one, named-chroot also segfaults at startup in the same way, so it looks like I have a weekend project…
Sam Varshavchik writes:
Tom Horsley writes:
On Sat, 10 Sep 2016 10:20:57 -0400 Sam Varshavchik wrote:
All-righty, this must be something about this particular named-chroot configuration…
In the "check the dumb stuff first" category, might want to run memtest and check the SMART info on the disk. Always a chance the code got corrupted somehow and isn't running the instructions intended to run :-).
I copied the chroot to another server that I can play with. On that one, named-chroot also segfaults at startup in the same way, so it looks like I have a weekend project…
I have this "options" directive in place for decades:
datasize 20M;
Commenting it out allows named to start up. I'll try it on my main server, the next time I reboot it. Something about 4.7.2 that makes named blow up, with this directive in place.
On 09/10/2016 08:09 AM, Sam Varshavchik wrote:
Sam Varshavchik writes:
Tom Horsley writes:
On Sat, 10 Sep 2016 10:20:57 -0400 Sam Varshavchik wrote:
All-righty, this must be something about this particular named-chroot configuration…
In the "check the dumb stuff first" category, might want to run memtest and check the SMART info on the disk. Always a chance the code got corrupted somehow and isn't running the instructions intended to run :-).
I copied the chroot to another server that I can play with. On that one, named-chroot also segfaults at startup in the same way, so it looks like I have a weekend project…
I have this "options" directive in place for decades:
datasize 20M;
Commenting it out allows named to start up. I'll try it on my main server, the next time I reboot it. Something about 4.7.2 that makes named blow up, with this directive in place.
Uhm, since that limits the size of memory that named can use, have you tried increasing it? I agree that a new kernel shouldn't cause it to puke unless there's something wrong with the way RAM is being allocated in the kernel (or the limit is actually being enforced in the new kernel and it wasn't in the older ones).
It'd be interesting if you had a top report of the memory usage of named under the old kernel and the new kernel (with the directive disabled), just to see what the memory footprint differences are. Might point toward something interesting. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 226437340 Yahoo: origrps2 - - - - Money can't buy happiness, but it can take the sting out of being - - miserable! - ----------------------------------------------------------------------
Rick Stevens writes:
On 09/10/2016 08:09 AM, Sam Varshavchik wrote:
Sam Varshavchik writes:
Tom Horsley writes:
On Sat, 10 Sep 2016 10:20:57 -0400 Sam Varshavchik wrote:
All-righty, this must be something about this particular named-chroot configuration…
In the "check the dumb stuff first" category, might want to run memtest and check the SMART info on the disk. Always a chance the code got corrupted somehow and isn't running the instructions intended to run :-).
I copied the chroot to another server that I can play with. On that one, named-chroot also segfaults at startup in the same way, so it looks like I have a weekend project…
I have this "options" directive in place for decades:
datasize 20M;
Commenting it out allows named to start up. I'll try it on my main server, the next time I reboot it. Something about 4.7.2 that makes named blow up, with this directive in place.
Uhm, since that limits the size of memory that named can use, have you tried increasing it? I agree that a new kernel shouldn't cause it to puke unless there's something wrong with the way RAM is being allocated in the kernel (or the limit is actually being enforced in the new kernel and it wasn't in the older ones).
It'd be interesting if you had a top report of the memory usage of named under the old kernel and the new kernel (with the directive disabled), just to see what the memory footprint differences are. Might point toward something interesting.
Yeah, something's definitely going on.
Freshly restarted named:
4.7.2:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29156 named 20 0 702528 83324 6360 S 12.5 2.1 0:00.23 named
4.6.7:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10208 named 20 0 407084 81908 6828 S 12.5 1.0 0:00.13 named
With 4.7.2, it's virtual space is nearly twice as much, also RES is just slightly bigger.
On 09/13/2016 04:06 AM, Sam Varshavchik wrote:
Rick Stevens writes:
On 09/10/2016 08:09 AM, Sam Varshavchik wrote:
Sam Varshavchik writes:
Tom Horsley writes:
On Sat, 10 Sep 2016 10:20:57 -0400 Sam Varshavchik wrote:
All-righty, this must be something about this particular
named-chroot
configuration…
In the "check the dumb stuff first" category, might want to run memtest and check the SMART info on the disk. Always a chance the code got corrupted somehow and isn't running the instructions intended to run :-).
I copied the chroot to another server that I can play with. On that one, named-chroot also segfaults at startup in the same way, so it looks like I have a weekend project…
I have this "options" directive in place for decades:
datasize 20M;
Commenting it out allows named to start up. I'll try it on my main server, the next time I reboot it. Something about 4.7.2 that makes named blow up, with this directive in place.
Uhm, since that limits the size of memory that named can use, have you tried increasing it? I agree that a new kernel shouldn't cause it to puke unless there's something wrong with the way RAM is being allocated in the kernel (or the limit is actually being enforced in the new kernel and it wasn't in the older ones).
It'd be interesting if you had a top report of the memory usage of named under the old kernel and the new kernel (with the directive disabled), just to see what the memory footprint differences are. Might point toward something interesting.
Yeah, something's definitely going on.
Freshly restarted named:
4.7.2:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29156 named 20 0 702528 83324 6360 S 12.5 2.1 0:00.23 named
4.6.7:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10208 named 20 0 407084 81908 6828 S 12.5 1.0 0:00.13 named
With 4.7.2, it's virtual space is nearly twice as much, also RES is just slightly bigger.
Yeah, the virtual usage is significantly bigger, the resident part slightly bigger and the shared segment is actually smaller. Weird.
I wonder if it has something to do with the way chroots work in 4.7.x? Is it possible for you to launch it again in both kernels but NOT in a chroot? That might allow you to bugzilla something a bit more focused, but there's SOMETHING weird there. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 226437340 Yahoo: origrps2 - - - - I will go to my happy place. I WOULD go to my happy place.... - - if I knew where the @$>&$@#* it is! - ----------------------------------------------------------------------
On Tue, 13 Sep 2016 09:56:08 -0700 Rick Stevens wrote:
Yeah, the virtual usage is significantly bigger, the resident part slightly bigger and the shared segment is actually smaller. Weird.
You could compare the /proc/pid/maps files for both cases and see which memory segment(s) were bigger.
Tom Horsley writes:
On Tue, 13 Sep 2016 09:56:08 -0700 Rick Stevens wrote:
Yeah, the virtual usage is significantly bigger, the resident part slightly bigger and the shared segment is actually smaller. Weird.
You could compare the /proc/pid/maps files for both cases and see which memory segment(s) were bigger.
Did that. Under either kernel, the named process maps the same shared libraries, and the mappings are identical in size.
The difference is entirely in the process's private mappings, half of them are "rw-p" mappings, half of them are "---p" mappings, which I do not understand, a private mapping without read and write privileges?
The "rw-p" mappings are slightly larger under 4.7.2. The "---p" mappings are significantly larger under the 4.7.2 kernel.
Rick Stevens writes:
On 09/13/2016 04:06 AM, Sam Varshavchik wrote:
Freshly restarted named:
4.7.2:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29156 named 20 0 702528 83324 6360 S 12.5 2.1 0:00.23 named
4.6.7:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10208 named 20 0 407084 81908 6828 S 12.5 1.0 0:00.13 named
With 4.7.2, it's virtual space is nearly twice as much, also RES is just slightly bigger.
Yeah, the virtual usage is significantly bigger, the resident part slightly bigger and the shared segment is actually smaller. Weird.
I wonder if it has something to do with the way chroots work in 4.7.x? Is it possible for you to launch it again in both kernels but NOT in a chroot? That might allow you to bugzilla something a bit more focused, but there's SOMETHING weird there.
My named config is set up in the chroot. I do not have a non-chrooted named config, but I can work on it. That's going to be my next weekend's project, I suppose.
On Tue, 13 Sep 2016 18:46:50 -0400 Sam Varshavchik wrote:
half of them are "---p" mappings, which I do not understand, a private mapping without read and write privileges?
I think that corresponds to an intentional "hole" in the address space where attempted access results in a segfault and no one can map anything new there.
I'm not exactly sure about any of that though. If I look at some maps files, there always seems to be one of these ---p regions somewhere in the middle of each shared library.
You could always guess security geeks are responsible. They seem to be behind all the inexplicable address layout stuff :-).
On Tue, 13 Sep 2016 19:10:13 -0400 Tom Horsley wrote:
I'm not exactly sure about any of that though. If I look at some maps files, there always seems to be one of these ---p regions somewhere in the middle of each shared library.
I found this:
http://unix.stackexchange.com/questions/226283/shared-library-mappings-in-pr...
It says the private only pages are gaps between readonly and readwrite to get aligned to a page boundary.
Did the new kernel change the default page size so the gaps need to be bigger now? (You'd think most of the universe would panic if the page size changed from 4K though :-).
Tom Horsley writes:
On Tue, 13 Sep 2016 19:10:13 -0400 Tom Horsley wrote:
I'm not exactly sure about any of that though. If I look at some maps files, there always seems to be one of these ---p regions somewhere in the middle of each shared library.
I found this:
http://unix.stackexchange.com/questions/226283/shared-library-mappings-in- proc-pid-maps
It says the private only pages are gaps between readonly and readwrite to get aligned to a page boundary.
The gaps are more than enough for a 4kb page alignment.
I augmented the output of /proc/pid/maps to show the length of each hexadecimal region, and nl-ed the output.
The first three mappings are /usr/sbin/named, then the next 19 mappings with kernel 4.6.7 are:
4 560c39d87000-560c39d8b000 16.00 Kb rw-p 00000000 00:00 0 5 560c3a3ae000-560c3a3cf000 132.00 Kb rw-p 00000000 00:00 0 [heap] 6 560c3a3cf000-560c3a43b000 432.00 Kb rw-p 00000000 00:00 0 [heap] 7 7fa2a0000000-7fa2a0085000 532.00 Kb rw-p 00000000 00:00 0 8 7fa2a0085000-7fa2a4000000 65004.00 Kb ---p 00000000 00:00 0 9 7fa2a8000000-7fa2a806f000 444.00 Kb rw-p 00000000 00:00 0 10 7fa2a806f000-7fa2ac000000 65092.00 Kb ---p 00000000 00:00 0 11 7fa2ac000000-7fa2ac05e000 376.00 Kb rw-p 00000000 00:00 0 12 7fa2ac05e000-7fa2b0000000 65160.00 Kb ---p 00000000 00:00 0 13 7fa2b0000000-7fa2b3085000 49684.00 Kb rw-p 00000000 00:00 0 14 7fa2b3085000-7fa2b4000000 15852.00 Kb ---p 00000000 00:00 0 15 7fa2b5bc7000-7fa2b7d4b000 34320.00 Kb rw-p 00000000 00:00 0 16 7fa2b7d4b000-7fa2b7d4c000 4.00 Kb ---p 00000000 00:00 0 17 7fa2b7d4c000-7fa2b854c000 8192.00 Kb rw-p 00000000 00:00 0 18 7fa2b854c000-7fa2b854d000 4.00 Kb ---p 00000000 00:00 0 19 7fa2b854d000-7fa2b8d4d000 8192.00 Kb rw-p 00000000 00:00 0 20 7fa2b8d4d000-7fa2b8d4e000 4.00 Kb ---p 00000000 00:00 0 21 7fa2b8d4e000-7fa2b954e000 8192.00 Kb rw-p 00000000 00:00 0 22 7fa2b954e000-7fa2b954f000 4.00 Kb ---p 00000000 00:00 0
And with kernel 4.7.2, the same 19 mappings are:
4 56099c4a9000-56099c4ad000 16.00 Kb rw-p 00000000 00:00 0 5 56099e11c000-56099e13d000 132.00 Kb rw-p 00000000 00:00 0 [heap] 6 56099e13d000-56099e1a9000 432.00 Kb rw-p 00000000 00:00 0 [heap] 7 7f04dc000000-7f04dc021000 132.00 Kb rw-p 00000000 00:00 0 8 7f04dc021000-7f04e0000000 65404.00 Kb ---p 00000000 00:00 0 9 7f04e379f000-7f04e4000000 8580.00 Kb rw-p 00000000 00:00 0 10 7f04e4000000-7f04e4032000 200.00 Kb rw-p 00000000 00:00 0 11 7f04e4032000-7f04e8000000 65336.00 Kb ---p 00000000 00:00 0 12 7f04e8000000-7f04e8029000 164.00 Kb rw-p 00000000 00:00 0 13 7f04e8029000-7f04ec000000 65372.00 Kb ---p 00000000 00:00 0 14 7f04ec000000-7f04ec02d000 180.00 Kb rw-p 00000000 00:00 0 15 7f04ec02d000-7f04f0000000 65356.00 Kb ---p 00000000 00:00 0 16 7f04f0000000-7f04f002e000 184.00 Kb rw-p 00000000 00:00 0 17 7f04f002e000-7f04f4000000 65352.00 Kb ---p 00000000 00:00 0 18 7f04f4000000-7f04f402d000 180.00 Kb rw-p 00000000 00:00 0 19 7f04f402d000-7f04f8000000 65356.00 Kb ---p 00000000 00:00 0 20 7f04f8000000-7f04f8038000 224.00 Kb rw-p 00000000 00:00 0 21 7f04f8038000-7f04fc000000 65312.00 Kb ---p 00000000 00:00 0 22 7f04fc000000-7f04ff206000 51224.00 Kb rw-p 00000000 00:00 0
The first half of the mappings are roughly comparable. Then, with kernel 4.7.2, the remaining mappings remain more or less the same. In 4.6.7, the remaining mappings are much smaller.
Did the new kernel change the default page size so the gaps need to be bigger now? (You'd think most of the universe would panic if the page size changed from 4K though :-).
Note how under 4.7.2, the read/write mapping and the immediately following buffer mapping tend to be 65536Kb in total size.
In 4.7.2, the first couple of mappings do seem to align to a 64kb boundary, but not all of them, the last half of the shown mappings do not align to 64kb boundaries, each.
The question now becomes how does this translate to bind's "datasize" config option work.
I originally had "datasize 20M" before 4.7.2.
I upped it to "datasize 64M". named-chroot still failed to start.
I upped it to "datasize 256M". named-chroot started. I looked into /proc/pid/limits, and I saw that "Max data size" was now set. It was set to 256 megabytes.
From that, I conclude that bind does not manage "datasize n" internally, it just uses it to set its own ulimit.
So, it seems that bind runs comfortably with a 20mb data ulimit in <4.7.2 can't even fit within a 64M data ulimit in 4.7.2.
How is this reconciled with the changes in the process mappings? Does ulimit just set the upper range of the data segment numerically, in which case the larger gaps would certainly eat into that rather quickly? But other evidence suggests that this is not the case:
The actual row in /proc/pid/limits is:
Max data size 268435456 268435456 bytes
Counting on my fingers, that's 256 megabytes.
Each pair of mappings, above, adds up to 65536kb, which is 64 megabytes. Unless I'm a bit slow mentally, today; but that's what it looks like to me, so the first four of those mappings would've blown through that limit, and there are quite a few more of those mappings. That would suggest that ulimit does not just set a raw cutoff for virtual memory addresses (which jives with what I read in mmap(), et. al.)
So, what the heck is going on.
On Tue, 13 Sep 2016 22:05:18 -0400 Sam Varshavchik wrote:
[heap] 7 7fa2a0000000-7fa2a0085000 532.00 Kb rw-p 00000000 00:00 0 8 7fa2a0085000-7fa2a4000000 65004.00 Kb ---p 00000000 00:00 0 9 7fa2a8000000-7fa2a806f000 444.00 Kb rw-p 00000000 00:00 0 10 7fa2a806f000-7fa2ac000000 65092.00 Kb ---p 00000000 00:00 0
That looks like someone is doing heap protection by putting a inaccessible section between each allocated heap chunk. There might be glibc malloc debugging trigger of some kind to do that, but I wouldn't think that would change just for the kernel version, so I'm out of ideas.
Tom Horsley writes:
On Tue, 13 Sep 2016 22:05:18 -0400 Sam Varshavchik wrote:
[heap] 7 7fa2a0000000-7fa2a0085000 532.00 Kb rw-p 00000000 00:00 0 8 7fa2a0085000-7fa2a4000000 65004.00 Kb ---p 00000000 00:00 0 9 7fa2a8000000-7fa2a806f000 444.00 Kb rw-p 00000000 00:00 0 10 7fa2a806f000-7fa2ac000000 65092.00 Kb ---p 00000000 00:00 0That looks like someone is doing heap protection by putting a inaccessible section between each allocated heap chunk. There might be glibc malloc debugging trigger of some kind to do that, but I wouldn't think that would change just for the kernel version, so I'm out of ideas.
I recall that ElectricFence did something like that. But why even bother mapping anything. I would think that accessing an unmapped region would result in a SIGSEGV just as well.
On Wed, 14 Sep 2016 06:47:28 -0400 Sam Varshavchik wrote:
But why even bother mapping anything. I would think that accessing an unmapped region would result in a SIGSEGV just as well.
Ah-HA! A question I can answer :-).
You need to map something because otherwise the program itself might have code to map something of its own and unless you reserve the protected space, it might wind up becoming unprotected.
I also see a new 4.7.3 kernel showed up in the repos. I wonder if the problem goes away with it?
Tom Horsley writes:
On Wed, 14 Sep 2016 06:47:28 -0400 Sam Varshavchik wrote:
But why even bother mapping anything. I would think that accessing an unmapped region would result in a SIGSEGV just as well.
Ah-HA! A question I can answer :-).
You need to map something because otherwise the program itself might have code to map something of its own and unless you reserve the protected space, it might wind up becoming unprotected.
I also see a new 4.7.3 kernel showed up in the repos. I wonder if the problem goes away with it?
Nope, and the issue is being reported by others, too. Not just me.
On 09/16/16 07:09, Sam Varshavchik wrote:
Tom Horsley writes:
On Wed, 14 Sep 2016 06:47:28 -0400 Sam Varshavchik wrote:
But why even bother mapping anything. I would think that accessing an unmapped region would result in a SIGSEGV just as well.
Ah-HA! A question I can answer :-).
You need to map something because otherwise the program itself might have code to map something of its own and unless you reserve the protected space, it might wind up becoming unprotected.
I also see a new 4.7.3 kernel showed up in the repos. I wonder if the problem goes away with it?
Nope, and the issue is being reported by others, too. Not just me.
Do you happen to know the Bugzilla?
Ed Greshko writes:
On 09/16/16 07:09, Sam Varshavchik wrote:
Tom Horsley writes:
On Wed, 14 Sep 2016 06:47:28 -0400 Sam Varshavchik wrote:
But why even bother mapping anything. I would think that accessing an unmapped region would result in a SIGSEGV just as well.
Ah-HA! A question I can answer :-).
You need to map something because otherwise the program itself might have code to map something of its own and unless you reserve the protected space, it might wind up becoming unprotected.
I also see a new 4.7.3 kernel showed up in the repos. I wonder if the problem goes away with it?
Nope, and the issue is being reported by others, too. Not just me.
Do you happen to know the Bugzilla?
On 09/16/16 09:25, Sam Varshavchik wrote:
Do you happen to know the Bugzilla?
Thanks much.