I'm having a tough time trying to get mysql rebuilt in rawhide: the ppc build keeps failing like this: http://koji.fedoraproject.org/koji/getfile?taskID=335275&name=build.log
Normally what a failure in execution_constants means is that the configuration constant STACK_MIN_SIZE has to be increased, because the error-recovery code needs more stack space than it did before. We have for years had to run that a bit higher than what mysql.com ships, but up to now 16384 has worked fine across all arches (all 7 redhat arches, not just Fedora). Sometime since 13 Dec 2007, however, the behavior of the ppc arch changed in rawhide, and now even boosting the number by 50% (to 24K) doesn't persuade it to work. I could try larger numbers, at the cost of also increasing DEFAULT_THREAD_STACK which is a pretty user-visible number. I am wondering if this isn't a bug in rawhide, though. Has gcc started making PPC stack frames a lot larger than before? Maybe glibc has gotten more stack-hungry? I'd guess on the problem being in gettext() or related code, if it is a glibc change, but I haven't tracked it down exactly.
I can't prove at the moment that the problem affects *only* PPC, but given that that build always dies first it seems pretty likely.
If anyone has a clue what to look at, I'd appreciate it.
regards, tom lane
On Tue, 2008-01-08 at 16:28 -0500, Tom Lane wrote:
I'm having a tough time trying to get mysql rebuilt in rawhide: the ppc build keeps failing like this: http://koji.fedoraproject.org/koji/getfile?taskID=335275&name=build.log
Normally what a failure in execution_constants means is that the configuration constant STACK_MIN_SIZE has to be increased, because the error-recovery code needs more stack space than it did before. We have for years had to run that a bit higher than what mysql.com ships, but up to now 16384 has worked fine across all arches (all 7 redhat arches, not just Fedora). Sometime since 13 Dec 2007, however, the behavior of the ppc arch changed in rawhide, and now even boosting the number by 50% (to 24K) doesn't persuade it to work. I could try larger numbers, at the cost of also increasing DEFAULT_THREAD_STACK which is a pretty user-visible number. I am wondering if this isn't a bug in rawhide, though. Has gcc started making PPC stack frames a lot larger than before? Maybe glibc has gotten more stack-hungry? I'd guess on the problem being in gettext() or related code, if it is a glibc change, but I haven't tracked it down exactly.
For a while we used 64KiB pages on ppc64, because IBM insisted on it in RHEL5 and I didn't notice we'd done the same stupid thing in Fedora. I believe it was like that in FC6 but I fixed it again for F7.
Is it possible that the kernel on the build machines is now similarly afflicted?
On Wed, 09 Jan 2008 10:15:59 +0000, David Woodhouse wrote:
On Tue, 2008-01-08 at 16:28 -0500, Tom Lane wrote:
I'm having a tough time trying to get mysql rebuilt in rawhide: the ppc build keeps failing like this: http://koji.fedoraproject.org/koji/getfile?taskID=335275&name=build.log
Normally what a failure in execution_constants means is that the configuration constant STACK_MIN_SIZE has to be increased, because the error-recovery code needs more stack space than it did before. We have for years had to run that a bit higher than what mysql.com ships, but up to now 16384 has worked fine across all arches (all 7 redhat arches, not just Fedora). Sometime since 13 Dec 2007, however, the behavior of the ppc arch changed in rawhide, and now even boosting the number by 50% (to 24K) doesn't persuade it to work. I could try larger numbers, at the cost of also increasing DEFAULT_THREAD_STACK which is a pretty user-visible number. I am wondering if this isn't a bug in rawhide, though. Has gcc started making PPC stack frames a lot larger than before? Maybe glibc has gotten more stack-hungry? I'd guess on the problem being in gettext() or related code, if it is a glibc change, but I haven't tracked it down exactly.
For a while we used 64KiB pages on ppc64, because IBM insisted on it in RHEL5 and I didn't notice we'd done the same stupid thing in Fedora. I believe it was like that in FC6 but I fixed it again for F7.
Is it possible that the kernel on the build machines is now similarly afflicted?
In December the builders have been upgraded to RHEL5.
On Wednesday 09 January 2008, David Woodhouse wrote:
On Tue, 2008-01-08 at 16:28 -0500, Tom Lane wrote:
I'm having a tough time trying to get mysql rebuilt in rawhide: the ppc build keeps failing like this: http://koji.fedoraproject.org/koji/getfile?taskID=335275&name=build.log
Normally what a failure in execution_constants means is that the configuration constant STACK_MIN_SIZE has to be increased, because the error-recovery code needs more stack space than it did before. We have for years had to run that a bit higher than what mysql.com ships, but up to now 16384 has worked fine across all arches (all 7 redhat arches, not just Fedora). Sometime since 13 Dec 2007, however, the behavior of the ppc arch changed in rawhide, and now even boosting the number by 50% (to 24K) doesn't persuade it to work. I could try larger numbers, at the cost of also increasing DEFAULT_THREAD_STACK which is a pretty user-visible number. I am wondering if this isn't a bug in rawhide, though. Has gcc started making PPC stack frames a lot larger than before? Maybe glibc has gotten more stack-hungry? I'd guess on the problem being in gettext() or related code, if it is a glibc change, but I haven't tracked it down exactly.
For a while we used 64KiB pages on ppc64, because IBM insisted on it in RHEL5 and I didn't notice we'd done the same stupid thing in Fedora. I believe it was like that in FC6 but I fixed it again for F7.
Is it possible that the kernel on the build machines is now similarly afflicted?
The build system when refreshed in December got updates to RHEL5 so are all running RHEL5 kernels. they do have one extra patch for a bug in tux.
Dennis
Dennis Gilmore dennis@ausil.us writes:
On Wednesday 09 January 2008, David Woodhouse wrote:
On Tue, 2008-01-08 at 16:28 -0500, Tom Lane wrote:
... Has gcc started making PPC stack frames a lot larger than before? Maybe glibc has gotten more stack-hungry? I'd guess on the problem being in gettext() or related code, if it is a glibc change, but I haven't tracked it down exactly.
For a while we used 64KiB pages on ppc64, because IBM insisted on it in RHEL5 and I didn't notice we'd done the same stupid thing in Fedora. I believe it was like that in FC6 but I fixed it again for F7.
Is it possible that the kernel on the build machines is now similarly afflicted?
The build system when refreshed in December got updates to RHEL5 so are all running RHEL5 kernels. they do have one extra patch for a bug in tux.
Interesting, but would that affect the rate at which userland code consumes stack space?
Since posting, I've verified that it still fails at STACK_MIN_SIZE = 48K, which is 300% of the setting that had worked up through mid-December. So *something* has gone pretty seriously wacko, but it's hard to tell what. I guess I shall have to request buildroot access and start poking at it with a debugger ...
regards, tom lane
On Wed, 2008-01-09 at 11:11 -0500, Tom Lane wrote:
Dennis Gilmore dennis@ausil.us writes:
On Wednesday 09 January 2008, David Woodhouse wrote:
On Tue, 2008-01-08 at 16:28 -0500, Tom Lane wrote:
... Has gcc started making PPC stack frames a lot larger than before? Maybe glibc has gotten more stack-hungry? I'd guess on the problem being in gettext() or related code, if it is a glibc change, but I haven't tracked it down exactly.
For a while we used 64KiB pages on ppc64, because IBM insisted on it in RHEL5 and I didn't notice we'd done the same stupid thing in Fedora. I believe it was like that in FC6 but I fixed it again for F7.
Is it possible that the kernel on the build machines is now similarly afflicted?
The build system when refreshed in December got updates to RHEL5 so are all running RHEL5 kernels. they do have one extra patch for a bug in tux.
Interesting, but would that affect the rate at which userland code consumes stack space?
It'll certainly affect the way it _allocates_ stack space.
Since posting, I've verified that it still fails at STACK_MIN_SIZE = 48K,
That's still less than a single page. Did you try 64KiB or 128KiB?
David Woodhouse dwmw2@infradead.org writes:
On Wed, 2008-01-09 at 11:11 -0500, Tom Lane wrote:
Interesting, but would that affect the rate at which userland code consumes stack space?
It'll certainly affect the way it _allocates_ stack space.
Since posting, I've verified that it still fails at STACK_MIN_SIZE = 48K,
That's still less than a single page. Did you try 64KiB or 128KiB?
I think you're missing the context. The actual requested size of the thread stack is 192K or 256K (and I've also tried 512K, without any better luck). I could see kernel page size affecting the result of that allocation request, but all the values are multiples of 64K already.
The problem I'm having is with some code that tries to detect how much stack space has been consumed, and error out if it's too much, where "too much" means the requested stack size minus STACK_MIN_SIZE. So that constant needs to be set large enough to ensure that the error recovery code itself can execute without overflowing the stack and incurring SIGSEGV.
My first assumption was that something was using a tad more stack space than before, but given the latest failed test I'm starting to think that something's been broken about the stack-depth-testing logic itself.
regards, tom lane