On Mon, 21 Jan 2019 18:48:04 -0500 Nate Pearlstein darknater@gmail.com wrote:
I normally run w/o quiet and rhgb anyway. I added earlyprintk=vga and it’s clear the system panics early. I tried adding boot_delay=500 and also boot_delay=10 to try to capture the spew with my phone camera capturing at 60fps. Only leaving off boot_delay can I see the panic but the output is coming faster than 60fps.
From what I can piece together without using a serial console and capturing from another host:
kernel BUG at mm/page_alloc.c:791! Invalid opcode: 0000 [#10 SMP PTI] (not sure about this too jumbled) I can’t really see the stack trace either __free_page_ok free_all_bootmem mem_init start_kernel secondary_startup_64 [1.860030] free_one_page RIP: 0010:free_one_page [1.863221] Code: 08 0e 03 00 0f 0b 48 89 da be 0c 00 00 00 4c 89 ff e8 56 02 00 e9 9c fb ff ff 48 c7 c6 08 86 0d 92 4c 89 f7 e8 e2 0d 03 00 <0f> ob 48 c6 30 86 0d 92 48 89 df e8 d1 0d 03 00 0f 0b 31 d2 e9 [1.872806] RSP: 0000:ffffffff92203e20 EFLAGS: 00010046 . . [1.923827] Kernel panic - not syncing
Samuel might be able to decipher this, but I have an off the wall idea. Kernels get bigger with each release. I wonder if there is a memory problem, that the earlier kernels don't trigger, but the larger kernels do. Run a memory test?
The other thing to try is re-installing the kernel. A really long shot, but worth a try.
And maybe it is a kernel bug. The line you are referring to is VM_BUG_ON_PAGE(bad_range(zone, page), page); and it occurs when trying to deallocate a page.
static inline void __free_one_page(struct page *page, unsigned long pfn, struct zone *zone, unsigned int order, int migratetype) {
I interpret the errors as saying that the kernel is trying to deallocate a page, and the CPU receives a 0000 opcode. That would be an error. But is it coming from the kernel, or is the kernel reading a bad location?
I think it has to be something about your hardware, because if the kernel was actually having trouble deallocating pages for all boots, this would be a well known problem. Maybe you have hit a corner case. You could open a bugzilla, but it will be difficult for someone to fix this without your hardware to replicate the crash or the complete crash output.
The 4.20 kernel series is not far away from coming to stable. You could either grab one from koji, https://koji.fedoraproject.org/koji/packageinfo?packageID=8 or use an older kernel until it is released. It might fix the issue as a side effect of other changes.