This patch changes restart of kdump service from cpu online/offline events to cpu add/remove events.
Some people have complained that they are running cpu online/offline tests at high frequency and kdump restarts at high frequency and systemd disables the service. As a temporary fix, we committed a patch to never disable kdump service.
In general it probably is a good idea to restart kdump service on cpu add/remove events.
Toshi Kani confirmed following.
- File for /sys/devices/system/cpu/cpuX/crash_notes will be created first before ADD event goes out. That means we can not miss creating EFL notes for newly created cpu.
- For REMOVE event files under /sys/devices/system/cpu/cpuX/ are removed first and then REMOVE event goes out. That means we will remove the elf note header for removed cpu.
- There are some race conditions like a cpu is removed but system crashes before kdump service restarts. In that case vmcore.c has to be more robust to be able to inspect elf notes and discard empty ones.
Also it is possible that after cpu remove, crash notes memory got reused for something else and after crash vmcore.c might see some random data. It does basic size checks and discards elf notes if checks don't pass.
Above rance conditions can happen even with OFFLINE event and there is no good way to remove these altogether. So making vmcore.c more robust is the right solution here.
Signed-off-by: Vivek Goyal vgoyal@redhat.com --- 98-kexec.rules | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Index: kexec-tools-fedora/98-kexec.rules =================================================================== --- kexec-tools-fedora.orig/98-kexec.rules 2014-06-03 13:19:04.813120747 -0400 +++ kexec-tools-fedora/98-kexec.rules 2014-09-04 10:59:59.093304225 -0400 @@ -1,4 +1,4 @@ -SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" -SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service" +SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/bin/systemctl try-restart kdump.service" +SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service"
Hi Chao,
Can you review and pull in this patch.
Thanks Vivek
On Fri, Sep 05, 2014 at 04:16:25PM -0400, Vivek Goyal wrote:
This patch changes restart of kdump service from cpu online/offline events to cpu add/remove events.
Some people have complained that they are running cpu online/offline tests at high frequency and kdump restarts at high frequency and systemd disables the service. As a temporary fix, we committed a patch to never disable kdump service.
In general it probably is a good idea to restart kdump service on cpu add/remove events.
Toshi Kani confirmed following.
File for /sys/devices/system/cpu/cpuX/crash_notes will be created first before ADD event goes out. That means we can not miss creating EFL notes for newly created cpu.
For REMOVE event files under /sys/devices/system/cpu/cpuX/ are removed first and then REMOVE event goes out. That means we will remove the elf note header for removed cpu.
There are some race conditions like a cpu is removed but system crashes before kdump service restarts. In that case vmcore.c has to be more robust to be able to inspect elf notes and discard empty ones.
Also it is possible that after cpu remove, crash notes memory got reused for something else and after crash vmcore.c might see some random data. It does basic size checks and discards elf notes if checks don't pass.
Above rance conditions can happen even with OFFLINE event and there is no good way to remove these altogether. So making vmcore.c more robust is the right solution here.
Signed-off-by: Vivek Goyal vgoyal@redhat.com
98-kexec.rules | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Index: kexec-tools-fedora/98-kexec.rules
--- kexec-tools-fedora.orig/98-kexec.rules 2014-06-03 13:19:04.813120747 -0400 +++ kexec-tools-fedora/98-kexec.rules 2014-09-04 10:59:59.093304225 -0400 @@ -1,4 +1,4 @@ -SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" -SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service" +SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/bin/systemctl try-restart kdump.service" +SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service" _______________________________________________ kexec mailing list kexec@lists.fedoraproject.org https://lists.fedoraproject.org/mailman/listinfo/kexec
On 09/05/14 at 04:16pm, Vivek Goyal wrote:
This patch changes restart of kdump service from cpu online/offline events to cpu add/remove events.
Some people have complained that they are running cpu online/offline tests at high frequency and kdump restarts at high frequency and systemd disables the service. As a temporary fix, we committed a patch to never disable kdump service.
In general it probably is a good idea to restart kdump service on cpu add/remove events.
Toshi Kani confirmed following.
File for /sys/devices/system/cpu/cpuX/crash_notes will be created first before ADD event goes out. That means we can not miss creating EFL notes for newly created cpu.
For REMOVE event files under /sys/devices/system/cpu/cpuX/ are removed first and then REMOVE event goes out. That means we will remove the elf note header for removed cpu.
There are some race conditions like a cpu is removed but system crashes before kdump service restarts. In that case vmcore.c has to be more robust to be able to inspect elf notes and discard empty ones.
Also it is possible that after cpu remove, crash notes memory got reused for something else and after crash vmcore.c might see some random data. It does basic size checks and discards elf notes if checks don't pass.
Above rance conditions can happen even with OFFLINE event and there is no good way to remove these altogether. So making vmcore.c more robust is the right solution here.
Signed-off-by: Vivek Goyal vgoyal@redhat.com
Restarting kdump service on ADD/REMOVE seems to be more reliable. And because vmcore can discard empty note at runtime, we don't have to rebuild elf note.
Acked-by: WANG Chao chaowang@redhat.com
98-kexec.rules | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Index: kexec-tools-fedora/98-kexec.rules
--- kexec-tools-fedora.orig/98-kexec.rules 2014-06-03 13:19:04.813120747 -0400 +++ kexec-tools-fedora/98-kexec.rules 2014-09-04 10:59:59.093304225 -0400 @@ -1,4 +1,4 @@ -SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" -SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service" +SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/bin/systemctl try-restart kdump.service" +SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump.service" SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump.service"