[PATCH 0/4] Add lvm2 thin provision support for kdump

[PATCH] kdump-lib: use non-debug...

Adding entry to...

Tao Liu

Sunday, 15 May 2022 Sun, 15 May '22

10:25 p.m.

Thin provision is a mechanism that you can allocate a lvm volume which has a large virtual size for file systems but actually in a small physical size. The physical size can be autoextended in use if thin pool reached a threshold specified in /etc/lvm/lvm.conf. There are 3 works should be handled when enable lvm2 thinp for kdump: 1) Check if the dump target device or directory is thinp device. 2) Monitor the thin pool and autoextend its size when it reached the threshold during kdump. 3) If thin pool size-autoextend fails, the user space program will not know due to buffered IO. So the lvm2 thinp dump target should be mounted with sync flag, making user space program know immediately if write fails. According to my testing, the memory consumption procedure for lvm2 thinp is the thin pool size-autoextend phase. For fedora and rhel9, the default crashkernel value is enough. But for rhel8, the default crashkernel value 1G-4G:160M is not enough, so it should be handled particularly. Please review. Tao Liu (4): Add lvm2 thin provision dump target checker Add lvm2-monitor.service for kdump when lvm2 thinp enabled lvm.conf should be check modified if lvm2 thinp enabled Mount the lvm2 thinp target with sync option dracut-lvm2-monitor.service | 15 +++++++++++++++ dracut-module-setup.sh | 16 ++++++++++++++++ kdump-lib-initramfs.sh | 25 +++++++++++++++++++++++++ kdumpctl | 1 + kexec-tools.spec | 2 ++ mkdumprd | 5 +++++ 6 files changed, 64 insertions(+) create mode 100644 dracut-lvm2-monitor.service -- 2.33.1

Show replies by date

Tao Liu

Sunday, 15 May Sun, 15 May

10:25 p.m.

New subject: [PATCH 1/4] Add lvm2 thin provision dump target checker

We need to check if a directory or a device is lvm2 thinp target. First, we use get_block_dump_target() to convert dump path into block device, then we check if the device is in the output of cmd lvs. If the device is lvm2 thinp, whose attribute given by lvs will be '/^.{6}t'. And the device's full path can be one of: 1) /dev/mapper/vg-thinp, 2) /dev/vg/thinp. Signed-off-by: Tao Liu <ltao(a)redhat.com> --- kdump-lib-initramfs.sh | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh index 84e6bf7..149aa49 100755 --- a/kdump-lib-initramfs.sh +++ b/kdump-lib-initramfs.sh @@ -131,3 +131,27 @@ is_fs_dump_target() { [ -n "$(kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix")" ] } + +is_lvm2_thinp_device() +{ + local _device_path=$1 + local _lvm2_thinp_volumes=($(lvs 2>/dev/null | + awk '{if ($3 ~ /^.{6}t/) {printf("%s-%s %s/%s ", $2, $1, $2, $1);}}')) + + for _v in ${_lvm2_thinp_volumes[@]}; do + [[ "/dev/mapper/$_v" == $_device_path || + "/dev/$_v" == $_device_path ]] && + return 0 + done + return 1 +} + +is_lvm2_thinp_dump_target() +{ + local _target=$(get_block_dump_target) + if [ -n $_target ]; then + is_lvm2_thinp_device $_target + else + return 1 + fi +} \ No newline at end of file -- 2.33.1

Coiby Xu

Tuesday, 17 May Tue, 17 May

9:11 p.m.

New subject: [PATCH 1/4] Add lvm2 thin provision dump target checker

Hi Tao, kdump-lib-initramfs.sh is supposed to be POSIX-compatible. As caught by shellspec [1], we can't use local or arrays in POSIX sh. You may need to fix other small issues caught by shellspec as well. [1] https://gitlab.com/coxu/fedora-kexec-tools/-/jobs/2461492476 On Mon, May 16, 2022 at 11:25:51AM +0800, Tao Liu wrote:

...

-- Best regards, Coiby

Tao Liu

9:37 p.m.

New subject: [PATCH 1/4] Add lvm2 thin provision dump target checker

On Wed, May 18, 2022 at 10:15 AM Coiby Xu <coxu(a)redhat.com> wrote:

...

OK, I think I will post 2 separate patches for this: one for fix POSIX-compatible issue specifically for this patch set. The other for fix POSIX-compatible issues for the rest. Thanks, Tao Liu

...

[1] https://gitlab.com/coxu/fedora-kexec-tools/-/jobs/2461492476 On Mon, May 16, 2022 at 11:25:51AM +0800, Tao Liu wrote: >We need to check if a directory or a device is lvm2 thinp target. > >First, we use get_block_dump_target() to convert dump path into >block device, then we check if the device is in the output of >cmd lvs. If the device is lvm2 thinp, whose attribute given by >lvs will be '/^.{6}t'. And the device's full path can be one of: >1) /dev/mapper/vg-thinp, 2) /dev/vg/thinp. > >Signed-off-by: Tao Liu <ltao(a)redhat.com> >--- > kdump-lib-initramfs.sh | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > >diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh >index 84e6bf7..149aa49 100755 >--- a/kdump-lib-initramfs.sh >+++ b/kdump-lib-initramfs.sh >@@ -131,3 +131,27 @@ is_fs_dump_target() > { > [ -n "$(kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix")" ] > } >+ >+is_lvm2_thinp_device() >+{ >+ local _device_path=$1 >+ local _lvm2_thinp_volumes=($(lvs 2>/dev/null | >+ awk '{if ($3 ~ /^.{6}t/) {printf("%s-%s %s/%s ", $2, $1, $2, $1);}}')) >+ >+ for _v in ${_lvm2_thinp_volumes[@]}; do >+ [[ "/dev/mapper/$_v" == $_device_path || >+ "/dev/$_v" == $_device_path ]] && >+ return 0 >+ done >+ return 1 >+} >+ >+is_lvm2_thinp_dump_target() >+{ >+ local _target=$(get_block_dump_target) >+ if [ -n $_target ]; then >+ is_lvm2_thinp_device $_target >+ else >+ return 1 >+ fi >+} >\ No newline at end of file >-- >2.33.1 > -- Best regards, Coiby

Tao Liu

Tuesday, 24 May Tue, 24 May

9:31 a.m.

New subject: [PATCH 1/4] Add lvm2 thin provision dump target checker

Hi Coiby, On Wed, May 18, 2022 at 10:37 AM Tao Liu <ltao(a)redhat.com> wrote:

...

On Wed, May 18, 2022 at 10:15 AM Coiby Xu <coxu(a)redhat.com> wrote: > > Hi Tao, > > kdump-lib-initramfs.sh is supposed to be POSIX-compatible. As caught by > shellspec [1], we can't use local or arrays in POSIX sh. You may need to > fix other small issues caught by shellspec as well.

I see from your gitlab shellcheck results, there are plenty of errors, and I can reproduce them locally by "$ shellcheck *.sh spec/*.sh kdumpctl mk*dumprd". Seems these checking errors existed for a long time, and I remember kasong had sent a big patch set to address the similar issue before. Just for curiosity, should the errors have been noticed then, or did we use a different shellcheck cmdline? Thanks, Tao Liu

...

OK, I think I will post 2 separate patches for this: one for fix POSIX-compatible issue specifically for this patch set. The other for fix POSIX-compatible issues for the rest. Thanks, Tao Liu > > [1] https://gitlab.com/coxu/fedora-kexec-tools/-/jobs/2461492476 > > On Mon, May 16, 2022 at 11:25:51AM +0800, Tao Liu wrote: > >We need to check if a directory or a device is lvm2 thinp target. > > > >First, we use get_block_dump_target() to convert dump path into > >block device, then we check if the device is in the output of > >cmd lvs. If the device is lvm2 thinp, whose attribute given by > >lvs will be '/^.{6}t'. And the device's full path can be one of: > >1) /dev/mapper/vg-thinp, 2) /dev/vg/thinp. > > > >Signed-off-by: Tao Liu <ltao(a)redhat.com> > >--- > > kdump-lib-initramfs.sh | 24 ++++++++++++++++++++++++ > > 1 file changed, 24 insertions(+) > > > >diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh > >index 84e6bf7..149aa49 100755 > >--- a/kdump-lib-initramfs.sh > >+++ b/kdump-lib-initramfs.sh > >@@ -131,3 +131,27 @@ is_fs_dump_target() > > { > > [ -n "$(kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix")" ] > > } > >+ > >+is_lvm2_thinp_device() > >+{ > >+ local _device_path=$1 > >+ local _lvm2_thinp_volumes=($(lvs 2>/dev/null | > >+ awk '{if ($3 ~ /^.{6}t/) {printf("%s-%s %s/%s ", $2, $1, $2, $1);}}')) > >+ > >+ for _v in ${_lvm2_thinp_volumes[@]}; do > >+ [[ "/dev/mapper/$_v" == $_device_path || > >+ "/dev/$_v" == $_device_path ]] && > >+ return 0 > >+ done > >+ return 1 > >+} > >+ > >+is_lvm2_thinp_dump_target() > >+{ > >+ local _target=$(get_block_dump_target) > >+ if [ -n $_target ]; then > >+ is_lvm2_thinp_device $_target > >+ else > >+ return 1 > >+ fi > >+} > >\ No newline at end of file > >-- > >2.33.1 > > > > -- > Best regards, > Coiby >

Coiby Xu

8:39 p.m.

New subject: [PATCH 1/4] Add lvm2 thin provision dump target checker

On Tue, May 24, 2022 at 10:31:17PM +0800, Tao Liu wrote:

...

Hi Coiby, On Wed, May 18, 2022 at 10:37 AM Tao Liu <ltao(a)redhat.com> wrote: > > On Wed, May 18, 2022 at 10:15 AM Coiby Xu <coxu(a)redhat.com> wrote: > > > > Hi Tao, > > > > kdump-lib-initramfs.sh is supposed to be POSIX-compatible. As caught by > > shellspec [1], we can't use local or arrays in POSIX sh. You may need to > > fix other small issues caught by shellspec as well. > I see from your gitlab shellcheck results, there are plenty of errors, and I can reproduce them locally by "$ shellcheck *.sh spec/*.sh kdumpctl mk*dumprd". Seems these checking errors existed for a long time, and I remember kasong had sent a big patch set to address the similar issue before. Just for curiosity, should the errors have been noticed then, or did we use a different shellcheck cmdline?

Kairui's huge patch only addressed some of the errors. There are still many remaining shellcheck errors. My plan is to pay off the technical debt gradually. As long we don't introduce new errors in new patches, it should be fine.

...

Thanks, Tao Liu > OK, I think I will post 2 separate patches for this: > > one for fix POSIX-compatible issue specifically for this patch set. > The other for fix POSIX-compatible issues for the rest. > > Thanks, > Tao Liu > > > > > [1] https://gitlab.com/coxu/fedora-kexec-tools/-/jobs/2461492476 > > > > On Mon, May 16, 2022 at 11:25:51AM +0800, Tao Liu wrote: > > >We need to check if a directory or a device is lvm2 thinp target. > > > > > >First, we use get_block_dump_target() to convert dump path into > > >block device, then we check if the device is in the output of > > >cmd lvs. If the device is lvm2 thinp, whose attribute given by > > >lvs will be '/^.{6}t'. And the device's full path can be one of: > > >1) /dev/mapper/vg-thinp, 2) /dev/vg/thinp. > > > > > >Signed-off-by: Tao Liu <ltao(a)redhat.com> > > >--- > > > kdump-lib-initramfs.sh | 24 ++++++++++++++++++++++++ > > > 1 file changed, 24 insertions(+) > > > > > >diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh > > >index 84e6bf7..149aa49 100755 > > >--- a/kdump-lib-initramfs.sh > > >+++ b/kdump-lib-initramfs.sh > > >@@ -131,3 +131,27 @@ is_fs_dump_target() > > > { > > > [ -n "$(kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix")" ] > > > } > > >+ > > >+is_lvm2_thinp_device() > > >+{ > > >+ local _device_path=$1 > > >+ local _lvm2_thinp_volumes=($(lvs 2>/dev/null | > > >+ awk '{if ($3 ~ /^.{6}t/) {printf("%s-%s %s/%s ", $2, $1, $2, $1);}}')) > > >+ > > >+ for _v in ${_lvm2_thinp_volumes[@]}; do > > >+ [[ "/dev/mapper/$_v" == $_device_path || > > >+ "/dev/$_v" == $_device_path ]] && > > >+ return 0 > > >+ done > > >+ return 1 > > >+} > > >+ > > >+is_lvm2_thinp_dump_target() > > >+{ > > >+ local _target=$(get_block_dump_target) > > >+ if [ -n $_target ]; then > > >+ is_lvm2_thinp_device $_target > > >+ else > > >+ return 1 > > >+ fi > > >+} > > >\ No newline at end of file > > >-- > > >2.33.1 > > > > > > > -- > > Best regards, > > Coiby > >

-- Best regards, Coiby

Tao Liu

9:15 p.m.

New subject: [PATCH 1/4] Add lvm2 thin provision dump target checker

On Wed, May 25, 2022 at 9:40 AM Coiby Xu <coxu(a)redhat.com> wrote:

...

On Tue, May 24, 2022 at 10:31:17PM +0800, Tao Liu wrote: >Hi Coiby, > >On Wed, May 18, 2022 at 10:37 AM Tao Liu <ltao(a)redhat.com> wrote: >> >> On Wed, May 18, 2022 at 10:15 AM Coiby Xu <coxu(a)redhat.com> wrote: >> > >> > Hi Tao, >> > >> > kdump-lib-initramfs.sh is supposed to be POSIX-compatible. As caught by >> > shellspec [1], we can't use local or arrays in POSIX sh. You may need to >> > fix other small issues caught by shellspec as well. >> > >I see from your gitlab shellcheck results, there are plenty of errors, and I can >reproduce them locally by "$ shellcheck *.sh spec/*.sh kdumpctl mk*dumprd". > >Seems these checking errors existed for a long time, and I remember kasong >had sent a big patch set to address the similar issue before. Just for >curiosity, >should the errors have been noticed then, or did we use a different shellcheck >cmdline? Kairui's huge patch only addressed some of the errors. There are still many remaining shellcheck errors. My plan is to pay off the technical debt gradually. As long we don't introduce new errors in new patches, it should be fine.

OK, thanks for the explanation! I will try to solve them, since I'm not quite familiar with shellcheck, it may take a while for me to figure out the useful or false alarms. Thanks, Tao Liu

...

> >Thanks, >Tao Liu > > >> OK, I think I will post 2 separate patches for this: >> >> one for fix POSIX-compatible issue specifically for this patch set. >> The other for fix POSIX-compatible issues for the rest. >> >> Thanks, >> Tao Liu >> >> > >> > [1] https://gitlab.com/coxu/fedora-kexec-tools/-/jobs/2461492476 >> > >> > On Mon, May 16, 2022 at 11:25:51AM +0800, Tao Liu wrote: >> > >We need to check if a directory or a device is lvm2 thinp target. >> > > >> > >First, we use get_block_dump_target() to convert dump path into >> > >block device, then we check if the device is in the output of >> > >cmd lvs. If the device is lvm2 thinp, whose attribute given by >> > >lvs will be '/^.{6}t'. And the device's full path can be one of: >> > >1) /dev/mapper/vg-thinp, 2) /dev/vg/thinp. >> > > >> > >Signed-off-by: Tao Liu <ltao(a)redhat.com> >> > >--- >> > > kdump-lib-initramfs.sh | 24 ++++++++++++++++++++++++ >> > > 1 file changed, 24 insertions(+) >> > > >> > >diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh >> > >index 84e6bf7..149aa49 100755 >> > >--- a/kdump-lib-initramfs.sh >> > >+++ b/kdump-lib-initramfs.sh >> > >@@ -131,3 +131,27 @@ is_fs_dump_target() >> > > { >> > > [ -n "$(kdump_get_conf_val "ext[234]\|xfs\|btrfs\|minix")" ] >> > > } >> > >+ >> > >+is_lvm2_thinp_device() >> > >+{ >> > >+ local _device_path=$1 >> > >+ local _lvm2_thinp_volumes=($(lvs 2>/dev/null | >> > >+ awk '{if ($3 ~ /^.{6}t/) {printf("%s-%s %s/%s ", $2, $1, $2, $1);}}')) >> > >+ >> > >+ for _v in ${_lvm2_thinp_volumes[@]}; do >> > >+ [[ "/dev/mapper/$_v" == $_device_path || >> > >+ "/dev/$_v" == $_device_path ]] && >> > >+ return 0 >> > >+ done >> > >+ return 1 >> > >+} >> > >+ >> > >+is_lvm2_thinp_dump_target() >> > >+{ >> > >+ local _target=$(get_block_dump_target) >> > >+ if [ -n $_target ]; then >> > >+ is_lvm2_thinp_device $_target >> > >+ else >> > >+ return 1 >> > >+ fi >> > >+} >> > >\ No newline at end of file >> > >-- >> > >2.33.1 >> > > >> > >> > -- >> > Best regards, >> > Coiby >> > > -- Best regards, Coiby

Tao Liu

Sunday, 15 May Sun, 15 May

10:25 p.m.

New subject: [PATCH 2/4] Add lvm2-monitor.service for kdump when lvm2 thinp enabled

If lvm2 thinp is enabled in kdump, lvm2-monitor.service is needed for monitor and autoextend the size of thin pool. Otherwise the vmcore dumped to a no-enough-space target will be incomplete and unable for further analysis. In this patch, lvm2-monitor.service will be started before kdump-capture .service for 2nd kernel, then be stopped in kdump post.d phase. So the thin pool monitoring and size-autoextend can be ensured during kdump. Signed-off-by: Tao Liu <ltao(a)redhat.com> --- dracut-lvm2-monitor.service | 15 +++++++++++++++ dracut-module-setup.sh | 16 ++++++++++++++++ kexec-tools.spec | 2 ++ 3 files changed, 33 insertions(+) create mode 100644 dracut-lvm2-monitor.service diff --git a/dracut-lvm2-monitor.service b/dracut-lvm2-monitor.service new file mode 100644 index 0000000..88e79e1 --- /dev/null +++ b/dracut-lvm2-monitor.service @@ -0,0 +1,15 @@ +[Unit] +Description=Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling +Documentation=man:dmeventd(8) man:lvcreate(8) man:lvchange(8) man:vgchange(8) +After=initrd.target initrd-parse-etc.service sysroot.mount +After=dracut-initqueue.service dracut-pre-mount.service dracut-mount.service dracut-pre-pivot.service +Before=initrd-cleanup.service kdump-capture.service shutdown.target local-fs-pre.target +DefaultDependencies=no +Conflicts=shutdown.target + +[Service] +Type=oneshot +Environment=LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES=1 +ExecStart=/usr/sbin/lvm vgchange --monitor y +ExecStop=/usr/sbin/lvm vgchange --monitor n +RemainAfterExit=yes \ No newline at end of file diff --git a/dracut-module-setup.sh b/dracut-module-setup.sh index c319fc2..f7d1202 100755 --- a/dracut-module-setup.sh +++ b/dracut-module-setup.sh @@ -1016,6 +1016,20 @@ remove_cpu_online_rule() { sed -i '/SUBSYSTEM=="cpu"/d' "$file" } +kdump_install_lvm2_monitor_service() +{ + inst "$moddir/lvm2-monitor.service" "$systemdsystemunitdir/lvm2-monitor.service" + systemctl -q --root "$initdir" add-wants initrd.target lvm2-monitor.service + + # We should stop lvm2-monitor service after kdump. SIGTERM is ignored + # by dmeventd when device is monitored. So before stopping dmevend, devices + # shall be unmonitored. This can save the waiting time between systemd-shutdown + # Sending SIGTERM and SIGKILL to remaining processes. + mkdir -p ${initdir}/etc/kdump/post.d + echo "systemctl stop lvm2-monitor" > ${initdir}/etc/kdump/post.d/stop-lvm2-monitor.sh + chmod +x ${initdir}/etc/kdump/post.d/stop-lvm2-monitor.sh +} + install() { local arch @@ -1058,6 +1072,8 @@ install() { inst "$moddir/kdump.sh" "/usr/bin/kdump.sh" inst "$moddir/kdump-capture.service" "$systemdsystemunitdir/kdump-capture.service" systemctl -q --root "$initdir" add-wants initrd.target kdump-capture.service + is_lvm2_thinp_dump_target && + kdump_install_lvm2_monitor_service # Replace existing emergency service and emergency target cp "$moddir/kdump-emergency.service" "$initdir/$systemdsystemunitdir/emergency.service" cp "$moddir/kdump-emergency.target" "$initdir/$systemdsystemunitdir/emergency.target" diff --git a/kexec-tools.spec b/kexec-tools.spec index 6673000..5f4344d 100644 --- a/kexec-tools.spec +++ b/kexec-tools.spec @@ -60,6 +60,7 @@ Source109: dracut-early-kdump-module-setup.sh Source200: dracut-fadump-init-fadump.sh Source201: dracut-fadump-module-setup.sh +Source202: dracut-lvm2-monitor.service %ifarch ppc64 ppc64le Requires(post): servicelog @@ -240,6 +241,7 @@ cp %{SOURCE102} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpb cp %{SOURCE104} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE104}} cp %{SOURCE106} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE106}} cp %{SOURCE107} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE107}} +cp %{SOURCE202} $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE202}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE100}} chmod 755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99kdumpbase/%{remove_dracut_prefix %{SOURCE101}} mkdir -p -m755 $RPM_BUILD_ROOT/etc/kdump-adv-conf/kdump_dracut_modules/99earlykdump -- 2.33.1

Tao Liu

10:25 p.m.

New subject: [PATCH 3/4] lvm.conf should be check modified if lvm2 thinp enabled

lvm2 relies on /etc/lvm/lvm.conf to determine its behaviour. The important configs such as thin_pool_autoextend_threshold and thin_pool_autoextend_percent will be used during kdump in 2nd kernel. So if the file is modified, the initramfs should be rebuild to include the latest. Signed-off-by: Tao Liu <ltao(a)redhat.com> --- kdump-lib-initramfs.sh | 1 + kdumpctl | 1 + 2 files changed, 2 insertions(+) diff --git a/kdump-lib-initramfs.sh b/kdump-lib-initramfs.sh index 149aa49..e14a765 100755 --- a/kdump-lib-initramfs.sh +++ b/kdump-lib-initramfs.sh @@ -8,6 +8,7 @@ DEFAULT_SSHKEY="/root/.ssh/kdump_id_rsa" KDUMP_CONFIG_FILE="/etc/kdump.conf" FENCE_KDUMP_CONFIG_FILE="/etc/sysconfig/fence_kdump" FENCE_KDUMP_SEND="/usr/libexec/fence_kdump_send" +LVM_CONF="/etc/lvm/lvm.conf" # Read kdump config in well formated style kdump_read_conf() diff --git a/kdumpctl b/kdumpctl index 6188d47..b157eb8 100755 --- a/kdumpctl +++ b/kdumpctl @@ -383,6 +383,7 @@ check_files_modified() # HOOKS is mandatory and need to check the modification time files="$files $HOOKS" + is_lvm2_thinp_dump_target && files="$files $LVM_CONF" check_exist "$files" && check_executable "$EXTRA_BINS" || return 2 for file in $files; do -- 2.33.1

Tao Liu

10:25 p.m.

New subject: [PATCH 4/4] Mount the lvm2 thinp target with sync option

If no sync flag specified and mount the dump device with default option, it will be mounted with async flag. User space programs such as makedumpfile, when trying to write vmcore to thinp target, will not know if the autoextend of thin pool fails: [ 3.895165] kdump[703]: saving vmcore [ 4.084815] device-mapper: thin: 253:2: reached low water mark for data device: sending event. [ 4.086282] device-mapper: thin: 253:2: switching pool to out-of-data-space (queue IO) mode [ 66.528180] device-mapper: thin: 253:2: switching pool to out-of-data-space (error IO) mode [ 66.531791] EXT4-fs warning (device dm-3): ext4_end_bio:323: I/O error 3 writing to inode 131075 (offset 5259264 size 2580480 starting block 36608) [ 66.536132] Buffer I/O error on device dm-3, logical block 36100 ... [ 66.557164] JBD2: Detected IO errors while flushing file data on dm-3-8 Copying data : [100.0 %] - eta: 0s [ 67.247209] kdump.sh[704]: The dumpfile is saved to /kdumproot/mnt///127.0.0.1-2022-04-07-17:41:31/vmcore-incomplete. [ 67.248769] kdump.sh[704]: makedumpfile Completed. [ 67.278042] JBD2: Detected IO errors while flushing file data on dm-3-8 [ 67.327795] kdump[709]: saving vmcore complete ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ User space program doesn't know incomplete saving In this patch, if we are dumping to lvm2 thinp target, we will mount the device with sync flag, so the user space program will know immediately if autoextend fails, so it can be handled properly. Signed-off-by: Tao Liu <ltao(a)redhat.com> --- mkdumprd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mkdumprd b/mkdumprd index 3e250e0..a9c24c0 100644 --- a/mkdumprd +++ b/mkdumprd @@ -81,6 +81,11 @@ to_mount() fi fi + # mount lvm2 thinp target with sync option, so user process will know + # immediately if thin pool autoextend fails. + is_lvm2_thinp_dump_target $_target && + _sed_cmd+='s/$^\|,$async$$\|,$/\1/g;' && + _sed_cmd+='s/$/,sync/;' # mount fs target as rw in 2nd kernel _sed_cmd+='s/$^\|,$ro$$\|,$/\1rw\2/g;' # with 'noauto' in fstab nfs and non-root disk mount will fail in 2nd -- 2.33.1

Tao Liu

Monday, 16 May Mon, 16 May

2:30 a.m.

Add Mike to CC list Hi Mike, This patch set is related to bz2034457, could you please help to review it? Thank you very much! Thanks, Tao Liu [1]: https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject.o... On Mon, May 16, 2022 at 11:26 AM Tao Liu <ltao(a)redhat.com> wrote:

...

Mike Snitzer

11:58 a.m.

On Mon, May 16 2022 at 3:30P -0400, Tao Liu <ltao(a)redhat.com> wrote:

...

Hi, I reviewed the 4 patches. The 1st patch doesn't seem ideal, you're parsing output of 'lvs 2>/dev/null' (all lvm2 lvs in the system) even though you already know the thin device you're searching for. Since you already have the device name, I discussed with Zdenek (cc'd) and we arrived at something like the following suggestion: local _device_path=$1 local _lvm2_thin_device=$(lvs -S 'lv_layout=sparse && lv_layout=thin' --nosuffix --noheadings -o vg_name,lv_name $_device_path) [ -n "$_lvm2_thin_device" ] return $? The 4th patch to mount with 'sync' so that failed IO (due to inability to extend thin lv) strikes me as fine, except couldn't other non-thin mounts also fail async writes? Don't you want kdump to fail on async devices too? What I'm saying is: seems async mount isn't safe (at least not in terms of proper reporting of kdump success) regardless of whether lvm2 thin device is used or not. Mike

...

On Mon, May 16, 2022 at 11:26 AM Tao Liu <ltao(a)redhat.com> wrote: > > Thin provision is a mechanism that you can allocate a lvm volume which has > a large virtual size for file systems but actually in a small physical > size. The physical size can be autoextended in use if thin pool reached a > threshold specified in /etc/lvm/lvm.conf. > > There are 3 works should be handled when enable lvm2 thinp for kdump: > > 1) Check if the dump target device or directory is thinp device. > 2) Monitor the thin pool and autoextend its size when it reached the threshold > during kdump. > 3) If thin pool size-autoextend fails, the user space program will not know due to > buffered IO. So the lvm2 thinp dump target should be mounted with sync flag, making > user space program know immediately if write fails. > > According to my testing, the memory consumption procedure for lvm2 thinp is the thin pool > size-autoextend phase. For fedora and rhel9, the default crashkernel value is enough. But > for rhel8, the default crashkernel value 1G-4G:160M is not enough, so it should > be handled particularly. > > Please review. > > Tao Liu (4): > Add lvm2 thin provision dump target checker > Add lvm2-monitor.service for kdump when lvm2 thinp enabled > lvm.conf should be check modified if lvm2 thinp enabled > Mount the lvm2 thinp target with sync option > > dracut-lvm2-monitor.service | 15 +++++++++++++++ > dracut-module-setup.sh | 16 ++++++++++++++++ > kdump-lib-initramfs.sh | 25 +++++++++++++++++++++++++ > kdumpctl | 1 + > kexec-tools.spec | 2 ++ > mkdumprd | 5 +++++ > 6 files changed, 64 insertions(+) > create mode 100644 dracut-lvm2-monitor.service > > -- > 2.33.1 >

Tao Liu

Tuesday, 17 May Tue, 17 May

1:34 a.m.

Hi Mike, On Tue, May 17, 2022 at 12:58 AM Mike Snitzer <snitzer(a)redhat.com> wrote:

...

On Mon, May 16 2022 at 3:30P -0400, Tao Liu <ltao(a)redhat.com> wrote: > Add Mike to CC list > > Hi Mike, > > This patch set is related to bz2034457, could you please help to > review it? Thank you very much! > > Thanks, > Tao Liu > > [1]: https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject.o... Hi, I reviewed the 4 patches. The 1st patch doesn't seem ideal, you're parsing output of 'lvs 2>/dev/null' (all lvm2 lvs in the system) even though you already know the thin device you're searching for. Since you already have the device name, I discussed with Zdenek (cc'd) and we arrived at something like the following suggestion: local _device_path=$1 local _lvm2_thin_device=$(lvs -S 'lv_layout=sparse && lv_layout=thin' --nosuffix --noheadings -o vg_name,lv_name $_device_path) [ -n "$_lvm2_thin_device" ] return $?

Thanks for the suggestion, it looks better! I will send v2 to include the change.

...

The 4th patch to mount with 'sync' so that failed IO (due to inability to extend thin lv) strikes me as fine, except couldn't other non-thin mounts also fail async writes? Don't you want kdump to fail on async devices too? What I'm saying is: seems async mount isn't safe (at least not in terms of proper reporting of kdump success) regardless of whether lvm2 thin device is used or not.

I'm not an expert to fs and IO. If the async IO fails in 2nd kernel for kdump, mostly the reason is insufficient file system space, and it worked well for kdump in the past. However as for the case of lvm2 thinp, I found the userspace program no longer gets informed in async when thin pool autoextend fails. So I turned to the sync flag, which force the userspace program to wait the date been synced to disk before exit, and it works well according to my test. But it does cost more writing time than async... Thanks, Tao Liu

...

Mike > > On Mon, May 16, 2022 at 11:26 AM Tao Liu <ltao(a)redhat.com> wrote: > > > > Thin provision is a mechanism that you can allocate a lvm volume which has > > a large virtual size for file systems but actually in a small physical > > size. The physical size can be autoextended in use if thin pool reached a > > threshold specified in /etc/lvm/lvm.conf. > > > > There are 3 works should be handled when enable lvm2 thinp for kdump: > > > > 1) Check if the dump target device or directory is thinp device. > > 2) Monitor the thin pool and autoextend its size when it reached the threshold > > during kdump. > > 3) If thin pool size-autoextend fails, the user space program will not know due to > > buffered IO. So the lvm2 thinp dump target should be mounted with sync flag, making > > user space program know immediately if write fails. > > > > According to my testing, the memory consumption procedure for lvm2 thinp is the thin pool > > size-autoextend phase. For fedora and rhel9, the default crashkernel value is enough. But > > for rhel8, the default crashkernel value 1G-4G:160M is not enough, so it should > > be handled particularly. > > > > Please review. > > > > Tao Liu (4): > > Add lvm2 thin provision dump target checker > > Add lvm2-monitor.service for kdump when lvm2 thinp enabled > > lvm.conf should be check modified if lvm2 thinp enabled > > Mount the lvm2 thinp target with sync option > > > > dracut-lvm2-monitor.service | 15 +++++++++++++++ > > dracut-module-setup.sh | 16 ++++++++++++++++ > > kdump-lib-initramfs.sh | 25 +++++++++++++++++++++++++ > > kdumpctl | 1 + > > kexec-tools.spec | 2 ++ > > mkdumprd | 5 +++++ > > 6 files changed, 64 insertions(+) > > create mode 100644 dracut-lvm2-monitor.service > > > > -- > > 2.33.1 > > >

Mike Snitzer

Wednesday, 18 May Wed, 18 May

10:48 a.m.

On Tue, May 17 2022 at 2:34P -0400, Tao Liu <ltao(a)redhat.com> wrote:

...

Hi Mike, On Tue, May 17, 2022 at 12:58 AM Mike Snitzer <snitzer(a)redhat.com> wrote: > > On Mon, May 16 2022 at 3:30P -0400, > Tao Liu <ltao(a)redhat.com> wrote: > > > Add Mike to CC list > > > > Hi Mike, > > > > This patch set is related to bz2034457, could you please help to > > review it? Thank you very much! > > > > Thanks, > > Tao Liu > > > > [1]: https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject.o... > > Hi, > > I reviewed the 4 patches. > > The 1st patch doesn't seem ideal, you're parsing output of 'lvs > 2>/dev/null' (all lvm2 lvs in the system) even though you already know > the thin device you're searching for. > > Since you already have the device name, I discussed with Zdenek (cc'd) > and we arrived at something like the following suggestion: > > local _device_path=$1 > local _lvm2_thin_device=$(lvs -S 'lv_layout=sparse && lv_layout=thin' --nosuffix --noheadings -o vg_name,lv_name $_device_path) > > [ -n "$_lvm2_thin_device" ] > return $? > Thanks for the suggestion, it looks better! I will send v2 to include the change. > > The 4th patch to mount with 'sync' so that failed IO (due to inability > to extend thin lv) strikes me as fine, except couldn't other non-thin > mounts also fail async writes? Don't you want kdump to fail on async > devices too? What I'm saying is: seems async mount isn't safe (at > least not in terms of proper reporting of kdump success) regardless of > whether lvm2 thin device is used or not. > I'm not an expert to fs and IO. If the async IO fails in 2nd kernel for kdump, mostly the reason is insufficient file system space, and it worked well for kdump in the past. However as for the case of lvm2 thinp, I found the userspace program no longer gets informed in async when thin pool autoextend fails. So I turned to the sync flag, which force the userspace program to wait the date been synced to disk before exit, and it works well according to my test. But it does cost more writing time than async...

I've consulted Eric Sandeen (cc'd) and he agrees there is a more generic problem in the kdump userspace if it isn't able to detect write failures without using the "sync" mount option. kdump's job is to dump system memory as carefully as possible. Yet you're saying kdump is using buffered IO. Buffered IO creates additional memory use, and associated pages don't get written back until writeback kicks in.. hence the delayed nature of write failures. But those failures can happen with non-thinp block devices too. Seems logical that kdump should be using direct IO to write system memory back (rather than buffered IO). Again, using buffered IO creates more memory use -- so you're needlessly increasing memory reserve for kdump's use by using buffered IO. Please take a closer look at how to properly detect write failures. Doing so properly should make kdump work, as in detect write failure, on any storage if it runs out of space. Please see: https://lwn.net/Articles/457667/ Anything short of that and you're papering over a general kdump problem by making it seem like a thinp specific problem. Mike

...

> > On Mon, May 16, 2022 at 11:26 AM Tao Liu <ltao(a)redhat.com> wrote: > > > > > > Thin provision is a mechanism that you can allocate a lvm volume which has > > > a large virtual size for file systems but actually in a small physical > > > size. The physical size can be autoextended in use if thin pool reached a > > > threshold specified in /etc/lvm/lvm.conf. > > > > > > There are 3 works should be handled when enable lvm2 thinp for kdump: > > > > > > 1) Check if the dump target device or directory is thinp device. > > > 2) Monitor the thin pool and autoextend its size when it reached the threshold > > > during kdump. > > > 3) If thin pool size-autoextend fails, the user space program will not know due to > > > buffered IO. So the lvm2 thinp dump target should be mounted with sync flag, making > > > user space program know immediately if write fails. > > > > > > According to my testing, the memory consumption procedure for lvm2 thinp is the thin pool > > > size-autoextend phase. For fedora and rhel9, the default crashkernel value is enough. But > > > for rhel8, the default crashkernel value 1G-4G:160M is not enough, so it should > > > be handled particularly. > > > > > > Please review. > > > > > > Tao Liu (4): > > > Add lvm2 thin provision dump target checker > > > Add lvm2-monitor.service for kdump when lvm2 thinp enabled > > > lvm.conf should be check modified if lvm2 thinp enabled > > > Mount the lvm2 thinp target with sync option > > > > > > dracut-lvm2-monitor.service | 15 +++++++++++++++ > > > dracut-module-setup.sh | 16 ++++++++++++++++ > > > kdump-lib-initramfs.sh | 25 +++++++++++++++++++++++++ > > > kdumpctl | 1 + > > > kexec-tools.spec | 2 ++ > > > mkdumprd | 5 +++++ > > > 6 files changed, 64 insertions(+) > > > create mode 100644 dracut-lvm2-monitor.service > > > > > > -- > > > 2.33.1 > > > > > >

Eric Sandeen

10:59 a.m.

On 5/18/22 10:45 AM, Mike Snitzer wrote:

...

On Tue, May 17 2022 at 2:34P -0400, Tao Liu <ltao(a)redhat.com> wrote:

...

> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel > for kdump, mostly the reason is insufficient file system space, and it > worked well for kdump in the past. However as for the case of lvm2 > thinp, I found the userspace program no longer gets informed in async > when thin pool autoextend fails. So I turned to the sync flag, which > force the userspace program to wait the date been synced to disk > before exit, and it works well according to my test. But it does cost > more writing time than async... I've consulted Eric Sandeen (cc'd) and he agrees there is a more generic problem in the kdump userspace if it isn't able to detect write failures without using the "sync" mount option. kdump's job is to dump system memory as carefully as possible. Yet you're saying kdump is using buffered IO. Buffered IO creates additional memory use, and associated pages don't get written back until writeback kicks in.. hence the delayed nature of write failures.

(cc: vivek, who may have thoughts on buffered IO vs direct IO)

...

But those failures can happen with non-thinp block devices too.

Exactly.

...

Seems logical that kdump should be using direct IO to write system memory back (rather than buffered IO). Again, using buffered IO creates more memory use -- so you're needlessly increasing memory reserve for kdump's use by using buffered IO. Please take a closer look at how to properly detect write failures. Doing so properly should make kdump work, as in detect write failure, on any storage if it runs out of space. Please see: https://lwn.net/Articles/457667/ Anything short of that and you're papering over a general kdump problem by making it seem like a thinp specific problem.

Yep. If you want to know if your buffered write succeeded or not, you have several options, including calling fsync() and handling any errors. The article above goes into much more detail. This should be done regardless of the storage type; there is no need to single out thinp here. IO can fail for any number of reasons, on any type of storage. You should always make the proper data integrity syscalls and do error handling if you care about the results of your buffered write() calls. Thanks, -Eric

...

Mike

Vivek Goyal

12:03 p.m.

On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote:

...

On 5/18/22 10:45 AM, Mike Snitzer wrote: > On Tue, May 17 2022 at 2:34P -0400, > Tao Liu <ltao(a)redhat.com> wrote: ... >> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel >> for kdump, mostly the reason is insufficient file system space, and it >> worked well for kdump in the past. However as for the case of lvm2 >> thinp, I found the userspace program no longer gets informed in async >> when thin pool autoextend fails. So I turned to the sync flag, which >> force the userspace program to wait the date been synced to disk >> before exit, and it works well according to my test. But it does cost >> more writing time than async... > > I've consulted Eric Sandeen (cc'd) and he agrees there is a more > generic problem in the kdump userspace if it isn't able to detect > write failures without using the "sync" mount option. > > kdump's job is to dump system memory as carefully as possible. Yet > you're saying kdump is using buffered IO. Buffered IO creates > additional memory use, and associated pages don't get written back > until writeback kicks in.. hence the delayed nature of write failures. (cc: vivek, who may have thoughts on buffered IO vs direct IO)

[ cc Bao and Dave ]

...

> But those failures can happen with non-thinp block devices too. Exactly. > Seems logical that kdump should be using direct IO to write system > memory back (rather than buffered IO). Again, using buffered IO > creates more memory use -- so you're needlessly increasing memory > reserve for kdump's use by using buffered IO. > > Please take a closer look at how to properly detect write failures. > Doing so properly should make kdump work, as in detect write failure, > on any storage if it runs out of space. > > Please see: https://lwn.net/Articles/457667/ > > Anything short of that and you're papering over a general kdump > problem by making it seem like a thinp specific problem. Yep. If you want to know if your buffered write succeeded or not, you have several options, including calling fsync() and handling any errors. The article above goes into much more detail. This should be done regardless of the storage type; there is no need to single out thinp here. IO can fail for any number of reasons, on any type of storage. You should always make the proper data integrity syscalls and do error handling if you care about the results of your buffered write() calls.

Right. I think key thing is to call fsync() after saving vmcore is finished and based on the result of fsync() determine if file could make it to disk or not. If fsync() is not reporting errors properly, then that's an issue we should try to fix. I remember Jeff Layton had done fixes in this area to report errors if page writeback failed. I am not sure if kdump scripts call fsync() or not. I think that's the first thing we should verify. And if we are not doing it, fix it. Bao/Dave you probably are in best position to answer that. direct I/O, O_SYNC I/O these all are slow options. We also have a requirement to save dump ASAP and reboot back into the original kernel so that we don't keep the machine down for longer duration. So if problem is about error detection, fsync() should solve it. Using direct I/O or O_SYNC is an option user should be able to choose if they wish it. I don't think Kdump provides any mechanism to do direct I/O. But it might be allowing passing "sync" mount option so that every I/O is effectively will use O_SYNC. Bao, do I get it right. Thanks Vivek

Dave Young

10:58 p.m.

On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote:

...

On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote: > On 5/18/22 10:45 AM, Mike Snitzer wrote: > > On Tue, May 17 2022 at 2:34P -0400, > > Tao Liu <ltao(a)redhat.com> wrote: > > ... > > >> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel > >> for kdump, mostly the reason is insufficient file system space, and it > >> worked well for kdump in the past. However as for the case of lvm2 > >> thinp, I found the userspace program no longer gets informed in async > >> when thin pool autoextend fails. So I turned to the sync flag, which > >> force the userspace program to wait the date been synced to disk > >> before exit, and it works well according to my test. But it does cost > >> more writing time than async... > > > > I've consulted Eric Sandeen (cc'd) and he agrees there is a more > > generic problem in the kdump userspace if it isn't able to detect > > write failures without using the "sync" mount option. > > > > kdump's job is to dump system memory as carefully as possible. Yet > > you're saying kdump is using buffered IO. Buffered IO creates > > additional memory use, and associated pages don't get written back > > until writeback kicks in.. hence the delayed nature of write failures. > > (cc: vivek, who may have thoughts on buffered IO vs direct IO) [ cc Bao and Dave ] > > > But those failures can happen with non-thinp block devices too. > > Exactly. > > > Seems logical that kdump should be using direct IO to write system > > memory back (rather than buffered IO). Again, using buffered IO > > creates more memory use -- so you're needlessly increasing memory > > reserve for kdump's use by using buffered IO. > > > > Please take a closer look at how to properly detect write failures. > > Doing so properly should make kdump work, as in detect write failure, > > on any storage if it runs out of space. > > > > Please see: https://lwn.net/Articles/457667/ > > > > Anything short of that and you're papering over a general kdump > > problem by making it seem like a thinp specific problem. > > Yep. If you want to know if your buffered write succeeded or not, you > have several options, including calling fsync() and handling any errors. > The article above goes into much more detail. > > This should be done regardless of the storage type; there is no need to > single out thinp here. IO can fail for any number of reasons, on any type > of storage. You should always make the proper data integrity syscalls > and do error handling if you care about the results of your buffered write() > calls. Right. I think key thing is to call fsync() after saving vmcore is finished and based on the result of fsync() determine if file could make it to disk or not. If fsync() is not reporting errors properly, then that's an issue we should try to fix. I remember Jeff Layton had done fixes in this area to report errors if page writeback failed. I am not sure if kdump scripts call fsync() or not. I think that's the first thing we should verify. And if we are not doing it, fix it. Bao/Dave you probably are in best position to answer that.

Hi Vivek, Checking the kdump scripts, we have below, a separate "sync" command is added after saving vmcore, so it would be good since that cover all the core collectors, if we use fsync then we need to patch makedumpfile and cp, maybe it is not necessary: $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete" _dump_exitcode=$? if [ $_dump_exitcode -eq 0 ]; then mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" sync dinfo "saving vmcore complete"

...

direct I/O, O_SYNC I/O these all are slow options. We also have a requirement to save dump ASAP and reboot back into the original kernel so that we don't keep the machine down for longer duration. So if problem is about error detection, fsync() should solve it. Using direct I/O or O_SYNC is an option user should be able to choose if they wish it. I don't think Kdump provides any mechanism to do direct I/O. But it might be allowing passing "sync" mount option so that every I/O is effectively will use O_SYNC. Bao, do I get it right.

Not sure if the sync mount option will cause vmcore saving slow down, since we are the only user in the kdump kernel so I suspect it will be fine, but it may need some actual testing.. Another thing is if we use sync io then the saving progress indicator will be more accurate, let's see how others think about this :)

...

Thanks Vivek

Thanks Dave

Tao Liu

Thursday, 19 May Thu, 19 May

2:42 a.m.

On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote:

...

On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote: > > On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote: > > On 5/18/22 10:45 AM, Mike Snitzer wrote: > > > On Tue, May 17 2022 at 2:34P -0400, > > > Tao Liu <ltao(a)redhat.com> wrote: > > > > ... > > > > >> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel > > >> for kdump, mostly the reason is insufficient file system space, and it > > >> worked well for kdump in the past. However as for the case of lvm2 > > >> thinp, I found the userspace program no longer gets informed in async > > >> when thin pool autoextend fails. So I turned to the sync flag, which > > >> force the userspace program to wait the date been synced to disk > > >> before exit, and it works well according to my test. But it does cost > > >> more writing time than async... > > > > > > I've consulted Eric Sandeen (cc'd) and he agrees there is a more > > > generic problem in the kdump userspace if it isn't able to detect > > > write failures without using the "sync" mount option. > > > > > > kdump's job is to dump system memory as carefully as possible. Yet > > > you're saying kdump is using buffered IO. Buffered IO creates > > > additional memory use, and associated pages don't get written back > > > until writeback kicks in.. hence the delayed nature of write failures. > > > > (cc: vivek, who may have thoughts on buffered IO vs direct IO) > > [ cc Bao and Dave ] > > > > > > But those failures can happen with non-thinp block devices too. > > > > Exactly. > > > > > Seems logical that kdump should be using direct IO to write system > > > memory back (rather than buffered IO). Again, using buffered IO > > > creates more memory use -- so you're needlessly increasing memory > > > reserve for kdump's use by using buffered IO. > > > > > > Please take a closer look at how to properly detect write failures. > > > Doing so properly should make kdump work, as in detect write failure, > > > on any storage if it runs out of space. > > > > > > Please see: https://lwn.net/Articles/457667/ > > > > > > Anything short of that and you're papering over a general kdump > > > problem by making it seem like a thinp specific problem. > > > > Yep. If you want to know if your buffered write succeeded or not, you > > have several options, including calling fsync() and handling any errors. > > The article above goes into much more detail. > > > > This should be done regardless of the storage type; there is no need to > > single out thinp here. IO can fail for any number of reasons, on any type > > of storage. You should always make the proper data integrity syscalls > > and do error handling if you care about the results of your buffered write() > > calls. > > Right. I think key thing is to call fsync() after saving vmcore is > finished and based on the result of fsync() determine if file could > make it to disk or not. > > If fsync() is not reporting errors properly, then that's an issue > we should try to fix. I remember Jeff Layton had done fixes in > this area to report errors if page writeback failed. > > I am not sure if kdump scripts call fsync() or not. I think that's > the first thing we should verify. And if we are not doing it, fix it. > Bao/Dave you probably are in best position to answer that. Hi Vivek,

Hi all,

...

Checking the kdump scripts, we have below, a separate "sync" command is added after saving vmcore, so it would be good since that cover all the core collectors, if we use fsync then we need to patch makedumpfile and cp, maybe it is not necessary: $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete"

^^^^^^^^^^^^^^^ There are a variety of collectors, user can specify one by setting /etc/kdump.conf , such as makedumpfile, cp etc. It will be impossible to adapt fsync() call for all of them. The doable way to me is to mount the destination device with sync flag, so collectors won't need to be modified for fsync().

...

_dump_exitcode=$?

The exit code is unreliable, 0 doesn't garentee the data been successfully written to disk when mount with async by default.

...

if [ $_dump_exitcode -eq 0 ]; then mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" sync

^^^^ If we have unflushed data, we reached here to flush it, how do we know if the flush is successful? cmd sync will call sync(), which is void according to man2 sync, so no exit code here for analysis. Any suggestions for how to improve the code? Thanks, Tao Liu

...

dinfo "saving vmcore complete" > > direct I/O, O_SYNC I/O these all are slow options. We also have a > requirement to save dump ASAP and reboot back into the original > kernel so that we don't keep the machine down for longer duration. > > So if problem is about error detection, fsync() should solve it. Using > direct I/O or O_SYNC is an option user should be able to choose if they > wish it. I don't think Kdump provides any mechanism to do direct I/O. But > it might be allowing passing "sync" mount option so that every I/O > is effectively will use O_SYNC. Bao, do I get it right. Not sure if the sync mount option will cause vmcore saving slow down, since we are the only user in the kdump kernel so I suspect it will be fine, but it may need some actual testing.. Another thing is if we use sync io then the saving progress indicator will be more accurate, let's see how others think about this :) > > Thanks > Vivek > Thanks Dave

Vivek Goyal

9:06 a.m.

On Thu, May 19, 2022 at 03:42:38PM +0800, Tao Liu wrote:

...

On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote: > On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote: > > > > On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote: > > > On 5/18/22 10:45 AM, Mike Snitzer wrote: > > > > On Tue, May 17 2022 at 2:34P -0400, > > > > Tao Liu <ltao(a)redhat.com> wrote: > > > > > > ... > > > > > > >> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel > > > >> for kdump, mostly the reason is insufficient file system space, and it > > > >> worked well for kdump in the past. However as for the case of lvm2 > > > >> thinp, I found the userspace program no longer gets informed in async > > > >> when thin pool autoextend fails. So I turned to the sync flag, which > > > >> force the userspace program to wait the date been synced to disk > > > >> before exit, and it works well according to my test. But it does cost > > > >> more writing time than async... > > > > > > > > I've consulted Eric Sandeen (cc'd) and he agrees there is a more > > > > generic problem in the kdump userspace if it isn't able to detect > > > > write failures without using the "sync" mount option. > > > > > > > > kdump's job is to dump system memory as carefully as possible. Yet > > > > you're saying kdump is using buffered IO. Buffered IO creates > > > > additional memory use, and associated pages don't get written back > > > > until writeback kicks in.. hence the delayed nature of write failures. > > > > > > (cc: vivek, who may have thoughts on buffered IO vs direct IO) > > > > [ cc Bao and Dave ] > > > > > > > > > But those failures can happen with non-thinp block devices too. > > > > > > Exactly. > > > > > > > Seems logical that kdump should be using direct IO to write system > > > > memory back (rather than buffered IO). Again, using buffered IO > > > > creates more memory use -- so you're needlessly increasing memory > > > > reserve for kdump's use by using buffered IO. > > > > > > > > Please take a closer look at how to properly detect write failures. > > > > Doing so properly should make kdump work, as in detect write failure, > > > > on any storage if it runs out of space. > > > > > > > > Please see: https://lwn.net/Articles/457667/ > > > > > > > > Anything short of that and you're papering over a general kdump > > > > problem by making it seem like a thinp specific problem. > > > > > > Yep. If you want to know if your buffered write succeeded or not, you > > > have several options, including calling fsync() and handling any errors. > > > The article above goes into much more detail. > > > > > > This should be done regardless of the storage type; there is no need to > > > single out thinp here. IO can fail for any number of reasons, on any type > > > of storage. You should always make the proper data integrity syscalls > > > and do error handling if you care about the results of your buffered write() > > > calls. > > > > Right. I think key thing is to call fsync() after saving vmcore is > > finished and based on the result of fsync() determine if file could > > make it to disk or not. > > > > If fsync() is not reporting errors properly, then that's an issue > > we should try to fix. I remember Jeff Layton had done fixes in > > this area to report errors if page writeback failed. > > > > I am not sure if kdump scripts call fsync() or not. I think that's > > the first thing we should verify. And if we are not doing it, fix it. > > Bao/Dave you probably are in best position to answer that. > > Hi Vivek, > Hi all, > Checking the kdump scripts, we have below, a separate "sync" command > is added after saving vmcore, so it would be good since that cover > all the core collectors, if we use fsync then we need to patch > makedumpfile and cp, maybe it is not necessary: > $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete" ^^^^^^^^^^^^^^^ There are a variety of collectors, user can specify one by setting /etc/kdump.conf , such as makedumpfile, cp etc. It will be impossible to adapt fsync() call for all of them. The doable way to me is to mount the destination device with sync flag, so collectors won't need to be modified for fsync().

Why can't we call fsync/syncfs after core collector has finished. core collector is just a too which is copying the dump to destination file (vmcore). Once core collector is finished, we should be able to call fsync/sycnfs on vmcore?

...

> _dump_exitcode=$? The exit code is unreliable, 0 doesn't garentee the data been successfully written to disk when mount with async by default.

This exit code is just signifying whether core collector finished its job or faced errors. It is not guaranteeing that data is on disk. All it says that it could write data and data could still be in various caches in operating system and will need to be flushed out.

...

> if [ $_dump_exitcode -eq 0 ]; then > mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" > sync ^^^^ If we have unflushed data, we reached here to flush it, how do we know if the flush is successful? cmd sync will call sync(), which is void according to man2 sync, so no exit code here for analysis.

I think using "sync" is the core of the problem and needs fixing to detect errors.

...

Any suggestions for how to improve the code?

I wrote in other email, we should try using "fsync vmcore" or "syncfs -f vmcore" and that will writeback all the caches and if thin pool can't be grown, we should see errors (or infinite retry from OS). BTW, configuring system in such a way that you have run out of space and thin pool can't be grown is a really bad idea. LVM team very strongly recommends that never ever run into a situation where thin pool is full. Thanks Vivek

...

Thanks, Tao Liu > dinfo "saving vmcore complete" > > > > > > direct I/O, O_SYNC I/O these all are slow options. We also have a > > requirement to save dump ASAP and reboot back into the original > > kernel so that we don't keep the machine down for longer duration. > > > > So if problem is about error detection, fsync() should solve it. Using > > direct I/O or O_SYNC is an option user should be able to choose if they > > wish it. I don't think Kdump provides any mechanism to do direct I/O. But > > it might be allowing passing "sync" mount option so that every I/O > > is effectively will use O_SYNC. Bao, do I get it right. > > Not sure if the sync mount option will cause vmcore saving slow down, > since we are the only user in the kdump kernel so I suspect it will be > fine, but it may need some actual testing.. Another thing is if we > use sync io then the saving progress indicator will be more accurate, > let's see how others think about this :) > > > > > Thanks > > Vivek > > > > Thanks > Dave >

Vivek Goyal

8:18 a.m.

On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote:

...

Hi Dave, I am not sure why do we need to patch makdeumpfile or cp. Once core collector has saved the file, we can either do "sync -f vmcore" or "fsync vmcore". Isn't it?

...

$CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete" _dump_exitcode=$? if [ $_dump_exitcode -eq 0 ]; then mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" sync

I think this "sync" call is not sufficient. It does not return error if there is one. man 2 sync says, sync() is always successful. So if some data can't be written back because underlying device is full, sync() will not return error and we will assume vmcore was saved successfully. So we probably either need to call syncfs (sync -f vmcore) or use fsync (fsync vmcore). And that probably should solve the problem. If storage does not have enough space, we should get an error back and we can display that vmcore could not be saved successfully. Another thing we will have to be careful about is that it does not hang infinitely. I think by default XFS retries I/O infinitely if it gets -ENOSPC from storage (thinking this is a temporary situation and will be resolved at some point of time). XFS had introduced knobs to tune this behavior. One could specify how long to retry upon error and when to give up and return error to user space. I think Carlos did that work and should know more about what are the knobs and how to tune these. Copying carlos. Thinking more about it, If there was not enough space, how come sync returned early without writing all the data back and not hang infinitely. Hmm... not sure.

...

I think it will most likely slow down vmcore saving. Let us give it a try and see how bad it is. Especially on large machines where vmcore can be big, using "sync" can be bad for performance. I would rather prefer buffered write.

...

Another thing is if we use sync io then the saving progress indicator will be more accurate, let's see how others think about this :)

We could create another indicator saying "Syncing vmcore to disk" after "Saving vmcore to disk". So I will not be too worried about it. Thanks Vivek

Mike Snitzer

9:02 a.m.

On Thu, May 19 2022 at 9:18P -0400, Vivek Goyal <vgoyal(a)redhat.com> wrote:

...

On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote: > On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote: > > > > On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote: > > > On 5/18/22 10:45 AM, Mike Snitzer wrote: > > > > On Tue, May 17 2022 at 2:34P -0400, > > > > Tao Liu <ltao(a)redhat.com> wrote: > > > > > > ... > > > > > > >> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel > > > >> for kdump, mostly the reason is insufficient file system space, and it > > > >> worked well for kdump in the past. However as for the case of lvm2 > > > >> thinp, I found the userspace program no longer gets informed in async > > > >> when thin pool autoextend fails. So I turned to the sync flag, which > > > >> force the userspace program to wait the date been synced to disk > > > >> before exit, and it works well according to my test. But it does cost > > > >> more writing time than async... > > > > > > > > I've consulted Eric Sandeen (cc'd) and he agrees there is a more > > > > generic problem in the kdump userspace if it isn't able to detect > > > > write failures without using the "sync" mount option. > > > > > > > > kdump's job is to dump system memory as carefully as possible. Yet > > > > you're saying kdump is using buffered IO. Buffered IO creates > > > > additional memory use, and associated pages don't get written back > > > > until writeback kicks in.. hence the delayed nature of write failures. > > > > > > (cc: vivek, who may have thoughts on buffered IO vs direct IO) > > > > [ cc Bao and Dave ] > > > > > > > > > But those failures can happen with non-thinp block devices too. > > > > > > Exactly. > > > > > > > Seems logical that kdump should be using direct IO to write system > > > > memory back (rather than buffered IO). Again, using buffered IO > > > > creates more memory use -- so you're needlessly increasing memory > > > > reserve for kdump's use by using buffered IO. > > > > > > > > Please take a closer look at how to properly detect write failures. > > > > Doing so properly should make kdump work, as in detect write failure, > > > > on any storage if it runs out of space. > > > > > > > > Please see: https://lwn.net/Articles/457667/ > > > > > > > > Anything short of that and you're papering over a general kdump > > > > problem by making it seem like a thinp specific problem. > > > > > > Yep. If you want to know if your buffered write succeeded or not, you > > > have several options, including calling fsync() and handling any errors. > > > The article above goes into much more detail. > > > > > > This should be done regardless of the storage type; there is no need to > > > single out thinp here. IO can fail for any number of reasons, on any type > > > of storage. You should always make the proper data integrity syscalls > > > and do error handling if you care about the results of your buffered write() > > > calls. > > > > Right. I think key thing is to call fsync() after saving vmcore is > > finished and based on the result of fsync() determine if file could > > make it to disk or not. > > > > If fsync() is not reporting errors properly, then that's an issue > > we should try to fix. I remember Jeff Layton had done fixes in > > this area to report errors if page writeback failed. > > > > I am not sure if kdump scripts call fsync() or not. I think that's > > the first thing we should verify. And if we are not doing it, fix it. > > Bao/Dave you probably are in best position to answer that. > > Hi Vivek, > > Checking the kdump scripts, we have below, a separate "sync" command > is added after saving vmcore, so it would be good since that cover > all the core collectors, if we use fsync then we need to patch > makedumpfile and cp, maybe it is not necessary: Hi Dave, I am not sure why do we need to patch makdeumpfile or cp. Once core collector has saved the file, we can either do "sync -f vmcore" or "fsync vmcore". Isn't it? > $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete" > _dump_exitcode=$? > if [ $_dump_exitcode -eq 0 ]; then > mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" > sync I think this "sync" call is not sufficient. It does not return error if there is one. man 2 sync says, sync() is always successful. So if some data can't be written back because underlying device is full, sync() will not return error and we will assume vmcore was saved successfully. So we probably either need to call syncfs (sync -f vmcore) or use fsync (fsync vmcore). And that probably should solve the problem. If storage does not have enough space, we should get an error back and we can display that vmcore could not be saved successfully. Another thing we will have to be careful about is that it does not hang infinitely. I think by default XFS retries I/O infinitely if it gets -ENOSPC from storage (thinking this is a temporary situation and will be resolved at some point of time). XFS had introduced knobs to tune this behavior. One could specify how long to retry upon error and when to give up and return error to user space. I think Carlos did that work and should know more about what are the knobs and how to tune these. Copying carlos. Thinking more about it, If there was not enough space, how come sync returned early without writing all the data back and not hang infinitely. Hmm... not sure.

Could be due to DM thinp queueing IO waiting only 60 secs for resize (after running out of space) and then, upon timeout, it pivots to erroring the IO. Hmm, nope.. dm-thin will error with BLK_STS_NOSPC in that case (so must be XFS doesn't do infinite retry on -ENOSPC anymore?).

...

> dinfo "saving vmcore complete" > > > > > > direct I/O, O_SYNC I/O these all are slow options. We also have a > > requirement to save dump ASAP and reboot back into the original > > kernel so that we don't keep the machine down for longer duration. > > > > So if problem is about error detection, fsync() should solve it. Using > > direct I/O or O_SYNC is an option user should be able to choose if they > > wish it. I don't think Kdump provides any mechanism to do direct I/O. But > > it might be allowing passing "sync" mount option so that every I/O > > is effectively will use O_SYNC. Bao, do I get it right. > > Not sure if the sync mount option will cause vmcore saving slow down, > since we are the only user in the kdump kernel so I suspect it will be > fine, but it may need some actual testing.. I think it will most likely slow down vmcore saving. Let us give it a try and see how bad it is. Especially on large machines where vmcore can be big, using "sync" can be bad for performance. I would rather prefer buffered write.

I agree, mounting with "sync" is really not a good solution. Your above suggestions should be tried first. Mike

Zdenek Kabelac

9:08 a.m.

Dne 19. 05. 22 v 16:02 Mike Snitzer napsal(a):

...

On Thu, May 19 2022 at 9:18P -0400, Vivek Goyal <vgoyal(a)redhat.com> wrote: > On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote: >> On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote: >>> On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote: >>>> On 5/18/22 10:45 AM, Mike Snitzer wrote: >>>>> On Tue, May 17 2022 at 2:34P -0400, >>>>> Tao Liu <ltao(a)redhat.com> wrote: >>>> ... >>>> >>>>>> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel >>>>>> for kdump, mostly the reason is insufficient file system space, and it >>>>>> worked well for kdump in the past. However as for the case of lvm2 >>>>>> thinp, I found the userspace program no longer gets informed in async >>>>>> when thin pool autoextend fails. So I turned to the sync flag, which >>>>>> force the userspace program to wait the date been synced to disk >>>>>> before exit, and it works well according to my test. But it does cost >>>>>> more writing time than async... >>>>> I've consulted Eric Sandeen (cc'd) and he agrees there is a more >>>>> generic problem in the kdump userspace if it isn't able to detect >>>>> write failures without using the "sync" mount option. >>>>> >>>>> kdump's job is to dump system memory as carefully as possible. Yet >>>>> you're saying kdump is using buffered IO. Buffered IO creates >>>>> additional memory use, and associated pages don't get written back >>>>> until writeback kicks in.. hence the delayed nature of write failures. >>>> (cc: vivek, who may have thoughts on buffered IO vs direct IO) >>> [ cc Bao and Dave ] >>> >>>>> But those failures can happen with non-thinp block devices too. >>>> Exactly. >>>> >>>>> Seems logical that kdump should be using direct IO to write system >>>>> memory back (rather than buffered IO). Again, using buffered IO >>>>> creates more memory use -- so you're needlessly increasing memory >>>>> reserve for kdump's use by using buffered IO. >>>>> >>>>> Please take a closer look at how to properly detect write failures. >>>>> Doing so properly should make kdump work, as in detect write failure, >>>>> on any storage if it runs out of space. >>>>> >>>>> Please see: https://lwn.net/Articles/457667/ >>>>> >>>>> Anything short of that and you're papering over a general kdump >>>>> problem by making it seem like a thinp specific problem. >>>> Yep. If you want to know if your buffered write succeeded or not, you >>>> have several options, including calling fsync() and handling any errors. >>>> The article above goes into much more detail. >>>> >>>> This should be done regardless of the storage type; there is no need to >>>> single out thinp here. IO can fail for any number of reasons, on any type >>>> of storage. You should always make the proper data integrity syscalls >>>> and do error handling if you care about the results of your buffered write() >>>> calls. >>> Right. I think key thing is to call fsync() after saving vmcore is >>> finished and based on the result of fsync() determine if file could >>> make it to disk or not. >>> >>> If fsync() is not reporting errors properly, then that's an issue >>> we should try to fix. I remember Jeff Layton had done fixes in >>> this area to report errors if page writeback failed. >>> >>> I am not sure if kdump scripts call fsync() or not. I think that's >>> the first thing we should verify. And if we are not doing it, fix it. >>> Bao/Dave you probably are in best position to answer that. >> Hi Vivek, >> >> Checking the kdump scripts, we have below, a separate "sync" command >> is added after saving vmcore, so it would be good since that cover >> all the core collectors, if we use fsync then we need to patch >> makedumpfile and cp, maybe it is not necessary: > Hi Dave, > > I am not sure why do we need to patch makdeumpfile or cp. Once core > collector has saved the file, we can either do "sync -f vmcore" or > "fsync vmcore". Isn't it? > >> $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete" >> _dump_exitcode=$? >> if [ $_dump_exitcode -eq 0 ]; then >> mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" >> sync > I think this "sync" call is not sufficient. It does not return error if > there is one. > > man 2 sync says, sync() is always successful. > > So if some data can't be written back because underlying device is full, > sync() will not return error and we will assume vmcore was saved > successfully. > > So we probably either need to call syncfs (sync -f vmcore) or use > fsync (fsync vmcore). And that probably should solve the problem. If > storage does not have enough space, we should get an error back and > we can display that vmcore could not be saved successfully. > > Another thing we will have to be careful about is that it does not > hang infinitely. I think by default XFS retries I/O infinitely if > it gets -ENOSPC from storage (thinking this is a temporary situation > and will be resolved at some point of time). XFS had introduced knobs > to tune this behavior. One could specify how long to retry upon error > and when to give up and return error to user space. I think Carlos did > that work and should know more about what are the knobs and how to > tune these. Copying carlos. > > Thinking more about it, If there was not enough space, how come sync > returned early without writing all the data back and not hang infinitely. > Hmm... not sure. Could be due to DM thinp queueing IO waiting only 60 secs for resize (after running out of space) and then, upon timeout, it pivots to erroring the IO. Hmm, nope.. dm-thin will error with BLK_STS_NOSPC in that case (so must be XFS doesn't do infinite retry on -ENOSPC anymore?).

Note - lvm2 supports 'lvchange --errorwhenfull y|n vgname/thinpoolname ' - to switch to instant error when pool gets full (avoiding 60s timeout).

...

> a try and see how bad it is. Especially on large machines where vmcore > can be big, using "sync" can be bad for performance. I would rather > prefer buffered write. I agree, mounting with "sync" is really not a good solution. Your above suggestions should be tried first.

Idealy this kdump thing with thin-pools should be using some 'preallocatted & provisioned' thinLV (if it really needs to be solved with thin). Regards Zdenek

Eric Sandeen

9:26 a.m.

On 5/19/22 8:18 AM, Vivek Goyal wrote:

...

> Hi Vivek, > > Checking the kdump scripts, we have below, a separate "sync" command > is added after saving vmcore, so it would be good since that cover > all the core collectors, if we use fsync then we need to patch > makedumpfile and cp, maybe it is not necessary: Hi Dave, I am not sure why do we need to patch makdeumpfile or cp. Once core collector has saved the file, we can either do "sync -f vmcore" or "fsync vmcore". Isn't it?

I'm not aware of any "fsync" shell command, unfortunately. It'd be nice. sync -f syncs the filesystem of the target file, not just the file itself - but that's fine. [1] so sync -f vmcore should be fine, will sync only the filesystem hosting vmcore, and should return an error if it fails. [2] -Eric [1] syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd. [2] ERRORS sync() is always successful. syncfs() can fail for at least the following reasons:

Vivek Goyal

9:37 a.m.

On Thu, May 19, 2022 at 09:26:11AM -0500, Eric Sandeen wrote:

...

On 5/19/22 8:18 AM, Vivek Goyal wrote: >> Hi Vivek, >> >> Checking the kdump scripts, we have below, a separate "sync" command >> is added after saving vmcore, so it would be good since that cover >> all the core collectors, if we use fsync then we need to patch >> makedumpfile and cp, maybe it is not necessary: > Hi Dave, > > I am not sure why do we need to patch makdeumpfile or cp. Once core > collector has saved the file, we can either do "sync -f vmcore" or > "fsync vmcore". Isn't it? I'm not aware of any "fsync" shell command, unfortunately. It'd be nice.

Hi Eric, Agreed, that is no "fsync" shell command. I always felt it will be nice to have one. May be I should open a bug for coreutils people as RFC and ask for either fsync command or extend sync to call fsync. For the time being either we need to write a simple utility say kdump_fsync and ship with kdump package. But even simpler solution is to just use "sync -f vmcore" which will call syncfs() on filesystem where vmcore resides.

...

sync -f syncs the filesystem of the target file, not just the file itself - but that's fine. [1]

Right. That's the idea. Thanks Vivek

...

so sync -f vmcore should be fine, will sync only the filesystem hosting vmcore, and should return an error if it fails. [2] -Eric [1] syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd. [2] ERRORS sync() is always successful. syncfs() can fail for at least the following reasons:

Tao Liu

9:39 a.m.

On Thu, May 19, 2022 at 10:26 PM Eric Sandeen <sandeen(a)redhat.com> wrote:

...

Since we are programming in pure shell script, it would be strange if we insert a c program which call fsync on vmcore specifically for the purpose, is there any best practice on how to achieve this? Thanks, Tao Liu

...

sync -f syncs the filesystem of the target file, not just the file itself - but that's fine. [1] so sync -f vmcore should be fine, will sync only the filesystem hosting vmcore, and should return an error if it fails. [2] -Eric [1] syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd. [2] ERRORS sync() is always successful. syncfs() can fail for at least the following reasons:

Eric Sandeen

9:52 a.m.

On 5/19/22 9:39 AM, Tao Liu wrote:

...

> I'm not aware of any "fsync" shell command, unfortunately. It'd be nice. > Since we are programming in pure shell script, it would be strange if we insert a c program which call fsync on vmcore specifically for the purpose, is there any best practice on how to achieve this?

Yes...

...

Thanks, Tao Liu > sync -f syncs the filesystem of the target file, not just the file itself - > but that's fine. [1]

^^^^ this. -Eric

...

> so sync -f vmcore should be fine, will sync only the filesystem hosting vmcore, > and should return an error if it fails. [2] > > -Eric

Tao Liu

10 a.m.

On Thu, May 19, 2022 at 10:52 PM Eric Sandeen <sandeen(a)redhat.com> wrote:

...

On 5/19/22 9:39 AM, Tao Liu wrote: >> I'm not aware of any "fsync" shell command, unfortunately. It'd be nice. >> > Since we are programming in pure shell script, it would be strange if > we insert a c program which call fsync on vmcore specifically for the > purpose, is there any best practice on how to achieve this? Yes... > Thanks, > Tao Liu > >> sync -f syncs the filesystem of the target file, not just the file itself - >> but that's fine. [1] ^^^^ this.

Right, I get the point. Thank you all for the good suggestions and discussion! I will have a try and test on it, thank you very much! Thanks, Tao Liu

...

-Eric >> so sync -f vmcore should be fine, will sync only the filesystem hosting vmcore, >> and should return an error if it fails. [2] >> >> -Eric

Eric Sandeen

10:48 a.m.

On 5/19/22 10:00 AM, Tao Liu wrote:

...

On Thu, May 19, 2022 at 10:52 PM Eric Sandeen <sandeen(a)redhat.com> wrote: > > On 5/19/22 9:39 AM, Tao Liu wrote: >>> I'm not aware of any "fsync" shell command, unfortunately. It'd be nice. >>> >> Since we are programming in pure shell script, it would be strange if >> we insert a c program which call fsync on vmcore specifically for the >> purpose, is there any best practice on how to achieve this? > > Yes... > >> Thanks, >> Tao Liu >> >>> sync -f syncs the filesystem of the target file, not just the file itself - >>> but that's fine. [1] > > ^^^^ this. Right, I get the point. Thank you all for the good suggestions and discussion! I will have a try and test on it, thank you very much!

Great! One caveat I just noticed ... commit 5679897eb104cec9e99609c3f045a0c20603da4c Author: Darrick J. Wong <djwong(a)kernel.org> Date: Sun Jan 30 08:53:16 2022 -0800 vfs: make sync_filesystem return errors from ->sync_fs Strangely, sync_filesystem ignores the return code from the ->sync_fs call, which means that syscalls like syncfs(2) never see the error. This doesn't seem right, so fix that. Signed-off-by: Darrick J. Wong <djwong(a)kernel.org> Reviewed-by: Jan Kara <jack(a)suse.cz> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Acked-by: Christian Brauner <brauner(a)kernel.org> I'm not sure how long that flaw has existed, but that fix went into v5.17, and also to the stable kernel releases. I think you should still use sync -f (unless you want to write an fsync helper), but be aware that it might still miss errors on older kernels. :( -Eric

Tao Liu

9:03 p.m.

On Thu, May 19, 2022 at 11:48 PM Eric Sandeen <sandeen(a)redhat.com> wrote:

...

On 5/19/22 10:00 AM, Tao Liu wrote: > On Thu, May 19, 2022 at 10:52 PM Eric Sandeen <sandeen(a)redhat.com> wrote: >> >> On 5/19/22 9:39 AM, Tao Liu wrote: >>>> I'm not aware of any "fsync" shell command, unfortunately. It'd be nice. >>>> >>> Since we are programming in pure shell script, it would be strange if >>> we insert a c program which call fsync on vmcore specifically for the >>> purpose, is there any best practice on how to achieve this? >> >> Yes... >> >>> Thanks, >>> Tao Liu >>> >>>> sync -f syncs the filesystem of the target file, not just the file itself - >>>> but that's fine. [1] >> >> ^^^^ this. > > Right, I get the point. Thank you all for the good suggestions and > discussion! I will have a try and test on it, thank you very much! Great! One caveat I just noticed ... commit 5679897eb104cec9e99609c3f045a0c20603da4c Author: Darrick J. Wong <djwong(a)kernel.org> Date: Sun Jan 30 08:53:16 2022 -0800 vfs: make sync_filesystem return errors from ->sync_fs Strangely, sync_filesystem ignores the return code from the ->sync_fs call, which means that syscalls like syncfs(2) never see the error. This doesn't seem right, so fix that. Signed-off-by: Darrick J. Wong <djwong(a)kernel.org> Reviewed-by: Jan Kara <jack(a)suse.cz> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Acked-by: Christian Brauner <brauner(a)kernel.org> I'm not sure how long that flaw has existed, but that fix went into v5.17, and also to the stable kernel releases. I think you should still use sync -f (unless you want to write an fsync helper), but be aware that it might still miss errors on older kernels. :(

OK, I will have a double check on fedora, rhel8 and rhel9 kernel, to verify the patch has been integrated in. Thanks, Tao Liu

...

-Eric

Dave Young

8:51 p.m.

On Thu, 19 May 2022 at 22:26, Eric Sandeen <sandeen(a)redhat.com> wrote:

...

Me too, that's why I replied before. I can not find it, man fsync we can get fsync(2) and fsync(3), maybe it is packaged in some other utils?

...

Agreed..

...

-Eric [1] syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd. [2] ERRORS sync() is always successful. syncfs() can fail for at least the following reasons:

Dave Young

9:56 p.m.

Hi Vivek, On Thu, 19 May 2022 at 21:18, Vivek Goyal <vgoyal(a)redhat.com> wrote:

...

On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote: > On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote: > > > > On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote: > > > On 5/18/22 10:45 AM, Mike Snitzer wrote: > > > > On Tue, May 17 2022 at 2:34P -0400, > > > > Tao Liu <ltao(a)redhat.com> wrote: > > > > > > ... > > > > > > >> I'm not an expert to fs and IO. If the async IO fails in 2nd kernel > > > >> for kdump, mostly the reason is insufficient file system space, and it > > > >> worked well for kdump in the past. However as for the case of lvm2 > > > >> thinp, I found the userspace program no longer gets informed in async > > > >> when thin pool autoextend fails. So I turned to the sync flag, which > > > >> force the userspace program to wait the date been synced to disk > > > >> before exit, and it works well according to my test. But it does cost > > > >> more writing time than async... > > > > > > > > I've consulted Eric Sandeen (cc'd) and he agrees there is a more > > > > generic problem in the kdump userspace if it isn't able to detect > > > > write failures without using the "sync" mount option. > > > > > > > > kdump's job is to dump system memory as carefully as possible. Yet > > > > you're saying kdump is using buffered IO. Buffered IO creates > > > > additional memory use, and associated pages don't get written back > > > > until writeback kicks in.. hence the delayed nature of write failures. > > > > > > (cc: vivek, who may have thoughts on buffered IO vs direct IO) > > > > [ cc Bao and Dave ] > > > > > > > > > But those failures can happen with non-thinp block devices too. > > > > > > Exactly. > > > > > > > Seems logical that kdump should be using direct IO to write system > > > > memory back (rather than buffered IO). Again, using buffered IO > > > > creates more memory use -- so you're needlessly increasing memory > > > > reserve for kdump's use by using buffered IO. > > > > > > > > Please take a closer look at how to properly detect write failures. > > > > Doing so properly should make kdump work, as in detect write failure, > > > > on any storage if it runs out of space. > > > > > > > > Please see: https://lwn.net/Articles/457667/ > > > > > > > > Anything short of that and you're papering over a general kdump > > > > problem by making it seem like a thinp specific problem. > > > > > > Yep. If you want to know if your buffered write succeeded or not, you > > > have several options, including calling fsync() and handling any errors. > > > The article above goes into much more detail. > > > > > > This should be done regardless of the storage type; there is no need to > > > single out thinp here. IO can fail for any number of reasons, on any type > > > of storage. You should always make the proper data integrity syscalls > > > and do error handling if you care about the results of your buffered write() > > > calls. > > > > Right. I think key thing is to call fsync() after saving vmcore is > > finished and based on the result of fsync() determine if file could > > make it to disk or not. > > > > If fsync() is not reporting errors properly, then that's an issue > > we should try to fix. I remember Jeff Layton had done fixes in > > this area to report errors if page writeback failed. > > > > I am not sure if kdump scripts call fsync() or not. I think that's > > the first thing we should verify. And if we are not doing it, fix it. > > Bao/Dave you probably are in best position to answer that. > > Hi Vivek, > > Checking the kdump scripts, we have below, a separate "sync" command > is added after saving vmcore, so it would be good since that cover > all the core collectors, if we use fsync then we need to patch > makedumpfile and cp, maybe it is not necessary: Hi Dave, I am not sure why do we need to patch makdeumpfile or cp. Once core collector has saved the file, we can either do "sync -f vmcore" or "fsync vmcore". Isn't it?

Hi Vivek, as we see in other replied, I was thinking we do not have fsync utility.

...

> $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete" > _dump_exitcode=$? > if [ $_dump_exitcode -eq 0 ]; then > mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore" > sync I think this "sync" call is not sufficient. It does not return error if there is one. man 2 sync says, sync() is always successful. So if some data can't be written back because underlying device is full, sync() will not return error and we will assume vmcore was saved successfully. So we probably either need to call syncfs (sync -f vmcore) or use fsync (fsync vmcore). And that probably should solve the problem. If storage does not have enough space, we should get an error back and we can display that vmcore could not be saved successfully.

Agreed. sync -f is better because of the error code.

...

Another thing we will have to be careful about is that it does not hang infinitely. I think by default XFS retries I/O infinitely if it gets -ENOSPC from storage (thinking this is a temporary situation and will be resolved at some point of time). XFS had introduced knobs to tune this behavior. One could specify how long to retry upon error and when to give up and return error to user space. I think Carlos did that work and should know more about what are the knobs and how to tune these. Copying carlos. Thinking more about it, If there was not enough space, how come sync returned early without writing all the data back and not hang infinitely. Hmm... not sure.

I would suggest Tao do some experiment to see if it is a problem..

...

yes, if we can have some more data it would be better, randomly google the "dd" performance, someone said below: https://stackoverflow.com/questions/33485108/why-is-dd-with-the-direct-o-... But in our case it may depends on the "bs" makedumpfile using (same as cyclic buffer size?), it may also depends on the filesystem? anyway I think more testing would be better.

...

> Another thing is if we > use sync io then the saving progress indicator will be more accurate, > let's see how others think about this :) We could create another indicator saying "Syncing vmcore to disk" after "Saving vmcore to disk". So I will not be too worried about it.

Agreed, this is not a big issue.

...

Thanks Vivek

Dave Young

10:10 p.m.

...

yes, if we can have some more data it would be better, randomly google the "dd" performance, someone said below: https://stackoverflow.com/questions/33485108/why-is-dd-with-the-direct-o-...

Above is for directio, seems sync io is different, anyway with below test on laptop, it seems directio is faster: [dyoung]$ dd if=/dev/zero of=testfile bs=1M count=100 oflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.202895 s, 517 MB/s [dyoung]$ dd if=/dev/zero of=testfile bs=1M count=100 oflag=sync 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 1.13316 s, 92.5 MB/s

Vivek Goyal

Friday, 20 May Fri, 20 May

7:14 a.m.

On Fri, May 20, 2022 at 11:10:11AM +0800, Dave Young wrote:

...

> yes, if we can have some more data it would be better, randomly > google the "dd" performance, someone said below: > https://stackoverflow.com/questions/33485108/why-is-dd-with-the-direct-o-... > Above is for directio, seems sync io is different,

Yes, sync io is different. man 2 open says following for O_SYNC. By the time write(2) (or similar) returns, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below. I think I/O still goes through the page cache but it is immediately written back to disk. As long as size of I/O per write is big (big block size), I think both direct I/O and sync will perform reasonably well. But if your write sizes are small, these both will perform poor. I think sync probably will perform wrose than direct I/O. Given we are not reading back the contents of vmcore, there is not much point in going through the cache.

...

anyway with below test on laptop, it seems directio is faster: [dyoung]$ dd if=/dev/zero of=testfile bs=1M count=100 oflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.202895 s, 517 MB/s [dyoung]$ dd if=/dev/zero of=testfile bs=1M count=100 oflag=sync 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 1.13316 s, 92.5 MB/s

Yes, O_SYNC is little slower than direct I/O even for bs=1M. So performance will depend on how makedumpfile is writing data. How many writes is it doing and what are the sizes of these writes. Also even if you use O_DIRECT, you will have to issue fsync at the end anyway to ensure file (and any associated metadata actually got persisted on the disk). So while we can play with O_DIRECT and O_SYNC, but that does not seem to be a real requirement. Using "sync -f" is the fastest fix for the issue. Thanks Vivek

Coiby Xu

Tuesday, 17 May Tue, 17 May

9:14 p.m.

Hi Tao, On Mon, May 16, 2022 at 11:25:50AM +0800, Tao Liu wrote:

...

If possible, could we add a test as similar to local kdump for lvm2 thinp?

...

size-autoextend phase. For fedora and rhel9, the default crashkernel value is enough. But for rhel8, the default crashkernel value 1G-4G:160M is not enough, so it should be handled particularly. Please review. Tao Liu (4): Add lvm2 thin provision dump target checker Add lvm2-monitor.service for kdump when lvm2 thinp enabled lvm.conf should be check modified if lvm2 thinp enabled Mount the lvm2 thinp target with sync option dracut-lvm2-monitor.service | 15 +++++++++++++++ dracut-module-setup.sh | 16 ++++++++++++++++ kdump-lib-initramfs.sh | 25 +++++++++++++++++++++++++ kdumpctl | 1 + kexec-tools.spec | 2 ++ mkdumprd | 5 +++++ 6 files changed, 64 insertions(+) create mode 100644 dracut-lvm2-monitor.service -- 2.33.1

-- Best regards, Coiby

Tao Liu

9:28 p.m.

On Wed, May 18, 2022 at 10:16 AM Coiby Xu <coxu(a)redhat.com> wrote:

...

Hi Tao, On Mon, May 16, 2022 at 11:25:50AM +0800, Tao Liu wrote: >Thin provision is a mechanism that you can allocate a lvm volume which has >a large virtual size for file systems but actually in a small physical >size. The physical size can be autoextended in use if thin pool reached a >threshold specified in /etc/lvm/lvm.conf. > >There are 3 works should be handled when enable lvm2 thinp for kdump: > >1) Check if the dump target device or directory is thinp device. >2) Monitor the thin pool and autoextend its size when it reached the threshold > during kdump. >3) If thin pool size-autoextend fails, the user space program will not know due to > buffered IO. So the lvm2 thinp dump target should be mounted with sync flag, making > user space program know immediately if write fails. > >According to my testing, the memory consumption procedure for lvm2 thinp is the thin pool If possible, could we add a test as similar to local kdump for lvm2 thinp?

Sure, sounds like a good idea, please wait a while before I compose the patch for v2... Thanks, Tao Liu

...

>size-autoextend phase. For fedora and rhel9, the default crashkernel value is enough. But >for rhel8, the default crashkernel value 1G-4G:160M is not enough, so it should >be handled particularly. > >Please review. > >Tao Liu (4): > Add lvm2 thin provision dump target checker > Add lvm2-monitor.service for kdump when lvm2 thinp enabled > lvm.conf should be check modified if lvm2 thinp enabled > Mount the lvm2 thinp target with sync option > > dracut-lvm2-monitor.service | 15 +++++++++++++++ > dracut-module-setup.sh | 16 ++++++++++++++++ > kdump-lib-initramfs.sh | 25 +++++++++++++++++++++++++ > kdumpctl | 1 + > kexec-tools.spec | 2 ++ > mkdumprd | 5 +++++ > 6 files changed, 64 insertions(+) > create mode 100644 dracut-lvm2-monitor.service > >-- >2.33.1 > -- Best regards, Coiby

730

days inactive

734

days old

kexec@lists.fedoraproject.org

Manage subscription

34 comments

7 participants

tags (0)

participants (7)

Coiby Xu
Dave Young
Eric Sandeen
Mike Snitzer
Tao Liu
Vivek Goyal
Zdenek Kabelac

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

[PATCH 0/4] Add lvm2 thin provision support for kdump