Hi Vivek,
On Thu, 19 May 2022 at 21:18, Vivek Goyal <vgoyal(a)redhat.com> wrote:
On Thu, May 19, 2022 at 11:58:42AM +0800, Dave Young wrote:
> On Thu, 19 May 2022 at 01:03, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> >
> > On Wed, May 18, 2022 at 10:58:57AM -0500, Eric Sandeen wrote:
> > > On 5/18/22 10:45 AM, Mike Snitzer wrote:
> > > > On Tue, May 17 2022 at 2:34P -0400,
> > > > Tao Liu <ltao(a)redhat.com> wrote:
> > >
> > > ...
> > >
> > > >> I'm not an expert to fs and IO. If the async IO fails in 2nd
kernel
> > > >> for kdump, mostly the reason is insufficient file system space,
and it
> > > >> worked well for kdump in the past. However as for the case of
lvm2
> > > >> thinp, I found the userspace program no longer gets informed in
async
> > > >> when thin pool autoextend fails. So I turned to the sync flag,
which
> > > >> force the userspace program to wait the date been synced to disk
> > > >> before exit, and it works well according to my test. But it does
cost
> > > >> more writing time than async...
> > > >
> > > > I've consulted Eric Sandeen (cc'd) and he agrees there is a
more
> > > > generic problem in the kdump userspace if it isn't able to
detect
> > > > write failures without using the "sync" mount option.
> > > >
> > > > kdump's job is to dump system memory as carefully as possible.
Yet
> > > > you're saying kdump is using buffered IO. Buffered IO creates
> > > > additional memory use, and associated pages don't get written
back
> > > > until writeback kicks in.. hence the delayed nature of write
failures.
> > >
> > > (cc: vivek, who may have thoughts on buffered IO vs direct IO)
> >
> > [ cc Bao and Dave ]
> >
> > >
> > > > But those failures can happen with non-thinp block devices too.
> > >
> > > Exactly.
> > >
> > > > Seems logical that kdump should be using direct IO to write system
> > > > memory back (rather than buffered IO). Again, using buffered IO
> > > > creates more memory use -- so you're needlessly increasing
memory
> > > > reserve for kdump's use by using buffered IO.
> > > >
> > > > Please take a closer look at how to properly detect write failures.
> > > > Doing so properly should make kdump work, as in detect write
failure,
> > > > on any storage if it runs out of space.
> > > >
> > > > Please see:
https://lwn.net/Articles/457667/
> > > >
> > > > Anything short of that and you're papering over a general kdump
> > > > problem by making it seem like a thinp specific problem.
> > >
> > > Yep. If you want to know if your buffered write succeeded or not, you
> > > have several options, including calling fsync() and handling any errors.
> > > The article above goes into much more detail.
> > >
> > > This should be done regardless of the storage type; there is no need to
> > > single out thinp here. IO can fail for any number of reasons, on any type
> > > of storage. You should always make the proper data integrity syscalls
> > > and do error handling if you care about the results of your buffered
write()
> > > calls.
> >
> > Right. I think key thing is to call fsync() after saving vmcore is
> > finished and based on the result of fsync() determine if file could
> > make it to disk or not.
> >
> > If fsync() is not reporting errors properly, then that's an issue
> > we should try to fix. I remember Jeff Layton had done fixes in
> > this area to report errors if page writeback failed.
> >
> > I am not sure if kdump scripts call fsync() or not. I think that's
> > the first thing we should verify. And if we are not doing it, fix it.
> > Bao/Dave you probably are in best position to answer that.
>
> Hi Vivek,
>
> Checking the kdump scripts, we have below, a separate "sync" command
> is added after saving vmcore, so it would be good since that cover
> all the core collectors, if we use fsync then we need to patch
> makedumpfile and cp, maybe it is not necessary:
Hi Dave,
I am not sure why do we need to patch makdeumpfile or cp. Once core
collector has saved the file, we can either do "sync -f vmcore" or
"fsync vmcore". Isn't it?
Hi Vivek, as we see in other replied, I was thinking we do not have
fsync utility.
> $CORE_COLLECTOR /proc/vmcore "$_dump_fs_path/vmcore-incomplete"
> _dump_exitcode=$?
> if [ $_dump_exitcode -eq 0 ]; then
> mv "$_dump_fs_path/vmcore-incomplete"
"$_dump_fs_path/vmcore"
> sync
I think this "sync" call is not sufficient. It does not return error if
there is one.
man 2 sync says, sync() is always successful.
So if some data can't be written back because underlying device is full,
sync() will not return error and we will assume vmcore was saved
successfully.
So we probably either need to call syncfs (sync -f vmcore) or use
fsync (fsync vmcore). And that probably should solve the problem. If
storage does not have enough space, we should get an error back and
we can display that vmcore could not be saved successfully.
Agreed. sync -f is better because of the error code.
Another thing we will have to be careful about is that it does not
hang infinitely. I think by default XFS retries I/O infinitely if
it gets -ENOSPC from storage (thinking this is a temporary situation
and will be resolved at some point of time). XFS had introduced knobs
to tune this behavior. One could specify how long to retry upon error
and when to give up and return error to user space. I think Carlos did
that work and should know more about what are the knobs and how to
tune these. Copying carlos.
Thinking more about it, If there was not enough space, how come sync
returned early without writing all the data back and not hang infinitely.
Hmm... not sure.
I would suggest Tao do some experiment to see if it is a problem..
> dinfo "saving vmcore complete"
>
>
> >
> > direct I/O, O_SYNC I/O these all are slow options. We also have a
> > requirement to save dump ASAP and reboot back into the original
> > kernel so that we don't keep the machine down for longer duration.
> >
> > So if problem is about error detection, fsync() should solve it. Using
> > direct I/O or O_SYNC is an option user should be able to choose if they
> > wish it. I don't think Kdump provides any mechanism to do direct I/O. But
> > it might be allowing passing "sync" mount option so that every I/O
> > is effectively will use O_SYNC. Bao, do I get it right.
>
> Not sure if the sync mount option will cause vmcore saving slow down,
> since we are the only user in the kdump kernel so I suspect it will be
> fine, but it may need some actual testing..
I think it will most likely slow down vmcore saving. Let us give it
a try and see how bad it is. Especially on large machines where vmcore
can be big, using "sync" can be bad for performance. I would rather
prefer buffered write.
yes, if we can have some more data it would be better, randomly
google the "dd" performance, someone said below:
https://stackoverflow.com/questions/33485108/why-is-dd-with-the-direct-o-...
But in our case it may depends on the "bs" makedumpfile using (same as
cyclic buffer size?), it may also depends on the filesystem? anyway I
think more testing would be better.
> Another thing is if we
> use sync io then the saving progress indicator will be more accurate,
> let's see how others think about this :)
We could create another indicator saying "Syncing vmcore to disk" after
"Saving vmcore to disk". So I will not be too worried about it.
Agreed, this is not a big issue.
Thanks
Vivek