On 6/26/20 5:44 PM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 03:22:07PM -0400, Josef Bacik wrote:
> I described this case to the working group last week, because it hit
> us in production this winter. Somebody screwed up and suddenly
> pushed 2 extra copies of the whole website to everybody's VM. The
> website is mostly metadata, because of the inline extents, so it
> exhausted everybody's metadata space. Tens of thousands of machines
> affected. Of those machines I had to hand boot and run balance on
> ~20 of them to get them back. The rest could run balance from the
> automation and recover cleanly.
Is there a way to mitigate this by reserving space or setting quotas? Users
running out of space on their laptops because:
* they downloaded a lot of media
* they created huge vms
* some sort of horrible log thing gone awry
are pretty common in both a) my anecdotal experience helping people
professionally and personally and b) um, me.
There's a difference between data ENOSPC and metadata ENOSPC. And again, this
is a pretty specific failure case. Obviously it's not impossible to hit, but
it's not something that's going to be a common occurrence. The two times
hit these issues in production was the thing that I mentioned, which had 750gib
fs's completely full with 20gib of metadata completely filled up.
The second was a bad service that was spewing empty files onto the disk slowly
filling up the metadata chunks, coupled with a bug in how we allocated data and
metadata chunks. The chunk allocation thing has been fixed for a year or two
now. This isn't something a normal user is going to hit most of the time. It
obviously does happen, I'm aware of it, and I've made progress on making it less
likely to get you into a "call Josef" situation. I'm sure there's still
work to be done, but there's continual progress on this particular edgecase.