Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

Mon Jul 29 22:32:03 UTC 2013

On 07/29/2013 05:06 PM, Lennart Poettering wrote:
> On Mon, 29.07.13 16:52, Ric Wheeler (rwheeler at redhat.com) wrote:
>
>>> Oh, we don't assume it's all ours. We recheck regularly, immediately
>>> before appending to the journal files, of course assuming that we are
>>> not the only writers.
>> With thinly provisioned storage (or things like btrfs, writeable
>> snapshots, etc), you will not really ever know how much space is
>> really there.
> Yeah, and that's an API regression.

It is actually  not an API regression, this is how file systems have always 
operated on enterprise storage (including writeable snapshots) and, to all 
practical purposes, whenever you are running in a multi-application environment.

In effect, there never was an API that gave you what you want outside of the 
"write(2)" system call :)
>
> On btrfs you can just add/remove device as you wish during runtime and
> statvfs() does refelct this immediately.

btrfs consumes space on each write to the same block.

If you have a 10GB file system with a 5GB, existing log file and overwrite it 
twice in place, you will run out of space.

>
> thinp should work the same. Of course, this requires that the block
> layer has to pass more metadata up to the file systems than before, but
> there's really nothing intrinsicly evil about that, I mean, it could be
> as basic as just passing along a "provisioning perentage" or so which
> the fs will simply multiply into the returned values... (Of
> course it won't be that simple, but you get the concept...)

I would argue that it is working how it should work. If you want fully 
provisioned storage and are a single application/single user file system, you 
can configure your box that way.

Thin provisioned storage - by design - has a pool of real storage that is shared 
across all file systems that sit on devices that it serves.  On SAN volumes, 
that exactly means you share the physical storage pool across multiple hosts and 
all of their file systems.

The way it works assumes:

* the system administrator understands thin provisioned storage and the system 
workload to some rough level
* the sys admin set the water marks appropriately so that when we hit a low 
water mark, we can add physical storage to the pool

There is no magic pony here for you - if you configure thin, you mean to use it 
to lie to the users and their file systems for a valid reason.

Applications can do whatever they want as long as the sys admin monitors the box 
properly and has a way to add storage when needed.

Think "just in time" storage provisioning.

>
>> I am starting to think that this is critical enough that we might
>> want to always fully provision this - just like we would for audit
>> logs....
>>
>> Checking won't hurt anything, but the storage stack will lie to you
>> (and honestly, we always have in many cases :)).
> Well, journald is totally fine if it is lied to in the sense that the
> values returned by statfs()/statvfs() are just estimates, and not
> precise. However, it is assumed that the values are not off by > 100% as
> they might be on thinp...

Or on btrfs or on copy on write LVM (not just ours, but hardware LVM) snapshots, 
etc.

Or if a large application is running that is about to do a pre-allocation of the 
rest of the free data.

The heuristic you assume does not work in any but the most constrained of all 
use cases.

>
> That the values are not perfectly accurate has been known forever. Since
> file systems existed developers knew that book-keeping and stuff means
> the returned valuea are slightly higher than practically reachable. And
> since compressed file systems they also knew that they might be lower
> than actually reachable. However, it's one thing to return bad
> estimates, and it is another thing to be totally off in the woods as is
> the case for thinp!

This is not new or unique to thinp.

>
>> There are some alerts that we can raise when you hit a low water
>> mark for the device mapper physical pool, it would be interesting to
>> talk about how you might leverage these.
> Well, the point I am making is that it is wrong to ask userspace to
> handle this. Get the APIs right you expose to userspace.
>
> I mean, ultimately for me it doesn't matter I geuss, since you say
> neither the fs/block layer nor userspace should care, but that this is
> the admin's problem, but that really sounds like chickening out to
> me...
>

Not chickening out, just working as designed. If you don't like this, you need 
to use traditional, fully provisioned storage and not use copy on write 
technologies (like btrfs or LVM writeable snapshots).

Apparently we have lied to  you so well over the years that you just never 
noticed the reality of many other misleading IO stack configurations :)

Ric