Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

Ric Wheeler <rwheeler@xxxxxxxxxx> · Mon, 29 Jul 2013 18:32:03 -0400

On 07/29/2013 05:06 PM, Lennart Poettering wrote:
On Mon, 29.07.13 16:52, Ric Wheeler (rwheeler@xxxxxxxxxx) wrote:

Oh, we don't assume it's all ours. We recheck regularly, immediately
before appending to the journal files, of course assuming that we are
not the only writers.
With thinly provisioned storage (or things like btrfs, writeable
snapshots, etc), you will not really ever know how much space is
really there.
Yeah, and that's an API regression.

It is actually  not an API regression, this is how file systems have always 
operated on enterprise storage (including writeable snapshots) and, to all 
practical purposes, whenever you are running in a multi-application environment.

In effect, there never was an API that gave you what you want outside of the 
"write(2)" system call :)

On btrfs you can just add/remove device as you wish during runtime and
statvfs() does refelct this immediately.

btrfs consumes space on each write to the same block.

If you have a 10GB file system with a 5GB, existing log file and overwrite it 
twice in place, you will run out of space.

thinp should work the same. Of course, this requires that the block
layer has to pass more metadata up to the file systems than before, but
there's really nothing intrinsicly evil about that, I mean, it could be
as basic as just passing along a "provisioning perentage" or so which
the fs will simply multiply into the returned values... (Of
course it won't be that simple, but you get the concept...)

I would argue that it is working how it should work. If you want fully 
provisioned storage and are a single application/single user file system, you 
can configure your box that way.

Thin provisioned storage - by design - has a pool of real storage that is shared 
across all file systems that sit on devices that it serves.  On SAN volumes, 
that exactly means you share the physical storage pool across multiple hosts and 
all of their file systems.

The way it works assumes:

* the system administrator understands thin provisioned storage and the system 
workload to some rough level
* the sys admin set the water marks appropriately so that when we hit a low 
water mark, we can add physical storage to the pool

There is no magic pony here for you - if you configure thin, you mean to use it 
to lie to the users and their file systems for a valid reason.

Applications can do whatever they want as long as the sys admin monitors the box 
properly and has a way to add storage when needed.

Think "just in time" storage provisioning.

I am starting to think that this is critical enough that we might
want to always fully provision this - just like we would for audit
logs....

Checking won't hurt anything, but the storage stack will lie to you
(and honestly, we always have in many cases :)).
Well, journald is totally fine if it is lied to in the sense that the
values returned by statfs()/statvfs() are just estimates, and not
precise. However, it is assumed that the values are not off by > 100% as
they might be on thinp...

Or on btrfs or on copy on write LVM (not just ours, but hardware LVM) snapshots, 
etc.

Or if a large application is running that is about to do a pre-allocation of the 
rest of the free data.

The heuristic you assume does not work in any but the most constrained of all 
use cases.

That the values are not perfectly accurate has been known forever. Since
file systems existed developers knew that book-keeping and stuff means
the returned valuea are slightly higher than practically reachable. And
since compressed file systems they also knew that they might be lower
than actually reachable. However, it's one thing to return bad
estimates, and it is another thing to be totally off in the woods as is
the case for thinp!

This is not new or unique to thinp.

There are some alerts that we can raise when you hit a low water
mark for the device mapper physical pool, it would be interesting to
talk about how you might leverage these.
Well, the point I am making is that it is wrong to ask userspace to
handle this. Get the APIs right you expose to userspace.

I mean, ultimately for me it doesn't matter I geuss, since you say
neither the fs/block layer nor userspace should care, but that this is
the admin's problem, but that really sounds like chickening out to
me...

Not chickening out, just working as designed. If you don't like this, you need 
to use traditional, fully provisioned storage and not use copy on write 
technologies (like btrfs or LVM writeable snapshots).

Apparently we have lied to  you so well over the years that you just never 
noticed the reality of many other misleading IO stack configurations :)

Ric

--
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct