Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/29/2013 05:06 PM, Lennart Poettering wrote:
On Mon, 29.07.13 16:52, Ric Wheeler (rwheeler@xxxxxxxxxx) wrote:

Oh, we don't assume it's all ours. We recheck regularly, immediately
before appending to the journal files, of course assuming that we are
not the only writers.
With thinly provisioned storage (or things like btrfs, writeable
snapshots, etc), you will not really ever know how much space is
really there.
Yeah, and that's an API regression.

It is actually not an API regression, this is how file systems have always operated on enterprise storage (including writeable snapshots) and, to all practical purposes, whenever you are running in a multi-application environment.

In effect, there never was an API that gave you what you want outside of the "write(2)" system call :)

On btrfs you can just add/remove device as you wish during runtime and
statvfs() does refelct this immediately.

btrfs consumes space on each write to the same block.

If you have a 10GB file system with a 5GB, existing log file and overwrite it twice in place, you will run out of space.


thinp should work the same. Of course, this requires that the block
layer has to pass more metadata up to the file systems than before, but
there's really nothing intrinsicly evil about that, I mean, it could be
as basic as just passing along a "provisioning perentage" or so which
the fs will simply multiply into the returned values... (Of
course it won't be that simple, but you get the concept...)

I would argue that it is working how it should work. If you want fully provisioned storage and are a single application/single user file system, you can configure your box that way.

Thin provisioned storage - by design - has a pool of real storage that is shared across all file systems that sit on devices that it serves. On SAN volumes, that exactly means you share the physical storage pool across multiple hosts and all of their file systems.

The way it works assumes:

* the system administrator understands thin provisioned storage and the system workload to some rough level * the sys admin set the water marks appropriately so that when we hit a low water mark, we can add physical storage to the pool

There is no magic pony here for you - if you configure thin, you mean to use it to lie to the users and their file systems for a valid reason.

Applications can do whatever they want as long as the sys admin monitors the box properly and has a way to add storage when needed.

Think "just in time" storage provisioning.


I am starting to think that this is critical enough that we might
want to always fully provision this - just like we would for audit
logs....

Checking won't hurt anything, but the storage stack will lie to you
(and honestly, we always have in many cases :)).
Well, journald is totally fine if it is lied to in the sense that the
values returned by statfs()/statvfs() are just estimates, and not
precise. However, it is assumed that the values are not off by > 100% as
they might be on thinp...

Or on btrfs or on copy on write LVM (not just ours, but hardware LVM) snapshots, etc.

Or if a large application is running that is about to do a pre-allocation of the rest of the free data.

The heuristic you assume does not work in any but the most constrained of all use cases.


That the values are not perfectly accurate has been known forever. Since
file systems existed developers knew that book-keeping and stuff means
the returned valuea are slightly higher than practically reachable. And
since compressed file systems they also knew that they might be lower
than actually reachable. However, it's one thing to return bad
estimates, and it is another thing to be totally off in the woods as is
the case for thinp!

This is not new or unique to thinp.


There are some alerts that we can raise when you hit a low water
mark for the device mapper physical pool, it would be interesting to
talk about how you might leverage these.
Well, the point I am making is that it is wrong to ask userspace to
handle this. Get the APIs right you expose to userspace.

I mean, ultimately for me it doesn't matter I geuss, since you say
neither the fs/block layer nor userspace should care, but that this is
the admin's problem, but that really sounds like chickening out to
me...


Not chickening out, just working as designed. If you don't like this, you need to use traditional, fully provisioned storage and not use copy on write technologies (like btrfs or LVM writeable snapshots).

Apparently we have lied to you so well over the years that you just never noticed the reality of many other misleading IO stack configurations :)

Ric

--
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]
  Powered by Linux