Re: [RFC PATCH 0/9] dm-thin/xfs: prototype a block reservation allocation model

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 21, 2016 at 02:33:46PM +0100, Carlos Maiolino wrote:
> Hi.
> 
> From my point of view, I like the idea of an interface between the filesystem,
> and the thin-provisioned device, so that we can actually know if the thin
> volume is running out of space or not, but, before we actually start to discuss
> how this should be implemented, I'd like to ask if this should be implemented.

TL;DR: No-brainer, yes.

> After a few days discussing this with some block layer and dm-thin developers,
> what I most hear/read is that a thin volume should be transparent to the
> filesystem. So, the filesystem itself should not know it's running over a
> thin-provisioned volume. And such interface being discussed here, breaks this
> abstraction.

We're adding things like fallocate to block devices to control
preallocation, zeroing and freeing of ranges within the block device
from user space. If filesystems can't directly control and query
block device ranges on thinp block devices, then why should we let
userspace have this capability?

The problem we need to solve is that users want transparency between
filesystems and thinp devices. They don't want the filesytsem to
tell them they have lots of space available, and then get unexpected
ENOSPC because the thinp pool backing the fs has run out of space.
Users don't want a write over a region they have run
posix_fallocate() on to return ENOSPC because the thinp pool ran out
of space, even after the filesystem said it guaranteed space was
available.Filesystems want to know that they should run fstrim
passes internally when the underlying thinp pool is running out of
space so that it can free as much unused space as possible.

So there's lots of reasons why we need closer functional integration of
the filesytem and block layers, but doing this does not need to
break the abstraction layer between the filesystem and block device.
Indeed, we have already have mechanisms to provide block layer
functionality to the filesystems, and this patchset uses it - the
bdev ops structure.

Just because the filesystem knows that the underlying device has
it's own space management and it has to interact with it to give
users the correct results does not mean we are "breaking layering
abstractions". Filesystems has long assumed that the the LBA space
presented by the block device is a physical representation of the
underlying device.

We know this is not true, and has not been true for a long time.
Most devices really present a virtual LBA space to the higher
layers, and manipulate their underlying "physical" storage in a
manner that suits them best. SSDs do this, thinp does this, RAID
does this, dedupe/compressing/encrypting storage does this, etc.
IOWs, we've got virtual LBA abstractions right through the storage
stack, whether the higher layers realise it or not.

IOWs, we know that filesystems have been using virutal LBA address
spaces for a long time, yet we keep a block device model that
treats them as a physical, unchangable address space with known
physical characteristics (e.g. seek time is correlated with LBA
distance). We need to stop thinking of block devices as linear
devices and start treating them as they really are - a set of
devices capable of complex management operations, and we need
to start exposing those management operations for the higher layer
to be able to take advantage of.

Filesystems can take advantage of block devices that expose some of
their space management operations. We can make the interactions
users have on these storage stacks much better if we expose smarter
primitives from the block devices to the filesystems. We don't need
to break or change any abstractions - the filesystem is still very
much separate from the block device - but we need to improve the
communications and functionality channels between them.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux