Re: Regarding ext4 extent allocation strategy

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Tue, 22 Feb 2022 10:48:35 +0800

Hi Ted,

Sorry for pinging so quickly since it's quite important for the
container on-demand use cases (maybe other on-demand distribution
use cases as well.) We still prefer this cachefiles way since
its data plane won't cross kernel-userspace boundary when data is
ready (and that's the common cases after data is fetched from
network.)

Many thanks again!
Gao Xiang

On Fri, Feb 18, 2022 at 11:18:14AM +0800, Gao Xiang wrote:
> Hi Ted and David,
> 
> On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote:
> > On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> > > 
> > > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > > filesystem has recently been exploring the use of fscache on top of
> > > ext4 for caching the network filesystem data for some customer
> > > workloads.
> > > 
> > > However, the maintainer of fscache (David Howells) recently warned us
> > > that a few other extent based filesystem developers pointed out a
> > > theoretical bug in the current implementation of fscache/cachefiles.
> > > It currently does not maintain a separate metadata for the cached data
> > > it holds, but instead uses the sparseness of the underlying filesystem
> > > to track the ranges of the data that is being cached.
> > > The bug that has been pointed out with this is that the underlying
> > > filesystems could bridge holes between data ranges with zeroes or
> > > punch hole in data ranges that contain zeroes. (@David please add if I
> > > missed something).
> > > 
> > > David has already begun working on the fix to this by maintaining the
> > > metadata of the cached ranges in fscache itself.
> > > However, since it could take some time for this fix to be approved and
> > > then backported by various distros, I'd like to understand if there is
> > > a potential problem in using fscache on top of ext4 without the fix.
> > > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > > way to disable such optimizations, I think we'll be okay to use the
> > > older versions of fscache even without the fix mentioned above.
> > 
> > Yes, the tuning knob you are looking for is:
> > 
> > What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
> > Date:		August 2012
> > Contact:	"Theodore Ts'o" <tytso@xxxxxxx>
> > Description:
> > 		The maximum number of kilobytes which will be zeroed
> > 		out in preference to creating a new uninitialized
> > 		extent when manipulating an inode's extent tree.  Note
> > 		that using a larger value will increase the
> > 		variability of time necessary to complete a random
> > 		write operation (since a 4k random write might turn
> > 		into a much larger write due to the zeroout
> > 		operation).
> > 
> > (From Documentation/ABI/testing/sysfs-fs-ext4)
> > 
> > The basic idea here is that with a random workload, with HDD's, the
> > cost of writing a 16k random write is not much more than the time to
> > write a 4k random write; that is, the cost of HDD seeks dominates.
> > There is also a cost in having a many additional entries in the extent
> > tree.  So if we have a fallocated region, e.g:
> > 
> >     +-------------+---+---+---+----------+---+---+---------+
> > ... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
> >     +-------------+---+---+---+----------+---+---+---------+
> > 
> > It's more efficient to have the extent tree look like this
> > 
> >     +-------------+-----------+----------+---+---+---------+
> > ... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
> >     +-------------+-----------+----------+---+---+---------+
> > 
> > And just simply write zeros to the first "U" in the above figure.
> > 
> > The default value of extent_max_zeroout_kb is 32k.  This optimization
> > can be disabled by setting extent_max_zeroout_kb to 0.  The downside
> > of this is a potential degredation of a random write workload (using
> > for example the fio benchmark program) on that file system.
> > 
> 
> As far as I understand what cachefile does, it just truncates a sparse
> file with a big hole, and do direct IO _only_ all the time to fill the
> holes.
> 
> But the description above is all around (un)written extents, which
> already have physical blocks allocated, but just without data
> initialization. So we could zero out the middle extent and merge
> these extents into one bigger written extent.
> 
> However, IMO, it's not the case of what the current cachefiles
> behavior is... I think rare local fs allocates blocks with direct
> i/o due to real holes, zero out and merge extents since at least it
> touches disk quota.
> 
> David pointed this message yesterday since we're doing on-demand read
> feature by using cachefiles as well. But I still fail to understand why
> the current cachefile behavior is wrong.
> 
> Could you kindly leave more hints about this? Many thanks!
> 
> Thanks,
> Gao Xiang
> 
> > Cheers,
> > 
> > 						- Ted