Re: Request for information on bloated writes using Swift

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 3 Feb 2016 17:37:05 +1100

On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> Hi Eric,
> 
> Thank you for your quick reply.
> 
> Using xfs_io as per your suggestion, I am able to reproduce the issue.
> However, I need to falloc for 256K and write for 257K to see this issue.
> 
> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> # stat /srv/node/r1/t4.txt | grep Blocks
>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file

Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.

When you writing *past the preallocated area* and do delayed
allocation, the speculative preallocation beyond EOF is double the
size of the extent at EOF. i.e. 512k, leading to 768k being
allocated to the file (1536 blocks, exactly).

This is expected behaviour.

> # xfs_io -f -c "pwrite 0 257k" /srv/node/r1/t2.txt
> # stat  /srv/node/r1/t2.txt | grep Blocks
> Size: 263168    *Blocks*: 520        IO Block: 4096   regular file

So pure delayed allocation, specualtive preallocation starts at 64k
file size, so it would have been (((64k + 64K) + 128K) + 256k) =
768k.

> I waited for around 15 mins before collecting the stat output to give the
> background reclamation logic a fair chance to do its job. I also tried
> changing the value of speculative_prealloc_lifetime from 300 to 10. But it
> was of no use.

The prealloc cleaner skips inodes with XFS_DIFLAG_PREALLOC set on
them.

Because the XFS_DIFLAG_PREALLOC flag is not set on the delayed
allocation inode, the EOF blocks cleaner runs truncates it to EOF,
and 260k (520 blocks) remains allocated to the file.

i.e. you are seeing behaviour exactly as designed and intended.

The way swift is using fallocate is actively harmful. You do not
want preallocation for write once files - this is exactly the
workload that delayed allocation was designed to be optimal for as
delayed allocation sequentialises the IO from multiple files.

Using preallocation means writeback of the data cannot be optimised
across files as the preallocation location will not be sequential to
the IO that was just issued, hence writeback will seek the disks
back and forth instead of seeing a nice sequential IO stream.

<sigh>

Yet another way that the swift storage back end tries to be smart
but ends up just making things go slow....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs