Re: fallocate mode flag for "unshare blocks"?

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 31 Mar 2016 13:47:50 -0600

On Mar 31, 2016, at 12:08 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
> On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
>> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
>>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
>>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
>>>>> Or is it ok that fallocate could block, potentially for a long time as
>>>>> we stream cows through the page cache (or however unshare works
>>>>> internally)?  Those same programs might not be expecting fallocate to
>>>>> take a long time.
>>>> 
>>>> Yes, it's perfectly fine for fallocate to block for long periods of
>>>> time. See what gfs2 does during preallocation of blocks - it ends up
>>>> calling sb_issue_zerout() because it doesn't have unwritten
>>>> extents, and hence can block for long periods of time....
>>> 
>>> gfs2 fallocate is an implementation that will cause all but the most
>>> trivial users real pain.  Even the initial XFS implementation just
>>> marking the transactions synchronous made it unusable for all kinds
>>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
>>> to gfs2 will probab;ly hand your connection for extended periods of
>>> time.
>>> 
>>> If we need to support something like what gfs2 does we should have a
>>> separate flag for it.
>> 
>> Using fallocate() for preallocation was always intended to
>> be a faster, more efficient method allocating zeroed space
>> than having userspace write blocks of data. Faster, more efficient
>> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
>> that if the hardware has zeroing offloads (deterministic trim, write
>> same, etc) it will use them, and that will be much faster than
>> writing zeros from userspace.
>> 
>> IMO, what gfs2 is definitely within the intended usage of
>> fallocate() for accelerating the preallocation of blocks.
>> 
>> Yes, it may not be optimal for things like NFS servers which haven't
>> considered that a fallocate based offload operation might take some
>> time to execute, but that's not a problem with fallocate. i.e.
>> that's a problem with the nfs server ALLOCATE implementation not
>> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
>> and timeouts while the operation is run....
> 
> That's an interesting idea, but I don't think it's really legal.  I take
> JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> later and it might succeed", not "OK, I'm working on it, try again and
> you may find out I've done it".
> 
> So if the client gets a JUKEBOX error but the server goes ahead and does
> the operation anyway, that'd be unexpected.

Well, the tape continued to be mounted in the background and/or the file
restored from the tape into the filesystem...

> I suppose it's comparable to the case where a slow fallocate is
> interrupted--would it be legal to return EINTR in that case and leave
> the application to sort out whether some part of the allocation had
> already happened?

If the later fallocate() was not re-doing the same work as the first one,
it should be fine for the client to re-send the fallocate() request.  The
fallocate() to reserve blocks does not touch the blocks that are already
allocated, so this is safe to do even if another process is writing to the
file.  If you have multiple processes writing and calling fallocate() with
PUNCH/ZERO/COLLAPSE/INSERT to overlapping regions at the same time then
the application is in for a world of hurt already.

> Would it be legal to continue the fallocate under the covers even after
> returning EINTR?

That might produce unexpected results in some cases, but it depends on
the options used.  Probably the safest is to not continue, and depend on
userspace to retry the operation on EINTR.  For fallocate() doing prealloc
or punch or zero this should eventually complete even if it is slow.

Cheers, Andreas

> But anyway my first inclination is to say that the NFS FALLOCATE
> protocol just wasn't designed to handle long-running fallocates, and if
> we really need that then we need to give it a way to either report
> partial results or to report results asynchronously.
> 
> --b.

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail