Re: [PATCH/RFC] NFS: Minimize COMMIT calls during writeback.

NeilBrown <neilb@xxxxxxx> · Wed, 18 Mar 2020 10:48:51 +1100

On Tue, Mar 17 2020, Trond Myklebust wrote:

> On Tue, 2020-03-17 at 21:41 +1100, NeilBrown wrote:
>> 
>> Fair enough.  I don't usually think of writeback as due to memory
>> pressure, but I can see that I filesystem wouldn't much care for the
>> distinction.
>> 
>> Still there is no overlap imposed by the VM.  The VM is sending a
>> sensible streak of writepages requests - as dd dirties more pages,
>> the
>> VM asks NFS to write out more pages.  All in a nice orderly fashion.
>
> The question essentially boils down to: "What is the purpose of the
> background writeback?" My answer would be "To free up pages", something
> that does not happen until COMMIT is sent.
>
> IOW: My concern is livelock, not serialisation. Right now, we do start
> commit when a batch sent by the VM has been completely written (same as
> we do when an O_DIRECT batch has been completely written). That ensures
> that the pages can be freed. Your patch is deliberately designed to
> defer that process that allows pages to be freed, and I'm not seeing
> how it is bounded.

In the current code the COMMIT for any given writepages() call only
completes after:
  - all the WRITEs generated by that writeapages call completes, and
    then
  - a subsequent COMMIT completes.

After my patch, the bound is a little more generous, but still a strong
bound.  The COMMIT will be complete after:
  - Any currently running COMMIT completes (there is usually only one),
    then
  - any writes that were submitted through writepages() before that
    COMMIT completed, completed, then
  - a subsequent COMMIT completes.

So we wait extra time to allow for a COMMIT and some more WRITEs to
complete.  Once the pending COMMIT completes, no more work can be added
to the set of things that need to be waited for, so it cannot live-lock.

>> > 
>> > commit_info.rpcs_out is there in order to count the number of
>> > outstanding COMMITs. I'm not seeing why we need to change that
>> > definition, particularly given that there can be processes out
>> > there
>> > that are trying to wait for the count to go to zero.
>> > 
>> > The I/O completion structure already has its own reference counter,
>> > and
>> > I can see nothing stopping you from modifying
>> > nfs_io_completion_commit() to do the same things you are trying to
>> > do
>> > in nfs_commit_end() here.
>> 
>> I don't see how that would work.
>> There are two events that need to happen asynchronously when a
>> counter
>> hits zero.
>> One is that a COMMIT needs to be sent when the number of outstanding
>> writes from a particular set hits zero.  That is the
>> nfs_io_completion
>> counter.
>> The other is that the currently growing set of pending commits needs
>> to
>> be detached and allowed to drain so that another COMMIT will get
>> sent.
>> This is what I'm using rpcs_out for.  It seems quite a natural
>> extension
>> of its current use.
>> 
>
> You're not supposed to call kref_get() on something with a refcount of
> zero. That means the nfs_io_completion counter defines the lifetime of
> the object that your commit_info.ioc points to, and once that refcount
> goes to zero, you should be removing it, which you are doing in
> nfs_writepages().

Certainly kref_get() cannot be called on something with a refcount of
zero - and if you have something that might have just gone to zero you
can use kref_get_unless_zero() to ensure you don't do the wrong thing.
I use kref_get_unless_zero() in nfs_writepages, but there is nowhere
that might take a ref on a kref with refcount of zero.

The io completion stored in commit_info.ioc has an extra refcount
to reflect the reference from commit_info.ioc.

The reference is removed and the reference count dropped using atomic
ops (no locking) so nfs_writepages() might get the pointer and not
be able to increment that refcount because it is already zero.  RCU
makes it safe to look at the refcount, and kref_get_unless_zero() makes
it safe to try an increment it.

>
> commit_info.rpcs_out is non-zero for the lifetime of any COMMIT
> operations. I'm not seeing how that relates to anything in the
> preceding paragraph (other than that nfs_io_completion going to zero
> will usually cause commit_info.rpcs_out to get incremented).
> IOW: commit_info.rpcs_out goes to zero a long time after the object
> pointed to by commit_info.ioc ought to be freed.

Yes, I agree with this.  I confirm that rpcs_out will only go to zero
well after commit_info.ioc is freed.  The purpose of my change to
rpcs_out is to cause it to be non-zero *earlier*.

As soon as there exists a detached io completion (which will eventually
trigger a COMMIT), rpcs_out becomes non-zero.  The count held by the
io completion is transferred to the pending COMMIT and dropped when the
COMMIT completes.

Making it non-zero earlier allows nfs_writepages to detect if there is
already a pending COMMIT *even if it hasn't been sent yet*.

If there is no pending COMMIT (or detached io completion), then it must
create a detached io completion so that a commit will follow.
If there *is* a pending COMMIT or detached io completion, then it can
leave its io completion attached because it can be sure that a COMMIT
will complete in bounded time, and so the io completion that it has
attached will be detached and processed.

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature