Re: [PATCH v5 3/5] nfsd: rework refcounting in filecache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Nov 2, 2022, at 2:07 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Nov 2, 2022, at 12:58 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>> 
>> On Tue, 2022-11-01 at 18:42 -0400, Jeff Layton wrote:
>>> On Wed, 2022-11-02 at 09:05 +1100, NeilBrown wrote:
>>>> On Wed, 02 Nov 2022, Jeff Layton wrote:
>>>>> On Wed, 2022-11-02 at 08:23 +1100, NeilBrown wrote:
>>>>>> On Wed, 02 Nov 2022, Jeff Layton wrote:
>>>>>>> The filecache refcounting is a bit non-standard for something searchable
>>>>>>> by RCU, in that we maintain a sentinel reference while it's hashed. This
>>>>>>> in turn requires that we have to do things differently in the "put"
>>>>>>> depending on whether its hashed, which we believe to have led to races.
>>>>>>> 
>>>>>>> There are other problems in here too. nfsd_file_close_inode_sync can end
>>>>>>> up freeing an nfsd_file while there are still outstanding references to
>>>>>>> it, and there are a number of subtle ToC/ToU races.
>>>>>>> 
>>>>>>> Rework the code so that the refcount is what drives the lifecycle. When
>>>>>>> the refcount goes to zero, then unhash and rcu free the object.
>>>>>>> 
>>>>>>> With this change, the LRU carries a reference. Take special care to
>>>>>>> deal with it when removing an entry from the list.
>>>>>> 
>>>>>> The refcounting and lru management all look sane here.
>>>>>> 
>>>>>> You need to have moved the final put (and corresponding fsync) to
>>>>>> different threads.  I think I see you and Chuck discussing that and I
>>>>>> have no sense of what is "right". 
>>>>>> 
>>>>> 
>>>>> Yeah, this is a tough call. I get Chuck's reticence.
>>>>> 
>>>>> One thing we could consider is offloading the SYNC_NONE writeback
>>>>> submission to a workqueue. I'm not sure though whether that's a win --
>>>>> it might just add needless context switches. OTOH, that would make it
>>>>> fairly simple to kick off writeback when the REFERENCED flag is cleared,
>>>>> which would probably be the best time to do it.
>>>>> 
>>>>> An entry that ends up being harvested by the LRU scanner is going to be
>>>>> touched by it at least twice: once to clear the REFERENCED flag, and
>>>>> again ~2s later to reap it.
>>>>> 
>>>>> If we schedule writeback when we clear the flag then we have a pretty
>>>>> good indication that nothing else is going to be using it (though I
>>>>> think we need to clear REFERENCED even when nfsd_file_check_writeback
>>>>> returns true -- I'll fix that in the coming series).
>>>>> 
>>>>> In any case, I'd probably like to do that sort of change in a separate
>>>>> series after we get the first part sorted.
>>>>> 
>>>>>> But it would be nice to explain in
>>>>>> the comment what is being moved and why, so I could then confirm that
>>>>>> the code matches the intent.
>>>>>> 
>>>>> 
>>>>> I'm happy to add comments, but I'm a little unclear on what you're
>>>>> confused by here. It's a bit too big of a patch for me to give a full
>>>>> play-by-play description. Can you elaborate on what you'd like to see?
>>>>> 
>>>> 
>>>> I don't need blow-by-blow, but all the behavioural changes should at
>>>> least be flagged in the intro, and possibly explained.
>>>> The one I particularly noticed is in nfsd_file_close_inode() which
>>>> previously used nfsd_file_dispose_list() which hands the final close off
>>>> to nfsd_filecache_wq.
>>>> But this patch now does the final close in-line so an fsnotify event
>>>> might now do the fsync.  I was assuming that was deliberate and wanted
>>>> it to be explained.  But maybe it wasn't deliberate?
>>>> 
>>> 
>>> Good catch! That wasn't a deliberate change, or at least I missed the
>>> subtlety that the earlier code attempted to avoid it. fsnotify callbacks
>>> are run under the srcu_read_lock. I don't think we want to run a fsync
>>> under that if we can at all help it.
>>> 
>>> What we can probably do is unhash it and dequeue it from the LRU, and
>>> then do a refcount_dec_and_test. If that comes back true, we can then
>>> queue it to the nfsd_fcache_disposal infrastructure to be closed and
>>> freed. I'll have a look at that tomorrow.
>>> 
>> 
>> Ok, I looked over the notification handling in here again and there is a
>> bit of a dilemma:
>> 
>> Neil is correct that we currently just put the reference directly in the
>> notification callback, and if we put the last reference at that point
>> then we could end up waiting on writeback.
> 
> I expect that for an unlink, that is the common case.
> 
> 
>> There are two notification callbacks:
>> 
>> 1/ fanotify callback to tell us when the link count on a file has gone
>> to 0.
>> 
>> 2/ the setlease_notifier which is called when someone wants to do a
>> setlease
>> 
>> ...both are called under the srcu_read_lock(), and they are both fairly
>> similar situations. We call different functions for them today, but we
>> may be OK to unify them since their needs are similar.
>> 
>> When I originally added these, I made them synchronous because it's best
>> if nfsd can clean up and get out the way quickly when either of these
>> events happen. At that time though, the code didn't call vfs_fsync at
>> all, much less always on the last put.
>> 
>> We have a couple of options:
>> 
>> 1/ just continue to do them synchronously: We can sleep under the
>> srcu_read_lock, so we can do both of those synchronously, but blocking
>> it for a long period of time could cause latency issues elsewhere.
>> 
>> 2/ move them to the delayed freeing infrastructure. That should be fine
>> but we could end doing stuff like silly renaming when someone deletes an
>> open file on an NFS reexport.
> 
> Isn't the NFS re-export case handled already by nfsd_file_close_inode_sync() ?
> In that case, the fsync / close is done synchronously before the unlink, but
> the caller is an nfsd thread, so that should be safe.
> 
> 
>> Thoughts? What's the best approach here?
>> 
>> Maybe we should just leave them synchronous for now, and plan to address
>> this in a later set?
> 
> I need to collect some experimental evidence, but we shouldn't be adding
> or removing notification calls with your patch set.

That wasn't clear. I meant: as I read it, your current patch set doesn't
add or remove notification calls. Sorry for the unnecessary subjunctive.


> It ought to be safe
> to leave it for a subsequent fix.
> 
> 
>>>> The movement of flush_delayed_fput() threw me at first, but I think I
>>>> understand it now - the new code for close_inode_sync is much cleaner,
>>>> not needing dispose_list_sync.
>>>> 
>>> 
>>> Yep, I think this is cleaner too.
>>> 
>> 
>> -- 
>> Jeff Layton <jlayton@xxxxxxxxxx>
> 
> --
> Chuck Lever

--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux