Re: [PATCH v3 2/4] nfsd: rework refcounting in filecache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Oct 28, 2022, at 5:03 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Fri, 2022-10-28 at 20:39 +0000, Chuck Lever III wrote:
>> 
>>> On Oct 28, 2022, at 4:13 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>> 
>>> On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>> 
>>>>> The filecache refcounting is a bit non-standard for something searchable
>>>>> by RCU, in that we maintain a sentinel reference while it's hashed. This
>>>>> in turn requires that we have to do things differently in the "put"
>>>>> depending on whether its hashed, which we believe to have led to races.
>>>>> 
>>>>> There are other problems in here too. nfsd_file_close_inode_sync can end
>>>>> up freeing an nfsd_file while there are still outstanding references to
>>>>> it, and there are a number of subtle ToC/ToU races.
>>>>> 
>>>>> Rework the code so that the refcount is what drives the lifecycle. When
>>>>> the refcount goes to zero, then unhash and rcu free the object.
>>>>> 
>>>>> With this change, the LRU carries a reference. Take special care to
>>>>> deal with it when removing an entry from the list.
>>>> 
>>>> I can see a way of making this patch a lot cleaner. It looks like there's
>>>> a fair bit of renaming and moving of functions -- that can go in clean
>>>> up patches before doing the heavy lifting.
>>>> 
>>> 
>>> Is this something that's really needed? I'm already basically rewriting
>>> this code. Reshuffling the old code around first will take a lot of time
>>> and we'll still end up with the same result.
>> 
>> I did exactly this for the nfs4_file rhash changes. It took just a couple
>> of hours. The outcome is that you can see exactly, in the final patch in
>> that series, how the file_hashtbl -> rhltable substitution is done.
>> 
>> Making sure each of the changes is more or less mechanical and obvious
>> is a good way to ensure no-one is doing something incorrect. That's why
>> folks like to use cocchinelle.
>> 
>> Trust me, it will be much easier to figure out in a year when we have
>> new bugs in this code if we split up this commit just a little.
>> 
> 
> Sigh. It seems pointless to rearrange code that is going to be replaced,
> but I'll do it. It'll probably be next week though.
> 
>> 
>>>> I'm still not sold on the idea of a synchronous flush in nfsd_file_free().
>>> 
>>> I think that we need to call this there to ensure that writeback errors
>>> are handled. I worry that if try to do this piecemeal, we could end up
>>> missing errors when they fall off the LRU.
>>> 
>>>> That feels like a deadlock waiting to happen and quite difficult to
>>>> reproduce because I/O there is rarely needed. It could help to put a
>>>> might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
>>>> drive I/O in that path at all.
>>> 
>>> I don't quite grok the potential for a deadlock here. nfsd_file_free
>>> already has to deal with blocking activities due to it effective doing a
>>> close(). This is just another one. That's why nfsd_file_put has a
>>> might_sleep in it (to warn its callers).
>> 
>> Currently nfsd_file_put() calls nfsd_file_flush(), which calls
>> vfs_fsync(). That can't be called while holding a spinlock.
>> 
>> 
> 
> nfsd_file_free (and hence, nfsd_file_put) can never be called with a
> spinlock held. That's true even before this set. Both functions can
> block.

Dead horse: in the current code base, nfsd_file_free() can be called
via nfsd_file_close_inode_sync(), which is an API external to
filecache.c. But, I agree now that both functions can block.


>>> What's the deadlock scenario you envision?
>> 
>> OK, filp_close() does call f_op->flush(). So we have this call
>> here and there aren't problems today. I still say this is a
>> problem waiting to occur, but I guess I can live with it.
>> 
>> If filp_close() already calls f_op->flush(), why do we need an
>> explicit vfs_fsync() there?
>> 
>> 
> 
> ->flush doesn't do anything on some filesystems, and while it does
> return an error code today, it should have been a void return function.
> The error from it can't be counted on.

OK. The goal is detecting writeback errors, and ->flush is not a
reliable way to do that.


> vfs_fsync is what ensures that everything gets written back and returns
> info about writeback errors. I don't see a way around calling it at
> least once before we close a file, if we want to keep up the "trick" of
> resetting the verifier when we see errors.

Fair enough, but maybe it's not worth a complexity and performance
impact. Writeback errors are supposed to be rare compared to I/O.

If we can find another way (especially, a more reliable way) to
monitor for writeback errors, that might be an improvement over
using vfs_fsync, which seems heavyweight no matter how you cut it.


> IMO, the goal ought to be to ensure that we don't end up having to do
> any writeback when we get to GC'ing it, and that's what patch 4/4 should
> do.

Understood. I don't disagree with the goal, since filp_close()
won't be able to avoid a potentially costly ->flush in some cases.

The issue is even a SYNC_NONE flush has appreciable cost.


--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux