Re: I can't get no readdir satisfaction

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 24 Aug 2016 14:19:49 +0000

> On Aug 24, 2016, at 10:16, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> 
> On 24 Aug 2016, at 10:02, Trond Myklebust wrote:
> 
>>> On Aug 24, 2016, at 09:56, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>> 
>>> On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
>>>> 
>>>>> On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
>>>>> 
>>>>>>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Hi linux-nfs,
>>>>>>> 
>>>>>>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -l'
>>>>>>> situations") changed when nfs_readdir() decides to revalidate the
>>>>>>> directory's mapping, which contains all the entries.  In addition to just
>>>>>>> checking if the attribute cache has expired, it includes a check to see if
>>>>>>> NFS_INO_INVALID_DATA is set on the directory.
>>>>>>> 
>>>>>>> Well, customers that have directories with very many dentries and that same
>>>>>>> directory's attributes are frequently updated are now grouchy that `ls -l`
>>>>>>> takes so long since any update of the directory causes the mapping to be
>>>>>>> invalidated and we have to start over filling the directory's mapping.
>>>>>>> 
>>>>>>> I actually haven't put real hard thought into it yet (because often for me
>>>>>>> that just wastes a lot of time), so I am doing the lazy thing by asking this
>>>>>>> question:
>>>>>>> 
>>>>>>> Can we go back to just the using the attribute cache timeout, or should we
>>>>>>> get all heuristical about readdir?
>>>>>>> 
>>>>>> 
>>>>>> We are all heuristical at this point. How are the heuristics failing?
>>>>>> 
>>>>>> The original problem those heuristics were designed to solve was that all
>>>>>> the stat() calls took forever to complete, since they are all synchronous;
>>>>>> Tigran showed some very convincing numbers for a large directory where the
>>>>>> difference in performance was an order of magnitude improved by using
>>>>>> readdirplus instead of readdir…
>>>>> 
>>>>> I'll try to present a better explanation.  While `ls -l` is walking through
>>>>> a directory repeatedly entering nfs_readdir(), a CREATE response send us
>>>>> through nfs_post_op_update_inode_locked():
>>>>> 
>>>>> 1531     if (S_ISDIR(inode->i_mode))
>>>>> 1532         invalid |= NFS_INO_INVALID_DATA;
>>>>> 1533     nfs_set_cache_invalid(inode, invalid);
>>>>> 
>>>>> Now, the next entry into nfs_readdir() has us do nfs_revalidate_mapping(),
>>>>> which will do nfs_invalidate_mapping() for the directory, and so we have to
>>>>> start over with cookie 0 sending READDIRPLUS to re-fill the directory's
>>>>> mapping to get back to where we are for the current nfs_readdir().
>>>>> 
>>>>> This process repeats for every entry into nfs_readdir() if the directory
>>>>> keeps getting updated, and it becomes more likely that it will be updated as
>>>>> each pass takes longer and longer to re-fill the mapping as the current
>>>>> nfs_readdir() invocation is further along.
>>>>> 
>>>>> READDIRPLUS isn't the problem, the problem is invalidating the directory
>>>>> mapping in the middle of a series of getdents() if we do a CREATE.  Also, I
>>>>> think a similar thing happens if the directory's mtime or ctime is updated -
>>>>> but in that case we set NFS_INO_INVALID_DATA because the change_attr
>>>>> updates.
>>>>> 
>>>>> So, for directories with a large number of entries that updates often, it can
>>>>> be very slow to list the directory.
>>>>> 
>>>>> Why did 311324ad1713 change nfs_readdir from
>>>>> 
>>>>> if (nfs_attribute_cache_expired(inode))
>>>>>  nfs_revalidate_mapping(inode, file->f_mapping);
>>>>> to
>>>>> 
>>>>> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & NFS_INO_INVALID_DATA)
>>>>>  nfs_revalidate_mapping(inode, file->f_mapping);
>>>> 
>>>> As the commit message says, the whole purpose was to use READDIRPLUS as a substitute for multiple GETATTR calls when the heuristic tells us that the user is performing an ‘ls -l’ style of workload.
>>>> 
>>>>> 
>>>>> .. and can we go back to the way it was before?
>>>> 
>>>> Not without slowing down ‘ls -l’ on large directories.
>>>> 
>>>>> 
>>>>> OK.. I understand why -- it is more correct since if we know the directory has
>>>>> changed, we might as well fetch the change.  Otherwise, we might be creating
>>>>> files and then wondering why they aren't listed.
>>>>> 
>>>>> It might be nicer to not invalidate the mapping we're currently using for
>>>>> readdir, though.  Maybe there's a way to keep the mapping for the currently
>>>>> opened directory and invalidate it once it's closed.
>>>> 
>>> 
>>>> POSIX requires that you revalidate on opendir() and rewinddir(), and
>>>> leaves behaviour w.r.t. file addition and removal after the call to
>>>> opendir()/rewinddir() undefined
>>>> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).
>>> 
>>> It is only undefined whether the added or removed entry is returned.
>>> Other entries still need to returned exactly once.
>>> 
>>> In this case we're restarting the read of the directory from scratch--I
>>> don't understand how that's possible while avoiding skipped or
>>> duplicated entries.
>>> 
>>> Surely the only safe thing to do is to continue reading using the last
>>> cookie returned from the server.
>> 
>> Why? The client should be able to restart using any cookie at any time,
>> and we rely on the cookies being unique to each entry. If you want more
>> relaxed cookie semantics then be prepared to have to set up a stateful NFS
>> readdir protocol.
> 
> It looks to me as if I send in a non-zero position to nfs_readdir(), but the mapping
> has been invalidated, then the client starts over at cookie 0 sending
> READDIRs to the server and reading the directory entries into the page cache
> in order to figure out which position maps to which entry/cookie.

Yes.

> It seems possible to get duplicated or skipped entries.

There is a logical step missing here. How do you get from “we can end up re-filling the cache from scratch” to “there can be duplicate or skipped entries”?

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥