> On Aug 24, 2016, at 10:16, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 24 Aug 2016, at 10:02, Trond Myklebust wrote: > >>> On Aug 24, 2016, at 09:56, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: >>> >>> On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote: >>>> >>>>> On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>> >>>>> >>>>> >>>>> On 23 Aug 2016, at 11:36, Trond Myklebust wrote: >>>>> >>>>>>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> Hi linux-nfs, >>>>>>> >>>>>>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -l' >>>>>>> situations") changed when nfs_readdir() decides to revalidate the >>>>>>> directory's mapping, which contains all the entries. In addition to just >>>>>>> checking if the attribute cache has expired, it includes a check to see if >>>>>>> NFS_INO_INVALID_DATA is set on the directory. >>>>>>> >>>>>>> Well, customers that have directories with very many dentries and that same >>>>>>> directory's attributes are frequently updated are now grouchy that `ls -l` >>>>>>> takes so long since any update of the directory causes the mapping to be >>>>>>> invalidated and we have to start over filling the directory's mapping. >>>>>>> >>>>>>> I actually haven't put real hard thought into it yet (because often for me >>>>>>> that just wastes a lot of time), so I am doing the lazy thing by asking this >>>>>>> question: >>>>>>> >>>>>>> Can we go back to just the using the attribute cache timeout, or should we >>>>>>> get all heuristical about readdir? >>>>>>> >>>>>> >>>>>> We are all heuristical at this point. How are the heuristics failing? >>>>>> >>>>>> The original problem those heuristics were designed to solve was that all >>>>>> the stat() calls took forever to complete, since they are all synchronous; >>>>>> Tigran showed some very convincing numbers for a large directory where the >>>>>> difference in performance was an order of magnitude improved by using >>>>>> readdirplus instead of readdir… >>>>> >>>>> I'll try to present a better explanation. While `ls -l` is walking through >>>>> a directory repeatedly entering nfs_readdir(), a CREATE response send us >>>>> through nfs_post_op_update_inode_locked(): >>>>> >>>>> 1531 if (S_ISDIR(inode->i_mode)) >>>>> 1532 invalid |= NFS_INO_INVALID_DATA; >>>>> 1533 nfs_set_cache_invalid(inode, invalid); >>>>> >>>>> Now, the next entry into nfs_readdir() has us do nfs_revalidate_mapping(), >>>>> which will do nfs_invalidate_mapping() for the directory, and so we have to >>>>> start over with cookie 0 sending READDIRPLUS to re-fill the directory's >>>>> mapping to get back to where we are for the current nfs_readdir(). >>>>> >>>>> This process repeats for every entry into nfs_readdir() if the directory >>>>> keeps getting updated, and it becomes more likely that it will be updated as >>>>> each pass takes longer and longer to re-fill the mapping as the current >>>>> nfs_readdir() invocation is further along. >>>>> >>>>> READDIRPLUS isn't the problem, the problem is invalidating the directory >>>>> mapping in the middle of a series of getdents() if we do a CREATE. Also, I >>>>> think a similar thing happens if the directory's mtime or ctime is updated - >>>>> but in that case we set NFS_INO_INVALID_DATA because the change_attr >>>>> updates. >>>>> >>>>> So, for directories with a large number of entries that updates often, it can >>>>> be very slow to list the directory. >>>>> >>>>> Why did 311324ad1713 change nfs_readdir from >>>>> >>>>> if (nfs_attribute_cache_expired(inode)) >>>>> nfs_revalidate_mapping(inode, file->f_mapping); >>>>> to >>>>> >>>>> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & NFS_INO_INVALID_DATA) >>>>> nfs_revalidate_mapping(inode, file->f_mapping); >>>> >>>> As the commit message says, the whole purpose was to use READDIRPLUS as a substitute for multiple GETATTR calls when the heuristic tells us that the user is performing an ‘ls -l’ style of workload. >>>> >>>>> >>>>> .. and can we go back to the way it was before? >>>> >>>> Not without slowing down ‘ls -l’ on large directories. >>>> >>>>> >>>>> OK.. I understand why -- it is more correct since if we know the directory has >>>>> changed, we might as well fetch the change. Otherwise, we might be creating >>>>> files and then wondering why they aren't listed. >>>>> >>>>> It might be nicer to not invalidate the mapping we're currently using for >>>>> readdir, though. Maybe there's a way to keep the mapping for the currently >>>>> opened directory and invalidate it once it's closed. >>>> >>> >>>> POSIX requires that you revalidate on opendir() and rewinddir(), and >>>> leaves behaviour w.r.t. file addition and removal after the call to >>>> opendir()/rewinddir() undefined >>>> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html). >>> >>> It is only undefined whether the added or removed entry is returned. >>> Other entries still need to returned exactly once. >>> >>> In this case we're restarting the read of the directory from scratch--I >>> don't understand how that's possible while avoiding skipped or >>> duplicated entries. >>> >>> Surely the only safe thing to do is to continue reading using the last >>> cookie returned from the server. >> >> Why? The client should be able to restart using any cookie at any time, >> and we rely on the cookies being unique to each entry. If you want more >> relaxed cookie semantics then be prepared to have to set up a stateful NFS >> readdir protocol. > > It looks to me as if I send in a non-zero position to nfs_readdir(), but the mapping > has been invalidated, then the client starts over at cookie 0 sending > READDIRs to the server and reading the directory entries into the page cache > in order to figure out which position maps to which entry/cookie. Yes. > It seems possible to get duplicated or skipped entries. There is a logical step missing here. How do you get from “we can end up re-filling the cache from scratch” to “there can be duplicate or skipped entries”? ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥