Hi all, Hoping to revive the discussion for $SUBJECT since we ran out of time when Boaz brought it up at LSF. Summary of what was discussed: - readdirplus syscall can be modeled after NFS' internal readdirplus implementation. - Need for a directory version counter (change count) - Need for each entry to have an opaque resume key - The linux_dirent.d_off in getdents(2) does this somewhat. - Header at top of the returned data with bits to signify what's inside. - What data to return? entries + stat + xattrs/acls? The fs/kernel guys were opposed to tossing xattrs/acls into the mix - I tend to agree, after having worked on a draft readdirplus syscall on GFS2 that does xattrs in addition to stat. The potentially large amount of variable length data to handle and the alloc/realloc/dealloc of said data makes the code quite complicated and hence, difficult to maintain. I had to write a new page-backed resizeable buffer to make this worthwhile (performance was actually worse with kmalloc & friends and kmap/kunmap compared to simply doing getdents()+stat()+getxattr()). For those who are interested, here are the patches (description in previous email below): https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14 There's an interesting seekwatcher graph on there too that compares the two cases. With a cold cache, almost all the speedup obtained by readdirplus is by being able to order all the disk reads. I've seen a 2x speedup (cold cache) with my test directories, but not much more. When the relevant disk blocks are in cache, readdirplus is about 3x faster - I attribute it to the minimal allocing and user/kernel mode switching that goes on. We might also get decent performance by simply having a system call that takes the directory as argument and goes off and pre-fetches all the relevant blocks required to do subsequent getdents()+stat()+getxattr() efficiently. Thoughts? Cheers! --Abhi ----- Original Message ----- > From: "Abhijith Das" <adas@xxxxxxxxxx> > To: "Boaz Harrosh" <bharrosh@xxxxxxxxxxx> > Cc: "Steven Whitehouse" <swhiteho@xxxxxxxxxx>, "Steve Dickson" <steved@xxxxxxxxxx>, "Jeff Layton" > <jlayton@xxxxxxxxxx>, lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx, "linux-fsdevel" <linux-fsdevel@xxxxxxxxxxxxxxx>, "Ganesha > NFS List" <nfs-ganesha-devel@xxxxxxxxxxxxxxxxxxxxx>, "Frank S Filz" <ffilz@xxxxxxxxxx>, "J. Bruce Fields" > <bfields@xxxxxxxxxx>, "Jim Lieb" <jlieb@xxxxxxxxxxx>, "Venkateswararao Jujjuri" <jvrao@xxxxxxxxxxxxxxxxxx>, "DENIEL > Philippe" <philippe.deniel@xxxxxx> > Sent: Monday, April 8, 2013 2:02:40 PM > Subject: Re: [1/8] readdir-plus system call > > Hi Boaz/All, > > ----- Original Message ----- > > From: "Boaz Harrosh" <bharrosh@xxxxxxxxxxx> > > To: "Steven Whitehouse" <swhiteho@xxxxxxxxxx>, "Steve Dickson" > > <steved@xxxxxxxxxx>, "Jeff Layton" > > <jlayton@xxxxxxxxxx>, lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx, "linux-fsdevel" > > <linux-fsdevel@xxxxxxxxxxxxxxx>, "Ganesha > > NFS List" <nfs-ganesha-devel@xxxxxxxxxxxxxxxxxxxxx>, "Frank S Filz" > > <ffilz@xxxxxxxxxx>, "J. Bruce Fields" > > <bfields@xxxxxxxxxx>, "Jim Lieb" <jlieb@xxxxxxxxxxx>, "Venkateswararao > > Jujjuri" <jvrao@xxxxxxxxxxxxxxxxxx>, "DENIEL > > Philippe" <philippe.deniel@xxxxxx> > > Sent: Monday, April 8, 2013 5:22:46 AM > > Subject: [1/8] readdir-plus system call > > > > By: Steven Whitehouse <swhiteho@xxxxxxxxxx>) > > > > I repeat below Steve's original mail. Steve you said you have > > some experimental code, could you post an header and a git URL > > so we can have a look? > > The patchset I'm working on is in a local tree, but the latest bits are > available in this Red Hat Bugzilla: > https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14 > > From a GFS2 perspective, the need for such a system call arose from our talks > with Samba folks to better support clustered samba over GFS2. The system > call simply collects dirents along with stat and extended attributes and > copies the info out to the user buffer. This patchset is a first-attempt at > tackling this problem from a GFS2 perspective and is mainly a way to get us > talking about possible implementations. > > As the patches stand right now, the VFS bits are just hooks and all the real > work is done in the GFS2 filesystem. However, there are some bits that could > be moved into the VFS so other filesystems can utilize them. > > For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat > system calls that David Howells proposed here : > https://lists.samba.org/archive/samba-technical/2012-April/082906.html > > There are 4 parts to my readdirplus (xgetdents()) patches: > > Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the > linux_xdirent structure that specifies how the collected data is packaged to > the user. From the caller's perspective, it behaves very much like the > getdents() syscall except for the -EAGAIN return code. This would require > the caller to re-issue the syscall with the same parameters. > > Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable > buffer backed by a vector of pages. This is used to collect all the > intermediate data before writing it out to the user buffer. > > Patch 3of4 is a simple port of the sort() function from lib/sort.c called > ctx_sort(). Only difference is that it takes an additional (void *) opaque > context pointer and passes it to the compare() and swap() functions. I > needed this to be able to sort pointers stored in the vector of pages > buffer. > > Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its > supporting functions. gfs2_xreaddir() tries to collect the requested data > efficiently by ordering disk block accesses based on the filesystem's > on-disk layout and also by adjusting the resizeable buffer as needed. > > In my quick testing with a 50,000 file directory, xgetdents() is at least > twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly > thrice as fast when the disk blocks have been cached. > > Cheers! > --Abhi > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html