Re: [1/8] readdir-plus system call - LSF/MM follow up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Hoping to revive the discussion for $SUBJECT since we ran out of time when Boaz brought it up at LSF.
Summary of what was discussed:

- readdirplus syscall can be modeled after NFS' internal readdirplus implementation.
- Need for a directory version counter (change count)
- Need for each entry to have an opaque resume key - The linux_dirent.d_off in getdents(2) does this somewhat.
- Header at top of the returned data with bits to signify what's inside.
- What data to return? entries + stat + xattrs/acls?

The fs/kernel guys were opposed to tossing xattrs/acls into the mix - I tend to agree, after having worked on a draft readdirplus syscall on GFS2 that does xattrs in addition to stat.

The potentially large amount of variable length data to handle and the alloc/realloc/dealloc of said data makes the code quite complicated and hence, difficult to maintain. I had to write a new page-backed resizeable buffer to make this worthwhile (performance was actually worse with kmalloc & friends and kmap/kunmap compared to simply doing getdents()+stat()+getxattr()).

For those who are interested, here are the patches (description in previous email below): https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
There's an interesting seekwatcher graph on there too that compares the two cases. With a cold cache, almost all the speedup obtained by readdirplus is by being able to order all the disk reads. I've seen a 2x speedup (cold cache) with my test directories, but not much more. When the relevant disk blocks are in cache, readdirplus is about 3x faster - I attribute it to the minimal allocing and user/kernel mode switching that goes on.

We might also get decent performance by simply having a system call that takes the directory as argument and goes off and pre-fetches all the relevant blocks required to do subsequent getdents()+stat()+getxattr() efficiently.

Thoughts?

Cheers!
--Abhi

----- Original Message -----
> From: "Abhijith Das" <adas@xxxxxxxxxx>
> To: "Boaz Harrosh" <bharrosh@xxxxxxxxxxx>
> Cc: "Steven Whitehouse" <swhiteho@xxxxxxxxxx>, "Steve Dickson" <steved@xxxxxxxxxx>, "Jeff Layton"
> <jlayton@xxxxxxxxxx>, lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx, "linux-fsdevel" <linux-fsdevel@xxxxxxxxxxxxxxx>, "Ganesha
> NFS List" <nfs-ganesha-devel@xxxxxxxxxxxxxxxxxxxxx>, "Frank S Filz" <ffilz@xxxxxxxxxx>, "J. Bruce Fields"
> <bfields@xxxxxxxxxx>, "Jim Lieb" <jlieb@xxxxxxxxxxx>, "Venkateswararao Jujjuri" <jvrao@xxxxxxxxxxxxxxxxxx>, "DENIEL
> Philippe" <philippe.deniel@xxxxxx>
> Sent: Monday, April 8, 2013 2:02:40 PM
> Subject: Re: [1/8] readdir-plus system call
> 
> Hi Boaz/All,
> 
> ----- Original Message -----
> > From: "Boaz Harrosh" <bharrosh@xxxxxxxxxxx>
> > To: "Steven Whitehouse" <swhiteho@xxxxxxxxxx>, "Steve Dickson"
> > <steved@xxxxxxxxxx>, "Jeff Layton"
> > <jlayton@xxxxxxxxxx>, lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx, "linux-fsdevel"
> > <linux-fsdevel@xxxxxxxxxxxxxxx>, "Ganesha
> > NFS List" <nfs-ganesha-devel@xxxxxxxxxxxxxxxxxxxxx>, "Frank S Filz"
> > <ffilz@xxxxxxxxxx>, "J. Bruce Fields"
> > <bfields@xxxxxxxxxx>, "Jim Lieb" <jlieb@xxxxxxxxxxx>, "Venkateswararao
> > Jujjuri" <jvrao@xxxxxxxxxxxxxxxxxx>, "DENIEL
> > Philippe" <philippe.deniel@xxxxxx>
> > Sent: Monday, April 8, 2013 5:22:46 AM
> > Subject: [1/8] readdir-plus system call
> > 
> > By: Steven Whitehouse <swhiteho@xxxxxxxxxx>)
> > 
> > I repeat below Steve's original mail. Steve you said you have
> > some experimental code, could you post an header and a git URL
> > so we can have a look?
> 
> The patchset I'm working on is in a local tree, but the latest bits are
> available in this Red Hat Bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=850426#c14
> 
> From a GFS2 perspective, the need for such a system call arose from our talks
> with Samba folks to better support clustered samba over GFS2. The system
> call simply collects dirents along with stat and extended attributes and
> copies the info out to the user buffer. This patchset is a first-attempt at
> tackling this problem from a GFS2 perspective and is mainly a way to get us
> talking about possible implementations.
> 
> As the patches stand right now, the VFS bits are just hooks and all the real
> work is done in the GFS2 filesystem. However, there are some bits that could
> be moved into the VFS so other filesystems can utilize them.
> 
> For obtaining stat info, I'm making use of VFS bits of the xstat and fxstat
> system calls that David Howells proposed here :
> https://lists.samba.org/archive/samba-technical/2012-April/082906.html
> 
> There are 4 parts to my readdirplus (xgetdents()) patches:
> 
> Patch 1of4 adds the xgetdents() syscall interface, xreaddir() f_op and the
> linux_xdirent structure that specifies how the collected data is packaged to
> the user. From the caller's perspective, it behaves very much like the
> getdents() syscall except for the -EAGAIN return code. This would require
> the caller to re-issue the syscall with the same parameters.
> 
> Patch 2of4 is a gfs2 patch that adds a data structure that is a resizeable
> buffer backed by a vector of pages. This is used to collect all the
> intermediate data before writing it out to the user buffer.
> 
> Patch 3of4 is a simple port of the sort() function from lib/sort.c called
> ctx_sort(). Only difference is that it takes an additional (void *) opaque
> context pointer and passes it to the compare() and swap() functions. I
> needed this to be able to sort pointers stored in the vector of pages
> buffer.
> 
> Patch 4of4 has GFS2's implementation of the xreaddir() f_op and all its
> supporting functions. gfs2_xreaddir() tries to collect the requested data
> efficiently by ordering disk block accesses based on the filesystem's
> on-disk layout and also by adjusting the resizeable buffer as needed.
> 
> In my quick testing with a 50,000 file directory, xgetdents() is at least
> twice as fast as getdents()+stat()+getxattr() with a cold cache and nearly
> thrice as fast when the disk blocks have been cached.
> 
> Cheers!
> --Abhi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux