Hi all, The topic of a readdirplus-like syscall had come up for discussion at last year's LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations to get at a directory's entries as well as stat() info on the individual inodes. I'm presenting these patches and some early test results on a single-node GFS2 filesystem. 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system call below and scales very well for large directories in GFS2. dirreadahead() is designed to be called prior to getdents+stat operations. In it's current form, it only speeds up stat() operations by caching the relevant inodes. Support can be added in the future to cache extended attribute blocks as well. This works by first collecting all the inode numbers of the directory's entries (subject to a numeric or memory cap). This list is sorted by inode disk block order and passed on to workqueues to perform lookups on the inodes asynchronously to bring them into the cache. 2. xgetdents() - I had posted a version of this patchset some time last year and it is largely unchanged - I just ported it to the latest upstream kernel. It allows the user to request a combination of entries, stat and xattrs (keys/values) for a directory. The stat portion is based on David Howells' xstat patchset he had posted last year as well. I've included the relevant vfs bits in my patchset. xgetdents() in GFS2 works in two phases. In the first phase, it collects all the dirents by reading the directory in question. In phase two, it reads in inode blocks and xattr blocks (if requested) for each entry after sorting the disk accesses in block order. All of the intermediate data is stored in a buffer backed by a vector of pages and is eventually transferred to the user supplied buffer. Both syscalls perform significantly better than a simple getdents+stat with a cold cache. The main advantage lies in being able to sort disk accesses for a bunch of inodes in advance compared to seeking all over the disk for inodes one entry at a time. This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the time taken to get directory entries and their respective stat info by 3 different sets of syscalls: 1) getdents+stat ('ls -l', basically) - Solid blue line 2) xgetdents with various buffer size and num_entries limits - Dotted lines Eg: v16384 d10000 means a limit of 16384 pages for the scratch buffer and a maximum of 10000 entries to collect at a time. 3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines Eg: d10000 implies that it would fire off a max of 10000 inode lookups during each syscall. numfiles: 10000 50000 100000 500000 -------------------------------------------------------------------- getdents+stat 1.4s 220s 514s 2441s xgetdents 1.2s 43s 75s 1710s dirreadahead+getdents+stat 1.1s 5s 68s 391s Here is a seekwatcher graph from a test run on a directory of 50000 files. (https://www.dropbox.com/s/fma8d4jzh7365lh/50000-combined.png) The comparison is between getdents+stat and xgetdents. The first set of plots is of getdents+stat, followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) and num_entries (100 to 1000000) limits. One can see the effect of ordering the disk reads in the Disk IO portion of the graphs and the corresponding effect on seeks, throughput and overall time taken. This second seekwatcher graph similarly shows the dirreadahead()+getdents()+stat() syscall-combo for a 500000-file directory with increasing num_entries (100 to 1000000) limits versus getdents+stat. (https://www.dropbox.com/s/rrhvamu99th3eae/500000-ra_combined_new.png) The corresponding getdents+stat baseline for this run is at the top of the series of graphs. I'm posting these two patchsets shortly for comments. Cheers! --Abhi Red Hat Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html