Re: [RFC PATCH 0/2] dirreadahead system call

Abhijith Das <adas@xxxxxxxxxx> · Thu, 31 Jul 2014 22:11:05 -0400 (EDT)

----- Original Message -----
> From: "Dave Chinner" <david@xxxxxxxxxxxxx>
> To: "Andreas Dilger" <adilger@xxxxxxxxx>
> Cc: "Abhijith Das" <adas@xxxxxxxxxx>, "LKML" <linux-kernel@xxxxxxxxxxxxxxx>, "linux-fsdevel"
> <linux-fsdevel@xxxxxxxxxxxxxxx>, cluster-devel@xxxxxxxxxx
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > 
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@xxxxxxxxxx> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@xxxxxxxxx> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order?  It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>> 
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>> 
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >> 
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server.  I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >> 
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>> 
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >> 
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > > 
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> > 
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched.  If the inode
> > number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
> 
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
> 
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
> 
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
> 

This was next on my to-do list to see how an all-userspace solution compares
with doing this in the kernel. I'll post some results as soon as I can find
some time to play with this stuff again.

Cheers!
--Abhi
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html