----- Original Message ----- > From: "Dave Chinner" <david@xxxxxxxxxxxxx> > To: "Andreas Dilger" <adilger@xxxxxxxxx> > Cc: "Abhijith Das" <adas@xxxxxxxxxx>, "LKML" <linux-kernel@xxxxxxxxxxxxxxx>, "linux-fsdevel" > <linux-fsdevel@xxxxxxxxxxxxxxx>, cluster-devel@xxxxxxxxxx > Sent: Thursday, July 31, 2014 6:53:06 PM > Subject: Re: [RFC PATCH 0/2] dirreadahead system call > > On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote: > > On Jul 31, 2014, at 6:49, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote: > > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@xxxxxxxxxx> wrote: > > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@xxxxxxxxx> wrote: > > >>>> Is there a time when this doesn't get called to prefetch entries in > > >>>> readdir() order? It isn't clear to me what benefit there is of > > >>>> returning > > >>>> the entries to userspace instead of just doing the statahead > > >>>> implicitly > > >>>> in the kernel? > > >>>> > > >>>> The Lustre client has had what we call "statahead" for a while, > > >>>> and similar to regular file readahead it detects the sequential access > > >>>> pattern for readdir() + stat() in readdir() order (taking into account > > >>>> if > > >>>> ".*" > > >>>> entries are being processed or not) and starts fetching the inode > > >>>> attributes asynchronously with a worker thread. > > >>> > > >>> Does this heuristic work well in practice? In the use case we were > > >>> trying to > > >>> address, a Samba server is aware beforehand if it is going to stat all > > >>> the > > >>> inodes in a directory. > > >> > > >> Typically this works well for us, because this is done by the Lustre > > >> client, so the statahead is hiding the network latency of the RPCs to > > >> fetch attributes from the server. I imagine the same could be seen with > > >> GFS2. I don't know if this approach would help very much for local > > >> filesystems because the latency is low. > > >> > > >>>> This syscall might be more useful if userspace called readdir() to get > > >>>> the dirents and then passed the kernel the list of inode numbers > > >>>> to prefetch before starting on the stat() calls. That way, userspace > > >>>> could generate an arbitrary list of inodes (e.g. names matching a > > >>>> regexp) and the kernel doesn't need to guess if every inode is needed. > > >>> > > >>> Were you thinking arbitrary inodes across the filesystem or just a > > >>> subset > > >>> from a directory? Arbitrary inodes may potentially throw up locking > > >>> issues. > > >> > > >> I was thinking about inodes returned from readdir(), but the syscall > > >> would be much more useful if it could handle arbitrary inodes. > > > > > > I'm not sure we can do that. The only way to safely identify a > > > specific inode in the filesystem from userspace is via a filehandle. > > > Plain inode numbers are susceptible to TOCTOU race conditions that > > > the kernel cannot resolve. Also, lookup by inode number bypasses > > > directory access permissions, so is not something we would expose > > > to arbitrary unprivileged users. > > > > None of these issues are relevant in the API that I'm thinking about. > > The syscall just passes the list of inode numbers to be prefetched > > into kernel memory, and then stat() is used to actually get the data into > > userspace (or whatever other operation is to be done on them), > > so there is no danger if the wrong inode is prefetched. If the inode > > number is bad the filesystem can just ignore it. > > Which means the filesystem has to treat the inode number as > potentially hostile. i.e. it can not be trusted to be correct and so > must take slow paths to validate the inode numbers. This adds > *significant* overhead to the readahead path for some filesystems: > readahead is only a win if it is low cost. > > For example, on XFS every untrusted inode number lookup requires an > inode btree lookup to validate the inode is actually valid on disk > and that is it allocated and has references. That lookup serialises > against inode allocation/freeing as well as other lookups. In > comparison, when using a trusted inode number from a directory > lookup within the kernel, we only need to do a couple of shift and > mask operations to convert it to a disk address and we are good to > go. > > i.e. the difference is at least 5 orders of magnitude higher CPU usage > for an "inode number readahead" syscall versus a "directory > readahead" syscall, it has significant serialisation issues and it > can stall other modification/lookups going on at the same time. > That's *horrible behaviour* for a speculative readahead operation, > but because the inodenumbers are untrusted, we can't avoid it. > > So, again, it's way more overhead than userspace just calling > stat() asycnhronously on many files at once as readdir/gentdents > returns dirents from the kernel to speed up cache population. > > That's my main issue with this patchset - it's implementing > something in kernelspace that can *easily* be done generically in > userspace without introducing all sorts of nasty corner cases that > we have to handle in the kernel. We only add functionality to the kernel if > there's a > compelling reason to do it in kernelspace, and right now I just > don't see any numbers that justify adding readdir+stat() readahead > or inode number based cache population in kernelspace. > > Before we add *any* syscall for directory readahead, we need > comparison numbers against doing the dumb multithreaded > userspace readahead of stat() calls. If userspace can do this as > fast as the kernel can.... > This was next on my to-do list to see how an all-userspace solution compares with doing this in the kernel. I'll post some results as soon as I can find some time to play with this stuff again. Cheers! --Abhi -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html