On Thu, May 14, 2015 at 4:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > IIRC, ext4 readdir is not slow because of the use of the buffer > cache, it's slow because of the way it hashes dirents across blocks > on disk. i.e. it has locality issues, not a caching problem. No, you're just worrying about IO. Natural for a filesystem guy, but a lot of loads cache really well, and IO isn't an issue. Yes, there's a bad cold-cache case, but that's not when you get inode semaphore contention. You get contention when you have lots of concurrent accesses to the same directory, and then the data is all nice and hot in the caches. But readdir() _still_ sucks donkey ass by the bucket-load for that case. And that's the case I'm talking about. Using the buffer cache for readdir() is a complete disaster, because it means that (a) you have to go down to the filesystem, wasting CPU resources, and more importantly, going into code that by definition hasn't been optimized as well and cannot ever be, because it's not common code that everybody sees. (b) you have to look up the physical block number, wasting even *more* CPU resources, because the buffer heads are physically indexed (c) you then use the buffer head lookup, which itself isn't horrible, but it's not as well optimized as the page cache is. (d) and because we call into the filesystem, not only is the code not getting as much attention as the vfs layer, we generally can't trust filesystem guys to get locking right (because 90% of the filesystems don't get the attention they need even _without_ locking, and the 10% that does is maintained by people who worry mainly about IO). So the VFS layer has no real choice except to use a big-hammer "lock the whole damn directory" approach. End result: readdir() wastes a *lot* of time on stupid stuff (just that physical block number lookup is generally more expensive than readdir itself should be), and it does so with excessive locking, serializing everything. Both readdir() and path component lookup are technically read operations, so why the hell do we use a mutex, rather than just get a read-write lock for reading? Yeah, it's that (d) above. I might trust xfs and ext4 to get their internal exclusions for allocations etc right when called concurrently for the same directory. But the others? I saw you talk about how the aio IO paths are "better" than the regular page cache paths just a few days ago (when talking about persistent memory). You're completely and utterly out to lunch, *especially* with things like persistent memory, where the IO paths wouldn't even *exist*, because things never get out of the cache. And that out to lunch on this comes from your total fixation with IO. The page cache is one studly mf in the normal cases when things are cached, BUT YOU NEVER EVEN SEE THAT. Why? Because your filesystem code never gets called for it, and the page cache ends up having almost perfect behavior. It scales perfectly, and it scales with good performance. I understand where you are coming from, but caching really really works. You ignore that, because you don't see those things, and the caching case never affects you. The readdir path? It sucks. And it sucks exactly because it's done in the filesystem, and not in some VFS caches that we could actually make go fast. We can't cache it well. Basically, in computer science, pretty much all performance work is about caching. And readdir is the one area where the VFS layer doesn't do well, falls on its face and punts back to the filesystem. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html