On Thu, May 14, 2015 at 08:51:12AM -0700, Linus Torvalds wrote: > On Thu, May 14, 2015 at 4:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > IIRC, ext4 readdir is not slow because of the use of the buffer > > cache, it's slow because of the way it hashes dirents across blocks > > on disk. i.e. it has locality issues, not a caching problem. > > No, you're just worrying about IO. Natural for a filesystem guy, but a > lot of loads cache really well, and IO isn't an issue. Yes, there's a > bad cold-cache case, but that's not when you get inode semaphore > contention. Right, because it's cold cache performance that everyone complains about. e.g. Workloads like gluster, ceph, fileservers, openstack (e.g. swift) etc are all mostly cold cache directory workloads with *extremely high* concurrency. Nobody is complaining about cached readdir performance - concurrency in cold cache directory operations is what everyone has been asking me for. In case you missed it, recently the Ceph developers have been talking about storing file handles in a userspace database and then using open_by_handle_at() so they can avoid the pain of cold cache directory lookup overhead (see the O_NOMTIME thread). We have a serious cold cache lookup problem on directories when people are looking to bypass the directory structure entirely.... [snip a bunch of rhetoric lacking in technical merit] > End result: readdir() wastes a *lot* of time on stupid stuff (just > that physical block number lookup is generally more expensive than > readdir itself should be), and it does so with excessive locking, > serializing everything. The most overhead in readdir is calling filldir over and over again for every dirent to copy it into the user buffer. The overhead is not from looking up the buffer in the cache. So, I just created close to a million dirents in a directory, and ran the xfs_io readdir command on it (look, a readdir performance measurement tool!). I used a ram disk to take IO out of the picture for the first read, the system has E5-4620 0 @ 2.20GHz CPUs, and I dropped caches to ensure that there was no cached metadata: $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" $ sudo xfs_io -c readdir /mnt/scratch read 29545648 bytes from offset 0 28 MiB, 923327 ops, 0.0000 sec (111.011 MiB/sec and 3637694.9201 ops/sec) $ sudo xfs_io -c readdir /mnt/scratch read 29545648 bytes from offset 0 28 MiB, 923327 ops, 0.0000 sec (189.864 MiB/sec and 6221628.5056 ops/sec) $ sudo xfs_io -c readdir /mnt/scratch read 29545648 bytes from offset 0 28 MiB, 923327 ops, 0.0000 sec (190.156 MiB/sec and 6231201.6629 ops/sec) $ Reading, decoding and copying dirents at 190MB/s? That's roughly 6 million dirents/second being pulled from cache, and it's doing roughly 4 million/second cold cache. That's not slow at all. What *noticable* performance gains are there to be had here for the average user? Anything that takes less than a second or two to complete is not going to be noticable to a user, and most people don't have 8-10 million inodes in a directory.... So, what did the profile look like? 10.07% [kernel] [k] __xfs_dir3_data_check 9.92% [kernel] [k] copy_user_generic_string 7.44% [kernel] [k] xfs_dir_ino_validate 6.83% [kernel] [k] filldir 5.43% [kernel] [k] xfs_dir2_leaf_getdents 4.56% [kernel] [k] kallsyms_expand_symbol.constprop.1 4.38% [kernel] [k] _raw_spin_unlock_irqrestore 4.26% [kernel] [k] _raw_spin_unlock_irq 4.02% [kernel] [k] __memcpy 3.02% [kernel] [k] format_decode 2.36% [kernel] [k] xfs_dir2_data_entsize 2.28% [kernel] [k] vsnprintf 1.99% [kernel] [k] __do_softirq 1.93% [kernel] [k] xfs_dir2_data_get_ftype 1.88% [kernel] [k] number.isra.14 1.84% [kernel] [k] _xfs_buf_find 1.82% [kernel] [k] ___might_sleep 1.61% [kernel] [k] strnlen 1.49% [kernel] [k] queue_work_on 1.48% [kernel] [k] string.isra.4 1.21% [kernel] [k] __might_sleep Oh, I'm running CONFIG_XFS_DEBUG=y, so internal runtime consistency checks consume most of the CPU (__xfs_dir3_data_check, xfs_dir_ino_validate). IOWs, real world readdir performance will be much, much faster than I've demonstrated. Other than that, the most CPU is spent on copying dirents into the user buffer (copy_user_generic_string), passing dirents to the user buffer (filldir) and extracting dirents from the on-disk buffer (xfs_dir2_leaf_getdents). The we have lock contention, ramdisk IO (memcpy), some vsnprintf stuff (includes format_decode, probably debug code) and some more dirent information extraction functions. it's not until we get to _xfs_buf_find() do we see a buffer cache lookup function, and that's actually comsuming less CPU than the __might_sleep/____might_sleep() debug annotations. That puts it in persepective just how little overhead readdir buffer caching actually has compared to everything else. IOWs, these numbers indicate that readdir caching overhead has no real impact on the performance of hot cache readdir operations. So, back to the question I asked that you didn't answer: exactly what are you proposing to cache in the VFS readdir cache? Without knowing that, I can't make any sane comment on about technical merit of your proposal.... > Both readdir() and path component lookup are technically read > operations, so why the hell do we use a mutex, rather than just > get a read-write lock for reading? Yeah, it's that (d) above. I > might trust xfs and ext4 to get their internal exclusions for > allocations etc right when called concurrently for the same > directory. But the others? They just use a write lock for everything and *nothing changes* - this is a simple problem to solve. The argument "filesystem developers are stupid" is not a compelling argument against changing locking. You're just being insulting, even though you probably don't realise it. [snip more rhetoric about the page cache being the only solution] > Basically, in computer science, pretty much all performance work > is about caching. And readdir is the one area where the VFS layer > doesn't do well, falls on its face and punts back to the > filesystem. Caching is used to hide the problems of the lower layers. If the lower layers don't have a problem, then another layer of caching is not necessary. Linus, what you haven't put together is a clear statement of the problem another layer of readdir caching is going to solve. What workload is having problems? Where are the profiles demonstrating that readdir caching is the issue, or the solution to the issue you are seeing? We know about plenty of workloads where directory access concurrency is a real problem, but I'm not seeing the problem you are trying to address... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html