On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 07:26:31PM GMT, Linus Torvalds wrote: > > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > However, the problematic workload is cold cache operations where > > > the dentry cache repeatedly misses. This places all the operational > > > concurrency directly on the inode hash as new inodes are inserted > > > into the hash. Add memory reclaim and that adds contention as it > > > removes inodes from the hash on eviction. > > > > Yeah, and then we spend all the time just adding the inodes to the > > hashes, and probably fairly seldom use them. Oh well. > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > > > So it's all a bit nasty. > > > > But I also assume most of the bad issues end up mainly showing up on > > just fairly synthetic benchmarks with ramdisks, because even with a > > good SSD I suspect the IO for the cold cache would still dominate? > > Not for bcachefs, because filling into the vfs inode cache doesn't > require a disk read - they're cached in the inodes btree and much > smaller there. We use a varint encoding so they're typically 50-100 > bytes, last I checked. > > Compare to ~1k, plus or minus, in the vfs inode cache. > > Thomas Bertshinger has been working on applications at LANL where > avoiding pulling into the vfs inode cache seems to make a significant > difference (file indexing in his case) - it turns out there's an xattr > syscall that's missing, which I believe he'll be submitting a patch for. > > But stat/statx always pulls into the vfs inode cache, and that's likely > worth fixing. No, let's not even consider going there. Unlike most people, old time XFS developers have direct experience with the problems that "uncached" inode access for stat purposes. XFS has had the bulkstat API for a long, long time (i.e. since 1998 on Irix). When it was first implemented on Irix, it was VFS cache coherent. But in the early 2000s, that caused problems with HSMs needing to scan billions inodes indexing petabytes of stored data with certain SLA guarantees (i.e. needing to scan at least a million inodes a second). The CPU overhead of cache instantiation and teardown was too great to meet those performance targets on 500MHz MIPS CPUs. So we converted bulkstat to run directly out of the XFS buffer cache (i.e. uncached from the perspective of the VFS). This reduced the CPU over per-inode substantially, allowing bulkstat rates to increase by a factor of 10. However, it introduced all sorts of coherency problems between cached inode state vs what was stored in the buffer cache. It was basically O_DIRECT for stat() and, as you'd expect from that description, the coherency problems were horrible. Detecting iallocated-but-not-yet-updated and unlinked-but-not-yet-freed inodes were particularly consistent sources of issues. The only way to fix these coherency problems was to check the inode cache for a resident inode first, which basically defeated the entire purpose of bypassing the VFS cache in the first place. So we went back to using coherent lookups once we stopped caring about the old SGI HSM SLA requirements back in 2010: commit 7dce11dbac54fce777eea0f5fb25b2694ccd7900 Author: Christoph Hellwig <hch@xxxxxx> Date: Wed Jun 23 18:11:11 2010 +1000 xfs: always use iget in bulkstat The non-coherent bulkstat versionsthat look directly at the inode buffers causes various problems with performance optimizations that make increased use of just logging inodes. This patch makes bulkstat always use iget, which should be fast enough for normal use with the radix-tree based inode cache introduced a while ago. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Essentially, we now always being inodes into cache as fully instantiated VFS inodes, and mark them with I_DONT_CACHE so they get torn down immediately after we've used them if they weren't already cached. (i.e. all accesses are done in a single task context with the inode hot in the CPU caches.) Without the patchset I pointed to, bulkstat maxxes out the VFS inode cache cycling on the sb->s_inodes_list_lock at about 700,000 inodes/sec being cycled through the cache. With that patchset, it maxxed out the bandwidth of the NVMe SSDs I was using at ~7GB/s read IO and cycling about 6 million inodes/sec through the VFS cache. IOWs, we can make the VFS inode cache scale way beyond what most users actually need right now. The lesson in all this? Don't hack around VFS scalability issues if it can be avoided. If we fix the problems properly in the first place, then most fs developers (let alone users!) won't even realise there was a scalability problem in the VFS to begin with. Users will, however, notice every inconsistency and/or corruption caused by fs developers trying to work around VFS cache limitations... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx