On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch. > > Btw, if you added [patch 2/2] too, I think the exact byte counts in > the comments are a bit misleading. > > The actual cacheline details will very much depend on 32-bit vs 64-bit > builds, but also on things like the slab allocator debug settings. > I indeed assumed "x86_64 production", with lines just copied from pahole. However, to the best of my understanding the counts are what one should expect on a 64-bit kernel. That said: > I think the important part is to keep the d_lockref - that is often > dirty and exclusive in the cache - away from the mostly RO parts of > the dentry that can be shared across CPU's in the cache. > > So rather than talk about exact byte offsets, maybe just state that > overall rule? > I would assume the rule is pretty much well known and instead an indicator where is what (as added in my comments) would be welcome. But if going that route, then perhaps: "Make sure d_lockref does not share a cacheline with fields used by RCU lookup, otherwise it can result in a signification throughput reduction. You can use pahole(1) to check the layout." [maybe a link to perfbook or something goes here?] Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap (only active without whatever runtime debug messing with the layout). -- Mateusz Guzik <mjguzik gmail.com>