I already mentioned these to Al, so he has seen most of them, because I wanted to make sure he was ok with the link_path_walk updates. But since he was ok (with a few comments), I cleaned things up and separated things into branches, and here's a heads-up for a wider audience in case anybody cares. This all started from me doing profiling on arm64, and just being annoyed by the code generation and some - admittedly mostly pretty darn minor - performance issues. It started with the arm64 user access code, moved on to __d_lookup_rcu(), and then extended into link_path_walk(), which together end up being the most noticeable parts of path lookup. The user access code is mostly for strncpy_from_user() - which is the main way the vfs layer gets the pathnames. vfs people probably don't really care - arm people cc'd, although they've seen most of this in earlier iterations (the minor word-at-a-time tweak is new). Same goes for x86 people for the minor changes on that side. I've pushed out four branches based on 6.10-rc4, because I think it's pretty ready. But I'll rebase them if people have commentary that needs addressing, so don't treat them as some kind of stable base yet. My plan is to merge them during the next merge window unless somebody screams. The branches are: arm64-uaccess: arm64: access_ok() optimization arm64: start using 'asm goto' for put_user() arm64: start using 'asm goto' for get_user() when available link_path_walk: vfs: link_path_walk: move more of the name hashing into hash_name() vfs: link_path_walk: improve may_lookup() code generation vfs: link_path_walk: do '.' and '..' detection while hashing vfs: link_path_walk: clarify and improve name hashing interface vfs: link_path_walk: simplify name hash flow runtime-constants: arm64: add 'runtime constant' support runtime constants: add x86 architecture support runtime constants: add default dummy infrastructure vfs: dcache: move hashlen_hash() from callers into d_hash() word-at-a-time: arm64: word-at-a-time: improve byte count calculations for LE x86-64: word-at-a-time: improve byte count calculations The arm64-uaccess branch is just what it says, and makes a big difference in strncpy_from_user(). The "access_ok()" change is certainly debatable, but I think needs to be done for sanity. I think it's one of those "let's do it, and if it causes problems we'll have to fix things up" things. The link_path_walk branch is the one that changes the vfs layer the most, but it's really mostly just a series of "fix calling conventions of 'hash_name()' to be better". The runtime-constants thing most people have already seen, it just makes d_hash() avoid all indirect memory accesses. And word-at-a-time just fixes code generation for both arm64 and x86-64 to use better sequences. None of this should be a huge deal, but together they make the profiles for __d_lookup_rcu(), link_path_walk() and strncpy_from_user() look pretty much optimal. And by "optimal" I mean "within the confines of what they do". For example, making d_hash() avoid indirection just means that now pretty much _all_ the cost of __d_lookup_rcu() is in the cache misses on the hash table itself. Which was always the bulk of it. And on my arm64 machine, it turns out that the best optimization for the load I tested would be to make that hash table smaller to actually be a bit denser in the cache, But that's such a load-dependent optimization that I'm not doing this. Tuning the hash table size or data structure cacheline layouts might be worthwhile - and likely a bigger deal - but is _not_ what these patches are about. Linus