On Tue, Jul 20, 2021 at 05:58:39PM +0300, Nikolay Borisov wrote: > > > On 16.07.21 г. 1:33, Dave Chinner wrote: > > On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote: > >> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote: > >>> I was wondering the same thing, but AFAICS it seems to be possible i.e > >>> if userspace spaces bad offsets, while all kinds of internal fs > >>> synchronization ops are going to be performed on aligned offsets, that > >>> doesn't mean the original ones, passed from userspace are themselves > >>> aligned explicitly. > >> > >> Ah, I thought it'd be failed before we got to this point. > >> > >> But honestly, I think x86-64 needs to be fixed to either use > >> __builtin_memcmp() or to have a nicely written custom memcmp(). I > >> tried to find the gcc implementation of __builtin_memcmp() on > >> x86-64, but I can't. > > > > Yup, this. memcmp() is widley used in hot paths through all the > > filesystem code and the rest of the kernel. We should fix the > > generic infrastructure problem, not play whack-a-mole to with custom > > one-off fixes that avoid the problem just where it shows up in some > > profile... > > I ported glibc's implementation of memcmp to the kernel and after > running the same workload I get the same performance as with the basic > memcmp implementation of doing byte comparison ... That's bizarre because the glibc memcmp that you pointed to earlier basically does what your open-coded solution did. Is it possible you have a bug in one of the tests and it's falling back to the byte loop? Specifically for the dedup case, we only need the optimisation that if ((p1 | p2 | length) & 7) ... do the byte loop ... ... do the long-based comparison ... so another possibility is that memcmp is doing too many tests.