On Thu, Dec 16, 2010 at 8:53 PM, Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote: > On 12/15/2010 08:00 PM, David Miller wrote: >> From: Boaz Harrosh <bharrosh@xxxxxxxxxxx> >> Date: Wed, 15 Dec 2010 15:15:09 +0200 >> >>> I agree that the byte-compare or long-compare should give you very close >>> results in modern pipeline CPUs. But surly 12 increments-and-test should >>> show up against 3 (or even 2). I would say it must be a better plan. >> >> For strings of these lengths the setup code necessary to initialize >> the inner loop and the tail code to handle the sub-word ending cases >> eliminate whatever gains there are. >> > > You miss understood me. I'm saying that we know the beggining of the > string is aligned and Nick offered to pad the last long, so surly > a shift by 2 (or 3) + the reduction of the 12 dec-and-test to 3 > should give you an optimization? Masking is still going to take a bit more code, but perhaps. We're talking a handful of cycles here, so if you add a branch mispredict, or icache miss, you'll kill your savings. This is what I've got at the moment, which adds only 8 bytes over the rep cmp variant, in the __d_lookup_rcu function. static inline int dentry_memcmp(const unsigned char *cs, const unsigned char *ct, size_t count) { int ret; do { ret = (*cs != *ct); if (ret) break; cs++; ct++; count--; } while (count); return ret; } Now if we pad and zero fill the dentry name, then we can compare with the path string, but we still have to mask that guy (unfortunately, I didn't consider that earlier) so it's not trivial and adds quite a bit of code size and branches: static inline int dentry_memcmp_long(const unsigned char *cs, const unsigned char *ct, ssize_t count) { const unsigned long *ls = (const unsigned long *)cs; const unsigned long *lt = (const unsigned long *)ct; int ret; do { unsigned long t = *lt; unsigned long s = *ls; int shift = 0; if (count < 8) shift = (8 - count) * 8; t = t & (0xffffffffffffffff >> shift); ret = (s != t); if (ret) break; ls++; lt++; count-=8; } while (count > 0); return ret; } Something like this should work on little endian. You'd have to coax gcc to generate a cmov to get rid of that branch I think, because it won't be predictable for small string lengths. But then you have to think about icache... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html