On Mon, Dec 13, 2010 at 6:29 PM, J. R. Okajima <hooanon05@xxxxxxxxxxx> wrote: > > Nick Piggin: >> It's not scaling but just single threaded performance. gcc turns memcmp >> into rep cmp, which has quite a long latency, so it's not appripriate >> for short strings. > > Honestly speaking I doubt how this 'long *' approach is effective > (Of course it never means that your result (by 'char *') is doubtful). Well, let's see what turns up. We certainly can try the long * approach. I suspect on architectures where byte loads are very slow, gcc will block the loop into larger loads, so it should be no worse than a normal memcmp call, but if we do explicit padding we can avoid all the problems associated with tail handling. Doing name padding and long * comparison will be practically free (because slab allocator will align out to sizeof(long long) anyway), so if any architecture prefers to do the long loads, I'd be interested to hear and we could whip up a patch. > But is the "rep cmp has quite a long latency" issue generic for all x86 > architecture, or Westmere system specific? I don't believe it is Westmere specific. Intel and AMD have been improving these instructions in the past few years, so Westmere is probably as good or better than any. That said, rep cmp may not be as heavily optimized as the set and copy string instructions. In short, I think the change should be suitable for all x86 CPUs, but I would like to hear more opinions or see numbers for other cores. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html