Re: [PATCH] fast-import: use hashcmp() for SHA1 hash comparison

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 19.07.2014 01:57, schrieb Jeff King:
> On Fri, Jul 18, 2014 at 09:14:05PM +0200, René Scharfe wrote:
> 
>> If inlining is really better is another matter; I don't understand how
>> 1a812f3a (hashcmp(): inline memcmp() by hand to optimize) could have made
>> git gc 18% faster, as it claimed.  I would expect memcmp(), which can
>> compare more than a byte at a time, to be significantly faster -- or at
>> least just as fast as whatever the compiler does with the inlined version.
> 
> I looked into this a while ago[1]. I think with glibc 2.13 and up, the
> memcmp is a win. We should consider switching back if that is what is
> common now.
> 
> -Peff
> 
> [1] http://article.gmane.org/gmane.comp.version-control.git/218396

On Linux this seems to improve rev-list performance by ca. 10% for me, but
on on FreeBSD it slows it down by ca. 25%.  That's less surprising after
looking at their memcmp() implementation:

http://svnweb.freebsd.org/base/release/10.0.0/lib/libc/string/memcmp.c?revision=260789&view=co

It's basically our one-byte-at-a-time inlined version, sans inlining.

I'd say if a platform doesn't bother optimizing memcmp() then they
deserve the resulting performance.  And it's probably not too bad a
penalty because such comparisons probably won't make up a significant
part of most applications.

But perhaps we can do better.  First, here are the numbers I got for
"/usr/bin/time git rev-list --all --objects v2.0.0 >/dev/null" (best
of five runs):

Debian testing amd64 with gcc 4.9.0 and libc 2.19 on Hyper-V on Intel i5-4670:

  master (398dd4bd)
  2.14user 0.02system 0:02.16elapsed 99%CPU (0avgtext+0avgdata 71004maxresident)k
  0inputs+0outputs (0major+21985minor)pagefaults 0swaps

  using memcmp
  1.92user 0.02system 0:01.95elapsed 99%CPU (0avgtext+0avgdata 71032maxresident)k
  0inputs+0outputs (0major+21987minor)pagefaults 0swaps

  four bytes at a time
  1.93user 0.03system 0:01.97elapsed 99%CPU (0avgtext+0avgdata 71240maxresident)k
  0inputs+0outputs (0major+22024minor)pagefaults 0swaps

Debian testing amd64 with clang 3.3 and libc 2.19 on Hyper-V on Intel i5-4670:

  master (398dd4bd)
  2.12user 0.04system 0:02.17elapsed 99%CPU (0avgtext+0avgdata 70976maxresident)k
  0inputs+0outputs (0major+21995minor)pagefaults 0swaps

  memcmp
  1.94user 0.01system 0:01.95elapsed 100%CPU (0avgtext+0avgdata 71012maxresident)k
  0inputs+0outputs (0major+21952minor)pagefaults 0swaps

  four bytes at a time
  1.86user 0.01system 0:01.88elapsed 99%CPU (0avgtext+0avgdata 70788maxresident)k
  0inputs+0outputs (0major+21909minor)pagefaults 0swaps

FreeBSD 10 amd64 with clang 3.3 on Hyper-V on Intel i5-4670c:

  master (398dd4bd)
        1.83 real         1.80 user         0.03 sys

  using memcmp
        2.32 real         2.32 user         0.00 sys

  four bytes at a time
        1.56 real         1.53 user         0.03 sys

Performance measurements in virtual machines are not very accurate, so
take them with a bag of salt, but the numbers were relatively stable
accross runs.

How come rev-list runs so much faster on FreeBSD?  Would it make sense
to apply the four-bytes-at-a-time patch?  How would it on other
platforms, especially ARM?  Can we do even better?  And more
importantly: Does it matter in the first place?

The memcmp version just replaces the body of hashcmp() with
"return memcmp(sha1, sha2, 20);".  Here's the four-bytes-at-a-time
patch:
---
 cache.h | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 92fc9f1..c6cd28e 100644
--- a/cache.h
+++ b/cache.h
@@ -732,13 +732,18 @@ extern const unsigned char null_sha1[20];
 
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
+	const uint32_t *p1 = (const uint32_t *)sha1;
+	const uint32_t *p2 = (const uint32_t *)sha2;
 	int i;
 
-	for (i = 0; i < 20; i++, sha1++, sha2++) {
-		if (*sha1 != *sha2)
-			return *sha1 - *sha2;
+	for (i = 0; i < 20 / sizeof(uint32_t); i++, p1++, p2++) {
+		uint32_t v1 = htonl(*p1);
+		uint32_t v2 = htonl(*p2);
+		if (v1 < v2)
+			return -1;
+		if (v1 > v2)
+			return 1;
 	}
-
 	return 0;
 }
 
-- 
2.0.2

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]