Re: Review of current github code [Re: [GSoC] Designing a faster index format - Progress report week 6]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Junio C Hamano <gitster@xxxxxxxxx> writes:

> Thomas Rast <trast@xxxxxxxxxxxxxxx> writes:
>
>>   Test                      this tree      
>>   -----------------------------------------
>>   0002.1: v[23]: ls-files   0.13(0.11+0.02)
>>   0002.4: v4: ls-files      0.11(0.08+0.02)
>>   0002.5: v5: ls-files      0.09(0.06+0.02)
>>
>> I made up a hacky perf script on the spot, it's pasted at the far end of
>> this email.  It would most likely still be slower than v4 if we didn't
>> switch away from SHA1, though -- we haven't really spent much time
>> looking into the speed, except for one particular avoidance of name
>> copies that translated into a roughly 30% speedup.
>
> Do you mean by "switch away from SHA-1" that your suspicion is a
> large part of the speed-up may be coming from the fact that the
> index file as a whole is no longer hashed?

Yes.  Since the v5 index is only slightly smaller than v4 one, the
reduction in data read cannot explain the difference alone.

I tried to quantify this a little.  For SHA1 and the v2/v4 index
(25MB/14MB, resp.), I get about 70ms/44ms for

  time git hash-object --stdin <.git/index

On the other hand I get about 35ms/22ms for

  time ~/g/test-crc32 .git/index

I do have a system crc32 utility, but it uses read() in 8k blocks
instead of mmap() and takes about 87ms.

So we can see that the switch from 25MB to 14MB fully explains the
speedup for v2->v4, and the switch from SHA1 to CRC32 explains the
speedup for v4->v5.

However, aside from gaining 20ms here, CRC32 is also suitable for
checking very short chunks of data, as is planned for the partial
loading support in v5.

> As long as the new format allows us to notice corruption in the file
> to a similar degree of confidence by some other means, I personally
> do not see it as a regression in safety.
> 
> We however eventually would need to hook the logic to check for
> index corruption into fsck.  Actually adding such a code to fsck can
> and probably should remain outside the GSoC project, but please make
> sure you have necessary checksums in the format to allow us to do so
> in the future.

I actually expect that a full loading of the index will verify all
checksums that are present in the file.  Since file additions and such
will still need a full rewrite, and thus a full read, I expect this to
happen every so often as a matter of normal operations.  fsck could of
course still learn to load the index at some point, for good measure.


diff --git i/Makefile w/Makefile
index 63eacda..76856bc 100644
--- i/Makefile
+++ w/Makefile
@@ -481,6 +481,7 @@ X =
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
 TEST_PROGRAMS_NEED_X += test-chmtime
+TEST_PROGRAMS_NEED_X += test-crc32
 TEST_PROGRAMS_NEED_X += test-credential
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-date
diff --git i/test-crc32.c w/test-crc32.c
index e69de29..092de48 100644
--- i/test-crc32.c
+++ w/test-crc32.c
@@ -0,0 +1,32 @@
+#include "git-compat-util.h"
+#include <zlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/mman.h>
+
+int main (int argc, char *argv[])
+
+{
+	unsigned int crc;
+	struct stat st;
+	int fd;
+	void *map;
+
+	if (argc != 2)
+		die("usage: %s <file>\n", argv[0]);
+	fd = open(argv[1], O_RDONLY);
+	if (fd < 0)
+		die_errno("open");
+	if (fstat(fd, &st) < 0)
+		die_errno("fstat");
+	map = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (map == MAP_FAILED)
+		die_errno("mmap");
+
+	crc = crc32(0, map, st.st_size);
+	printf("%8x\n", crc);
+
+	return 0;
+}


-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]