On 2/14/2019 5:14 AM, Duy Nguyen wrote:
On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
same most of the time. ctime should often be the same (or differs just
slightly). And sometimes mtime is the same as well. st_ino is also
always zero on Windows. We're storing a lot of duplicate values.
Index v5 handles this
This looks really promising.
I was going to reply to Junio. But it turns out I underestimated
"varint" encoding overhead and it increases read time too much. I
might get back and try some optimization when I'm bored, but until
then this is yet another failed experiment.
As a result of this, v5 reduces file size from 30% (git.git) to
36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
Just for kicks, I tried this out on a couple of repos I have handy.
files version index size %savings
200k 2 25,033,758 0.00%
3 25,033,758 0.00%
4 15,269,923 39.00%
5 9,759,844 61.01%
3m 2 446,123,848 0.00%
3 446,123,848 0.00%
4 249,631,640 44.04%
5 82,147,981 81.59%
The 81% savings is very impressive. I didn't measure performance but
not writing out an extra 167MB to disk has to help.
I'm definitely also interested in your 'sparse index' format ideas as in
our 3M repos, there are typically only a few thousand that don't have
the skip-worktree bit set. I'm not sure if that is the same 'sparse'
you had in mind but it would sure be nice!
I've also contemplated multi-threading the index write code path. My
thought was in the primary thread to allocate a buffer and when it is
full have a background thread compute the SHA and write it to disk while
the primary thread fills the next buffer.
I'm not sure how much it will buy us as I don't know the relative cost
of computing the SHA/writing to disk vs filling the buffer. I've
suspected the filling the buffer thread would end up blocked on the
background thread most of the time which is why I haven't tried it yet.
Of course we trade off storage with cpu. We now need to spend more
cycles writing or even reading (but still plenty fast compared to
zlib). For reading, I'm counting on multi thread to hide away all this
even if it becomes significant.
This would be a bigger change, but have we/you ever done a POC
experiment to see how much of this time is eaten up by zlib that
wouldn't be eaten up with some of the newer "fast but good enough"
compression algorithms, e.g. Snappy and Zstandard?
I'm quite sure I tried zlib at some point, the only lasting impression
I have is "not good enough". Other algorithms might improve a bit,
perhaps on the uncompress/read side, but I find it unlikely we could
reasonably compress like a hundred megabytes in a few dozen
milliseconds (a quick google says Snappy compresses 250MB/s, so about
400ms per 100MB, too long). Splitting the files and compressing in
parallel might help. But I will probably focus on "sparse index"
approach before going that direction.