On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the > > same most of the time. ctime should often be the same (or differs just > > slightly). And sometimes mtime is the same as well. st_ino is also > > always zero on Windows. We're storing a lot of duplicate values. > > > > Index v5 handles this > > This looks really promising. I was going to reply to Junio. But it turns out I underestimated "varint" encoding overhead and it increases read time too much. I might get back and try some optimization when I'm bored, but until then this is yet another failed experiment. > > As a result of this, v5 reduces file size from 30% (git.git) to > > 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file > > size is reduced by 63%! A 8.4MB index file is _almost_ acceptable. > > > > Of course we trade off storage with cpu. We now need to spend more > > cycles writing or even reading (but still plenty fast compared to > > zlib). For reading, I'm counting on multi thread to hide away all this > > even if it becomes significant. > > This would be a bigger change, but have we/you ever done a POC > experiment to see how much of this time is eaten up by zlib that > wouldn't be eaten up with some of the newer "fast but good enough" > compression algorithms, e.g. Snappy and Zstandard? I'm quite sure I tried zlib at some point, the only lasting impression I have is "not good enough". Other algorithms might improve a bit, perhaps on the uncompress/read side, but I find it unlikely we could reasonably compress like a hundred megabytes in a few dozen milliseconds (a quick google says Snappy compresses 250MB/s, so about 400ms per 100MB, too long). Splitting the files and compressing in parallel might help. But I will probably focus on "sparse index" approach before going that direction. -- Duy