Store refreshed stat info in a separate file?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



With git status, writing refreshed index takes 252ms per total 1s,
361s/1.4s, 86ms/360ms on gentoo-x86, webkit and linux-2.6 respectively
(*). It's takes a significant amount of time from "git status". And
this happens whenever you touch a single tracked file, then do "git
status". We tried to solve this with index v5, but it's been years(?)
since its start as a GSoC project. So I'm thinking of another way
around..

The major cost of writing an index is the SHA-1 hashing. The bigger
the written part is, the higher cost we pay. So what if we write
stat-only data to a separate file? Think of it as an index extension,
only it stays outside the index. On webkit with 182k files, the stat
data size would be about 6MB (its index v4 is 15M for comparison). But
with stat-only we could employ some cheap but efficient compressing,
sd_dev, sd_uid and sd_gid are likely the same for every entry. And we
could store the stat data of updated entries only. So I'm hoping to
get that 6MB down to a few hundred KBs. That makes hashing lightning
fast.

So the idea is, when we do refresh, we note what entry has stat
updated. Then we write $GIT_DIR/index.stat (and leave $GIT_DIR/index
alone), which is a valid index except that it has zero entries and a
only one (new) extension storing (maybe compressed) stat data of
updated entries. The extension also contains the trailing SHA-1 of
$GIT_DIR/index for verification later. When we read $GIT_DIR/index, we
check for the existence of index.stat. If it does and its attached
SHA-1 matches, we overwrite some stat data with the info from
index.stat.

Back to the original question, I'm hoping to reduce some significant
numbers above to less than 10ms with this. So I see all good points
but no bad ones. Time to ask git@vger to give some. I'm actually
trying this idea in my untracked cache because I can't afford to lose
50% of the gain from untracked cache, just because I have to save some
bits in the giant $GIT_DIR/index and take the cost of rehashing.

(*) this is with the "untracked cache" enabled and total time is about
40% less than upstream "git status". The numbers against upstream "git
status" are actually less signficant. But I have to think positive
that one day untracked cache may be merged :)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]