Michael Haggerty skrev 2012-05-04 09.12:
On 05/03/2012 08:16 PM, Thomas Rast wrote:
Thomas Gummerer<t.gummerer@xxxxxxxxx> writes:
32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
[...]
[7] Since all stat data (except mtime and ctime) is just used for
checking if a file has changed a checksum of the data is enough.
In addition to that Thomas Rast suggested ctime could be ditched
completely (core.trustctime=false) and thus included in the
checksum. This would save 24 bytes per index entry, which would
be about 4 MB on the Webkit index.
(Thanks for the suggestion to Michael Haggerty)
This is the part I'm most curious about. Are we missing anything?
Michael brought it up on IRC: the stat() results are only used to test
whether they are still the same, with the exception of the mtime (which
also undergoes raciness checks).
As far as I can see, none of st_{ino,dev,uid,gid} are useful for
anything. st_size might conceivably be used as a hint for a buffer
size, but nobody actually does that. The ctime undergoes stricter
checks, but AFAICS it's also all about whether it has changed, and
besides that can be turned off. We think all of those fields can be
replaced by an arbitrary hash/CRC and only tested for equality. 32 bits
should be plenty, probably even if we just xor the values together.
XOR is definitely *not* adequate; for example, changing uid=gid="you" to uid=gid="me"
> would not affect the XOR of the values (assuming, as is often the case, that each user
has his own uid/gid with the same numerical values).
If you change uid/gid, that has no relevance for the content that git tracks. If the CRC
is equal you have to check the content. Ideally a change that does not change the content
should not change the CRC either, so there is really no absolute need to see that change.
I assume the idea is that if you do "tar xvf" or something like that, then changes in file,
mtime etc could be picked up by looking at these attributes, but it seems that those that
mess with mtime such that it goes back in time are out of luck with git anyway.
Which hash to use depends on some estimate of the likelihood that the hashes collide and
> simultaneously that the other metadata coincide. It seems to me that CRC-32 would
be adequate. But if not, a longer hash could be used (albeit with less space savings).
Michael
JGit simply ignores ctime, ino, dev, uid and gid. The real reason is of course that
standard Java does not have an API for these extra attributes. On the the other hand
nobody is going to fix this bug. The reason is that if you follow the rule that mtime
must always change to "now" if content change, then all changes will be found simply
by looking at mtime or performing a content check for the racy case. Those that mess
with mtime tend to be unhappy anyway.
Then there is the issue of how often we can detect changes without checking content. Ino
usually changes, but when it changes mtime usually does too, so how often does it speed
up.
Has anyone instrumented git to see how much the different attributes actually
contribute to performance and accuracy?
I'd like to extend the size field to 64 bits. We rarely need the extra bits, but we
cannot differ between 3 bytes and 4294967299 bytes so avoiding the very expensive
content check there would be welcome, even it it's a rare event. I haven't thought
too much about this though. I just felt uncomfortable when looking at the code and
knowing that performing a content check of a 4 GB file could take a minute or two.
-- robin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html