Re: [GSoC] Designing a faster index format - Progress report week 13

Thomas Rast <trast@xxxxxxxxxxxxxxx> · Tue, 24 Jul 2012 13:54:02 +0200

Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> writes:

> Junio C Hamano skrev 2012-07-22 23.08:
>> Thomas Rast <trast@xxxxxxxxxxxxxxx> writes:
>>
>>> What is the status quo?  I take it JGit does not have any of ctime, dev,
>>> ino etc., and either leaves the existing value or puts a 0....
>>> an argument in favor of splitting stat_crc into its fields again?
>>
>> A difference is that JGit already has such code, and we would be
>> adding a burden to do so yet again.  It also may not just be JGit,
>> but anything that wants to be "compatible" with systems whose
>> filesystem interface does not give enough data by omitting fields
>> the current index pays attention to.
>>
>> It isn't really a discussion about splitting again, but more about
>> not squishing them into a new field in the first place---IIUC, even
>> outside Windows, ctime is already problematic on some systems where
>> background processes muck with extended attributes Git does not pay
>> attention to. If the patch makes us lose the ability to selectively
>> ignore changes to certain fields (e.g. changes to dev and ino are
>> noticed but ctime are ignored) by squishing them into one new field,
>> wouldn't removing them without adding such a useless field a simpler
>> way to go?
> 
> I wasnt't thinking of splitting, but now I read it again, I do think
> it should split.

Aren't you two going off in different directions?  I read Junio as
implying that if size/ctime/dev/ino are a pain to deal with, we should
just drop them altogether.  You seem to be saying the opposite:

> Having size accessible is a good thing, and even
> better if it a 64-bit value so we don't have the modulo-4G problem
> when looking at it. Current size is 4G + 33 bytes, index says 33. Did
> the
> file change or not?
>
> Having access to size make the need for actually
> invoking the racy git logic and comparing file content less likely.

I don't think this is true.  Racy git logic is needed every time that
the file *looks* unchanged, but isn't.  In the case where the file is
certified (by mtime) unchanged, we don't go checking.  But in the case
where it *looks* changed, we still have to go and read it to know if,
perhaps, the only thing the user did was hit "save" again.

Not to mention that this really hurts in terms of index size.  Our
benchmark for index-v5 is the Webkit project, which stands at 180k
files.  So every 6B/entry is about an MB of final size, which needs to
be loaded, hashed (or crc'd), then hashed/crc'd again and written.
Junio's index-v4 was a speed boost mainly because it cuts down on the
size of the index.  Do we want to throw that out?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html