Re: [PATCH 3/6] Stop producing index version 2

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 6 Feb 2012 19:09:15 -0800

2012/2/5 Junio C Hamano <gitster@xxxxxxxxx>:
> Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:
>
>> read-cache.c learned to produce version 2 or 3 depending on whether
>> extended cache entries exist in 06aaaa0 (Extend index to save more flags
>> - 2008-10-01), first released in 1.6.1. The purpose is to keep
>> compatibility with older git. It's been more than three years since
>> then and git has reached 1.7.9. Drop support for older git.
>
> Cc'ing this, as I suspect this would surely raise eyebrows of some people
> who wanted to get rid of the version 3 format.

Version 3 was a mistake because of the variable length record sizes.
Saving 2 bytes on some records that don't use the extended flags makes
the index file *MUCH* harder to parse. So much so that we should take
version 3 and kill it, not encourage it as the default!

IMHO, when these extended flags were added to make version 3 the
following should have happened:

- All records use the larger structure format with 4 bytes for the
flags, not 2 bytes.

- Change the trailing padding after the name to be a *SINGLE* \0 byte,
and do not pad out to an 8 byte boundary.

Both make it really hard to process the file, and the latter happens
only for direct mmap usage, which we don't do anymore.

We also have to consider the EGit and JGit user base as part of the
ecosystem. We can't just kill a file format because git-core has been
capable of reading its alternative since some arbitrary YYYY-MM-DD
release date. We need to also consider when did some other major tools
catch up and also support this format?

FWIW JGit released index version 3 support in version 0.9.1, which
shipped Sep 15, 2010. JGit/EGit were more than 2 years behind here.

<thinking type="wishful" probability="never-happen"
probably-inflating-flame-from="linus">

I have long wanted to scrap the current index format. I unfortunately
don't have the time to do it myself. But I suspect there may be a lot
of gains by making the index format match the canonical tree format
better by keeping the tree structure within a single file stream,
nesting entries below their parent directory, and keeping tree SHA-1
data along with the directory entry. For one thing the index would be
able to register an empty subdirectory, rather than ignoring them. It
would also better line up with the filesystem's readdir() handling,
giving us more sane logic to compare what readdir() tells us exists
against what the index thinks should be in the same file. And the
overall index should be smaller, because we don't have to repeat the
same path/to/a/file/for/every/file/in/that/same/directory/tree.
Reconstructing the path strings at read time into a flat list should
be pretty trivial, and still keep the parallel lstat calls running off
a flat list working well for fast status operations.

</thinking>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html