Re: [PATCH v3 01/20] sparse-index: design doc and format update

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 23 Mar 2021 07:16:33 -0400

On 3/19/2021 7:43 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>>
>> This begins a long effort to update the index format to allow sparse
>> directory entries. This should result in a significant improvement to
>> Git commands when HEAD contains millions of files, but the user has
>> selected many fewer files to keep in their sparse-checkout definition.
> 
> This compromise makes sense.
> 
> In the past, we often dreamed of recording trees in the index
> (instead of using a bolted on extension like cache-tree, treating
> trees as first-class citizens) and lazily expanding it only when the
> user starts modifying the paths within the subdirectory.
> 
> But such an optimization never materialized, as the dual and
> conflicting nature of the index to keep track of the contents for
> the "next" commit (for which it is sufficient to just record trees
> for parts that have not been modified) and to cache stat information
> to detect which working tree paths may possibly have modifications
> (for which, we used the one-entry-per-path nature of the cache
> entries so far) was never resolved.
> 
> But if we limit the use of trees-in-index for sparse/cone checkout
> case, we do not even have to worry about having to cache the stat
> information for those paths that we are not going to populate in the
> working tree at all.  It is a great simplification of the problem.

Thanks. I appreciate your input here.

>> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
>> +  the path ends in a directory separator.
>> +
> 
> Why leading two 0's?  At the tree object level, we do not 0-pad blob
> mode word, and if you are writing for C programmers, you need only
> one '0' prefix to signal that it is in octal (in the on-disk index
> file, the blob mode word is stored in a be16 word).

Fixed.

>> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
>> new file mode 100644
>> index 000000000000..aa116406a016
>> --- /dev/null
>> +++ b/Documentation/technical/sparse-index.txt
>> @@ -0,0 +1,173 @@
>> +Git Sparse-Index Design Document
>> +================================
>> +
>> +The sparse-checkout feature allows users to focus a working directory on
>> +a subset of the files at HEAD. The cone mode patterns, enabled by
>> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
>> +discover which files at HEAD belong in the sparse-checkout cone.
>> +
>> +Three important scale dimensions for a Git worktree are:
> 
> s/worktree/working tree/; The former is the thing the "git worktree"
> command deals with.  The latter is relevant even when "git worktree"
> is not used (the traditional "git clone and you get a working tree
> to work in").

I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
directory" is more specific and hence better.

>> +These dimensions are also ordered by how expensive they are per item: it
>> +is expensive to detect a modified file than it is to write one that we
>> +know must be populated; changing `HEAD` only really requires updating the
>> +index.
> 
> This is a bit too dense to grok.  Among Populated, there are some
> Modified but it takes lstat(2) per path or fsmonitor listening to
> inotify to know which ones are in the Modified set.  Is that the
> "expensive" you are referring to here?  I am not sure how you
> compared the cost to know if a path is modified or merely populated
> with the cost of "write one that we know must be populated" (which I
> take as "given a populated file, make modification to it"). 

I could rearrange things here. The important things to note are:

1. Updating index entries is very fast, but adds up at large scale.

2. It is faster to write a file to disk from Git's object database
   than it is to compare a file on disk to the copy in the database,
   which is frequently necessary when the mtime on disk doesn't match
   the mtime in the index.

> Also it
> is unclear what you mean by "changing HEAD only require updating the
> index".  Certainly when "git switch" flips HEAD from one commit to
> another, you'd update the index and update the files in the working
> tree (in the Populated part that is in the sparse-checkout cone) to
> match, no?

This is unclear of me. I was thinking more on the lines of "git reset"
(soft mode) which updates HEAD without changing the files on disk.

After all of this postulating, I think that the offending sentences
are better off deleted. They don't add clarity over what can be
inferred by an interested reader.

>> In addition, they expect to see all files at `HEAD`.
> 
> It is not clear to me what this means.  After "git add", "git
> ls-files" would expect to see a file that may not even in HEAD.
> After "git rm", it would expect to see some file missing from the
> set of paths in HEAD.  While I do not think that is what you meant
> here, it is hard to guess what you wanted to say.

I'm mixing terms incorrectly. I think what I really mean is

  In fact, these loops expect to see a reference to every
  staged file.

>> One
>> +way to handle this is to parse trees to replace a sparse-directory entry
>> +with all of the files within that tree as the index is loaded. However,
>> +parsing trees is slower than parsing the index format, so that is a slower
>> +operation than if we left the index alone.
> 
> Besides, that would leave in-core index fully populated, so I would
> suspect that you'd lose a lot of benefit that comes from having to
> keep much fewer entries in the in-core index than what is in HEAD.
> It would be nice for "git diff-index --cached" (which is part of
> "git status") to be able to skip a single "tree" entry in the sparse
> index as "known to be untouched", than skipping thousands of paths
> in that single subdirectory (in a mega monorepo project) as "these
> are marked with SKIP_WORKTREE so ignore what is in the working tree".

Absolutely! I'm burying the lead here, so I should get to the real
point by adding this to the end:

 The plan is to make all of these integrations "sparse aware" so
 this expansion through tree parsing is unnecessary and they use
 fewer resources than when using a full index.

>> +Phase I: Format and initial speedups
>> +------------------------------------
>> +
>> +During this phase, Git learns to enable the sparse-index and safely parse
>> +one. Protections are put in place so that every consumer of the in-memory
>> +data structure can operate with its current assumption of every file at
>> +`HEAD`.
> 
> IOW, before they iterate over the in-core index, tree entries are expanded
> into bunch of individual entries with SKIP_WORKTREE bit?  Makes sense.
> 
>> +At first, every index parse will expand the sparse-directory entries into
>> +the full list of paths at `HEAD`. This will be slower in all cases. The
>> +only noticable change in behavior will be that the serialized index file
>> +contains sparse-directory entries.
> 
> Hmph, do you mean that the expansion is done by not replacing each
> "tree" entry with blob entries for the contents of the directory,
> but the original "tree" entry is still left in the in-core index?

I meant by "serialized index file" is that the file written to disk has
the sparse directory entries, but the in-core copy will not (except for
a very brief moment in time, during do_read_index()).

The intention at this point in time is that all code behaves identically
to the full index case, except that the index file itself is smaller due
to these sparse directory entries.

> It is not immediately clear what we are trying to gain by leaving it
> in, but let's read on.  Perhaps we can get rid of cache-tree
> extension and replace its use with these "tree" entries whose
> content paths are populated in the index?

This is an interesting idea, but not one I plan to pursue with this work.

>> +Next, consumers of the index will be guarded against operating on a
>> +sparse-index by inserting calls to `ensure_full_index()` or
>> +`expand_index_to_path()`. After these guards are in place, we can begin
>> +leaving sparse-directory entries in the in-memory index structure.
> 
> It is unclear why "we can begin leaving"; an iterator that only
> expects to see blobs would need to be updated to skip them, too, no?
> They would probably be already skipping blob entries that are marked
> with the SKIP_WORKTREE bit, so it may be just a matter of skipping
> more things than the current code.
> 
> Or did I misread the design presented earlier, and when a directory
> that is outside the cone is expanded into the paths of blobs in the
> directory, the "tree" entry is removed from the in-core index?

I will make this more explicit.

Thanks for your help improving this doc! Hopefully the plan is a
little more clear, now.

-Stolee