Re: Index format v5

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Wed, 09 May 2012 10:37:03 +0200

On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
* The details of the extension data blocks are described in the first
   (overview) section, whereas it seems like they should be described
   in their own section following the "conflict data" section.  But
   wouldn't the presence of extension data blocks prevent the addition
   of conflict data?

Only the details that should be there for every extension are described
in the overview (the header of the extension), to make sure every
extension has the same header format, and thus a reader which doesn't
understand a specific extension still can read its header and know
what's going on.

They won't prevent the addition of conflicted data, since when a
conflict is created, other files were probably added and the index has
to be rewritten anyway. Once the conflict is resolved however only a
bit has to be flipped, so there is no rewrite necessary.

In other words, the presence of extensions *does indeed* prevent the 
addition of conflict data, but you don't think that it is a problem.

Moving the conflict data to after the extensions, on the other hand, 
would mean that conflict data can sometimes be added without a rewrite. 
 I cannot judge whether this would be useful.

Handling conflict data *as* an extension would allow the conflict data 
to be added at any time without rewriting.  I cannot judge whether this 
would be useful.

* Does the index file include directory entries for empty directories?
   What about directories that contain only other directories?

In theory the index is able to include empty directories. I'm however
not sure if this should be implemented. I'd be happy to get more
feedback there.

Currently git does not keep track of empty directories.  Even though 
there have been proposals to fix this, it is far beyond the scope of 
your project to implement the handling of empty directories.  The 
question is whether your format definition *forbids* the presence of 
empty directories in the index file (in the interest of definiteness, 
and it might make the reader implementation a little bit simpler, but it 
imposes a constraint on the writer).  Obviously empty directories, even 
if present, mustn't have an effect on the SHA1 of the trees containing them.

Directory entry
===============

* "4-byte number of entries in the index that is covered by the tree
   this entry represents."  What does this include?
   Files/directories/both?  Recursive or non-recursive?

This is from the cache-tree. I'm not sure but I think it includes both
files and directories, recursively.

Please figure this out for the final spec.

File entry
==========
[...]

* Are file entries sorted by entire path or only by the basename?

They are sorted by the basename, in the respective block of their
directories.
Example: paths: a/a a/z b/b
File entries in the index:
a ...
z ...
b ...

OK, so in other words, the file entries of all files in a directory (not 
including files in subdirectories) are stored contiguously, sorted by 
basename.  (The thing that wasn't immediately clear is whether files 
from subdirectories are intermingled with those of the parent directory.)

Flat loading
============

* I found the explanation pretty incomprehensible.  Perhaps some
   pseudo-code would make it clearer?
[...]
[...] I have changed the flat loading in the documentation,
hope it's more understandable now.

Maybe it's just be, but I still don't think it is very clear.  Here is 
version fbf8add1b026:

== Flat loading

Since internally git expects and works with lexicografic ordering,
a simple linear scan throught the subdirectories doesn't give
the right internal sorting. To achieve the right internal sorting
the loading will be done in the following way:

1. Start with the root directory, and read also the name of the
  first subdirectory (=next directory in the list).

1a. Use the next directory (the one against which the filenames
  were checked previously), and read the next directory name,
  to check the files against.

2. Check the stack if the element at the top is < then the current
  directoryname.

  If it's < then current directory name, add files from the stack
    to the entry list, until the file name is > then the
    directory name.

2. While filename < directoryname add the filenames to the entry
  list

3. Add the rest of the files to a stack.

4. Continue with 1a, if there are more directories left.

5. Add the rest of the files from the stack to the end of the
  entry list.

Aside from the fact that there are two number (2)s,

* What does "Use the next directory (the one against which the filenames 
were checked previously)" mean?  What does it mean to "use a directory"? 
 Does it mean to recurse into the directory?  Is the stack preserved 
passed down to the recursive function calls, or does each level of the 
recursion have its own stack?  What does "against which the filenames 
were checked previously" mean (there are no filenames mentioned in the 
earlier steps)?

* You talk about a stack, and "Add the rest of the files to a stack". 
But when you retrieve entries from a stack, they come out in reverse 
order.  So are you imagining that each element of the stack is an array 
of file entries?  Or do you push the files onto the stack in reverse 
order?  Or do you really mean a queue rather than a stack?

* Are the file entries read before they are put on the stack, or does 
the stack just remember where to read them from when their turn comes?

* "Continue with 1a, if there are more directories left": I assume you 
mean subdirectories of the current directory, but maybe you are talking 
about all directories?

There is a reason that I asked for pseudocode, namely because it forces 
you to be more precise in your description.  I can certainly imagine 
several workable algorithms for reading the index file, and the 
different algorithms have different tradeoffs particularly regarding the 
amount of temporary space needed and locality of reference in the index 
file (which, I understand, will be mmapped when practical but it is not 
practical on all platforms).  Once you express the algorithm in 
pseudocode it is possible to be sure which variant you have chosen and 
consider whether it is really workable.

Michael

--
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html