Re: Index format v5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
* The details of the extension data blocks are described in the first
   (overview) section, whereas it seems like they should be described
   in their own section following the "conflict data" section.  But
   wouldn't the presence of extension data blocks prevent the addition
   of conflict data?

Only the details that should be there for every extension are described
in the overview (the header of the extension), to make sure every
extension has the same header format, and thus a reader which doesn't
understand a specific extension still can read its header and know
what's going on.

They won't prevent the addition of conflicted data, since when a
conflict is created, other files were probably added and the index has
to be rewritten anyway. Once the conflict is resolved however only a
bit has to be flipped, so there is no rewrite necessary.

In other words, the presence of extensions *does indeed* prevent the addition of conflict data, but you don't think that it is a problem.

Moving the conflict data to after the extensions, on the other hand, would mean that conflict data can sometimes be added without a rewrite. I cannot judge whether this would be useful.

Handling conflict data *as* an extension would allow the conflict data to be added at any time without rewriting. I cannot judge whether this would be useful.

* Does the index file include directory entries for empty directories?
   What about directories that contain only other directories?

In theory the index is able to include empty directories. I'm however
not sure if this should be implemented. I'd be happy to get more
feedback there.

Currently git does not keep track of empty directories. Even though there have been proposals to fix this, it is far beyond the scope of your project to implement the handling of empty directories. The question is whether your format definition *forbids* the presence of empty directories in the index file (in the interest of definiteness, and it might make the reader implementation a little bit simpler, but it imposes a constraint on the writer). Obviously empty directories, even if present, mustn't have an effect on the SHA1 of the trees containing them.

Directory entry
===============

* "4-byte number of entries in the index that is covered by the tree
   this entry represents."  What does this include?
   Files/directories/both?  Recursive or non-recursive?

This is from the cache-tree. I'm not sure but I think it includes both
files and directories, recursively.

Please figure this out for the final spec.

File entry
==========
[...]

* Are file entries sorted by entire path or only by the basename?

They are sorted by the basename, in the respective block of their
directories.
Example: paths: a/a a/z b/b
File entries in the index:
a ...
z ...
b ...

OK, so in other words, the file entries of all files in a directory (not including files in subdirectories) are stored contiguously, sorted by basename. (The thing that wasn't immediately clear is whether files from subdirectories are intermingled with those of the parent directory.)

Flat loading
============

* I found the explanation pretty incomprehensible.  Perhaps some
   pseudo-code would make it clearer?
[...]
[...] I have changed the flat loading in the documentation,
hope it's more understandable now.

Maybe it's just be, but I still don't think it is very clear. Here is version fbf8add1b026:

== Flat loading

Since internally git expects and works with lexicografic ordering,
a simple linear scan throught the subdirectories doesn't give
the right internal sorting. To achieve the right internal sorting
the loading will be done in the following way:

1. Start with the root directory, and read also the name of the
  first subdirectory (=next directory in the list).

1a. Use the next directory (the one against which the filenames
  were checked previously), and read the next directory name,
  to check the files against.

2. Check the stack if the element at the top is < then the current
  directoryname.

  If it's < then current directory name, add files from the stack
    to the entry list, until the file name is > then the
    directory name.

2. While filename < directoryname add the filenames to the entry
  list

3. Add the rest of the files to a stack.

4. Continue with 1a, if there are more directories left.

5. Add the rest of the files from the stack to the end of the
  entry list.

Aside from the fact that there are two number (2)s,

* What does "Use the next directory (the one against which the filenames were checked previously)" mean? What does it mean to "use a directory"? Does it mean to recurse into the directory? Is the stack preserved passed down to the recursive function calls, or does each level of the recursion have its own stack? What does "against which the filenames were checked previously" mean (there are no filenames mentioned in the earlier steps)?

* You talk about a stack, and "Add the rest of the files to a stack". But when you retrieve entries from a stack, they come out in reverse order. So are you imagining that each element of the stack is an array of file entries? Or do you push the files onto the stack in reverse order? Or do you really mean a queue rather than a stack?

* Are the file entries read before they are put on the stack, or does the stack just remember where to read them from when their turn comes?

* "Continue with 1a, if there are more directories left": I assume you mean subdirectories of the current directory, but maybe you are talking about all directories?

There is a reason that I asked for pseudocode, namely because it forces you to be more precise in your description. I can certainly imagine several workable algorithms for reading the index file, and the different algorithms have different tradeoffs particularly regarding the amount of temporary space needed and locality of reference in the index file (which, I understand, will be mmapped when practical but it is not practical on all platforms). Once you express the algorithm in pseudocode it is possible to be sure which variant you have chosen and consider whether it is really workable.

Michael

--
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]