Re: [GSoC] Designing a faster index format - Progress report

Thomas Rast <trast@xxxxxxxxxxxxxxx> · Thu, 24 May 2012 22:01:59 +0200

Thomas asked for feedback on IRC, and Michael requested that I put it in
email for posterity, so here goes.

This is referring to 0c9214a from git://github.com/tgummerer/git.git,
where Thomas put the very early beginnings of wiring v5 parsing code
into read_index_from().

* On the high-level history making side of things: You structured your
  commits by "steps of implementation", but that sometimes does not give
  very useful intermediate steps. for example, now your index-v5~1 just
  prints "Header verified". That means it cannot be meaningfully tested,
  or judged by tests, or some such.

  That's ok as long as you are still drafting up things, but from
  experience the most workable approach is refactor-patch-improve (or
  sometimes patch-refactor-improve).  That is, you proceed in up to
  three steps, each spelled as one or more commits:

  - (Optionally, but in this case crucially) refactor the existing code
    into a shape where your feature patches become simpler.  This should
    be a true refactoring, i.e., not changing any observable behavior.

  - Optionally fix any bugs you encounter along the way.

  - Patch in your shiny new feature.

  (Tests should of course go with the last two steps.)

  This tends to make the feature patches simpler, and thus easier to
  review.  For a long project like yours, you may end up having several
  feature patches or even several iterations of this approach.

  For a great example of just this happening, consider Junio's index-v4
  series:

  - Preparations:
    varint: make it available outside the context of pack
    cache.h: hide on-disk index details
    read-cache.c: allow unaligned mapping of the index file
    read-cache.c: make create_from_disk() report number of bytes it consumed

  - Bugfix (ok, it only makes the error better)
    read-cache.c: report the header version we do not understand

  - Refactoring:
    read-cache.c: move code to copy ondisk to incore cache to a helper function
    read-cache.c: move code to copy incore to ondisk cache to a helper function

  - The feature itself:
    read-cache.c: read prefix-compressed names in index on-disk version v4
    read-cache.c: write prefix-compressed names in the index

  - The supporting mini-features required to use it:
    update-index: upgrade/downgrade on-disk index version
    unpack-trees: preserve the index file version of original

  - The docs:
    index-v4: document the entry format

  Ok, you can already see that the reality is not quite as simple as my
  initial explanations...

* The code structure itself:

  I like the split you made in the verify_hdr_* family, like so:

    -static int verify_hdr(struct cache_header *hdr, unsigned long size)
    +static int verify_hdr_version(struct cache_header *hdr, unsigned long size)
     {
    -       git_SHA_CTX c;
    -       unsigned char sha1[20];
            int hdr_version;

            if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
                    return error("bad signature");
            hdr_version = ntohl(hdr->hdr_version);
    -       if (hdr_version < 2 || 4 < hdr_version)
    +       if (hdr_version < 2 || 5 < hdr_version)
                    return error("bad index version %d", hdr_version);
    +       return 0;
    +}
    +
    +static int verify_hdr(struct cache_header *hdr, unsigned long size)
    +{
    +       git_SHA_CTX c;
    +       unsigned char sha1[20];
    +
            git_SHA1_Init(&c);
            git_SHA1_Update(&c, hdr, size - 20);
            git_SHA1_Final(sha1, &c);

  since it decouples version checking from the checksumming.  (Without
  the change to make 5 an acceptable version, this could actually go in
  a refactoring patch.)

  On the other hand the read_index_from() changes show none of that
  "function modularity".  Your results would be much better if you first
  refactored the meat of read_index_from to read_index_v2_from(),
  so that read_index_from() is essentially:

    mmap();
    verify_hdr();
    read_index_v2_from(...);
    munmap();

  Then, in another patch, wire in your shiny new read_index_v5_from()
  function.  Then change required to read_index_from() will be a
  two-liner, which makes it *much* easier to see what is going on,
  especially if you compare with your f50c747b

     read-cache.c | 118 ++++++++++++++++++++++++++++++++++--------------------

  80% of that is just reindentation because you are wrapping things in a
  version test!

* There's little code at this point, but I like the general direction
  you are taking: for now, start with a flat array of index entries and
  slurp everything in there, like the v3 code does.  Later patches can
  then improve on that to use the features of the new format.

* The struct{} layout should be such that it is not a pure coincidence
  that hdr_version has the same position in all structures.  You
  basically have two options to achieve this:

  - You write it as a concatenation of two structs, such as

      struct cache_version_header { unsigned hdr_signature, hdr_version; }
      struct cache_header_v2      { unsigned hdr_entries; }

    and similarly for v5.  The code then just reads the first struct,
    and dispatches to read the second one depending.

  - You write it as a struct-within-a-struct, so that the generic header
    becomes a part of the version-specific ones, like

      struct cache_header_v2 {
        struct cache_version_header foo;
        unsigned hdr_entries;
      }

    Finding a good substitute for "foo" is left as an exercise for the
    reader.

* Finally, on style: Your indentation is off in parts, please use a
  tab-width of 8 spaces and use tabs for the leading indentation.

Cheers,
Thomas

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html