Re: [PATCH 1/3] Lazily open pack index files on demand

"Dana How" <danahow@xxxxxxxxx> · Sat, 26 May 2007 10:31:00 -0700

On 5/26/07, Junio C Hamano <junkio@xxxxxxx> wrote:
"Shawn O. Pearce" <spearce@xxxxxxxxxxx> writes:
>  This conflicts (in a subtle way) with Dana How's
>  "sha1_file.c:rearrange_packed_git() should consider packs' object
>  sizes" patch as we now have num_objects = 0 for any indexes we
>  have not opened.  In the case of Dana's patch this would cause
>  those packfiles to have very high ranks, possibly sorting much
>  later than they should have.
I am keeping that rearrange stuff on hold, partly because I am
moderately hesitant to do the fp, which feels overkill at that
low level of code.
Oh,  I thought the fp might cause a gag reflex -- I had to add -lm.
Unfortunately,  when trying to automatically detect and grade outliers,
which is what I was trying to do,  (datum - mean) / std_dev is hard to beat,
and I needed sqrt for std_dev -- all other fp could be easily written out.

Also, I am hoping that we can discard that the object density
criteria altogether by making the default repack behaviour
friendlier to the pathological cases, e.g. by emitting huge
blobs at the end of the packstream, potentially pushing it out
to later parts of split packs by themselves and automatically
marking them with the .keep flag.  Until that kind of
improvements materialize, people with pathological cases could
(1) handcraft a pack that contains only megablob, (2) place that
on central alternate, (3) touch it with artificially old
timestamp, which hopefully is a good enough workaround.
I think we should do what we can to make the timestamp as
meaningful as possible,  which is why I submitted that stamping patch.

I think there are two interesting strategies compatible
with maximally-informative timestamps:

(1) git-repack -a -d repacks everything on each call.  You would need:
(1a) Rewrite builtin-pack-objects.c so only the object_ix hash
      accesses the "objects" array directly, everything else
      goes through a pointer table.
(1b) Sort the new pointer table by object type,  in order
      tag -> commit -> tree -> nice blob -> naughty blob.
     The sort is stable so the order within each group is unchanged.
(1c) Do not deltify naughty blobs.  Naughty blobs are those
     blobs marked "nodelta" or very large blobs.
(1d) Write out objects in new pointer table order.  Splitting
      will cause metadata to be in first pack,  naughty blobs
      tend to be in the last pack.
(1e) When done writing all packs,  swap their timestamps
     so current timestamp sorting will look at naughty blobs last.

(2) git-repack -a -d runs in two passes and maintains .keep files:
(2a) Add a new flag --types=[gctb]+ to pack-objects to be supplied
     by git-repack.  This means only taGs/Commits/Trees/Blobs
     are to be passed,  all others dropped.
(2b) Put a new loop around the core of git-repack.  In the first iteration,
     pack with --types=b, then with --types=gct in the second.
     Thus metadata will have more recent timestamp.
(2c) If packs are split, also swap timestamps like in (1e),
      within each iteration.
(2d) If an iteration produces split packs, mark all but the last
     in the sequence with a .keep file automatically.  The
     .keep files contain the string "repack".
(2e) Add new option to repack: -A.  If specified,  the first
     thing repack does is remove any keep file containing "repack".
(2f) The existing response of repack to keep files -- do not repack them --
    is retained to ensure on each -a/not -A repack,  we only
    repack the tail of each set of packs: metadata, data.
    The metadata set will probably only ever contain one pack
    and will always be repacked.

I've (badly) implemented (1b) and confirmed it had no impact
on linux-2.6 repo.  I've also implemented (2a), (2b), (2d), and (2f),
but not fully measured them.  I'd like to finish this work,  but
"megapacks" are very time-consuming to manipulate,  and
with the loose megablob approach they are not as useful for me.

Finally,  some people might want more esoteric repacking
strategies than what I've listed above.  We could add a
--packed flag to pack-objects to help them.  This means that
git pack-objects --packed --unpacked=<pack1> --unpacked=<pack2>
would only repack pack1 and pack2 and would not absorb
any loose blobs.  This would allow you to maintain any number
of packfile classes you want and maintain them yourself.
Each would be indicated by something different in a .keep file.
(To newly absorb loose blobs in a class,  you would do
cat object-list | git-pack-objects --incremental
from some object-list you built following your rules).
These strategies would be too special-purpose to be in git,
but adding --packed is a small and useful change.

Shawn:  When I first saw the index-loading code,  my first
thought was that all the index tables should be
merged (easy since sorted) so callers only need to do one search.
With indices loaded lazily,  either you can't merge,  or you
merge sequentially, raising merge cost from (total entries) to
almost (index files) * (total entries).  What do you think about
merging the SHA-1 tables,  and how would/should it interact with
lazy index file loading?

BTW,  if it's not apparent,  I think my object density patch
should be dropped.  It has served its purpose as a thought experiment.

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html