Re: A look at some alternative PACK file encodings

Junio C Hamano <junkio@xxxxxxx> · Thu, 07 Sep 2006 10:40:02 -0700

Richard Curnow <rc@xxxxxxxxxx> writes:

>  <linux <at> horizon.com> writes:
>
>> For regular packs, such objects wouldn't even be present, because
>> all base objects are in the pack itself.
>
> It would actually be useful if this restriction were lifted.

The primary reason why we require the base object to be present
in the same pack is because our assumption has always been we
mmap a pack as a whole, and we assumed that we can make sure
that the address space of a process to be enough to mmap at
least one pack (otherwise the user will be given a way to keep
size of each pack down to a size that is a mmapable piece) but
not necessarily more than one.  In other words, while unpacking
one object that is deltified, we wanted to make sure everything
that is necessary is availble in memory.  Otherwise, when
multiple packs are needed to reconstitute a single object
because the object is a delta in one pack and its base is in the
other pack, we might not be able to mmap both of them into the
address space at the same time.

The "mmap only parts of a huge packfile into map windows" update
Shawn Pearce is working on would lift that worry, so it would be
less of a reason.

Having said that, people must realize that keeping thin packs on
disk, indexed, is a major undertaking.  It can be made to work,
but it will be painful.

People can grab a thin pack via rsync or http without realizing
it is thin, try to use it and then later (possibly much later,
when a delta that is against a missing object happens to be
needed) discover some of the pack data is unusable.  To avoid
this, the dumb transports need to walk the ancestry chains of
commits and their associated objects while fetching as they do
right now, but also need to walk the delta chains to make sure
their base objects are available.

I like the proposal by Linus to have stub objects that records
what the missing base object is in the thin pack stream.  It
simplifies things for the offset representation of bases.  But
if we were to do so, I suspect indexing such a thing would need
more surgery to the .idx format.  The question is if we should
record the stub object in the .idx file.

Right now we record the beginning of data for each object
(either base or delta) without storing anything to say about
where it ends -- the pack is assumed to be dense, so the point
where next object begins should be where the current one ends.
This means that we need to build a reverse index to list the
objects in the pack in offset order, when we want to find the
end of each object efficiently.

We work this around when using a pack by not doing the full
verification.  We start decoding the header from the beginning
and in either base or delta the deflated stream sits at the end.
We inflate that deflated stream until zlib says it got enough,
and we do not check if that "enough" point stepped into the
location that is supposed to be inside the "next" object (if it
did so then it is an error in the packfile we missed to detect).
If things inflate well and delta applies cleanly, we are happy.

But pack-objects, when reusing the data from an existing pack,
wants to be careful, so it _does_ build a reverse index itself
and makes sure that inflate() ends at the location we expect it
to.  For that process to remain the correct way to determine
where each object ends, the .idx file must record the presense
of stub objects.

However, if we write the names of stub objects in the .idx file,
we have another problem.  Currently, the existence of an object
name in the .idx means that such an object exists in the
corresponding .pack; this will not hold true anymore.

So if we were to do this, we need to give one extra bit to each
of the .idx entries to mark which ones are stub and which ones
are not.  This would allow the reverse index to properly notice
where each object ends, and allow the dumb transports to walk
the .idx for missing base objects efficiently.  The code that
looks at .idx at runtime to determine which pack has needed
object needs to notice this bit.

Again, all of this can be made to work, but it will be painful.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html