Missing Promisor Objects in Partial Repo Design Doc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It seems that we're at a standstill for the various possible designs
that can solve this problem, so I decided to write up a design document
to discuss the ideas we've come up with so far and new ones. Hopefully
this will get us closer to a viable implementation we can agree on.

Missing Promisor Objects in Partial Repo Design Doc
===================================================

Basic Reproduction Steps
------------------------

 - Partial clone repository
 - Create local commit and push
 - Fetch new changes
 - Garbage collection

State After Reproduction
------------------------

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2b ---- T2b -- B2b (created locally, in non-promisor pack)
  |
  C2a ---- T2a -- B2a (created locally, in non-promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Explanation of the Problem
--------------------------

In a partial clone repository, non-promisor commits are locally
committed as children of promisor commits and then pushed up to the
server. Fetches of new history can result in promisor commits that have
non-promisor commits as ancestors. During garbage collection, objects
are repacked in 2 steps. In the first step, if there is more than one
promisor packfile, all objects in promisor packfiles are repacked into a
single promisor packfile. In the second step, a revision walk is made
from all refs (and some other things like HEAD and reflog entries) that
stops whenever it encounters a promisor object. In the example above, if
a ref pointed directly to C2a, it would be returned by the walk (as an
object to be packed). But if we only had a ref pointing to C3, the
revision walk immediately sees that it is a promisor object, does not
return it, and does not iterate through its parents.

(C2b is a bit of a special case. Despite not being in a promisor pack,
it is still considered to be a promisor object since C3 directly
references it.)

If we think this is a bad state, we should propagate the “promisor-ness”
of C3 to its ancestors. Git commands should either prevent this state
from occurring or tolerate it and fix it when we can. If we did run into
this state unexpectedly, then it would be considered a BUG.

If we think it is a valid state, we should NOT propagate the
“promisor-ness” of C3 to its ancestors. Git commands should respect that
this is a possible state and be able to work around it. Therefore, this
bug would then be strictly caused by garbage collection


Bad State Solutions
===================

Fetch negotiation
-----------------
Implemented at
https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@xxxxxxxxxx/

During fetch negotiation, if a commit is not in a promisor pack and
therefore local, do not declare it as "have" so they can be fetched into
a promisor pack.

Cost:
- Creation of set of promisor pack objects (by iterating through every
  .idx of promisor packs)
- Refetch number of local commits

Pros: Implementation is simple, client doesn’t have to repack, prevents
state from ever occurring in the repository.

Cons: Network cost of refetching could be high if many local commits
need to be refetched.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, refetched into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Fetch repack
------------
Not yet implemented.

Enumerate the objects in the freshly fetched promisor packs, checking
every outgoing link to see if they reference a non-promisor object that
we have, to get a list of tips where local objects are parents of
promisor objects ("bad history"). After collecting these "tips of bad
history", you then start another traversal from them until you hit an
object in a promisor pack and stop traversal there. You have
successfully enumerated the local objects to be repacked into a promisor
pack.

Cost:
- Traversal through newly fetched promisor trees and commits
- Creation of set of promisor pack objects (for tips of bad history
  traversal to stop at a promisor object)
- Traversal through all local commits and check existence in promisor
  pack set
- Repack all pushed local commits

Pros: Prevents state from ever occurring in the repository, no network
cost.

Cons: Additional cost of repacking is incurred during fetch, more
complex implementation.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Garbage Collection repack
-------------------------
Not yet implemented.

Same concept at “fetch repack”, but happens during garbage collection
instead. The traversal is more expensive since we no longer have access
to what was recently fetched so we have to traverse through all promisor
packs to collect tips of “bad” history.

Cost:
- Creation of set of promisor pack objects
- Traversal through all promisor commits
- Traversal through all local commits and check existence in promisor
  object set
- Repack all pushed local commits

Pros: Can be run in the background as part of maintenance, no network
cost.

Cons: More expensive than “fetch repack”, state isn’t fixed until
garbage collection, more complex implementation

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Garbage Collection repack all
-----------------------------
Implemented at
https://lore.kernel.org/git/20240925072021.77078-1-hanyang.tony@xxxxxxxxxxxxx/ 

Repack all local commits into promisor packs during garbage collection.

Both valid scenarios
commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

commit  tree  blob
  C3 ---- T3 -- B3 (created locally, packed into promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Cost:
Repack all local commits

Pros: Can be run in the background as part of maintenance, no network
cost, less complex implementation, and less expensive than “garbage
collection repack”.

Cons: Packing local objects into promisor packs means that it is no
longer possible to detect if an object is missing due to repository
corruption or because we need to fetch it from a promisor remote.
Packing local objects into promisor packs means that garbage collection
will no longer remove unreachable local objects.

Valid State Solutions
=====================
Garbage Collection check
------------------------
Not yet implemented.

Currently during the garbage collection rev walk, whenever a promisor
commit is reached, it is marked UNINTERESTING, and then subsequently all
ancestors of the promisor commit are traversed and also marked
UNINTERESTING. Therefore, add a check for whether a commit is local or
not during promisor commit ancestor traversal and do not mark local
commits as UNINTERESTING.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Cost:
- Adds an additional check to every ancestor of a promisor commit.

This is practically the only solution if the state is valid. Fsck would
also have to start checking for validity of ancestors of promisor
commits instead of ignoring them as it currently does.

Optimizations
=============

The “creation of set of promisor pack objects” can be replaced with
“creation of set of non-promisor objects” since the latter is almost
always cheaper and we can check for non-existence rather than existence.
This does not work for “fetch negotiation” since if we have a commit
that's in both a promisor pack and a non-promisor pack, the algorithm's
correctness relies on the fact that we report it as a promisor object
(because we really need the server to re-send it).





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux