Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Fri, 22 Sep 2017 17:32:00 -0400

On 9/21/2017 7:04 PM, Jonathan Tan wrote:
On Thu, 21 Sep 2017 14:00:40 -0400
Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:

(part 3)

Additional overall comments on:
https://github.com/jonathantanmy/git/commits/partialclone2

{} WRT the code in is_promised() [1]

[1]
https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960

     {} it looked like it was adding ALL promisor- and
promised-objects to the "promised" OIDSET, rather than just
promised-objects, but I could be mistaken.

As far as I can tell, it is just adding the promised objects (some of
which may also be promisor objects). If you're saying that you expected
it to add the promisor objects as well, that might be a reasonable
expectation...I'm thinking of doing that.

It looked like it was adding both types.  I was concerned that that
it might be doing too much.  But I haven't run the code, that was from
an observation.

     {} Is this iterating over ALL promisor-packfiles?

Yes.

     {} It looked like this was being used by fsck and rev-list.  I
have concerns about how big this OIDSET will get and how it will
scale, since if we start with a partial-clone all packfiles will be
        promisor-packfiles.

It's true that scaling is an issue. I'm not sure if omitting the oidset
will solve anything, though - as it is, Git maintains an object hash and
adds to it quite liberally.

I guess I'm afraid that the first call to is_promised() is going
cause a very long pause as it loads up a very large hash of objects.

Perhaps you could augment the OID lookup to remember where the object
was found (essentially a .promisor bit set).  Then you wouldn't need
to touch them all.

One thing that might help is some sort of flushing of objects in
promisor packfiles from the local repository - that way, the oidset
won't be so large.

     {} When iterating thru a tree object, you add everything that it
        references (everything in that folder).  This adds all of the
        child OIDs -- without regard to whether they are new to this
        version of the tree object. (Granted, this is hard to compute.)

The oidset will deduplicate OIDs.

Right, but you still have an entry for each object.  For a repo the
size of Windows, you may have 25M+ objects your copy of the ODB.

        My concern is that this will add too many objects to the
OIDSET. That is, a new tree object (because of a recent change to
something in that folder) will also have the OIDs of the other
*unchanged* files which may be present in an earlier non-provisor
packfile from an earlier commit.

        I worry that this will grow the OIDSET to essentially include
        everything.  And possibly defeating its own purpose.  I could
be wrong here, but that's my concern.

Same answer as above (about flushing of objects in promisor packfiles).

{} I'm not opposed to the .promisor file concept, but I have concerns
     that in practice all packfiles after a partial clone will be
     promisor-packfiles and therefore short-cut during fsck, so fsck
     still won't gain anything.

     It would help if there are also non-promisor packfiles present,
     but only for objects referenced by non-promisor packfiles.

     But then I also have to wonder whether we can even support
non-promisor packfiles after starting with a partial clone -- because
of the properties of received thin-packs on a non-partial fetch.

Same reply as to your other e-mail - locally created objects are not in
promisor packfiles. (Or were you thinking of a situation where locally
created objects are immediately uploaded to the promisor remote, thus
making them promisor objects too?)

Thanks,
Jeff