On 17 May 2016, at 14:13, Jeff King <peff@xxxxxxxx> wrote: > On Tue, May 17, 2016 at 10:07:16AM +0200, Lars Schneider wrote: > >> I think that is pretty much the problem. Here is what is happening: >> >> 1. git-p4 imports all changelists for the "main" branch >> >> 2. git-p4 starts to import a second branch and creates a fast-import >> "progress checkpoint". This triggers: >> >> --> fast-import.c: cycle_packfile >> --> fast-import.c: end_packfile >> --> fast-import.c: loosen_small_pack >> >> At this point we have no packfile anymore. >> >> 3. git-p4 sends the command "commit refs/remotes/p4/depot/branch2" >> to fast-import in order to create a new branch. This triggers: >> >> --> fast-import.c: parse_new_commit >> --> fast-import.c: load_branch >> --> fast-import.c: load_tree >> >> "load_tree" calls "find_object" and the result has a "pack_id" of 0. >> I think this indicates that the object is in the packfile. Shouldn't >> the "pack_id" be "MAX_PACK_ID" instead? >> >> myoe = find_object(sha1); >> if (myoe && myoe->pack_id != MAX_PACK_ID) { > > Thanks for the analysis. I think this is definitely the problem. After > fast-import finalizes a packfile (either at the end of the program or > due to a checkpoint), it never discards its internal mapping of objects > to pack locations. It _has_ to keep such a mapping before the pack is > finalized, because git's regular object database doesn't know about it. > But afterwards, it should be able to rely on the object database. > > Retaining the mapping at the end of the program is obviously OK because > we won't make any more lookups in it. > > Retaining it at a checkpoint when we keep the packfile is OK, because we > can (usually) still access that packfile (the exception would be if > somebody runs "git repack" while fast-import is running). > > But obviously a checkpoint where we throw away the pack needs to clear > that mapping. > > The patch below probably makes your case work, but there are a lot of > open questions: Confirmed. The offending tests pass with your patch. > 1. Should we always discard the mapping, even when not loosening > objects? I can't think of a real downside to always using git's > object lookups. > > 2. Can we free memory associated with object_entry structs at this > point? They won't be accessible via the hash, but might other bits > of the code have kept pointers to them? > > I suspect it may screw up the statistics that fast-import prints at > the end, but that's a minor point. > > 3. I notice that a few other structs (branch and tag) hold onto the > pack_id, which will now point to a pack we can't access. Does this > matter? I don't think so, because checkpoint() seems to dump the > branches and tags. > 4. In general, should we be loosening objects at checkpoints at all? > > I guess it is probably more efficient than creating a bunch of > little packs. And it should probably not happen much at all outside > of tests (i.e., callers should generally checkpoint after an > appreciable amount of work is done). I am not too familiar with the code and therefore I can't give a good answer to your questions. However, I noticed that Shawn (CC'ed) wrote a lot of fast-import code. Maybe he can help? >From my point of view little packs are no problem. I run fast-import on a dedicated migration machine. After fast-import completion I run repack [1] before I upload the repo to its final location. Thanks, Lars [1] https://gcc.gnu.org/ml/gcc/2007-12/msg00165.html > > diff --git a/fast-import.c b/fast-import.c > index 0e8bc6a..9bfbfb0 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -597,6 +597,22 @@ static struct object_entry *insert_object(unsigned char *sha1) > return e; > } > > +static void clear_object_table(void) > +{ > + unsigned int h; > + for (h = 0; h < ARRAY_SIZE(object_table); h++) { > + /* > + * We can't individually free objects here > + * because they are allocated from a pool. > + */ > + object_table[h] = NULL; > + } > + /* > + * XXX maybe free object_entry_pool here, > + * or might something still be referencing them? > + */ > +} > + > static unsigned int hc_str(const char *s, size_t len) > { > unsigned int r = 0; > @@ -1035,6 +1051,9 @@ discard_pack: > pack_data = NULL; > running = 0; > > + /* The objects are now available via git's regular lookups. */ > + clear_object_table(); > + > /* We can't carry a delta across packfiles. */ > strbuf_release(&last_blob.data); > last_blob.offset = 0; -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html