On Tue, May 17, 2016 at 10:07:16AM +0200, Lars Schneider wrote: > I think that is pretty much the problem. Here is what is happening: > > 1. git-p4 imports all changelists for the "main" branch > > 2. git-p4 starts to import a second branch and creates a fast-import > "progress checkpoint". This triggers: > > --> fast-import.c: cycle_packfile > --> fast-import.c: end_packfile > --> fast-import.c: loosen_small_pack > > At this point we have no packfile anymore. > > 3. git-p4 sends the command "commit refs/remotes/p4/depot/branch2" > to fast-import in order to create a new branch. This triggers: > > --> fast-import.c: parse_new_commit > --> fast-import.c: load_branch > --> fast-import.c: load_tree > > "load_tree" calls "find_object" and the result has a "pack_id" of 0. > I think this indicates that the object is in the packfile. Shouldn't > the "pack_id" be "MAX_PACK_ID" instead? > > myoe = find_object(sha1); > if (myoe && myoe->pack_id != MAX_PACK_ID) { Thanks for the analysis. I think this is definitely the problem. After fast-import finalizes a packfile (either at the end of the program or due to a checkpoint), it never discards its internal mapping of objects to pack locations. It _has_ to keep such a mapping before the pack is finalized, because git's regular object database doesn't know about it. But afterwards, it should be able to rely on the object database. Retaining the mapping at the end of the program is obviously OK because we won't make any more lookups in it. Retaining it at a checkpoint when we keep the packfile is OK, because we can (usually) still access that packfile (the exception would be if somebody runs "git repack" while fast-import is running). But obviously a checkpoint where we throw away the pack needs to clear that mapping. The patch below probably makes your case work, but there are a lot of open questions: 1. Should we always discard the mapping, even when not loosening objects? I can't think of a real downside to always using git's object lookups. 2. Can we free memory associated with object_entry structs at this point? They won't be accessible via the hash, but might other bits of the code have kept pointers to them? I suspect it may screw up the statistics that fast-import prints at the end, but that's a minor point. 3. I notice that a few other structs (branch and tag) hold onto the pack_id, which will now point to a pack we can't access. Does this matter? I don't think so, because checkpoint() seems to dump the branches and tags. 4. In general, should we be loosening objects at checkpoints at all? I guess it is probably more efficient than creating a bunch of little packs. And it should probably not happen much at all outside of tests (i.e., callers should generally checkpoint after an appreciable amount of work is done). diff --git a/fast-import.c b/fast-import.c index 0e8bc6a..9bfbfb0 100644 --- a/fast-import.c +++ b/fast-import.c @@ -597,6 +597,22 @@ static struct object_entry *insert_object(unsigned char *sha1) return e; } +static void clear_object_table(void) +{ + unsigned int h; + for (h = 0; h < ARRAY_SIZE(object_table); h++) { + /* + * We can't individually free objects here + * because they are allocated from a pool. + */ + object_table[h] = NULL; + } + /* + * XXX maybe free object_entry_pool here, + * or might something still be referencing them? + */ +} + static unsigned int hc_str(const char *s, size_t len) { unsigned int r = 0; @@ -1035,6 +1051,9 @@ discard_pack: pack_data = NULL; running = 0; + /* The objects are now available via git's regular lookups. */ + clear_object_table(); + /* We can't carry a delta across packfiles. */ strbuf_release(&last_blob.data); last_blob.offset = 0; -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html