On Tue, Oct 22, 2019 at 9:48 PM Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > > First, the commit message could probably be reordered to start with the > main point (reusing of packfiles at object granularity instead of > allowing reuse of only one contiguous region). To specific points: Done in my current version. > > The dynamic array of `struct reused_chunk` is useful > > because we need to know not just the number of zero bits, > > but the accumulated offset due to missing objects. > > The number of zero bits is irrelevant, I think. (I know I introduced > this idea, but at that time I was somehow confused that the offset was a > number of objects and not a number of bytes.) Ok, I removed "not just the number of zero bits,". > > If a client both asks for a tag by sha1 and specifies > > "include-tag", we may end up including the tag in the reuse > > bitmap (due to the first thing), and then later adding it to > > the packlist (due to the second). This results in duplicate > > objects in the pack, which git chokes on. We should notice > > that we are already including it when doing the include-tag > > portion, and avoid adding it to the packlist. > > I still don't understand this part. Going off what's written in the > commit message here, it seems to me that the issue is that (1) an object > can be added to the reuse bitmap without going through to_pack, and (2) > this is done when the client asks for a tag by SHA-1. But all objects > when specified by SHA-1 go through to_pack, don't they? That's how Peff explained it in response to one of your previous messages. Peff, would you mind answering Jonathan's questions? At the end of the commit message I added the following which might help though: >> No tests, because git does not actually exhibit this "ask >> for it and also include-tag" behavior. We do one or the >> other on clone, depending on whether --single-branch is set. >> However, libgit2 does both. So my wild guess sould be that maybe the reason is that some of this code was shared (or intended to be shared) with libgit2? > On to the review of the code. The ordering of - and + lines are > confusing, so I have just removed all - lines. > > > +/* > > + * Record the offsets needed in our reused packfile chunks due to > > + * "gaps" where we omitted some objects. > > + */ > > +static struct reused_chunk { > > + off_t start; > > + off_t offset; > > +} *reused_chunks; > > +static int reused_chunks_nr; > > +static int reused_chunks_alloc; > > A chunk is a set of objects from the original packfile that we are > reusing; all objects in a chunk have the same relative position in the > original packfile and the generated packfile. (This does not mean that > the bytes are exactly the same or, even, that the last object in the > chunk has the same size in the original packfile and the generated > packfile.) > > "start" is the offset of the first object of the chunk in the original > packfile. > > "offset" is "start" minus the offset of the first object of the chunk in > the generated packfile. (I think it is easier to understand if this was > "new minus old" instead of "old minus new", that is, the negation of its > current value.) > > So I would prefer: > > /* > * A reused set of objects. All objects in a chunk have the same > * relative position in the original packfile and the generated > * packfile. > */ > struct reused_chunk { > /* The offset of the first object of this chunk in the original > * packfile. */ > off_t original; > /* The offset of the first object of this chunk in the generated > * packfile minus "original". */ > off_t difference; > } Ok, I have used that though it introduces some discrepancies, as record_reused_object() for example still has an "offset" argument. > > +static size_t write_reused_pack_verbatim(struct hashfile *out, > > + struct pack_window **w_curs) > > +{ > > + size_t pos = 0; > > + > > + while (pos < reuse_packfile_bitmap->word_alloc && > > + reuse_packfile_bitmap->words[pos] == (eword_t)~0) > > + pos++; > > + > > + if (pos) { > > + off_t to_write; > > + > > + written = (pos * BITS_IN_EWORD); > > + to_write = reuse_packfile->revindex[written].offset > > + - sizeof(struct pack_header); > > + > > + record_reused_object(sizeof(struct pack_header), 0); > > Probably worth a comment, since we're recording one chunk, not one > object. Ok, added a comment. > The code is correct, though - one big chunk with original offset > as written, and no difference between original and generated offsets. > > @@ -2661,6 +2785,7 @@ static void prepare_pack(int window, int depth) > > > > if (nr_deltas && n > 1) { > > unsigned nr_done = 0; > > + > > if (progress) > > progress_state = start_progress(_("Compressing objects"), > > nr_deltas); > > Unnecessary added line - please remove. We often separate variable declaration from the rest of function code with a blank line, so I don't think the added line is completely unnecessary. It helps make the code more standard and therefore more readable, so I left it. You might say that it should be in a preparatory patch, but a preparatory patch for just such a small change is a bit wasteful. I think it's ok to add this blank line "while at it" in this patch. > On to pack-bitmap.c. > > > +static void try_partial_reuse(struct bitmap_index *bitmap_git, > > + size_t pos, > > + struct bitmap *reuse, > > + struct pack_window **w_curs) > > { > > + struct revindex_entry *revidx; > > + off_t offset; > > + enum object_type type; > > + unsigned long size; > > + > > + if (pos >= bitmap_git->pack->num_objects) > > + return; /* not actually in the pack */ > > Is this case possible? try_partial_reuse() is only called when there is > a 1 at the bit location. I'd like Peff to answer here too, as I don't think I fully understand what's going on here. > > + /* Don't mark objects not in the packfile */ > > + if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD) > > + i = bitmap_git->pack->num_objects / BITS_IN_EWORD; > > Why is this necessary? Let's say we have 42 fully-1 words, and therefore > i is 42. Shouldn't we allocate 42 words in our reuse bitmap? I'd also prefer Peff to answer here. [...] > And thanks for this patch - other than the difficult-to-understand parts > I described above, I found the overall flow easy to discern. > > I presume tests would also need to be written. One way is to introduce > yet another test variable, but maybe that is overkill. This part of the commit message might again help understand why there is no test: >> No tests, because git does not actually exhibit this "ask >> for it and also include-tag" behavior. We do one or the >> other on clone, depending on whether --single-branch is set. >> However, libgit2 does both. But I'd be again happy if Peff could confirm. Thanks for another review, Christian.