Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits

Jeff King <peff@xxxxxxxx> · Fri, 6 Sep 2019 13:04:17 -0400

On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:

> > diff --git a/revision.h b/revision.h
> > index 4134dc6029..5c0b831b37 100644
> > --- a/revision.h
> > +++ b/revision.h
> > @@ -33,7 +33,7 @@
> >  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
> >  
> >  #define TOPO_WALK_EXPLORED	(1u<<27)
> > -#define TOPO_WALK_INDEGREE	(1u<<28)
> > +#define TOPO_WALK_INDEGREE	(1u<<24)
> 
> As an aside, these flag bit modifications look fine, but would need to
> be explained. I'm guessing that since you are adding a bit of data
> to struct object you want to avoid increasing the struct size across
> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
> the cache-entry flags, so this seems safe.)

Yeah, I'd definitely break this up into several commits with explanation
(though see an alternate I posted that just uses the parsed flag without
any new bits).

Bit 24 isn't used according to the table in objects.h, which is
_supposed_ to be the source of truth, though of course there's no
compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
part of ALL_REV_FLAGS?).

And yes, the goal was to keep things to the 32-bit boundary. But in the
course of this, I discovered something interesting: 64-bit systems are
now padding this up to the 8-byte boundary!

The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our
object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us
at 24 bytes, which is both 4- and 8-byte aligned.

With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes
takes us to 36, and then 8-byte aligning the struct takes us to 40
bytes, with 4 bytes of wasted padding.

I'm sorely tempted to use this as an opportunity to move commit->index
into "struct object". That would actually shrink commit object sizes by
4 bytes, and would let all object types do the commit-slab trick to
store object data with constant-time lookup. This would make it possible
to migrate some uses of flags to per-operation bitfields (so e.g., two
traversals would have their _own_ flag data, and wouldn't risk stomping
on each other's bits).

The one downside would be that the index space would become more sparse.
I.e., right now if you're only storing things for commits in a slab, you
know that every slot you allocate is for a commit. But if we allocate an
index for each object, then the commits are less likely to be together
(so wasted memory and worse cache performance). That might be solvable
by assigning a per-type index (with a few hacks to handle OBJ_NONE).

Anyway, all of that is rather off the topic of this discussion.

-Peff