Re: [RFC/PATCH] replace: add --graft option

Jeff King <peff@xxxxxxxx> · Mon, 19 May 2014 13:35:05 -0400

On Mon, May 19, 2014 at 10:25:10AM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > It might make sense to just teach parse_commit_header to be a little
> > more thorough; it has to read past those lines anyway to find the author
> > and committer lines, so it would not be much more expensive to note
> > them.  And then of course the code needs to be pulled out of the
> > pretty-printer and made generally accessible.
> 
> I notice that you said "might" above.

Yeah. I think having a reusable parser is definitely a good thing. I'm
not sure if it's worth the massive amount of refactoring that would be
required for this particular use case.

> > That's more or less what Christian's function is doing, though it
> > assumes things like that the parents come immediately after the tree,
> > and that they are all in a group. Those are all true for objects created
> > by git, but I think we can be a little flexible.
> 
> The headers up to committer are cast in stone in their ordering, and
> I do not immediately see how loosening it would be beneficial.
> 
> Unless you are trying to give users a new way to record exactly the
> same commit in twenty-four (or more) ways with their own object
> names, that is ;-)

Sorry, I didn't mean to imply that people can do what they want with
that ordering. Implementations that reorder the headers are stupid and
wrong, and should be fixed.

BUT. I really don't like making these assumptions or doing ad-hoc
parsing all through the code. We _do_ see quirky, wrong objects, and
want to handle them sanely and consistently. That's hard to do when
there are parsers sprinkled throughout the code which handle each case
slightly differently. I don't recall seeing this with header ordering,
but I know that broken ident lines have caused us headaches in the past,
and I'm happy that we've (mostly) settled on the code in
split_ident_line to handle this.

Things like:

  +       /* find existing parents */
  +       parent_start = buf.buf;
  +       parent_start += 46; /* "tree " + "hex sha1" + "\n" */
  +       parent_end = parent_start;

scare me. Is buf actually 46 bytes long? If not, we read past the end of
the buffer. What if it contains something besides "tree <sha1>\n" at the
beginning? We don't even notice!

The version I posted elsewhere in the thread at least operates on a
line-by-line basis, and tries to verify its assumptions (and barfs if
not). It's still doing its own parsing, but at least it's keeping its
assumptions on the object format to a minimum ("drop old parent lines",
"add new ones after tree, and barf if there's not exactly one tree
line").

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html