Re: Compatibility between git.git and jgit

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Fri, 1 May 2009 18:59:50 -0700

Nicolas Pitre <nico@xxxxxxx> wrote:
> On Fri, 1 May 2009, Shawn O. Pearce wrote:
> 
> > On an unrelated note, someone asked me recently, how do we ensure
> > compatibility in implementations between git.git and jgit?
> 
> Well... this is not exactly easy.  As I said in the past 
> (http://marc.info/?l=git&m=121035043412788&w=2), I think that the C 
> version must remain the reference with regards to protocols and on-disk 
> data structures.

I agree fully.

> If people go wild with JGit and start making changes 
> to data structures then it simply won't be Git compatible anymore and 
> the user base will get fragmented.

Agree.  We may see some prototyping happen in JGit first on some
topics, and JGit may even support something earlier than git.git,
e.g JGit has an amazon-s3:// transport that git.git doesn't have.
But it also isn't widely used.

> A formal compatibility test suite would imply that every Git 
> reimplementation should be compatible with the reference C version.  
> You could add some tests in your test suite which are performed in 
> parallel using JGit and the C git, and make sure that the produced 
> results are identical, etc.

Yea, and to some extent we try to do that already in JGit, but our
tests aren't complete enough in that area.

> But to which extent should the C version remain backward compatible with 
> other implementations?  Let's suppose a future protocol extension is 
> made and old unsuspecting C clients work just fine but some other 
> implementation crashes with it?

This is what I think scares both myself and the folks that have
recently asked me about compatibility.

If JGit gets a broader user base, and suddenly it stops working
against a newer C git-daemon because of a protocol change, those
users are going to be pissed.  Its no worse than the "github can't
ever upgrade past 1.6.1" issue we had not too long ago.

I think we're doing better these days about embedding file format
version numbers into files (e.g. pack idx v2) to help alert older
clients that the format is different.  But we also have a something
of a history of looking for "holes" in older C git parsers in
order to wedge in new features where we didn't plan for them in
the first place.  E.g. the protocol capability slots we have now.

I think that as reimplementations become more popular, we need to
rely less on extending things by exploiting parser quirks in older
C git.git code, and rely more on at least explicit version markers
that everyone can work with.

> And the reference implementation cannot be held back because 
> of bugs in all alternative implementations.

I agree.  A bug is a bug.  But I'd really like to get away from the
trend where we exploit bugs in older C git.git implementations to
add new functionality, because maybe JGit doesn't have that same
bug and will fall flat on its face with that exploit.

> As long as they're futzing^Wdeveloping on top of Jgit then 
> interoperability shouldn't be at risk.  If people would start adding new 
> object types and pack formats and the like without obtaining a consensus 
> with people around the C version then I might get extremely worried (and 
> pissed) though.

That's why JGit is BSD, so everyone can use the one f'king library
and not risk fragmenting the Java market further.

But yea, I'd be really pissed too if someone hacked up JGit and made
it incompatible with anything else.  Its a risk that the liberal
BSD license permits.

I'm really sort of hoping that the development momentum around
git.git and JGit trying to keep up will keep them coming back
to the canonical JGit for updates, forcing them to give back any
hacks^Wimprovements they have made.  If the improvements really are
worthwhile, they can be easily ported over to C before they become
widely used in JGit.

> One defensive approach we could adopt is to use a capability slot to 
> identify the software version of each peer involved in the network 
> communication.  The advantage would be for a later Git version to avoid 
> doing some things that are known to break with client X or Y.  Of course 
> even such a scheme can be abused and misused, like on some web sites if 
> you don't have the "right" browser, leading some of them to allow faking 
> the User-Agent string, etc.  But maybe the upsides are more important 
> than the downsides.  This doesn't help with on-disk interoperability, 
> but this is probably less important than communication interoperability.

Blargh.  I'm with you about the whole User-Agent mess.

Asking clients and servers to identify with implementation and
version markers might be useful for analysis of who-is-using-what,
but I don't think its a good way to negotiate between the peers of
what functionality to enable or disable, or what bug workarounds
to use.  Reminds me of the Apache hack during output to work around
an HTTP header parsing bug in Netscape 2 when the "\r\n" pair was
exactly at byte 256 in the stream.  *shudder*

FWIW, an EGit user recently complained that some random Git hosting
site they were using couldn't work with EGit, but EGit worked fine
with other sites, e.g. GitHub.  Apparently this site's SSH forced command
filter script didn't like EGit asking for "git upload-pack 'path.git'".

Its not strictly a Git protocol issue, how the client launches
the remote process over SSH, but this random hosting site was
apparently relying on C git's current calling convention of
"git-upload-pack 'path.git'".

Long story short, I claimed it was the hosting site's bug.  :-)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html