If I were redoing git from scratch...

Junio C Hamano <junkio@xxxxxxx> · Sat, 04 Nov 2006 03:34:42 -0800

I've been thinking about these for a while on the back of my
head, and thought it might be better to start writing it down.

A lot of issues involve UI which means it will not materialize
without breaking existing uses, but if we know in advance what
we will be aiming for, maybe we will find a smoother path to
reach there.

* Core data structure

I consider on-disk data structures and on-wire protocol we
currently use are sane and there is not much to fix.  There are
certainly things to be enhanced (64-bit .idx offset, for
example), but I do not think there is anything fundamentally
broken and needs to be reworked.

I have the same feeling for in-core data structures in general,
except a few issues.

The biggest one is that we use too many static (worse, function
scope static) variables that live for the life of the process,
which makes many things very nice and easy ("run-once and let
exit clean up the mess" mentality), but because of this it
becomes awkward to do certain things.  Examples are:

 - Multiple invocations of merge-bases (needs clearing the
   marks left on commit objects by earlier traversal),

 - Creating a new pack and immediately start using it inside the
   process itself (prepare_packed_git() is call-once, and we
   have hacks to cause it re-read the packs in many places).

 - Visiting more than one repositories within one process
   (many per-repository variables in sha1_file.c are static
   variables and there is no "struct repository" that we can
   re-initialize in one go),

 - The object layer holds onto all parsed objects
   indefinitely.  Because the object store at the philosophy
   level represents the global commit ancestry DAG, there is
   no inherent reason to have more than one instance of
   object.c::obj_hash even if we visit more than one
   repositories in a process, but if the two repositories are
   unrelated, objects from the repository we were looking at
   only waste memory after switching to a different
   repostiory.

 - The diffcore is not run-once but it is run-one-at-a-time.
   This is easy to fix if needed, though.

There are some other minor details but they are not as
fundamental.  Examples are:

 - The revision traversal is nicely done but one gripe I have is
   that it is focused on painting commits into two (and only
   two) classes: interesting and uninteresting.  If we allowed
   more than one (especially, arbitrary number of) kinds of
   interesting, answering questions like "which branches does
   this commit belong to?  which tagged versions is this commit
   already included in?"  would become more easy and efficient.
   show-branch has machinery to do that for a handful but it
   could be unified with the revision.c traversal machinery. 

 - We have at least three independent implementations of
   pathspec match logic and two different semantics (one is
   component-prefix match, the other is shell glob), and they
   should be unified.  You can say "git grep foo -- 't/t5*'" but
   not "git diff otherbranch -- 't/t5*'".

* Fetch/Push/Pull/Merge confusion

Everybody hates the fact that inverse of push is fetch not pull,
and merge is not a usual Porcelain (while it _is_ usable as a
regular UI command, it was originally done as a lower layer
helper to "pull" Porcelain and has a strange parameter order
with seemingly useless HEAD parameter in the middle).

If I were doing git from scratch, I would probably avoid any of
the above words that have loaded meanings from other SCMs.
Perhaps...

 - "git download" would download changes made in the other end
   since we contacted them the last time and would not touch our
   branches nor working tree (associate the word with getting
   tarballs -- people would not expect the act of downloading a
   tarball would touch their working tree nor local history.
   untarring it does).  It is a different story if the end-user
   should be required to explicitly say "download"; I am leaning
   towards making it more or less transparent.

 - "git upload" to upload our changes to the other end -- that
   is what "git push" currently does.

 - "git join" to merge another branch into the current branch,
   with the "per branch configuration" frills to decide what the
   default for "another branch" is based on what the current
   branch is, etc.

* Less visible "remoteness" of remote branches

If I were doing git from scratch, I would probably have done
separate remotes _the_ only layout, except I might have opted to
make "remotes" even less visible and treating it as merely a
cache of "the branch tips and tags we saw when we connected over
the network to look at them the last time".

So "git branch --list $remote" might contact the remote over the
network or use cached version.  When you think about, it it is
not all that different from always contacting the remote end --
the remote end may have mirror propagation delays, and your
local instance of git caching and not contacting the remote all
the time introduces a similar delay on your end which is (1) not
a big deal, and (2) unlike the remote mirror delay, controllable
on your end.  For example, you could force it to update the
cache by "git download $remote; git branch --list $remote".

* Unified "fetch" and "push" across backends.

I was rediscovering git-cvsimport today and wished if I could
just have said (syntax aside):

	URL: cvs;/my.re.po/.cvsroot
        Pull: HEAD:remotes/cvs/master
        Pull: experiment:remotes/cvs/experiment

to cause "git fetch" to run git-cvsimport to update the remotes/cvs/
branches (and "git pull" to merge CVS changes to my branches).
The same thing should be possible for SVN and other foreign SCM
backends.

Also it should be possible to use git-cvsexportcommit as a
backend for "git push" into the cvs repository.

That's it for tonight...

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html