Re: Weird shallow-tree conversion state, and branches of shallow trees

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 16 Apr 2007 08:58:17 -0700 (PDT)

On Mon, 16 Apr 2007, Andy Parkins wrote:
> 
> It's not an accurate analogy at all.  Your conclusion is your supposition - 
> it's stupid because it's stupid.  I don't understand what the huge problems 
> are - all you've done is say again that it's a problem to have keyword 
> expansion.  Why?  What problem does it actually cause?

The easiest way to explain it is that keyword expansion is like crlf, just 
a million times worse (but if you were to do it in git, you'd literally 
do it in the same path that does crlf expansion).

Like crlf:
 - it requires you to be careful about binary vs non-binary, and corrupts 
   binary files subtly.
 - it never appears to be a problem as long as you stay inside the "same 
   system", because everybody just agrees.

But why did I actually implement auto-CRLF, if I'm so against it? Because 
keyword expansion has a lot of problems that CRLF does *not* have:

 - pretty much every single tool out there actually handles CRLF 
   automatically. When you send emails from a CRLF system to a non-CRLF 
   system, the CRLF will just be removed. Why? Because tools *outside* the 
   SCM already know about "text vs binary", and while you can certainly 
   screw it up (use a CRLF system to generate a kernel patch and send it 
   as a binary attachment, and it won't apply for me, for example), you 
   actually have to work at it a bit.

 - A transformation like LF<->CRLF is "stateless". Anybody can translate a 
   file between CRLF and LF without having to know anything at all, so 
   even *if* somebody sends me a patch with CRLF (and it actually happens: 
   the amount of whitespace damage that people can do with email is just 
   surprisingly high, and people occasionally use Windows machines to send 
   me kernel patches, probably because they send email from some other 
   machine than the one they did development on).

 - Related to the statelessness: CRLF is a "global" operation, and doesn't 
   depend on file history or placement. Keyword expansion explicitly does
   *not* work that way, since the whole *point* of keywords is to make it 
   depend on its place in history!

An example of real-world problems with that lack of statelessness of 
keywords is something as simple as "git rebase". Think about what it does: 
it moves a commit around in history. But then think about *how* it does 
that.

[ Ok, take a break here, and think about why "keyword expansion" might be 
  a problem for "git rebase" in a way that CRLF is not, before you read on ]

Hint: the reason statefulness is broken for things like "git rebase" is 
that the natural operation for something like that is to generate a patch, 
and carry it forward. Now, what is in the patch? Keywords. Will the patch 
apply to the target? Yes? No?

See? Keywords means that you suddenly have merge problems with something 
as simple as patches. Does this matter in CVS? Not often. CVS is so 
limited that you cannot much do those operations anyway, but if you've 
ever done a merge in CVS, keyword expansion tends to be one of the things 
that just make it more complicated. So now you have to remember flags like 
like "-kk" that disable keywords.

(Not a lot of people actually do merges in CVS - branches are hard to use 
to begin with, so the only people who do branches tend to be pretty 
hardcode CVS people, and once you've learnt enough to do a branch, keyword 
expansion is the least of your problems. But it's *one* reason - however 
small compared to the other reasons - that doing things like merging in 
CVS is just more painful than it should be)

Or what about generating a diff between two branches? Keywords are a total 
*nightmare*. Do you realize just how *fast* git is in diff generation. 
Have you ever done "cvs diff"? Have you ever *thought* about how git can 
be so fast? Hint: we don't even *look* at the contents for most files. But 
if the content is "generated" depending on history, you just screwed that 
up too.

Or what about something as seemingly unrelated as "git grep". You may not 
even *realize* how nasty a problem it is when you have two different 
representations of the same data: one that has keywords in it and is 
checked out, and one that does not. Which one should you choose? Which one 
is the right one? What about the git optimization of using the checked-out 
data because it doesn't need any unpacking?

Again, none of these things are problems with CRLF: CRLF is an issue that 
is pretty much *defined* to not matter for text-files. If you do a "grep", 
it doesn't matter if lines end in LF or CRLF. If you do a diff, line 
ending differences (a) shouldn't exist in the first place because they are 
stateless and (b) even if they were to exist, they shouldn't change the 
diff, because LF and CRLF are the same in text.

And the whole keyword issue gets *worse* when you move between 
repositories. If you stay "inside" the SCM, you can generally teach it to 
ignore them. For example, going back to the "git rebase" example (or the 
"git grep" one, for that matter), you can just define that it's done 
without keyword expansion.

But when you move the data between people? That's exactly where keyword 
expansion is enabled, and now you not only make things like "git diff" 
fundamentally broken and much much slower (in fact, it *cannot*work* in 
the git model, because we don't even *have* tree history, so you cannot 
add keywords to a tree!), you also guarantee that the end result is much 
less useful, because now when you send the patch to others, they'll have 
all the same issues that you had to work around locally.

I don't know if I can convince you, but take it from me, keyword expansion 
is fundamentally broken in the first place, but it's *more* so with git 
than with CVS, for example.

In CVS, the reason you can do keyword expansion in the first place is:

 - it's file-based to begin with. A file actually *has* history in CVS, in 
   a way it fundmanentally does *not* have in git. So when you generate a 
   diff on a file, the revision information is "just there". That's simply 
   not true in git. There *is* no per-file revision information. You 
   cannot know who touched the file last, for example, without starting 
   from a commit, and doing very expensive things.

 - it's slow to begin with. This is related to the above thing: exactly 
   because CVS is file-based and not content-based, when you do things 
   like "cvs diff" you will walk files individually anyway. People 
   *accept* (and I cannot imagine why) that an empty "cvs diff" on some 
   big project will take minutes. And the problems aren't even about 
   keyword expansion - keyword expansion is just a small detail.

 - it's centralized in more ways than one. You are simply not expected to 
   work by applying patches between two unrelated CVS trees. It's not 
   done. It cannot work. The closest you get is 
	(a) merging. Which is *hell*. Again, keyword expansion is just a
	    small detail in why it's hell, and people don't generally pick 
	    it up exactly because the merge problems are so much bigger.
	(b) applying patches from the outside from people who do *not* use 
	    CVS, and thus don't generally touch things around the 
	    keywords (but even here, you actually end up having problems
	    occasionally).

 - CVS really fundamentally has so many other problems that keyword
   expansion just isn't on peoples radar. Yeah, it can corrupt data, but 
   you're more likely to corrupt data with binary files other ways, so 
   it's just not an issue.

So basically, other (more fundamental) design mistakes in CVS make 
keywords seem like a better idea there, but all the keyword problems are 
just magnified ten-fold by the fact that git doesn't make those _other_ 
mistakes that CVS does.

And don't get me wrong: I think RCS was a great step forward, and CVS was 
too. A few decades ago. But in git, we sometimes have to teach people to 
*not* make the mistakes they did with CVS. Keyword expansion is a small 
detail, and happily few enough people used it in CVS that it's so far not 
been a huge problem to teach people not to do it.

We had to teach people that there's a difference between doing a local 
repository commit, and pushing that commit to a shared central point. 
That's a much more fundamental difference, and it's a lot harder to get 
your brain to accept that kind of change. In contrast, keywords look 
"trivial", but they really aren't. It's a fundamentally broken notion, 
even if it *sounds* like a small detail.

I'll finish off trying to explain the problem in fundamental git terms: 
say you have a repository with two branches, A and B, and different 
history  on a file "xyzzy" in those two branches, but because they both 
ended up applying the same patches, the actual file contents do end up 
being 100% identical. So they have the same SHA1.

What is

	git diff A..B -- xyzzy

supposed to print?

And *I* claim that if you don't get an immediate and empty diff, your 
system is TOTALLY BROKEN.

And now think about what keywords do. And realize that keywords are 
TOTALLY BROKEN!

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html