Re: On Tabs and Spaces

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 17 Oct 2007 16:53:24 -0700 (PDT)

On Thu, 18 Oct 2007, Christer Weinigel wrote:
> 
> BTW, how serious is the problem with deltifying when there are a lot of
> spaces that David Kastrup mentioned?

I suspect it works quite well in practice.

But we've had to tweak the xdiff code before, and the hash calculations 
for bucket size limits. If somebody actually points out a problem case, we 
can probably tweak it again.

> Wouldn't it be a problem when people put ASCII graphics in comments or 
> just have comments like /*********************************/ in their 
> code?

In general, *any* situation where you have tons of character sequences 
that are the same (and here it's not the characters *themselves* that have 
to be the same - it's the *sequence* that has to be the same, so it's not 
about repeating the same character over and over per se: it's about 
repeating a certain block of characters many many times in the source 
code) will be problematic for pretty much any similarity analysis.

Why? Because you just have a lot of the same sequence, and to get a good 
delta you want to find common "sequences of these sequences" (call them 
supersequences) in order to find the biggest common chunk.

So the badly performing cases for any delta algorithm (and I do want to 
point out that this has nothing what-so-ever to do with the particular one 
that git uses) tends to be exactly the ones where you have lots and lots 
of smaller chunks that match in two files, and that then makes it costlier 
to find the *bigger* chunks that are build up of those smaller chunks.

And generally you tend to have two situations: you either (a) take *much* 
longer to find the common areas (they are often quadratic or worse 
algorithms) or (b) you decide to ignore chunks that are so common that 
they don't really add any real information when it comes to finding truly 
common chunks. Where that second choice generally means that you can miss 
some cases where you *could* have found a good match for deltification.

In fact, usually you have a combination of the above two effects: certain 
deltas may be more expensive to find but there is also a limit that kicks 
in and means that you never spend *too* much time on finding them if the 
pattern space is not amenable to it.

Would lots of spaces be such a pattern? I personally doubt it would really 
matter. In general, source code is easy to delta: the bulk of any common 
sequences in most files will be found by the trivial "look for common 
sequences in the beginning and the end". The really *bad* cases tend to be 
rather odd, and often generated files.

So no, I don't think deltification is a huge deal for spaces. But it does 
boil down to the same kind of issues: if you blow up the source base by 
20%, you often slow down things by 20% or more, simply because there is 
more data to process at all stages. It simply just slows down everything - 
totally unnecessarily.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html