On Thu, 18 Oct 2007, Christer Weinigel wrote: > > BTW, how serious is the problem with deltifying when there are a lot of > spaces that David Kastrup mentioned? I suspect it works quite well in practice. But we've had to tweak the xdiff code before, and the hash calculations for bucket size limits. If somebody actually points out a problem case, we can probably tweak it again. > Wouldn't it be a problem when people put ASCII graphics in comments or > just have comments like /*********************************/ in their > code? In general, *any* situation where you have tons of character sequences that are the same (and here it's not the characters *themselves* that have to be the same - it's the *sequence* that has to be the same, so it's not about repeating the same character over and over per se: it's about repeating a certain block of characters many many times in the source code) will be problematic for pretty much any similarity analysis. Why? Because you just have a lot of the same sequence, and to get a good delta you want to find common "sequences of these sequences" (call them supersequences) in order to find the biggest common chunk. So the badly performing cases for any delta algorithm (and I do want to point out that this has nothing what-so-ever to do with the particular one that git uses) tends to be exactly the ones where you have lots and lots of smaller chunks that match in two files, and that then makes it costlier to find the *bigger* chunks that are build up of those smaller chunks. And generally you tend to have two situations: you either (a) take *much* longer to find the common areas (they are often quadratic or worse algorithms) or (b) you decide to ignore chunks that are so common that they don't really add any real information when it comes to finding truly common chunks. Where that second choice generally means that you can miss some cases where you *could* have found a good match for deltification. In fact, usually you have a combination of the above two effects: certain deltas may be more expensive to find but there is also a limit that kicks in and means that you never spend *too* much time on finding them if the pattern space is not amenable to it. Would lots of spaces be such a pattern? I personally doubt it would really matter. In general, source code is easy to delta: the bulk of any common sequences in most files will be found by the trivial "look for common sequences in the beginning and the end". The really *bad* cases tend to be rather odd, and often generated files. So no, I don't think deltification is a huge deal for spaces. But it does boil down to the same kind of issues: if you blow up the source base by 20%, you often slow down things by 20% or more, simply because there is more data to process at all stages. It simply just slows down everything - totally unnecessarily. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html