Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes: > On Mon, 3 Dec 2007, Linus Torvalds wrote: > >> On Tue, 4 Dec 2007, Dmitry V. Levin wrote: >> > >> > Average file size in the linux-2.6.23.9 kernel tree is 10944 bytes, >> >> Don't do "average" sizes. That's an almost totally meaningless number. >> >> "Average" makes sense if you have some kind of gaussian distribution or >> similar. > > To enhance on that: Gaussian is symmetric, which cannot be the proper > distribution for anything that is non-negative. This reasoning is nonsense, since Gaussians are not necessarily symmetric about zero, and for example the equally distributed probability between 0 and 1 is both symmetric and non-negative. And the trivial distribution of "always zero" is even _both_ symmetric around zero and non-negative. What is true for Gaussians is that their probability is non-zero everywhere. But with a meaning of "non-zero" that should let the kind of people using hashes for unique file identification sleep well. > I see so many mis-applications of statistics/probability theory in my > day job that I cannot resist pointing people to the Poisson > distribution here (in whose context "average" actually makes kind of > sense). The main point of Gaussians is that they approximate a distribution coming from a sum of independent random sources pretty well, even if the individual sources are not Gaussians themselves. Poisson distributions come about as the sum of independent exponential distributions, and yes, when the number of summands grows, the result is quite well modeled by a Gaussian: the impossible outliers predicted by the Gaussian approximation take a negligible probability of the total. If the distribution is more like coming from a product of independent sources, the results will be better modeled by log-Gaussian distributions (which happen to be non-negative). The exponential distribution happens to be the log-distribution of the (0,1) equal probability distribution, so for lower order Poisson distributions, approximation with log-Gaussians may seem more straightforward. > But back to the problem: if you have a truly binary file, then _every_ > byte (absent further information, of course) has a probability of > 1/256 of being 0. Absent any information, there is no reason to assume equal probability as more likely than other probabilities. > Which means that if a file is binary, is _random_ binary. Few people version random binary files. It is rather pointless. > but is unusual enough to have that property only for half of the first > 8192 bytes, you get a probability of 1 - 1 / 256^4096 = 1 - 1 / 2 ^ > 32768 that the current test succeeds. Rather (1-1/256)^4096 ~= exp(-4096/256) = about 1 in 10 million Which means that the test would yield a false negative (with regard to the data being binary) about once in 10000000 times when done on 4096 bytes, even given your rather absurd assumption of random binary data being versioned. To make this somewhat more obvious: let's take a look at the probability of 128 random bytes being all non-zero. According to your model, that should be (1-1/256^128)=1-1/32768. According to mine, (1-1/256)^128 ~= exp(-0.5) ~= 60%. Why is this even more than 50%, which would be the naive assumption? Because it is likely that we will have duplicate bytes among our 128 bytes, so the probability for an occurence of 0 becomes less. > I fail to see how this test can possibly fail for the average case. You start with a nonsensical mathematical model, and don't even do the math right that would follow from it. The "average case" is not random equidistributed uncorrelated data, anyway. > So if it fails only for special cases, we are probably (in the common, > not the mathematical, sense) better off asking those people > encountering them to add git-attributes for the files. > > IMHO that is not asking for too much. That the problem has actually been encountered in real life does not exactly do much to support your assumptions and your conclusions, does it? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html