On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > Here's a re-roll of my series to add "rev-list --disk-usage", for > > counting up object storage used for various slices of history. > > ... > > t/t6114-rev-list-du.sh | 51 +++++++++++++++++++ > > t/test-lib-functions.sh | 9 +++- > > 7 files changed, 199 insertions(+), 8 deletions(-) > > create mode 100755 t/t6114-rev-list-du.sh > > I relocated 6114 to 6115 to avoid tests sharing the same number. Thanks. I wondered why I didn't notice, but it's because the other 6114 also just made it into "seen". :) > I am getting these numbers from random ranges I am interested in, > but do they say what I think they mean? Was the development effort > went into the v2.28 release almost half the size of v2.29, and have > we already done about the same amont of work for this cycle? > > : gitster git.git/seen; rungit seen rev-list --disk-usage master..next > 83105 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master > 183463 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0 > 231640 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0 > 334355 > : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0 > 182298 As Taylor mentioned, this is only hitting the commits. So you might as well just be looking at commit counts as a measure of work, I'd think (and indeed v2.28 has about half as many commits as v2.29!). Adding --objects gets you a rougher estimate of "bytes changed", which helps accounts for commits of different sizes. But there I think you'd do just as well to look at the actual number of lines changed with "git diff --numstat". I'd expect the number of on-disk bytes to _roughly_ correspond to the size of the changes. But you are working against the heuristics of the delta chains there. It may well be that we would store a base object in the v2.28..v2.29 range, and a delta against it in v2.27..v2.28. And that would attribute most of the bytes to v2.29, even though they should be shared roughly with v2.28. I'm sure one could devise a scheme for "sharing" the bytes from a delta family across all of its objects. That might even be worth implementing on top (I don't even think it would be too expensive; you just have to collect the delta chains for any objects you're reporting, and then average the total size among a chain). But in practice, we've found this kind of naive --disk-usage useful for answering questions like: - do I need all of these objects? Comparing "rev-list --disk-usage --objects --all", "rev-list --disk-usage --objects --all --reflog", and "du objects/pack/*.pack" will tell you if a prune/repack might help, and whether expiring reflogs makes a difference. - the size of the shared alternates repo for a set of forks has jumped. Comparing "rev-list --disk-usage --objects --remotes=$base --not --remotes=$fork" will tell you what's reachable from a fork but not from the base (we use "refs/remotes/$id/*" to keep track of fork refs in our alternates repo). This can be junk like somebody forking git/git and then uploading a bunch of pirated video files. :) - likewise, the size of cloning a single repo may jump. Comparing "rev-list --disk-usage --objects HEAD..$branch" for each branch might show that one branch is an outlier (e.g., because somebody accidentally committed a bunch of build artifacts). In those kinds of cases, it's not usually "oh, this version is twice as big as this other one". It's more like "wow, this branch is 100x as big as the other branches", and little decisions like delta direction are just noise. I imagine that in those cases the uncompressed object sizes would probably produce similar patterns and answers. But it's actually faster to produce the on-disk sizes. :) -Peff