On Tue, Mar 8, 2016 at 9:51 PM, Jeff King <peff@xxxxxxxx> wrote: > On Tue, Mar 08, 2016 at 04:08:21PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> What I really want is something for git-log more like >> git-for-each-ref, so I could emit the following info for each file >> being modified delimited by some binary marker: >> >> - file name before >> - file name after >> - is rename? >> - is binary? >> - size in bytes before >> - size it bytes after >> - removed lines >> - added lines > > If you get the full sha1s of each object (e.g., by adding --raw), then > you can dump them all to a single cat-file invocation to efficiently get > the sizes. > > I'm not quite sure I understand why you want to know about renames and > added/removed lines if you are just blocking binary files. If I were > implementing this[1], I'd probably just block based on blob size, which > you can do with: I want to know about renames because if you're just moving an existing binary file around that's fine, it's not adding a new big blob to the repo. The hook also has a facility to commit binary stuff if you add "yes I know what I'm doing and want to commit N bytes to the repo" to the commit message. Mostly when people do this it's an accident. I wanted to know about added/removed lines because I was looking into extending this non-binary files. Today at work someone committed 300MB of text files to a branch, we could delete it in that case, but it would also be nice to have limits on that sort of thing too. > git rev-list --objects $old..$new | > git cat-file --batch-check='%(objectsize) %(objectname) %(rest)' | > perl -alne 'print if $F[0] > 1_000_000; # or whatever' | > while read size sha1 file; do > echo "Whoops, $file ($sha1) is too big" > exit 1 > done > > You can also use %(objectsize:disk) to get the on-disk size (which can > tell you about things that don't compress well, which tend to be the > sorts of things you are trying to keep out). > > You can't ask about binary-ness, but I don't think it would unreasonable > for cat-file to have a "would git consider this content binary?" > placeholder for --batch-check. > > The other things are properties of the comparison, not of individual > objects, so you'll have to get them from "git log". But with some clever > scripting, I think you could feed those sha1s (or $commit:$path > specifiers) into a single cat-file invocation to get the before/after > sizes. > > -Peff > > [1] GitHub has hard and soft limits for various blob sizes, and at one > point the implementation looked very similar to what I showed here. > The downside is that for a large push, the rev-list can actually > take a fair bit of time (e.g., consider pushing up all of the kernel > history to a brand new repo), and this is on top of the similar work > already done by index-pack and check_everything_connected(). > > These days I have a hacky patch to notice the too-big size directly > in index-pack, which is essentially free. It doesn't know about the > file path, so we pull that out later in the pre-receive hook. But we > only have to do so in the uncommon case that there _is_ actually a > too-big file, so normal pushes incur no penalty. All good tips / insights. I'll definitely check some of this out. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html