Re: Is there a --stat or --numstat like option that'll allow me to have my cake and eat it too?

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 8 Mar 2016 21:58:59 +0100

On Tue, Mar 8, 2016 at 9:51 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Tue, Mar 08, 2016 at 04:08:21PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> What I really want is something for git-log more like
>> git-for-each-ref, so I could emit the following info for each file
>> being modified delimited by some binary marker:
>>
>>     - file name before
>>     - file name after
>>     - is rename?
>>     - is binary?
>>     - size in bytes before
>>     - size it bytes after
>>     - removed lines
>>     - added lines
>
> If you get the full sha1s of each object (e.g., by adding --raw), then
> you can dump them all to a single cat-file invocation to efficiently get
> the sizes.
>
> I'm not quite sure I understand why you want to know about renames and
> added/removed lines if you are just blocking binary files. If I were
> implementing this[1], I'd probably just block based on blob size, which
> you can do with:

I want to know about renames because if you're just moving an existing
binary file around that's fine, it's not adding a new big blob to the
repo.

The hook also has a facility to commit binary stuff if you add "yes I
know what I'm doing and want to commit N bytes to the repo" to the
commit message. Mostly when people do this it's an accident.

I wanted to know about added/removed lines because I was looking into
extending this non-binary files. Today at work someone committed 300MB
of text files to a branch, we could delete it in that case, but it
would also be nice to have limits on that sort of thing too.

>   git rev-list --objects $old..$new |
>   git cat-file --batch-check='%(objectsize) %(objectname) %(rest)' |
>   perl -alne 'print if $F[0] > 1_000_000; # or whatever' |
>   while read size sha1 file; do
>         echo "Whoops, $file ($sha1) is too big"
>         exit 1
>   done
>
> You can also use %(objectsize:disk) to get the on-disk size (which can
> tell you about things that don't compress well, which tend to be the
> sorts of things you are trying to keep out).
>
> You can't ask about binary-ness, but I don't think it would unreasonable
> for cat-file to have a "would git consider this content binary?"
> placeholder for --batch-check.
>
> The other things are properties of the comparison, not of individual
> objects, so you'll have to get them from "git log". But with some clever
> scripting, I think you could feed those sha1s (or $commit:$path
> specifiers) into a single cat-file invocation to get the before/after
> sizes.
>
> -Peff
>
> [1] GitHub has hard and soft limits for various blob sizes, and at one
>     point the implementation looked very similar to what I showed here.
>     The downside is that for a large push, the rev-list can actually
>     take a fair bit of time (e.g., consider pushing up all of the kernel
>     history to a brand new repo), and this is on top of the similar work
>     already done by index-pack and check_everything_connected().
>
>     These days I have a hacky patch to notice the too-big size directly
>     in index-pack, which is essentially free. It doesn't know about the
>     file path, so we pull that out later in the pre-receive hook. But we
>     only have to do so in the uncommon case that there _is_ actually a
>     too-big file, so normal pushes incur no penalty.

All good tips / insights. I'll definitely check some of this out.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html