git-diff memory/speed/disk impacts (was: being nice to patch(1))

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Some more experiments:

David Kastrup <dak@xxxxxxx> writes:

> Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:
>
>>> > >
>>> > > I guess the second choice generally isn't an option, but dammit, 
>>> > > "git-apply" really is the better program here.
>>> > 
>>> > Why not?  git-apply works outside of a git repo ;-)
>>> 
>>> I was more thinking that people are not necessarily willing to install git 
>>> just to get the "git-apply" program..
>>
>> But maybe they would be willing to install git to get that wonderful
>> git-apply program, and that wonderful rename-and-mode-aware
>> git-diff, and the git-merge-file program, all of which can operate
>> outside of a git repository. (Take that, hg!)
>
> Well, hmph!  I just rewrote my git-diff-using script to not check
> stuff into a throw-away git repository, and guess what: with real-life
> use cases (diffing trees of about 500MB size), git-diff runs out of
> memory (the machine probably has something like 1.5GB of virtual memory
> size) when operating outside of a git repository.
>
> So the usefulness still seems limited, even now that the output format
> of --name-status has been fixed.
>
> Any idea whether this is a bug, sloppy programming, or an inherent
> restriction/necessity?
>
> Also an idea which of the following scenarios would be best for
> catching all of moves/renames/deletes/adds?  Note: any repository is
> strictly throw-away.
>
> Experiments are somewhat time-consuming, so every hunch helps.
>
> a) diff directories outside of git (works, but fatal memory footprint
>                                     for large cases)
> b) diff index against work directory
fatal memory footprint
> c) diff revision against work directory
fatal memory footprint
> d) diff revision against index
does not detect copies/renames
> e) diff revision against revision (works, but high disk footprint and
>                                    likely slower than alternatives)

So it seems like option e) is the only feasible option.  In the total
numbers, git-add is by far the slowest operation, followed by
git-commit.  git-diff on revisions is quite fast and with moderate
memory footprint.

Committing itself does not seem to add much disk space: adding into
the index seems to be the main disk space allocation.

So while the behavior of d) appears puzzling, doing another commit
before the diff is cheap, so the motivation for asking people to find
out the problems with d) is low for me.

Somewhat dissatisfactory that rewriting my script for using the
repository-less variant of git-diff fails for seriously large use
cases due to out-of-memory conditions.

I suppose that's life.

-- 
David Kastrup

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux