Re: Should I store large text files on Git LFS?

Andrew Ardill <andrew.ardill@xxxxxxxxx> · Mon, 24 Jul 2017 14:58:38 +1000

Hi Farshid,

On 24 July 2017 at 13:45, Farshid Zavareh <fhzavareh@xxxxxxxxx> wrote:
> I'll probably test this myself, but would modifying and committing a 4GB
> text file actually add 4GB to the repository's size? I anticipate that it
> won't, since Git keeps track of the changes only, instead of storing a copy
> of the whole file (whereas this is not the case with binary files, hence the
> need for LFS).

I decided to do a little test myself. I add three versions of the same
data set (sometimes slightly different cuts of the parent data set,
which I don't have) each between 2 and 4GB in size.
Each time I added a new version it added ~500MB to the repository, and
operations on the repository took 35-45 seconds to complete.
Running `git gc` compressed the objects fairly well, saving ~400MB of
space. I would imagine that even more space would be saved
(proportionally) if there were a lot more similar files in the repo.
The time to checkout different commits didn't change much, I presume
that most of the time is spent copying the large file into the working
directory, but I didn't test that. I did test adding some other small
files, and sometimes it was slow (when cold I think?) and other times
fast.

Overall, I think as long as the files change rarely, and the
repository remains responsive, having these large files in the
repository is ok. They're still big, and if most people will never use
them it will be annoying for people to clone and checkout updated
versions of the files. If you have a lot of the files, or they update
often, or most people don't need all the files, using something like
LFS will help a lot.

$ git version  # running on my windows machine at work
git version 2.6.3.windows.1

$ git init git-csv-test && cd git-csv-test
$ du -h --max-depth=2  # including here to compare after large data
files are added
35K     ./.git/hooks
1.0K    ./.git/info
0       ./.git/objects
0       ./.git/refs
43K     ./.git
43K     .

$ git add data.txt  # first version of the data file, 3.2 GB
$ git commit
$ du -h --max-depth=2  # the data gets compressed down to ~580M of
objects in the git store
35K     ./.git/hooks
1.0K    ./.git/info
2.0K    ./.git/logs
580M    ./.git/objects
1.0K    ./.git/refs
581M    ./.git
3.7G    .

$ git add data.txt  # second version of the data file, 3.6 GB
$ git commit
$ du -h --max-depth=1  # an extra ~520M of objects added
1.2G    ./.git
4.7G    .

$ time git add data.txt  # 42.344s - second version of the data file, 2.2 GB
$ git commit  # takes about 30 seconds to load editor
$ du -h --max-depth=1
1.7G    ./.git
3.9G    .

$ time git checkout HEAD^  # 36.509s
$ time git checkout HEAD^  # 44.658s
$ time git checkout master  # 38.267s

$ git gc
$ du -h --max-depth=1
1.3G    ./.git
3.4G    .

$ time git checkout HEAD^  # 34.743s
$ time git checkout HEAD^  # 41.226s

Regards,

Andrew Ardill