Re: Git chokes on large file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 27, 2014 at 11:47 PM, Dale R. Worley <worley@xxxxxxxxxxxx> wrote:
> I've discovered a problem using Git.  It's not clear to me what the
> "correct" behavior should be, but it seems to me that Git is failing
> in an undesirable way.
>
> The problem arises when trying to handle a very large file.  For
> example:
>
>     $ git --version
>     git version 1.8.3.1
>     $ mkdir $$
>     $ cd $$
>     $ git init
>     Initialized empty Git repository in /common/not-replicated/worley/temp/5627/.git/
>     $ truncate --size=20G big_file
>     $ ls -l
>     total 0
>     -rw-rw-r--. 1 worley worley 21474836480 May 27 11:59 big_file
>     $ time git add big_file
>
>     real        4m48.752s
>     user        4m31.295s
>     sys 0m16.747s
>     $
>
> At this point, either 'git fsck' or 'git commit' fails:
>
>     $ git fsck --full --strict
>     notice: HEAD points to an unborn branch (master)
>     Checking object directories: 100% (256/256), done.
>     fatal: Out of memory, malloc failed (tried to allocate 21474836481 bytes)

Back trace for this one

#3  0x000000000055cf39 in xmalloc (size=21474836481) at wrapper.c:49
#4  0x000000000055cffd in xmallocz (size=21474836480) at wrapper.c:73
#5  0x0000000000537858 in unpack_compressed_entry (p=0x858ac0,
w_curs=0x7fffffffc0f8, curpos=18, size=21474836480) at
sha1_file.c:1924
#6  0x0000000000538364 in unpack_entry (p=0x858ac0, obj_offset=12,
final_type=0x7fffffffc1e4, final_size=0x7fffffffc1d8) at
sha1_file.c:2206
#7  0x00000000004fb0a2 in verify_packfile (p=0x858ac0,
w_curs=0x7fffffffc320, fn=0x43f5f2 <fsck_obj_buffer>,
progress=0x858a90, base_count=0) at pack-check.c:119
#8  0x00000000004fb3f4 in verify_pack (p=0x858ac0, fn=0x43f5f2
<fsck_obj_buffer>, progress=0x858a90, base_count=0) at
pack-check.c:177
#9  0x00000000004401d7 in cmd_fsck (argc=0, argv=0x7fffffffd650,
prefix=0x0) at builtin/fsck.c:677

Not easy to fix. I started working on converting fsck to use
index-pack code for pack verification. index-pack supports large files
well, so in the end it might fix this (as well as speeding up fsck).
But that work has stalled for a long time.

>
>     $ git commit -m Test.
>     [master (root-commit) 3df3655] Test.
>     fatal: Out of memory, malloc failed (tried to allocate 21474836481 bytes)

And back trace

#11 0x00000000004b9da0 in read_sha1_file (sha1=0x8558a0
"\256/s\324\370\304\344\212\304I\v\342\334MS\002\352\214\061\222",
type=0x7fffffffc6c4, size=0x8558d0) at cache.h:820
#12 0x00000000004c1b98 in diff_populate_filespec (s=0x8558a0,
size_only=0) at diff.c:2749
#13 0x00000000004c0110 in diff_filespec_is_binary (one=0x8558a0) at diff.c:2188
#14 0x00000000004c0f0b in builtin_diffstat (name_a=0x858530
"big_file", name_b=0x0, one=0x8584e0, two=0x8558a0,
diffstat=0x7fffffffc8a0, o=0x7fffffffce88, p=0x855910) at diff.c:2435
#15 0x00000000004c2fd4 in run_diffstat (p=0x855910, o=0x7fffffffce88,
diffstat=0x7fffffffc8a0) at diff.c:3168
#16 0x00000000004c603a in diff_flush_stat (p=0x855910,
o=0x7fffffffce88, diffstat=0x7fffffffc8a0) at diff.c:4081
#17 0x00000000004c70e4 in diff_flush (options=0x7fffffffce88) at diff.c:4520
#18 0x00000000004e5d59 in log_tree_diff_flush (opt=0x7fffffffcaf0) at
log-tree.c:715
#19 0x00000000004e5e5a in log_tree_diff (opt=0x7fffffffcaf0,
commit=0x8585b0, log=0x7fffffffc9a0) at log-tree.c:747
#20 0x00000000004e60b1 in log_tree_commit (opt=0x7fffffffcaf0,
commit=0x8585b0) at log-tree.c:810
#21 0x000000000042c45c in print_summary (prefix=0x0,
sha1=0x7fffffffd300 ".&Gȑ\360\243\202\351&!\035\312q\374\345\314LL)",
initial_commit=1) at builtin/commit.c:1426
#22 0x000000000042d213 in cmd_commit (argc=0, argv=0x7fffffffd650,
prefix=0x0) at builtin/commit.c:1750

If we could have an option in read_sha1_file to read max to <n> bytes
(enough for binary detection purpose), it would fix this. Another
option is declare all files larger than core.bigfilethreshold binary.
Easier in both senses of implementation cost and looseness.

> Even doing a 'git reset' does not put the repository in a state where
> 'git fsck' will complete:
>
>     $ git reset
>     $ git fsck --full --strict
>     notice: HEAD points to an unborn branch (master)
>     Checking object directories: 100% (256/256), done.
>     fatal: Out of memory, malloc failed (tried to allocate 21474836481 bytes)

I don't know how many commands are hit by this. If you have time and
gdb, please put a break point in die_builtin() function and send
backtraces for those that fail. You could speed up the process by
creating a smaller file and set the environment variable
GIT_ALLOC_LIMIT (in kilobytes) to a number lower than that size. If
git attempts to allocate a block larger than that limit it'll die.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]