Re: [GSOC 2012] Some questions regarding a possible project to improve big file support

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Mon, 26 Mar 2012 08:21:53 +0700

On Mon, Mar 26, 2012 at 3:48 AM, Peter C. <th3flyboy@xxxxxxxxx> wrote:
> My first question is more of a question regarding low level
> functionality of how Git diffs files. The question is, in the diff
> process, does git just parse the file and see if there are diffs, or
> does it use something like hashing to first tell if the file has been
> modified at all, and then go to the diff process if the hash is
> different. An extension to this question is, in Git's internal database,
> does it set any kind of flag to say that a file is a binary if it is one.

If hashes are available, we compare hash first (e.g. diff-tree). We
can mark a file binary with gitattributes. I think the binary
detection code, buffer_is_binary, could be just moved up a little bit
before we unpack file contents. But I'm not really familiar with this
area.

> My thought process in implementation involves checking the hash, and if
> the hash is the same, skip it, if the hash is different, check the MIME
> type possibly using libmagic, and if it matches a known binary format,
> then just commit the new version, rather than trying to run a whole diff
> and load the whole file in the process.

Overkill, compared to how binary is detected today :)

#define FIRST_FEW_BYTES 8000
int buffer_is_binary(const char *ptr, unsigned long size)
{
	if (FIRST_FEW_BYTES < size)
		size = FIRST_FEW_BYTES;
	return !!memchr(ptr, 0, size);
}

If you are interested in this big file support, I think you should
focus on the "Many large files do not delta well..." item in the wiki
page. The framework has already been done by Junio. That can make git
manage gigabyte files just fine (aka "bup").
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html