Re: [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object`

Jeff King <peff@xxxxxxxx> · Tue, 17 Feb 2009 12:33:03 -0500

On Wed, Feb 18, 2009 at 12:31:35AM +1000, Ben Hoskings wrote:

> This patch adds a proof-of-concept implementation of streaming SHA1  
> calculation in sha1_file.c, as demoed with `git hash-object <large input 
> file>`. Instead of the command's memory footprint being equal to the input 
> file's size, this caps it at SHA1_CHUNK_SIZE (currently 64MB).

This might be nice to reduce the memory footprint for _some_ operations
that only need to look at the data once, but...

> Capping memory use this easily seems like a win, but then all this code 
> does is stream-calculate a SHA1 and print it to stdout. There seem to be a 
> lot of disparate places throughout the codebase where objects have their 
> SHA1 calculated.

...as you noticed, there are a lot of other places where we need the
whole object. So you are not suddenly making git not suck with a 700MB
file on a 512MB system.

But more importantly, hash-object (via index_fd) already mmaps the input
when it is possible to do so. So on memory-tight systems, won't the OS
already release memory when appropriate (i.e., your virtual size is
large, but the OS is free to optimize what is actually paged in as
appropriate).

I suppose this is an extra hint to the OS that we don't care about past
chunks. On systems which support it, perhaps using madvise with
MADV_SEQUENTIAL would be another useful hint.

> Then again, I presume most of these are working with blobs and not entire 
> files, and hence wouldn't require streaming anyway. (I'm assuming blobs 
> don't grow large enough to warrant it - is that necessarily true?)

Either I don't understand what you are saying here, or you are missing
something fundamental about hwo git works: blobs _are_ entire files.

> On my machine, the original implementation hashed a 700MB file in 5.8sec. 
> My patch does it in 6.2sec with SHA1_CHUNK_SIZE set to 64MB.

Hmph. So it's slower? I'm not sure what the advantage is, then. On a
low-memory machine, the OS's paging strategy should be reasonable,
especially with MADV_SEQUENTIAL (though I haven't tested it). And on a
machine with enough memory, it's slower.

I guess you are evicting fewer pages from other processes on the system
in the meantime.

> +inline void write_sha1_fd_process_chunk(int fd, unsigned long len,
> +                                        unsigned long offset,  
> git_SHA_CTX *c,
> +                                        void *buf)
> +{
> +  buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, offset);
> +  git_SHA1_Update(c, buf, len);
> +  munmap(buf, len);
> +}

What is the point of the buf input parameter, which is promptly
overwritten?

> +int hash_sha1_fd(int fd, unsigned long len, const char *type,
> +                 unsigned char *sha1)
> +{
> +	char hdr[32];
> +	int hdrlen;
> +	write_sha1_fd_prepare(fd, len, type, sha1, hdr, &hdrlen);
> +	return 0;
> +}

What is the point of the hdr and hdrlen variables being passed in to
receive values, when we never actually look at them? I would think you
are conforming to an existing interface except that you just added the
function earlier in the patch.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html