On Wed, Feb 18, 2009 at 12:31:35AM +1000, Ben Hoskings wrote: > This patch adds a proof-of-concept implementation of streaming SHA1 > calculation in sha1_file.c, as demoed with `git hash-object <large input > file>`. Instead of the command's memory footprint being equal to the input > file's size, this caps it at SHA1_CHUNK_SIZE (currently 64MB). This might be nice to reduce the memory footprint for _some_ operations that only need to look at the data once, but... > Capping memory use this easily seems like a win, but then all this code > does is stream-calculate a SHA1 and print it to stdout. There seem to be a > lot of disparate places throughout the codebase where objects have their > SHA1 calculated. ...as you noticed, there are a lot of other places where we need the whole object. So you are not suddenly making git not suck with a 700MB file on a 512MB system. But more importantly, hash-object (via index_fd) already mmaps the input when it is possible to do so. So on memory-tight systems, won't the OS already release memory when appropriate (i.e., your virtual size is large, but the OS is free to optimize what is actually paged in as appropriate). I suppose this is an extra hint to the OS that we don't care about past chunks. On systems which support it, perhaps using madvise with MADV_SEQUENTIAL would be another useful hint. > Then again, I presume most of these are working with blobs and not entire > files, and hence wouldn't require streaming anyway. (I'm assuming blobs > don't grow large enough to warrant it - is that necessarily true?) Either I don't understand what you are saying here, or you are missing something fundamental about hwo git works: blobs _are_ entire files. > On my machine, the original implementation hashed a 700MB file in 5.8sec. > My patch does it in 6.2sec with SHA1_CHUNK_SIZE set to 64MB. Hmph. So it's slower? I'm not sure what the advantage is, then. On a low-memory machine, the OS's paging strategy should be reasonable, especially with MADV_SEQUENTIAL (though I haven't tested it). And on a machine with enough memory, it's slower. I guess you are evicting fewer pages from other processes on the system in the meantime. > +inline void write_sha1_fd_process_chunk(int fd, unsigned long len, > + unsigned long offset, > git_SHA_CTX *c, > + void *buf) > +{ > + buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, offset); > + git_SHA1_Update(c, buf, len); > + munmap(buf, len); > +} What is the point of the buf input parameter, which is promptly overwritten? > +int hash_sha1_fd(int fd, unsigned long len, const char *type, > + unsigned char *sha1) > +{ > + char hdr[32]; > + int hdrlen; > + write_sha1_fd_prepare(fd, len, type, sha1, hdr, &hdrlen); > + return 0; > +} What is the point of the hdr and hdrlen variables being passed in to receive values, when we never actually look at them? I would think you are conforming to an existing interface except that you just added the function earlier in the patch. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html