Re: [PATCH] Teach "git add" and friends to be paranoid

Dmitry Potapov <dpotapov@xxxxxxxxx> · Mon, 22 Feb 2010 21:05:45 +0300

On Mon, Feb 22, 2010 at 8:31 PM, Zygo Blaxell
<zblaxell@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> If you're read()ing a chunk at a time into a fixed size buffer, and
> doing sha1 and deflate in chunks, the data should be copied once into CPU
> cache, processed with both algorithms, and replaced with new data from
> the next chunk.

Currently, we calculate SHA-1, then lookup whether the object with this
SHA-1 exists, and if it does not, then deflate and write it to the
object storage. So, we avoid deflate and write cost if this object
already exists. Moreover, when we deflate data, we create the temporary
file in the same directory where the target object will be stored, thus
avoiding cross-directory rename (which is important for some reason, but
I don't remember why).  So, creating the temporary file requires the
knowledge first two digits of SHA-1, which you cannot know without
calculation SHA-1.

So, the idea of processing file in chunks is very attractive, but it has
two drawbacks:
1. extra cost (deflating+writing) when the object is already stored
2. some issues with cross-directory renaming

Dmitry
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html