Re: Problem with large files on different OSes

Alan Manuel Gloria <almkglor@xxxxxxxxx> · Thu, 28 May 2009 07:09:14 +0800

On Thu, May 28, 2009 at 6:07 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>
> On Wed, 27 May 2009, Jeff King wrote:
>>
>> Linus' "split into multiple objects" approach means you could perhaps
>> split intelligently into metadata and "uninteresting data" sections
>> based on the file type.
>
> I suspect you wouldn't even need to. A regular delta algorithm would just
> work fairly well to find the common parts.
>
> Sure, if the offset of the data changes a lot, then you'd miss all the
> deltas between two (large) objects that now have data that traverses
> object boundaries, but especially if the split size is pretty large (ie
> several tens of MB, possibly something like 256M), that's still going to
> be a pretty rare event.
>
> IOW, imagine that you have a big file that is 2GB in size, and you prepend
> 100kB of data to it (that's why it's so big - you keep prepending data to
> it as some kind of odd ChangeLog file). What happens? It would still delta
> fairly well, even if the delta's would now be:
>
>  - 100kB of new data
>  - 256M - 100kB of old data as a small delta entry
>
> and the _next_ chunk woul be:
>
>  - 100kB of "new" data (old data from the previous chunk)
>  - 256M - 100kB of old data as a small delta entry
>
> .. and so on for each chunk. So if the whole file is 2GB, it would be
> roughly 8 256MB chunks, and it would delta perfectly well: except for the
> overlap, that would now be 8x 100kB "slop" deltas.
>
> So even a totally unmodified delta algorithm would shrink down the two
> copies of a ~2GB file to one copy + 900kB of extra delta.
>
> Sure, a perfect xdelta thing that would have treated it as one huge file
> would have had just 100kB of delta data, but 900kB would still be a *big*
> saving over duplicating the whole 2GB.
>
>> That would make things like rename detection very fast. Of course it has
>> the downside that you are cementing whatever split you made into history
>> for all time. And it means that two people adding the same content might
>> end up with different trees. Both things that git tries to avoid.
>
> It's the "I can no longer see that the files are the same by comparing
> SHA1's" that I personally dislike.
>
> So my "fixed chunk" approach would be nice in that if you have this kind
> of "chunkblob" entry, in the tree (and index) it would literally be one
> entry, and look like that:
>
>   100644 chunkblob <sha1>
>
> so you could compare two trees that have the same chunkblob entry, and
> just see that they are the same without ever looking at the (humongous)
> data.
>
> The <chunkblob> type itself would then look like just an array of SHA1's,
> ie it would literally be an object that only points to other blobs. Kind
> of a "simplified tree object", if you will.
>
> I think it would fit very well in the git model. But it's a nontrivial
> amount of changes.
>
>                        Linus

I'd like to pitch in that our mother company uses Subversion, and they
consistently push very large binaries onto their Subvesion
repositories (I know it's not a good idea.  They do it nevertheless.
The very large binary is a description of a design in a proprietary
format by a proprietary tool; they don't want to keep running that
tool because of licensing etc issues, so they archive it on
Subversion).

I'm trying to convince the mother company to switch to git, mostly
because our company (the daughter company) doesn't have direct access
to their Subversion repo (we're in another country), and I've become
convinced that distributed repos like git are the way to go.  But the
fact that large binaries require me to turn off gc.auto and otherwise
avoid packing large filles makes my case a harder sell; quite a bit of
the mother company's workflow has been integrated with Subversion.

Note that in my case "large binary" is really a 164Mb file, but my
work system is a dual-core 512Mb computer, so I suppose my hardware is
really the limitation; still, some of the computers at the mother
company are even lousier.

If you'd prefer someone else to hack it, can you at least give me some
pointers on which code files to start looking?  I'd really like to
have proper large-file-packing support, where large file is anything
much bigger than a megabyte or so.

Admittedly I'm not a filesystems guy and I can just barely grok git's
blobs (they're the actual files, right? except they're named with
their hash), but not packs (err, a bunch of files?) and trees (brown
and green stuff you plant?).  Still, I can try to learn it.

Sincerely,
AmkG
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html