Sergio venit, vidit, dixit 09.09.2008 11:02: > Johannes Sixt <j.sixt <at> viscovery.net> writes: > >> Peter Krefting schrieb: >>> Since OpenOffice doucuments are just zipped xml files, I wondered how >>> difficult it would be to create some hooks/hack git to track the files >>> inside the archives instead? >> You could write a "clean" filter that "recompresses" the archive with >> level 0 upon git-add. >> > > > A couple of notes: > > 1) For Openoffice documents whose size is dominated by embed images and other > large objects, the git delta mechanism already performs reasonably well, since > OO files are Zip archives where each file is compressed separately. If you do > not change an image, then that image remains stored in the same way and the > delta can be done. > > 2) For OO documents whose size is dominated by plain content, the git delta > mechanism cannot work, since the zip compression introduces "mixing" and a small > change in the document is converted into a very large change in the zip file. > > It could be possible to write a clean filter to uncompress before commit. > However there is a trick with the complementary smudge filter to be used at > checkout. If you do not smudge properly, git always shows the file as changed > wrt the index. Smudging correctly would mean using the very same compression > ratio and compress method that OO uses, which can be a little tricky. I have > tried using the zip binary both in the clean and the smudge phases and it does > not work nicely. The smudged file is always different from the original one. One > should probably work at a lower level to have a finer control on what is > happening (libzip) and prepend to the uncompressed file the compression > parameters to be restored on smudging. > > The bigger issue is however that the clean/smudge thing can be really slow when > dealing with large OO files. I made similar observations when I experimented with tracking pdf and sqlite (FF profile) files. Problems occurred so far: PDF: on compressing/uncompressing with pdftk there seems to be a random order of objects. We need something bijective. sqlite files for FF profiles: uncompressing (i.e. dumping) and recompressing gives something different than what FF writes. FF seems to write out "holes" in the db to be filled out later. I know, you and I will be told that git is not meant to track OO, PDF, sql. Anyways, I think it's all up to finding a strictly bijective and reasonably efficient compress/uncompress pair. It turns out that when I have a choice between tracking larger or smaller formats, such as ps/dvi vs pdf, it's often better to track the larger one if it's mostly clear text. On a side note, gc'ing helps a lot with binary files. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html