On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer <sebastianspublicaddress@xxxxxxxxxxxxxx> wrote: > Do you store everything in a single file and configure git to use > special diff- and merge-tools? > Do you use XML for this purpose? XML is terrible for most data storage purposes. Data exchange, maybe, but IMHO the best thing you can do when you get XML data is to put it in some other format ASAP. As it happens, I've been doing a project where we store a bunch of stuff in csv format in git, and it works fairly well. We made a special merge driver that can merge csv data (based on knowing which columns should be treated as the "primary key"). > Do you take care that the contents of your file is as stable as possible > when it's saved or do you let your diff tools cope with issues like > reordering, reassignment of identifiers (for example when identifiers > are offsets in the file), ...? A custom merge driver is better, by far, than the builtin ones (which were designed for source code) if you have any kind of structured data that you don't want to have to merge by hand. That said, however, you should still try to make your files as stable as possible, because: - If your program outputs the data in random order, it's just being sloppy anyway - 'git diff' doesn't work usefully otherwise (for examining the data and debugging) Of course, all bets are off if your file is actually binary; merging and diffing is mostly impossible unless you use a totally custom engine. And if your file contains byte offsets, then it's a binary file, no matter that it looks like in your text editor. Adding a byte in the middle would make such a file entirely nonsense, which is not an attribute of a text file. > Do you store one object/record per file (with filename=id, for example > with GUID-s) and hope that git will not mess them up when it merges > them? > > Do you store records as directories, with very small files which contain > single attributes (because records can be considered sets of > key-value-pairs and the same applies to directories)? Do you configure > git to do a scalar merge on non-text "attributes" (with special file > extensions)? In git, you have to balance between its different limitations. If you have a tonne of small files, it'll take you longer to retrieve a large amount of data. If you have one big huge file, git will suck a lot of memory when repacking. The best is to achieve a reasonable balance. One trick that I've been using lately is to split large files according to a rolling checksum: http://alumnit.ca/~apenwarr/log/?m=200910#04 This generally keeps diffs useful, but keeps individual file sizes down. Obviously the implementation pointed to there is just a toy, but the idea is sound. > When you don't store everything in a single, binary file: Do you use git > hooks to update an index for efficient queries on your structured data? > Do you update the whole index for every change? Or do you use git hashes > to decide which segment of your index needs to be updated? We keep a separate index file that's not part of git. When the git repo is updated, we note which rows have changed, then update the index. Avery -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html