Re: How do you best store structured data in git repositories?

Avery Pennarun <apenwarr@xxxxxxxxx> · Wed, 2 Dec 2009 16:17:10 -0500

On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer
<sebastianspublicaddress@xxxxxxxxxxxxxx> wrote:
> Do you store everything in a single file and configure git to use
> special diff- and merge-tools?
> Do you use XML for this purpose?

XML is terrible for most data storage purposes.  Data exchange, maybe,
but IMHO the best thing you can do when you get XML data is to put it
in some other format ASAP.

As it happens, I've been doing a project where we store a bunch of
stuff in csv format in git, and it works fairly well.  We made a
special merge driver that can merge csv data (based on knowing which
columns should be treated as the "primary key").

> Do you take care that the contents of your file is as stable as possible
> when it's saved or do you let your diff tools cope with issues like
> reordering, reassignment of identifiers (for example when identifiers
> are offsets in the file), ...?

A custom merge driver is better, by far, than the builtin ones (which
were designed for source code) if you have any kind of structured data
that you don't want to have to merge by hand.

That said, however, you should still try to make your files as stable
as possible, because:

- If your program outputs the data in random order, it's just being
sloppy anyway

- 'git diff' doesn't work usefully otherwise (for examining the data
and debugging)

Of course, all bets are off if your file is actually binary; merging
and diffing is mostly impossible unless you use a totally custom
engine.  And if your file contains byte offsets, then it's a binary
file, no matter that it looks like in your text editor.  Adding a byte
in the middle would make such a file entirely nonsense, which is not
an attribute of a text file.

> Do you store one object/record per file (with filename=id, for example
> with GUID-s) and hope that git will not mess them up when it merges
> them?
>
> Do you store records as directories, with very small files which contain
> single attributes (because records can be considered sets of
> key-value-pairs and the same applies to directories)? Do you configure
> git to do a scalar merge on non-text "attributes" (with special file
> extensions)?

In git, you have to balance between its different limitations.  If you
have a tonne of small files, it'll take you longer to retrieve a large
amount of data.  If you have one big huge file, git will suck a lot of
memory when repacking.  The best is to achieve a reasonable balance.

One trick that I've been using lately is to split large files
according to a rolling checksum:
http://alumnit.ca/~apenwarr/log/?m=200910#04

This generally keeps diffs useful, but keeps individual file sizes
down.  Obviously the implementation pointed to there is just a toy,
but the idea is sound.

> When you don't store everything in a single, binary file: Do you use git
> hooks to update an index for efficient queries on your structured data?
> Do you update the whole index for every change? Or do you use git hashes
> to decide which segment of your index needs to be updated?

We keep a separate index file that's not part of git.  When the git
repo is updated, we note which rows have changed, then update the
index.

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html