Organizing (large) test data in git

Bill Lear <rael@xxxxxxxxxx> · Tue, 27 Feb 2007 11:58:27 -0600

In my company we generate test data that we want coupled with test
code, and despite the size, we have historically kept our test data
with our code base.

This is becoming a problem.

95% of the size of our 500 meg "code" base is actually test data, and
the size of the test data is likely to increase, perhaps radically.
We are contemplating files on the order of 500 megabytes a piece.

Many of our developers have multiple copies of our code base checked
out, duplicating the test data, so we would like to come up with a
solution to this that minimizes the amount of data we have to check
out.

Personally, I dislike having separate test data and code repos.
Keeping the two synchronized seems like a real pain.  I like to be
able to do things like:

cd component_x
[muck muck muck on part "y"]
mkdir testsuite/component_x.part_y
cd testsuite/component_x.part_y
[muck muck muck]
git commit -a -m "Finished mucking with part y of component x"

Where the directory structure is, essentially:

      component_x/
          testsuite/component_x.part_y

If we separate out the test data, for the above I would have to do
two commits in two repos, switching directories, etc.  And then, there
is the issue of ensuring that checkouts of code also get the associated
data needed.  I can see this being a potential nightmare.

Have others on the list grappled with this and come up with good
solutions with git?

I know there was some talk of sub-modules, but not sure if that is
working or even a viable option here.

Bill
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html