Re: Git as electronic lab notebook

"Ciprian Dorin, Craciun" <ciprian.craciun@xxxxxxxxx> · Sat, 19 Dec 2009 15:38:32 +0200



On Sat, Dec 19, 2009 at 2:23 PM, Thomas Johnson
<thomas.j.johnson@xxxxxxxxx> wrote:
> Hello group,
>
> I've been using git on a few different projects over the last couple of months,
> and as a former svn user I really like it. Recently, I've been using it as an
> 'electronic lab notebook' for an empirical project. My workflow looks like this:
> 1. Start with the stable code base on head
> 2. Create  and change to branch 'Experiment123'
> 3. Make some changes
> 4. Run the program, which generates a giant (10MB-4G) output text file,
> Experiment123.log. Update my LabNotebook.txt file.
> 5. Were the new changes helpful?
> 5.yes: Bzip Experiment123.log, and commit it on the branch. Merge the
> Experiment123 branch to head and goto 1.
> 5.no: Bzip Experiment123.log, and commit it on the branch. Merge LabNotebook.txt
> and Experiment123.log back to head. Switch back to head and goto 1.
>
> The thing is, Experiment123.log is going to be very similar to Experiment122.log
> and Experiment124.log except for a few details. My understanding is that git is
> great at compressing groups of files like this, is that correct? Should I not be
> bzipping them myself? On the other hand, I don't want HEAD to contain hundreds
> of gigs of uncompressed files that bzip down to only a few hundred megs.
>
> Any thoughts on the workflow itself would also be very welcome.


    I have used myself such a similar workflow for parametric studies
on some genetic algorithms, and below are my observations related to
your question:
    * saving the entire log file (either zipped or not) in the
repository has some drawbacks with repository clonning; (in my setup
I've runned the tests in parallel on a different machine, and used Git
to synchronize between the development machine and the test machine;)
the problem lies in the fact that when I wanted to "clean" the test
machine and start over I had to clone the repository, which also held
all the unneeded log files;
    * (actually I've used two Git repositories -- one for the actual
source code where I make the commits by hand, and another one which I
use for the synchronization;)
    * even if you prefer having the logs, it's best to let Git handle
the compression; because even if only some small parts change from the
original txt file, I would guess that the BZip-ped file looks quite
different;
    * maybe it would be better than instead of holding the experiment
log, you just keep a sumarization of it (only the important stuff);
and even if you do need the entire log, you could always recreate it
by running the code again; (this was the road I took in the end, by
keeping a small SQLite database of each experiment;)
    * (and of course there is also another little trick I've used:
just put the logs file in a `log` directory which is "git-ignored",
that way you can switch between branches, but Git won't touch the
`log` directory, unless you force it by issuing `git clean -f -d -x`;)

    Hope I've been useful,
    Ciprian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html