Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

Shawn Pearce <spearce@xxxxxxxxxxx> · Fri, 27 Jun 2014 08:45:02 -0700

On Fri, Jun 27, 2014 at 1:45 AM, Jarrad Hope <me@xxxxxxxxxxxxxx> wrote:
> As a software developer I've used git for years and have found it the
> perfect solution for source control.
>
> Lately I have found myself using git in a unique use-case - modifying
> DNA/RNA sequences and storing them in git, which are essentially
> software/source code for cells/life. For Bacteria and Viruses the
> repo's are very small <10mb & compress nicely.
>
> However on the extreme end of the spectrum a human genome can run in
> at 50gb or say ~1gb per file/chromosome.

Interesting. Unfortunately not everything is used like source code. :)

Git does source code well. I don't know enough to judge if DNA/RNA
sequence storage is similar enough to source code to benefit from
things like `git log -p` showing deltas over time, or if some other
algorithm would be more effective.

> From my understanding the largest problem revolves around git's delta
> discovery method, holding 2 files in memory at once - is there a
> reason this could not be adapted to page/chunk the data in a sliding
> window fashion ?

During delta discovery Git holds like 11 files in memory at once. One
T is the target file that you are trying to delta compress. The other
10 are in a window and Git compares T to each one of them in turn,
selecting the file that produces the smallest delta instruction
sequence to recreate T.

Because T is compared to 10ish other files (the window size is
tuneable), Git needs a full copy of T in memory for the entire compare
step. For any single compare, T is scanned through only once. If you
were doing a single compare (window size of "1"), T could be "on disk"
and paged through sequentially.

The files in the window need to be held entirely in memory, along with
a matching index. The actual delta compression algorithm is a
Rabin-Karp sliding window hash function. Copies can be made from any
part of the source file with no regard to ordering. This makes
paging/chunking the source file at both compression and decompression
time nearly impossible. Git jumps around the source file many times,
but it allows for efficient storage for movement of long sequences
within a file (e.g. move function foo() later in the file).

Maybe if you limited the window to 1 and limited the hash function to
avoid backing up in the source file so it could be paged, you can get
somewhere.

But you mentioned the files are O(1 GiB). Just buy more RAM? Modern
workstations have pretty good memory capacity.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html