On Fri, Jun 27, 2014 at 1:45 AM, Jarrad Hope <me@xxxxxxxxxxxxxx> wrote: > As a software developer I've used git for years and have found it the > perfect solution for source control. > > Lately I have found myself using git in a unique use-case - modifying > DNA/RNA sequences and storing them in git, which are essentially > software/source code for cells/life. For Bacteria and Viruses the > repo's are very small <10mb & compress nicely. > > However on the extreme end of the spectrum a human genome can run in > at 50gb or say ~1gb per file/chromosome. Interesting. Unfortunately not everything is used like source code. :) Git does source code well. I don't know enough to judge if DNA/RNA sequence storage is similar enough to source code to benefit from things like `git log -p` showing deltas over time, or if some other algorithm would be more effective. > From my understanding the largest problem revolves around git's delta > discovery method, holding 2 files in memory at once - is there a > reason this could not be adapted to page/chunk the data in a sliding > window fashion ? During delta discovery Git holds like 11 files in memory at once. One T is the target file that you are trying to delta compress. The other 10 are in a window and Git compares T to each one of them in turn, selecting the file that produces the smallest delta instruction sequence to recreate T. Because T is compared to 10ish other files (the window size is tuneable), Git needs a full copy of T in memory for the entire compare step. For any single compare, T is scanned through only once. If you were doing a single compare (window size of "1"), T could be "on disk" and paged through sequentially. The files in the window need to be held entirely in memory, along with a matching index. The actual delta compression algorithm is a Rabin-Karp sliding window hash function. Copies can be made from any part of the source file with no regard to ordering. This makes paging/chunking the source file at both compression and decompression time nearly impossible. Git jumps around the source file many times, but it allows for efficient storage for movement of long sequences within a file (e.g. move function foo() later in the file). Maybe if you limited the window to 1 and limited the hash function to avoid backing up in the source file so it could be paged, you can get somewhere. But you mentioned the files are O(1 GiB). Just buy more RAM? Modern workstations have pretty good memory capacity. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html