Proposed design of fast-export helper

Ramkumar Ramachandra <artagnon@xxxxxxxxx> · Fri, 1 Apr 2011 11:44:38 +0530

Hi Jonathan,

(+CC: Git List, Junio)

This is the proposed design for a new "fast-export helper library".
Normally, this would be an RFC patch; however, I'm not happy with the
design, and I'd like some input before starting off.

The Problem
-----------
svn-fi, a program to convert a git-fast-import stream to
a Subversion dumpfile, has hit a dead-end [1].  This is because it
doesn't know how to handle any `<dataref>` except 'inline' (this
appears inside `filemodify`, `notemodify`, `ls` and `cat-blob`
commands).

The other two kinds of `<dataref>` that exporters can produce are:
1. A mark reference (`:<idnum>`) set by a prior `blob` command
2. A full 40-byte SHA-1 of an existing Git blob object.

The most naive solution will involve modifying svn-fi to persist blobs
in-memory as soon as it sees one after a `blob` command (this is the
approach that git2svn uses).  However, this solution is both expensive
in memory and highly unscalable.  Also, svn-fi's job is lightweight
parsing and conversion from one format to another- I don't want to
clutter it up with this additional complexity.

The other alternative that svn-fi currently uses: --inline-blobs [2].
This is a modification to the git-fast-export so that it only ever
produces inlined blobs.  However, this has severe drawbacks, the main
one being that every exporter must implement it for it to become
accepted.  You also pointed out another problem: One blob may be
referenced multiple times in the same stream, especially when dealing
with cherry-picks and rebases (when branch support is added to
svn-fi); writing it out explicitly that many times will pollute the
stream unnecessarily large with a lot of redundant data.  In the best
case, this can simply be a way to hint the git-fast-export to minimize
the work that the helper has to do.

Junio suggested using a fast-import-filter that can convert a
fast-import stream from the current format to one that contains only
inlined blobs [3].  The final proposal for the implementation differs,
because I don't like the idea of having to parse the data twice and do
the same error handling in two different places (svn-fi and the
fast-import-filter).

The library's API
-----------------
I've thought of building a sort of library which applications can link
to. The API is as follows:
int write_blob(unit32_t, char *, size_t, FILE *);
int fetch_blob_mark(unit32_t, struct strbuf *);
int fetch_blob_sha1(char *sha1, struct strbuf *); /* sha1[20] */

The svn-fi parser should call write_blob when it encounters some data
that it wants to persist. The arguments are:

1. A mark using which the blob can be recalled using fetch_blob_mark
(optional: use 0 to omit).
2. A terminator character in the case of delimited format. Should be
NULL when the format is non-delimited.
3. In the case of the delimited format, the size of the delimeter
itself.  Otherwise, the size of the blob itself.
4. The FILE * to parse the blob from, which is already seeked to the
right position and ready to parse.

The library then parses this data and dumps it into a storage backend
(described later) after computing its SHA1.

fetch_blob_mark and fetch_blob_sha1 can then be used to fetch blobs
using their mark or SHA1.  Fetching blobs using their mark should be
O(1), while locating the exact SHA1 will require a bisect of sorts:
slightly better than O(log (n)).

How the library works
---------------------
It maintains an sorted list of (SHA1, off_t, off_t) triplets in a
buffer along with a 256-entry fanout table (call this blob_index) --
this is mmap'ed, and only munmap'ed at the end of the program.  When
write_blob is invoked, the blob is read from the FILE, and its SHA1 is
computed.  Then it is written to another big buffer (call this
blob_buffer) after an inexpensive zlib deflate along with its deflated
size, and its (SHA1, offset1, offset2) is written to the blob_index --
the first number refers to the blob_buffer number (there are many
blob_buffers), and the second offset refers to the offset within that
blob_buffer.  No dobut, this is an expensive table to maintain, but we
don't have a choice in the matter -- there's nothing in the spec
preventing a dataref from referring to blobs using their marks and
SHA1s interchangably.

For marks, there is another marks_buffer which stores (uint32_t,
off_t, off_t) triplets.

So, what do you think?

-- Ram

[1]: http://thread.gmane.org/gmane.comp.version-control.git/170290
[2]: http://thread.gmane.org/gmane.comp.version-control.git/170290/focus=170292
[3]: http://thread.gmane.org/gmane.comp.version-control.git/165237/focus=165289
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html