Re: Proposed design of fast-export helper

Jonathan Nieder <jrnieder@xxxxxxxxx> · Thu, 7 Apr 2011 18:03:19 -0500

Hi,

Ramkumar Ramachandra wrote:

> The other two kinds of `<dataref>` that exporters can produce are:
> 1. A mark reference (`:<idnum>`) set by a prior `blob` command
> 2. A full 40-byte SHA-1 of an existing Git blob object.

The above is very git-specific --- arbitrary foreign vcs-es are
unlikely to all use 40-byte hashes as <dataref>.  So far I've been
assuming that a <dataref> is sufficiently "nice" (not containing
spaces, NULs, quotation marks, or newlines nor starting with a colon).

It would be better to come up with a more formal rule and document it.

> The most naive solution will involve modifying svn-fi to persist blobs
> in-memory as soon as it sees one after a `blob` command (this is the
> approach that git2svn uses).  However, this solution is both expensive
> in memory and highly unscalable.

I suppose it also means trouble with large (>4 GiB) files on 32-bit
architectures.  Of course git proper does not support those anyway.

[...]
> The other alternative that svn-fi currently uses: --inline-blobs [2].
> This is a modification to the git-fast-export so that it only ever
> produces inlined blobs.  However, this has severe drawbacks
[... explanation of drawbacks to relying on "inline" data only ...]
> In the best
> case, this can simply be a way to hint the git-fast-export to minimize
> the work that the helper has to do.

Yes.

> The library's API
> -----------------
> I've thought of building a sort of library which applications can link
> to. The API is as follows:
> int write_blob(unit32_t, char *, size_t, FILE *);
> int fetch_blob_mark(unit32_t, struct strbuf *);
> int fetch_blob_sha1(char *sha1, struct strbuf *); /* sha1[20] */

Hmm.

What the first function does is not obvious without argument names.
I had guessed it would write a blob to file.  More nits:

 - the type of marks in fast-import is uintmax_t (yech) --- they're
   not restricted to 32-bit integers as far as I know.

 - missing "const" before char *. :)

 - lengths of subsets of files should be of type off_t, if we want
   to allow 64-bit lengths on 32-bit architectures.  If this is
   git-specific, it would be more consistent to use "unsigned long".

 - the size that toggles between size of a delimiter and length of
   file seems strange to me.  I'd rather have two separate functions.

 - this requires the frontend to think in terms of git blob object
   names, which doesn't fit well with most applications I'm thinking
   of (think: fast-import backends in C that if written in python
   would use python-fastimport).

I assume the delimited format works as in fast-import's "data" command
(and only supports blobs ending with LF)?

[...]
> The library then parses this data and dumps it into a storage backend
> (described later) after computing its SHA1.

Right, so a nice aspect of this exercise is that the nature of the
key/value store is hidden from the caller.

> fetch_blob_mark and fetch_blob_sha1 can then be used to fetch blobs
> using their mark or SHA1.  Fetching blobs using their mark should be
> O(1), while locating the exact SHA1 will require a bisect of sorts:
> slightly better than O(log (n)).

http://fanf.livejournal.com/101174.html

> How the library works

I wonder if it would be sensible to make it run as a separate process.
The upside: writing to and from pipes is easy in a variety of
programming languages (including the shell), even easier than calling
C code.  So in particular that would make testing it easier.  But
performance considerations might outweigh that.

I also wonder if it is possible or makes sense to make the API less
git-specific.  If the buffers were in-memory, something like

	set(key, value);
	value = get(key);

would do.  Since they are not, maybe something vaguely like

	FILE *f = kvstore_fopen(key, O_WRONLY);
	fwrite(value, sz, 1, f);
	kvstore_fclose(f);

	FILE *f = kvstore_fopen(key, O_RDONLY);
	strbuf_fread(&value, SIZE_MAX, f);
	kvstore_fclose(f);

would be something to aim for.  For the getter case, fmemopen is
portable (in case one wants to just put the value in memory) but
fopencookie (in case one doesn't) is not, so the idea does not work as
nicely as one might like.  And it's not quite the right abstraction
--- for a fast-import backend, I suppose the operations needed are:

 * get length
 * dump the value to a caller-specified FILE * or fd
 * let the caller read the value one chunk or line at a time to
   transform it (e.g., to escape special characters).

Is there prior art that this could mimic or reuse (so we can learn
from others' mistakes and make sure the API feels familiar)?

Hope that helps,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html