Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

Felipe Contreras <felipe.contreras@xxxxxxxxx> · Mon, 6 May 2013 14:02:13 -0500

On Mon, May 6, 2013 at 7:31 AM, Jeff King <peff@xxxxxxxx> wrote:
> On Sun, May 05, 2013 at 05:38:53PM -0500, Felipe Contreras wrote:
>
>> We don't care about blobs, or any object other than commits, but in
>> order to find the type of object, we are parsing the whole thing, which
>> is slow, specially in big repositories with lots of big files.
>
> I did a double-take on reading this subject line and first paragraph,
> thinking "surely fast-export needs to actually output blobs?".

If you think that, then you are not familiar with the code.

--export-marks=<file>::
	Dumps the internal marks table to <file> when complete.
	Marks are written one per line as `:markid SHA-1`. Only marks
	for revisions are dumped; marks for blobs are ignored.

		if (deco->base && deco->base->type == 1) {
			mark = ptr_to_mark(deco->decoration);
			if (fprintf(f, ":%"PRIu32" %s\n", mark,
				sha1_to_hex(deco->base->sha1)) < 0) {
			    e = 1;
			    break;
			}
		}

> Reading the patch, I see that this is only about not bothering to load
> blob marks from --import-marks. It might be nice to mention that in the
> commit message, which is otherwise quite confusing.

The commit message says it exactly like it is: we don't care about blobs.

If an object is not a commit, we *already* skip it. But as the commit
message already says, we do so by parsing the whole thing.

> I'm also not sure why your claim "we don't care about blobs" is true,
> because naively we would want future runs of fast-export to avoid having
> to write out the whole blob content when mentioning the blob again.

Because it's pointless to have hundreds and thousands of blob marks
that are *never* going to be used, only for an extremely tiny minority
that would.

> Does that match your reasoning?

It doesn't matter, it has been that way since --export-marks was introduced.

>> Before this, loading the objects of a fresh emacs import, with 260598
>> blobs took 14 minutes, after this patch, it takes 3 seconds.
>
> Presumably most of that speed improvement comes from not parsing the
> blob objects. I wonder if you could get similar speedups by applying the
> "do not bother parsing" rule from your patch 3. You would still incur
> some cost to create a "struct blob", but it may or may not be
> measurable.  That would mean we get the "case not worth worrying about"
> from above for free. I doubt it would make that big a difference,
> though, given the rarity of it. So I am OK with it either way.

How would I know if it's a blob or a commit, if not by the code this
patch introduces?

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html