Re: [PATCH 0/6] address packed-refs speed regressions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Apr 05, 2015 at 03:41:39PM +0200, René Scharfe wrote:

> > The main culprits seem to be d0f810f (which introduced some extra
> > expensive code for each ref) and my 10c497a, which switched from fgets()
> > to strbuf_getwholeline. It turns out that strbuf_getwholeline is really
> > slow.
> 
> 10c497a changed read_packed_refs(), which reads *all* packed refs.
> Each is checked for validity.  That sounds expensive if the goal is
> just to look up a single (non-existing) ref.
> 
> Would it help to defer any checks until a ref is actually accessed?
> Can a binary search be used instead of reading the whole file?

Yes, but addressing that is much more invasive.

Right now we parse all of the packed-refs file into an in-memory cache,
and then do single lookups from that cache. Doing an mmap() and a binary
search is way faster (and costs less memory) for doing individual
lookups. It relies on the list being sorted. This is generally true, but
not something we currently rely on (however, it would be easy to add a
"sorted" flag to top of the file and have the readers fall back when the
flag is missing). I've played with a patch to do this (it's not entirely
trivial, because you jump into the middle of a line, and then have to
walk backwards to find the start of the record).

For traversals, it's more complicated. Obviously if you are traversing
all refs, you have to read the whole thing anyway. If you are traversing
a subset of the refs, you can binary-search the start of the subset, and
then walk forward. But that's where it gets tricky with the current
code.

The ref_cache code expects to fill in from outer to inner. So if you
have "refs/foo", you should also have filled in all of "refs/" (but not
necessarily "refs/bar"). This matches the way we traverse loose ref
directories; we opendir "refs/", find out that it has "foo" and "bar",
and the descend into "foo", and so forth. But reading a subset of the
packed-ref file is "inside out". You fill in all of "refs/foo", but you
have no idea what else is in "refs/".

So going in that direction would involve some surgery to the ref_cache
code. It might even involve throwing it out entirely (i.e., just mmap
the packed-refs file and look through it directly, without any kind of
in-memory cache; we don't tend to do more than one ref-iteration per
program anyway, so I'm not sure the caching is buying us much anyway).
My big concern there would be that there are a lot of subtle race issues
between packed and loose refs, and the current state is the result of a
lot of tweaking. I'd be worried that a heavy rewrite there would risk
introducing subtle and rare corruptions.

Plus it would be a lot of work, which leads me to...

> I wonder if pluggable reference backends could help here.  Storing refs
> in a database table indexed by refname should simplify things.

...this. I think that effort might be better spent on a ref storage
format that's more efficient, simpler (with respect to subtle races and
such), and could provide other features (e.g., transactional atomicity).

The big plus side of packed-refs improvements is that they "just work"
without worrying about compatibility issues. But ref storage is local,
so I'm not sure how big a deal that is in practice.

> Short-term, can we avoid the getc()/strbuf_grow() dance e.g. by mapping
> the packed refs file?  What numbers do you get with the following patch?

It's about 9% faster than my series + the fgets optimization I posted
(or about 25% than using getc).  Which is certainly nice, but I was
really hoping to just make strbuf_getline faster for all callers, rather
than introducing special code for one call-site. Certainly we could
generalize the technique (i.e., a struct with the mmap data), but then I
feel we are somewhat reinventing stdio. Which is maybe a good thing,
because stdio has a lot of rough edges (as seen here), but it does feel
a bit like NIH syndrome.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]