Re: What's cooking in git.git (Jun 2016, #05; Thu, 16)

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 14 Jul 2016 11:17:44 -0700

Torsten Bögershausen <tboegi@xxxxxx> writes:

>> There's less redundant work going on than at first seems, because
>> .gitattribute files are only read once and cached. Verified by strace.
>>
> OK, I think I missed that work (not enough time for Git at the moment)
> Junio, please help me out, do we have a cache here now?

The call convert_attrs() makes to the attribute subsystem is:

    git_check_attr(path, NUM_CONV_ATTRS, ccheck)

Conceptually, this call has to do the following, and for the very
first call that is actually what it does:

 1. Read all the relevant attrubutes files.  If you are asking for
    "a/x/1", we need to read $GIT_DIR/info/attributes,
    ".gitattributes", "a/.gitattributes" and "a/x/.gitattributes".

 2. Find matching patterns that cover "a/x/1", and pick up the
    attribute definition from the above.

If you have asked for "a/x/1", it is very likely that you would next
ask for "a/x/2" (think: "git checkout a/x"), and we can realize that
exactly the same set of attributes files apply to that path.  So an
obvious optimization is to cache the result of the first step.

In addition, it is also likely that you would later ask for "a/y/3"
before asking for "b/z/4" (think: "git add .").  A part of the step
1. that was done when you asked for "a/x/1" and then was reused when
you asked for "a/x/2" can further be reused for this request, by
discarding only what was read from "a/x/.gitattributes" and reading
only from "a/y/.gitattributes".

The above two optimizations are done in prepare_attr_stack().

Unfortunately this is one of the three reasons why the attribute
subsystem is not thread-ready.  I.e. there is only one global cache,
so if you spawn two threads to speed up "git add ." by handing paths
[a-m]* to one and [n-z]* to the other, they would thrash the cache
and making it ineffective (even if we protect the cache with mutex,
which obviously has not been done).

I earlier started looking into this, but the effort stalled.  But
for a single-thread use, the attributes read from the filesystem are
cached, and the cache is designed to perform well as long as you
make requests in-order.

To make the attribute look-up thread-ready, the attribute cache
needs to become per-thread.

Orthogonal of the threading issue, there is another posssible
optimization that is not there yet.  The cache can be tied to what
is in ccheck[] to further reduce the size of the cache, making step
2. a lot cheaper.  Currently in step 1. we read and keep everything,
but if we tie the cache to the contents of ccheck[], we can read and
ignore entries we read in step 1. that does not talk about the
attributes ccheck[] is interested in.  My plan is to either

 (1) make the cache per-thread, limit the reading done in 1. to
     ccheck[], but invalidate the cache when a different set of
     attributes are asked; or

 (2) make the cache per <thread, ccheck[]>.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html