[PATCH v2 00/12] revamping git_check_attr() API

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 16 May 2016 14:05:33 -0700

A common pattern to check N attributes for many paths is to

 (1) prepare an array A of N git_attr_check_elem items;
 (2) call git_attr() to intern the N attribute names and fill A;
 (3) repeatedly call git_check_attrs() for path with N and A;

A look-up for these N attributes for a single path P scans the
entire attr_stack, starting from the .git/info/attributes file and
then .gitattributes file in the directory the path P is in, going
upwards to find .gitattributes file found in parent directories.

An earlier commit 06a604e6 (attr: avoid heavy work when we know the
specified attr is not defined, 2014-12-28) tried to optimize out
this scanning for one trivial special case: when the attribute being
sought is known not to exist, we do not have to scan for it.  While
this may be a cheap and effective heuristic, it would not work well
when N is (much) more than 1.

What we would want is a more customized way to skip irrelevant
entries in the attribute stack, and the definition of irrelevance
is tied to the set of attributes passed to git_check_attrs() call,
i.e. the set of attributes being sought.  The data necessary for
this optimization needs to live alongside the set of attributes, but
a simple array of git_attr_check_elem simply does not have any place
for that.

Introduce "struct git_attr_check" that contains N, the number of
attributes being sought, and A, the array that holds N
git_attr_check_elem items, and a function git_check_attr() that
takes a path P and this structure as its parameters.  This structure
can later be extended to hold extra data necessary for optimization.

    Side Note: While in this series, I am not interested in
    specifying the exact optimization, it would help readers to give
    what is envisioned.  Even when the caller is interested in only
    a few attributes (say, "diff", "text" and "group-doc"), we'd walk
    the attribute entries in various .gitattributes files, perform
    the pattern match and collect attributes for the path.  An
    earlier commit 06a604e6 (attr: avoid heavy work when we know the
    specified attr is not defined, 2014-12-28) tried to optimize out
    this scanning for one trivial special case: where an asked-for
    attribute does not appear anywhere in these files, we know we do
    not have to scan at all.

    We can do much better.  We can scan these attribute entries that
    came from .gitattributes files once and mark the ones that affect
    one or more of the attributes we know the caller is interested
    in.  Then we only go through the marked entries.  E.g. if the
    caller is interested in "diff", "text" and "group-doc"
    attributes, we can mark entries that talks about any of "diff",
    "text", "group-doc" and "binary" attributes.  The last one is
    because it is an attribute macro that expands to affect "diff"
    and "text" (these attribute are unset by having "binary").

    The way git_attr_check() API is structured expects that the
    caller expresses the set of attributes it is interested in
    upfront, and reuses to ask the same question about many paths,
    and it is already optimized for the case where it does so in
    in-order recursive descent of a tree hierarchy by having an
    attr_stack that reads, pre-parses and caches contents of the
    .gitattributes files found in each directory hierarchy.  The
    cached optimization data that sits in "struct git_attr_check"
    would be effective for this expected access pattern.

    The optimization data cannot be file-scope static to attr.c,
    even though that might be easier to implement, in anticipation
    of having more than one "struct git_attr_check" working
    alternatingly, e.g. having more than one pathspec with
    ":(attr=X)" pathspec magic, i.e.

        git cmd ":(attr=X)dir/" ":(attr=Y)dir/ectory/"

    Each pathspec element is expected to keep its own "struct
    git_attr_check" in the custom data to support pathspec magic
    to match with attributes, and the optimization data kept there
    would survive repeated calls to git_check_attr() for the same
    path with different set of attributes.

The patches in the earliest part of the series have been sent to the
list already; there is no substantial change (I think I made a
typofix in the commit log message found by Eric).

  commit.c: use strchrnul() to scan for one line
  attr.c: use strchrnul() to scan for one line
  attr.c: update a stale comment on "struct match_attr"
  attr.c: explain the lack of attr-name syntax check in parse_attr()
  attr.c: complete a sentence in a comment
  attr.c: mark where #if DEBUG ends more clearly
  attr.c: simplify macroexpand_one()

Step 8 is to make sure I can catch all the existing callers and
update the API incrementally.

  attr: rename function and struct related to checking attributes

Step 9 is the most important one.  By updating the API to allow the
callers hold piece of information more than just a simple array of
<attr, value> pair, it paves the way to add data structures that
will help optimizing the lookup.  As a demonstration of the new API,
it converts one caller.  The immediate effect to the callers is that
they can lose moderate amount of code, but that is not the point of
this change.

  attr: (re)introduce git_check_attr() and struct git_attr_check

The two patches that follow converts remaining callers to the enw
API.

  attr: convert git_all_attrs() to use "struct git_attr_check"
  attr: convert git_check_attrs() callers to use the new API

The last step retires the old one and updates the document.

  attr: retire git_check_attrs() API

 Documentation/technical/api-gitattributes.txt |  62 ++++++-----
 archive.c                                     |  24 ++---
 attr.c                                        | 143 +++++++++++++++++++-------
 attr.h                                        |  31 ++++--
 builtin/check-attr.c                          |  55 +++++-----
 builtin/pack-objects.c                        |  19 +---
 commit.c                                      |   3 +-
 convert.c                                     |  26 ++---
 ll-merge.c                                    |  33 +++---
 userdiff.c                                    |  19 ++--
 ws.c                                          |  19 ++--
 11 files changed, 236 insertions(+), 198 deletions(-)

-- 
2.8.2-748-gfb85f76

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html