Re: [PATCH 0/5] Start of a journey: drop NO_THE_INDEX_COMPATIBILITY_MACROS

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Tue, 2 May 2017 10:05:00 -0400

On 5/2/2017 12:17 AM, Stefan Beller wrote:
On Mon, May 1, 2017 at 6:36 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
Stefan Beller <sbeller@xxxxxxxxxx> writes:

This applies to origin/master.

For better readability and understandability for newcomers it is a good idea
to not offer 2 APIs doing the same thing with on being the #define of the other.

In the long run we may want to drop the macros guarded by
NO_THE_INDEX_COMPATIBILITY_MACROS. This converts a couple of them.

Thank you for bringing this up and making this proposal.
I started a similar effort internally last fall, but
stopped because of the footprint size.

Why?  Why should we keep typing &the_index, when most of the time we
are given _the_ index and working on it?

As someone knowledgeable with the code base you know that the cache_*
and index_* functions only differ by an index argument. A newcomer may not
know this, so they wonder why we have (A) so many functions [and which is the
right function to use]; it is an issue of ease of use of the code base.

Anything you do In submodule land today needs to spawn new processes in
the submodule. This is cumbersome and not performant. So in the far future
we may want to have an abstraction of a repo (B), i.e. all repository state in
one struct/class. That way we can open a submodule in-process and perform
the required actions without spawning a process.

The road to (B) is a long road, but we have to art somewhere. And this seemed
like a good place by introducing a dedicated argument for the
repository. In a follow
up in the future we may want to replace &the_index by "the_main_repo.its_index"
and then could also run the commands on other (submodule) indexes. But more
importantly, all these commands would operate on a repository object.

In such a far future we would have functions like the cmd_* functions
that would take a repository object instead of doing its setup discovery
on their own.

Another reason may be its current velocity (or absence of it) w.r.t. to these
functions, such that fewer merge conflicts may arise.

In addition to (eventually) allowing multiple repos be open at
the same time for submodules, it would also help with various
multi-threading efforts.  For example, we have loops that do a
"for (k = 0, k < active_nr; k++) {...}"  There is no visual clue
in that code that it references "the_index" and therefore should
be subject to the same locking.  Granted, this is a trivial example,
but goes to the argument that the code has lots of subtle global
variables and macros that make it difficult to reason about the
code.

This step would help un-hide this.

In a much longer future, we could also consider building an
improved API around the in-memory index data.  For example,
currently we have a simple array of cache_entry pointers and
the entire code base uses "for" loops like the above to iterate.
If we could hide that fact, then we could consider alternative
representations for various reasons.
() bulk alloc the cache_entries from a pool, rather than individually.
() cluster cache_entries linearly by parent directory, rather
   than linearly over the whole tree.
() efficient alternative iterator methods on the index, such as
   non-recursive breadth-first

Things like this would be difficult with the current set of
globals and macros.

Thanks,
Jeff

---
This discussion is similar to the "free memory at the end of cmd_*" discussion,
as it aims to make code reusable, and accepting a minor drawback for it.
Typing "the_index" re-enforces the object thinking model and may have people
start on thinking if they would like to declare yet another global variable.

Thanks,
Stefan