Re: [PATCH v6 03/19] index-helper: new daemon for caching index and related stuff

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 28 Apr 2016 11:58:37 -0700

David Turner <dturner@xxxxxxxxxxxxxxxx> writes:

> From: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>
> ...
> The biggest gain is not having to verify the trailing SHA-1, which
> takes lots of time especially on large index files. But this also
> opens doors for further optimiztions:

optimizAtion

> Git can poke the daemon via unix domain sockets to tell it to refresh
> the index cache, or to keep it alive some more minutes. It can't give
> any real index data directly to the daemon. Real data goes to disk
> first, then the daemon reads and verifies it from there. Poking only
> happens for $GIT_DIR/index, not temporary index files.

Is this limited to "poking", or the helper daemon is not involved in
codepaths that handle temporary index at all?  It makes sense if it
is the latter, and it doesn't if it were the former, but it is
unclear in this paragraph.

> $GIT_DIR/index-helper.sock is the socket for the daemon process. The
> daemon reads from the socket and executes commands.
>
> Named pipes were considered for portability reasons, but then commands
> that need replies from the daemon would have open their own pipes,

"would have TO open"?

> since a named pipe should only have one reader.  Unix domain sockets
> don't have this problem.

> diff --git a/Documentation/git-index-helper.txt b/Documentation/git-index-helper.txt
> new file mode 100644
> index 0000000..77687c0
> --- /dev/null
> +++ b/Documentation/git-index-helper.txt
> @@ -0,0 +1,47 @@
> +git-index-helper(1)
> +===================
> +
> +NAME
> +----
> +git-index-helper - A simple cache daemon for speeding up index file access
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git index-helper' [options]
> +
> +DESCRIPTION
> +-----------
> +Keep the index file in memory for faster access. This daemon is per
> +repository.

Not an objection but a question.  Does this mean "per index file"?

What the users would need to be aware of is that it is not like
running a single daemon instance helps the toplevel project with its
entire 600 submodules checked out you have--you would need that many
instances of the helper, which is already conveyed well with "per
repository".  But I was wondering if it helps users experimenting
with the multiple worktree feature if we said "per index file".  It
would make it more clear that you would run two instances of the
helper daemon when you use another worktree in addition to your
primary repository.

> +... The following commands are used to control the daemon:
> +
> +"refresh"::
> +	Reread the index.
> +
> +"poke":
> +	Let the daemon know the index is to be read. It keeps the
> +	daemon alive longer, unless `--exit-after=0` is used.
> +
> +All commands and replies are terminated by a 0 byte.

s/0/NUL/

> diff --git a/cache.h b/cache.h
> index 4180e2b..43fb314 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -334,6 +334,8 @@ struct index_state {
>  	struct cache_time timestamp;
>  	unsigned name_hash_initialized : 1,
>  		 keep_mmap : 1,
> +		 from_shm : 1,
> +		 to_shm : 1,

keep_mmap bit had its own commit, which made it more-or-less
unnecessary to have a separate explanation, but it is unclear how
these two bits are envisioned to be used...

> diff --git a/index-helper.c b/index-helper.c
> new file mode 100644
> index 0000000..976e913
> --- /dev/null
> +++ b/index-helper.c
> @@ -0,0 +1,285 @@
> +#include "cache.h"
> +#include "parse-options.h"
> +#include "sigchain.h"
> +#include "strbuf.h"
> +#include "exec_cmd.h"
> +#include "split-index.h"
> +#include "lockfile.h"
> +#include "cache.h"
> +#include "unix-socket.h"
> +#include "pkt-line.h"
> +
> +struct shm {
> +	unsigned char sha1[20];
> +	void *shm;
> +	size_t size;
> +};
> +
> +static struct shm shm_index;
> +static struct shm shm_base_index;
> +
> +static void release_index_shm(struct shm *is)
> +{
> +	if (!is->shm)
> +		return;
> +	munmap(is->shm, is->size);
> +	unlink(git_path("shm-index-%s", sha1_to_hex(is->sha1)));
> +	is->shm = NULL;
> +}
> +
> +static void cleanup_shm(void)
> +{
> +	release_index_shm(&shm_index);
> +	release_index_shm(&shm_base_index);
> +}
> +
> +static void cleanup(void)
> +{
> +	unlink(git_path("index-helper.sock"));
> +	cleanup_shm();
> +}
> +
> +static void cleanup_on_signal(int sig)
> +{
> +	/* We ignore sigpipes -- that's just a client being broken. */

Is it really "a broken" client, in which case we would want to know
about the breakage so that we can fix it?  Or is it a case of "a
client wasn't interested in what we had to say, which is perfectly
fine"?

> +	if (sig == SIGPIPE)
> +		return;
> +	cleanup();
> +	sigchain_pop(sig);
> +	raise(sig);
> +}
> +
> +static int shared_mmap_create(int file_flags, int file_mode, size_t size,
> +			      void **new_mmap, int mmap_prot, int mmap_flags,
> +			      const char *path)
> +{

I can understand size, new_mmap and path, but do all others need the
customizability?  It is hard to tell at this point in the series as
there is only one caller, which gives a reasonable set of values to
these parameters.  And each of all these "reasonable values" is the
only reasonable value for each of these parameters-- you would ever
want to have mmap_prot without PROT_WRIT as this is creating an
empty region that is mapped to a newly created file, for example.
It is inconceivable that a function named "shared_mmap_create" ever
passes mmap_flags that lack MAP_SHARED, for another example, or
file_flags that is not O_CREAT|O_EXCL|O_RDWR for that matter.

> + ...
> +	if (ftruncate(fd, size))
> +		goto done;
> +	*new_mmap = mmap(NULL, size, mmap_prot, mmap_flags, fd, 0);
> ...
> +static void share_index(struct index_state *istate, struct shm *is)
> +{
> +	void *new_mmap;
> +	if (istate->mmap_size <= 20 ||
> +	    hashcmp(istate->sha1,
> +		    (unsigned char *)istate->mmap + istate->mmap_size - 20) ||
> +	    !hashcmp(istate->sha1, is->sha1) ||
> +	    shared_mmap_create(O_CREAT | O_EXCL | O_RDWR, 0700,
> +			       istate->mmap_size, &new_mmap,
> +			       PROT_READ | PROT_WRITE, MAP_SHARED,
> +			       git_path("shm-index-%s",
> +					sha1_to_hex(istate->sha1))) < 0)
> +		return;
> +	release_index_shm(is);
> +	is->size = istate->mmap_size;
> +	is->shm = new_mmap;
> +	hashcpy(is->sha1, istate->sha1);

So "is" is expected to be a valid instance of shm that was obtained
previously.  When we fail to map and return, does it make sense to
retain it?  Because this function does not return any value, the
caller wouldn't be able to tell that shared-mmap-create failed, and
will keep using that stale instance of "is"--I cannot tell if that
is OK in the caller of this thing or not from what I read so far in
this series...

> +static void refresh(void)
> +{
> +	discard_index(&the_index);
> +	the_index.keep_mmap = 1;
> +	the_index.to_shm    = 1;

Again, mysterious to_shm appears here...

> +		/* Wait for a request */
> +		FD_ZERO(&readfds);
> +		FD_SET(fd, &readfds);
> +		result = select(fd + 1, &readfds, NULL, NULL, timeout_p);

I admit that I am old fashioned and am more familiar with select(2)
than poll(2), but I think we try to avoid select(2) in new code.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html