Re: [PATCH v6 03/19] index-helper: new daemon for caching index and related stuff

David Turner <dturner@xxxxxxxxxxxxxxxx> · Fri, 29 Apr 2016 14:46:57 -0400

On Thu, 2016-04-28 at 11:58 -0700, Junio C Hamano wrote:
> David Turner <dturner@xxxxxxxxxxxxxxxx> writes:
> 
> > From: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>
> > ...
> > The biggest gain is not having to verify the trailing SHA-1, which
> > takes lots of time especially on large index files. But this also
> > opens doors for further optimiztions:
> 
> optimizAtion
> 
> > Git can poke the daemon via unix domain sockets to tell it to
> > refresh
> > the index cache, or to keep it alive some more minutes. It can't
> > give
> > any real index data directly to the daemon. Real data goes to disk
> > first, then the daemon reads and verifies it from there. Poking
> > only
> > happens for $GIT_DIR/index, not temporary index files.
> 
> Is this limited to "poking", or the helper daemon is not involved in
> codepaths that handle temporary index at all?  It makes sense if it
> is the latter, and it doesn't if it were the former, but it is
> unclear in this paragraph.

It is in fact the latter.  Will clarify.

> > $GIT_DIR/index-helper.sock is the socket for the daemon process.
> > The
> > daemon reads from the socket and executes commands.
> > 
> > Named pipes were considered for portability reasons, but then
> > commands
> > that need replies from the daemon would have open their own pipes,
> 
> "would have TO open"?

Oops.

> > since a named pipe should only have one reader.  Unix domain
> > sockets
> > don't have this problem.
> 
> > diff --git a/Documentation/git-index-helper.txt b/Documentation/git
> > -index-helper.txt
> > new file mode 100644
> > index 0000000..77687c0
> > --- /dev/null
> > +++ b/Documentation/git-index-helper.txt
> > @@ -0,0 +1,47 @@
> > +git-index-helper(1)
> > +===================
> > +
> > +NAME
> > +----
> > +git-index-helper - A simple cache daemon for speeding up index
> > file access
> > +
> > +SYNOPSIS
> > +--------
> > +[verse]
> > +'git index-helper' [options]
> > +
> > +DESCRIPTION
> > +-----------
> > +Keep the index file in memory for faster access. This daemon is
> > per
> > +repository.
> 
> Not an objection but a question.  Does this mean "per index file"?
>
> What the users would need to be aware of is that it is not like
> running a single daemon instance helps the toplevel project with its
> entire 600 submodules checked out you have--you would need that many
> instances of the helper, which is already conveyed well with "per
> repository".  But I was wondering if it helps users experimenting
> with the multiple worktree feature if we said "per index file".  It
> would make it more clear that you would run two instances of the
> helper daemon when you use another worktree in addition to your
> primary repository.

Per working tree/repository, yes (not quite per-index because working
trees with split indexes have two).  But yeah, that's confusing; will
clarify.

> > +... The following commands are used to control the daemon:
> > +
> > +"refresh"::
> > +	Reread the index.
> > +
> > +"poke":
> > +	Let the daemon know the index is to be read. It keeps the
> > +	daemon alive longer, unless `--exit-after=0` is used.
> > +
> > +All commands and replies are terminated by a 0 byte.
> 
> s/0/NUL/

Will fix.

> > diff --git a/cache.h b/cache.h
> > index 4180e2b..43fb314 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -334,6 +334,8 @@ struct index_state {
> >  	struct cache_time timestamp;
> >  	unsigned name_hash_initialized : 1,
> >  		 keep_mmap : 1,
> > +		 from_shm : 1,
> > +		 to_shm : 1,
> 
> keep_mmap bit had its own commit, which made it more-or-less
> unnecessary to have a separate explanation, but it is unclear how
> these two bits are envisioned to be used...

Will add comments.

> > diff --git a/index-helper.c b/index-helper.c
> > new file mode 100644
> > index 0000000..976e913
> > --- /dev/null
> > +++ b/index-helper.c
> > @@ -0,0 +1,285 @@
> > +#include "cache.h"
> > +#include "parse-options.h"
> > +#include "sigchain.h"
> > +#include "strbuf.h"
> > +#include "exec_cmd.h"
> > +#include "split-index.h"
> > +#include "lockfile.h"
> > +#include "cache.h"
> > +#include "unix-socket.h"
> > +#include "pkt-line.h"
> > +
> > +struct shm {
> > +	unsigned char sha1[20];
> > +	void *shm;
> > +	size_t size;
> > +};
> > +
> > +static struct shm shm_index;
> > +static struct shm shm_base_index;
> > +
> > +static void release_index_shm(struct shm *is)
> > +{
> > +	if (!is->shm)
> > +		return;
> > +	munmap(is->shm, is->size);
> > +	unlink(git_path("shm-index-%s", sha1_to_hex(is->sha1)));
> > +	is->shm = NULL;
> > +}
> > +
> > +static void cleanup_shm(void)
> > +{
> > +	release_index_shm(&shm_index);
> > +	release_index_shm(&shm_base_index);
> > +}
> > +
> > +static void cleanup(void)
> > +{
> > +	unlink(git_path("index-helper.sock"));
> > +	cleanup_shm();
> > +}
> > +
> > +static void cleanup_on_signal(int sig)
> > +{
> > +	/* We ignore sigpipes -- that's just a client being
> > broken. */
> 
> Is it really "a broken" client, in which case we would want to know
> about the breakage so that we can fix it?  Or is it a case of "a
> client wasn't interested in what we had to say, which is perfectly
> fine"?

It's usually a client that's killed at an inopportune moment e.g.:
git diff 
^C 
#oops I meant 
git diff --cached

I guess I'll reword to "suddenly dying".

> > +	if (sig == SIGPIPE)
> > +		return;
> > +	cleanup();
> > +	sigchain_pop(sig);
> > +	raise(sig);
> > +}
> > +
> > +static int shared_mmap_create(int file_flags, int file_mode,
> > size_t size,
> > +			      void **new_mmap, int mmap_prot, int
> > mmap_flags,
> > +			      const char *path)
> > +{
> 
> I can understand size, new_mmap and path, but do all others need the
> customizability?  It is hard to tell at this point in the series as
> there is only one caller, which gives a reasonable set of values to
> these parameters.  And each of all these "reasonable values" is the
> only reasonable value for each of these parameters-- you would ever
> want to have mmap_prot without PROT_WRIT as this is creating an
> empty region that is mapped to a newly created file, for example.
> It is inconceivable that a function named "shared_mmap_create" ever
> passes mmap_flags that lack MAP_SHARED, for another example, or
> file_flags that is not O_CREAT|O_EXCL|O_RDWR for that matter.

Yeah, good point.

> > + ...
> > +	if (ftruncate(fd, size))
> > +		goto done;
> > +	*new_mmap = mmap(NULL, size, mmap_prot, mmap_flags, fd,
> > 0);
> > ...
> > +static void share_index(struct index_state *istate, struct shm
> > *is)
> > +{
> > +	void *new_mmap;
> > +	if (istate->mmap_size <= 20 ||
> > +	    hashcmp(istate->sha1,
> > +		    (unsigned char *)istate->mmap + istate
> > ->mmap_size - 20) ||
> > +	    !hashcmp(istate->sha1, is->sha1) ||
> > +	    shared_mmap_create(O_CREAT | O_EXCL | O_RDWR, 0700,
> > +			       istate->mmap_size, &new_mmap,
> > +			       PROT_READ | PROT_WRITE, MAP_SHARED,
> > +			       git_path("shm-index-%s",
> > +					sha1_to_hex(istate
> > ->sha1))) < 0)
> > +		return;
> > +	release_index_shm(is);
> > +	is->size = istate->mmap_size;
> > +	is->shm = new_mmap;
> > +	hashcpy(is->sha1, istate->sha1);
> 
> So "is" is expected to be a valid instance of shm that was obtained
> previously.  When we fail to map and return, does it make sense to
> retain it?  Because this function does not return any value, the
> caller wouldn't be able to tell that shared-mmap-create failed, and
> will keep using that stale instance of "is"--I cannot tell if that
> is OK in the caller of this thing or not from what I read so far in
> this series...

Honestly, if we fail to mmap, we should probably just die.  That's not
a normal circumstance, and there's not much point in keeping a non
-functioning index-helper around.

> > +static void refresh(void)
> > +{
> > +	discard_index(&the_index);
> > +	the_index.keep_mmap = 1;
> > +	the_index.to_shm    = 1;
> 
> Again, mysterious to_shm appears here...

I expect that the explanation of to_shm will help here.

> > +		/* Wait for a request */
> > +		FD_ZERO(&readfds);
> > +		FD_SET(fd, &readfds);
> > +		result = select(fd + 1, &readfds, NULL, NULL,
> > timeout_p);
> 
> I admit that I am old fashioned and am more familiar with select(2)
> than poll(2), but I think we try to avoid select(2) in new code.

I don't see that in Documentation/CodingGuidelines (although I admit
that I did not even consider checking it before writing this code, so
if it had been there, I would have missed it).

Why do we try to avoid select(2)?  It seems marginally less efficient
for small numbers of fds, and might not scale to large numbers of fds,
but we're not using it in a performance-critical way, and we only care
about one fd at a time.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html