Re: [PATCH v14 00/21] index-helper/watchman

Ben Peart <peartben@xxxxxxxxx> · Fri, 15 Jul 2016 01:20:47 +0000 (UTC)

Duy Nguyen <pclouds <at> gmail.com> writes:

> 
> On Wed, Jul 13, 2016 at 11:59 PM, David Turner <novalis <at> novalis.org> 
wrote:
> > On 07/12/2016 02:24 PM, Duy Nguyen wrote:
> >>
> >> Just thinking out loud. I've been thinking about this more about this.
> >> After the move from signal-based to unix socket for communication, we
> >> probably are better off with a simpler design than the shm-alike one
> >> we have now.
> >>
> >> What if we send everything over a socket or a pipe? Sending 500MB over
> >> a unix socket takes 253ms, that's insignificant when operations on an
> >> index that size usually take seconds. If we send everything over
> >> socket/pipe, we can trust data integrity and don't have to verify,
> >> even the trailing SHA-1 in shm file.
> >
> >
> > I think it would be good to make index operations not take seconds.
> >
> > In general, we should not need to verify the trailing SHA-1 for shm data.
> > So the index-helper verifies it when it loads it, but the git (e.g.) status
> > should not need to verify.
> >
> > Also, if we have two git commands running at the same time, the index-helper
> > can only serve one at a time; with shm, both can run at full speed.
> 
> We still have an option to send a (shm, possibly) path to git to pick
> up and skip verification. If we can exchange capabilities then sending
> the index some way else is always possible.
> 
> >> So, what I have in mind is this, at read index time, instead of open a
> >> socket, we run a separate program and communicate via pipes. We can
> >> exchange capabilities if needed, then the program sends the entire
> >> current index, the list of updated files back (and/or the list of dirs
> >> to invalidate). The design looks very much like a smudge/clean filter.
> >
> >
> > This seems very complicated.  Now git status talks to the separate program,
> > which talks to the index-helper, which talks to watchman.  That is a lot of
> > steps!
> 
> I was suggesting this because I think it would simplify things, not
> complicate stuff further. Yes the separate program plays the role of
> our unix client, if we keep the index-helper. But we don't have to.
> 
> Do you remember Junio once suggested to put the index on tmpfs? That's
> what I imagine in common, medium scale setups. We don't need an extra
> daemon:
> 
> 1) when git needs the index, the script looks at its tmpfs mount, if
> found, pass the path back
> 2) when git announces the index has been updated, the script reads the
> index and saves it in tmpfs
> 3) when git refreshes and asks for watchman support, the script simply
> runs "watchman" command, post processes the output a bit and send the
> file list to git
> 
> Because there is no separate daemon in this case, we don't need
> --kill, we don't need --autorun. We still need WAMA extension but it
> can contain just an arbitrary clock string, this is completely opaque
> to git. If we can get rid of the index-helper (with an example script
> probably landed in contrib folder), that's a lot of less headache down
> the road.
> 
> For giant-scale repos, you probably want something more efficient than
> a script like this. And the good thing is you have freedom to do
> whatever you want. You can run one daemon per repo, you can run one
> daemon per system... In some previous mail exchange with Dscho, it was
> mentioned that something other than watchman may be desired. This
> opens up that door without much headache from outside.
> 
> > I think the daemon also has the advantage that it can reload the index as
> > soon as it changes.  This is not quite implemented, but it would be pretty
> > easy to do.  That would save a lot of time in the typical workflow.
> 
> A script has the same advantage, that is if git notifies it (like we
> do now). You can also do it using watchman trigger, which does not
> need any special support from git.

Taking a step back, git needs (at least) 3 things to update the index quickly, 
the old index, the list of potentially modified files and the list of 
directories with changes.  When looked at this way, the question becomes how do 
we provide this data to git in the fastest, simplest, most compatible way.

It makes sense to separate the code that git can call to abstract away some of 
the potentially platform dependent aspects of how we store and retrieve this 
data quickly.  For example, we would like to use something other than Watchman 
to track changes as it isn’t well supported on Windows.

It would be nice to simplify this for git as much as possible by eliminating the 
need for it to manage a daemon.  No need for --autorun, --kill, poking sockets, 
etc - just run the program and get back the data.

It also makes sense to allow git to negotiate for the data it needs.  It may 
only need the index, it may need the list of updated files and if untrackedcache 
is turned on, it would need the list of directories to invalidate.  The separate 
code should also be able to say which parts it can provide quickly and have git 
fallback to the old code path if something isn’t available.

Another thing to consider is compatibility moving forward and versioning so that 
git and the separate code can be revised independently.  

I think the smudge/clean filter model could work.  It provides a way to register 
different commands for each part of the data git needs.  This could be part of 
the "negotiation” process – if the command is registered, then it’s supported.  
If no command is registered, then fallback to the old code path for that data.  
It would be nice if you could get all the information in a single step.
��.n��������+%������w��{.n��������n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�