Re: [PATCH 18/19] index-helper: autorun

Duy Nguyen <pclouds@xxxxxxxxx> · Fri, 18 Mar 2016 07:50:27 +0700

On Thu, Mar 17, 2016 at 9:43 PM, Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
> Hi Duy,
>
> On Thu, 17 Mar 2016, Duy Nguyen wrote:
>
>> On Thu, Mar 17, 2016 at 1:27 AM, Johannes Schindelin
>> <Johannes.Schindelin@xxxxxx> wrote:
>> > I am much more concerned about concurrent accesses and the communication
>> > between the Git processes and the index-helper. Writing to the .pid file
>> > sounds very fragile to me, in particular when multiple processes can poke
>> > the index-helper in succession and some readers are unaware that the index
>> > is being refreshed.
>>
>> It's not that bad.
>
> Well, the way I read the code it is possible that:
>
> 1. Git process 1 starts, reading the index
> 2. Git process 2 starts, poking the index-helper
> 3. The index-helper updates the .pid file (why not set a bit in the shared
>    memory?) with a prefix "W"
> 4. Git process 2 reads the .pid file and waits for the "W" to go away
>    (what if index-helper is not fast enough to write the "W"?)
> 5. Git process 1 access the index, happily oblivious that it is being
>    updated and the data is in an inconsistent state

No, if process 1 reads the index file, then that file will remain
consistent/unchanged all the time. index-helper is not allowed to
touch that file at all.

The process 2 gets the index content from shm (cached by the index
helper), verifies that it's good (with the signature at the end of the
shm). If watchman is used, process 2 can also read the list of
modified files from another shm, combine it with the in-core index,
then write it down the normal way. Only then process 1 (or process 3)
can see the new index content from the file.

>> We should have protection in place to deal with this and fall back to
>> reading directly from file when things get suspicious.
>
> I really want to prevent that. I know of use cases where the index weighs
> 300MB, and falling back to reading it directly *really* hurts.

For crying out loud, what do you store in that repo? What I have in
mind for all these works are indexes in 10MB range, or maybe 50MB max.

Very unscientifically, git.git index is about 274kb and contains ~3000
entries, so 94 bytes per entry on average. With a 300MB index , the
extrapolated number of entries is about 3 millions! At around 1
million index entries, I think it's time to just use a database as
index.

>> But I agree that sending UNIX signals (or PostMessage) is not really
>> good communication.
>
> Yeah, I really would like two-way communication instead. Named pipes?
> They'd have the advantage that you could use the full path to the index as
> identifier.

Yep.

> The way I read the current code, we would actually create a different
> shared memory every time the index changes because its checksum is part of
> the shared memory's "path"...

Yep. shm objects are "immutable", pretty much like git objects. But
now that I think of it, I don't know how cheap/expensive shm creation
operation is on Windows.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html