Re: Loop, glib and spice-server

Frediano Ziglio <fziglio@xxxxxxxxxx> · Tue, 26 Jan 2016 11:32:05 -0500 (EST)

> 
> Just a few initial comments, questions, below
> 
> On Mon, 2016-01-25 at 07:35 -0500, Frediano Ziglio wrote:
> > Hi,
> >   I spent quite some time trying to understand the main loop of
> >   spice-server.
> > (Perhaps I'm tired of rebasing the glib loop patches after 250 patches!)
> > 
> > There were quite some questions on the loop and the patches left:
> > - why new code using glib is doing much more iterations?
> > - where the client blocking problems came? See [1], [2] and [3].
> > - why all that polling code in red-worker is needed?
> > 
> > Glib and iterations
> > -------------------
> > Using strace (poll and other debugging) I tried to understand where all
> > these
> > iterations came.
> > The reason is multi-threading (and some problems on Glib).
> > In our implementation multithreading is simple: we only have a thread!
> 
> Here do you mean we only have *one* thread?
> 

All timers/watches functions and the loop are always called from a single
thread. It's true that during initialization functions are called from another
thread but then this thread stop doing that calls and all code is executed in
the red worker thread. So it's like that all functions are executed from a
single thread.
In the case of glib for instance you want to handle the case that you are in
g_main_loop in a thread and another thread want to add a source to that loop
context.

> > But Glib supports multithreading! So assume one thread is doing the poll
> > and another what to add a file descriptor or change a timeout how to stop
> > the other thread and make it consider the modified information?
> > The solution (which I even coded for similar stuff!) is adding a file
> > descriptor
> > to wake up the other thread. You can see quite clearly from strace, spice
> > -server-replay
> > was waken up by a file descriptor 10 which turned out to be a eventfd.
> > These are the events which should make the poll stop:
> > - timeout changes;
> > - file descriptor added or removed;
> > - file descriptor event types (like read and/or write) changed;
> > - other special events (loop exit, explicit wakeups, signals).
> > But which kind of events of the above happens in spice-server?
> > Our current implementation of timers/watches add and remove source events
> > for many changes (timer started again or event mask changed or even set
> > with
> > same value). Turn out however that these events are not happening that
> > frequently.
> > Turn out that to handle recursion of loops Glib does a lot of file
> > descriptor
> > addition/removal during the loop. This happen every time so every poll call
> > is
> > done twice.
> 
> I don't understand what you mean here by "Glib does a lot of file descriptor
> addition/removal during the loop". For what purpose?
> 

I found that in Fedora 22 version glib block/unblocks the sources while processing.
One reason is to support recursion. Basically you call g_main_loop again inside a
source dispatch function. The sources are then removed before the dispatch call and
then added again (more properly the source is blocked causing a poll_fd to be removed
and unblocked after dispatch call causing the poll_fd to be added again).

> > I then added some debugging on when the loop happened
> > (worker_source_prepare
> > is
> > called), when the poll happened and when the eventfd is cleared.
> > What usually happened was
> > - poll, one of our events was triggered
> > - prepare called
> > - poll, eventfd triggered
> > - prepare called
> > - poll, eventfd triggered
> > - eventfd cleared
> > So each event needed two prepare call and 3 polls (so 3 glib iteration)
> > happened.
> 
> I don't see anything mentioned here about the dispatch function. is
> worker_source_dispatch() called in each of these 3 iterations?
> 

The prepare is surely called twice. I can try again (I have installed
my version which is not affected by this problem).

> > One thing to consider: we run everything in a thread so why changes to file
> > descriptor should behave like they are waking the poll even if the poll
> > was surely not run? The reason is that although this comment in Glib:
> > 
> > /* Now wake up the main loop if it is waiting in the poll() */
> > 
> > the wakeup line:
> > 
> > g_wakeup_signal (context->wakeup);
> > 
> > is most of the time executed!
> > Second consideration: why eventfd is cleared the second time the event is
> > triggered and not at the first time? The reason is that in these lines
> > 
> > if (context->wake_up_rec.revents)
> >   g_wakeup_acknowledge (context->wakeup);
> > 
> > revents is not set and you have to wait another poll round!
> 
> If I understand correctly, context->wake_up_rec is a cached struct rather
> than
> the struct passed to poll(), so context->wake_up_rec.revents will never
> actually
> be non-0, right?
> 

After 2 loops is updated. Basically when a loop change is detected no
events are handled and no revents are updated but on next poll there are no changes
and revents is updated and the wake_up event is reset.

> > 
> > So how to fix these Glib issues? I wrote this patch (still to send):
> > http://pastebin.com/XT0W7xnQ. Is working quite fine on my Fedora system
> > (I replaced the system library!).
> > On my tests (using spice-server with replay utility) the iteration went
> > from
> > 700 to 350!
> > 
> > 
> > Spice-server loop
> > -----------------
> > So why is client sometimes blocking?
> > First let's see the events the worker loop is handling:
> > 1- data from dispatcher (this includes wakeups from guest);
> > 2- data from channels (so clients);
> > 3- timeouts on streams;
> > 4- other timeouts (like pings);
> > 5- polling from last ring empty (200 times with a poll of
> >    10 milliseconds).
> > About this last point why is this needed?
> > Also note that loops are counted so that 200 times can expire quite
> > fast if guest (for instance) keeps sending wakeups (see [2])!
> > 
> > What happens if:
> > - client does not read data for a while and channel pipes are full
> >   (network buffer full and items maxed out)
> > - guest fills the command ring as command processing is not running
> >   due to first condition
> > - guest send a lot of wakeups (see [2])
> > ?
> > Simple: the loop stop entirely! Let's look at the above conditions:
> > 1- if nothing special (like user trying to resize client windows)
> >    happens no data are sent to dispatcher as guest is not appending
> >    to ring and so wakeups are not sent;
> > 2- channels have no reasons to send data;
> > 3- there are no stream so no timeouts;
> > 4- there are quite long to happen, we can't wait minutes;
> 
> I'm afraid I can't make sense of this sentence.
> 

4 are special timeouts. These timeouts are quite high (even seconds).

> > 5- the counter run out so the polling stopped.
> > What about if after a while client read data and network buffer became
> > available again? Nothing! There is no check on this (1 and 2 are just
> > checking from data received).
> > 
> > Happily I managed to reproduce this event with this command
> > 
> > xzcat win7-boot-record2.xz | SPICE_DEBUG_LEVEL=3 strace -o trace.txt -f -e
> > poll,close,execve,clone,write,writev,read,recv -s 100 ./libtool
> > --mode=execute
> > ./server/tests/spice-server-replay -p 1235 -c 'remote-viewer
> > spice://localhost:1235' -w - -C 5
> > 
> > Could be that this command is helpful for somebody that wants to try, the
> > server
> > hangs when large portion of screen are updated (this was a recording of a
> > machine
> > booting as name suggests).
> > 
> > So why the polling is helpful? Well... to try avoiding these condition to
> > happen! However as me and others demonstrate this is still possible!
> > 
> > How to fix this? My actual fix is using setting event_timeout when we exit
> > from
> > red_process_display from last return inside the loop (see also [1]) and
> > last
> > return.
> > 
> > What should be in my opinion a better solution?
> > I think that instead of using polling we should change code in order to
> > wait
> > if
> > we can send data to client (see my "implement no pushing on RedWorker"
> > patch
> > on the ML for a possible solution, see [3]). This will solve most of the
> > problem
> > but still consider this:
> > - we are in previous condition (no wakeup, ring full, channel pipes full,
> > full
> >   sending network buffers);
> > - client receive data and buffers became available;
> > - poll is waken up (now we detect above condition):
> >   - we try to check commands but channel pipes are still full;
> >   - we send all data to network so now we stop polling for network buffers;
> > - poll is run again.
> > Let's see the above poll conditions:
> > 1- ring is still full so no wakeups;
> > 2- no data from channels;
> > 3- no streams;
> > 4- far to happen;
> > 5- counter are already expired;
> > 6- (new, writing) no more reason to poll for writing.
> > We are stuck again. I think this can be easyly be fixed with:
> > - during worker_source_prepare (before poll):
> >   if (worker->was_stuck && !is_stuck(worker))
> >     return TRUE;
> >   so event is triggered and poll resume
> > - during worker_source_dispatch (after poll):
> >   worker->was_stuck = is_stuck(worker);
> >   if (!worker->was_stuck)
> >     red_process_XXX(...);
> > 
> > Hope I didn't forget nothing (took me quite a while to write this mail) and
> > sorry
> > for the possible thousand mistakes.
> > 
> > Oh... what about iterations before and after all of this?
> > Before glib loop: about 500 iterations;
> > After glib loop: about 700 iterations;
> > After all these fixes: about 350 iterations (probably some are due to
> > slowdown
> > and iteration collapsing due to debugging stuff).
> > 
> > Frediano
> > 
> > [1]
> > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=6
> > 131c1942a80c4fb909697f5f4c44fe010fc6f92
> > [2]
> > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=a
> > ef334e6e8abc629a4603c890a2982d542b627f4
> > [3]
> > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=a
> > d51aeec0913e6d5ecf7e403e96612658f3b13f2

Frediano
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/spice-devel