Re: [PATCH 8/9] worker: make sure we dispatch after releasing items

Jonathon Jongsma <jjongsma@xxxxxxxxxx> · Mon, 21 Dec 2015 09:54:22 -0600

On Mon, 2015-12-21 at 10:17 -0500, Frediano Ziglio wrote:
> > 
> > On Fri, 2015-12-18 at 06:10 -0500, Frediano Ziglio wrote:
> > > > Hi
> > > > 
> > > > ----- Original Message -----
> > > > > On Wed, 2015-12-16 at 18:16 -0500, Marc-André Lureau wrote:
> > > > > > Hi
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > I actually think that was it. The worker thread may currently block
> > > > > > because
> > > > > > the client is too slow. But then, when we release item, the pipe is
> > > > > > going
> > > > > > to
> > > > > > be smaller hopefully, and thus we can try to dispatch again without
> > > > > > waiting
> > > > > > for the timeout. (waiting for timers can really degrade performances
> > > > > > in
> > > > > > some
> > > > > > cases, like you see today with the qemu virgl timer, but that's
> > > > > > another
> > > > > > story where I can show you figures when we merge gl scanounts in
> > > > > > spice
> > > > > > ;)
> > > > > 
> > > > > Then I'm still confused. In the old loop, we had a very explicit
> > > > > sequence
> > > > > of
> > > > > behavior. For each loop:
> > > > > 
> > > > >  1 calculate timeout for poll() function
> > > > >  2 poll() fds with timeout from step 1
> > > > >  3 service expired timers
> > > > >  4 handle fd watch events
> > > > >  5 process cursor commands from guest
> > > > >  6 process display commands from guest
> > > > >  7 push queued messages to clients
> > > > >  8 GOTO 1
> > > > > 
> > > > > After pushing the queued messages to the client, we did not go back
> > > > > and
> > > > > handle
> > > > > cursor and display commands again. We went back to the top of the loop
> > > > > and
> > > > > did a
> > > > > poll with timeout again. So I fail to see why the glib loop needs to
> > > > > short
> > > > > -circuit the timeout and return to step 5 when the old design didn't.
> > > > > 
> > > > > Or is this change unrelated to the new glib loop and the worker has
> > > > > simply
> > > > > always been too slow even with the old loop implementation?
> > > > > 
> > > > 
> > > > I wouldn't say "too slow", and I wouldn't try to mimic the old code
> > > > either.
> > > > For me the goal is simply to process the pipe/command as fast as
> > > > possible
> > > > and if possible, get rid of timeouts.
> > > 
> > > I can understand that not mimic can be better but at the moment this is
> > > causing
> > > a bit of regressions and this patch was trying to avoid some (I think).
> > > 
> > > I looked more deeply at glib code. The sources are stored (appended) in
> > > multiple
> > > list (each for every priority and given at the moment we are using a
> > > single
> > > priority which was similar to former code anyway). Considering that the
> > > processing
> > > code is registered at worker initialization and that watches are recreated
> > > when
> > > event masks are updated this possibly lead the watched to be after the
> > > process
> > > leading the loop to be:
> > > 
> > >  1 calculate timeout for poll() function
> > >  2 poll() fds with timeout from step 1
> > >  3 process cursor commands from guest
> > >  4 process display commands from guest
> > >  5 push queued messages to clients
> > >  6 service expired timers
> > >  7 handle fd watch events
> > >  8 GOTO 1
> > > (not sure about timers but are not that frequent).
> > > This obviously make stuff worse as after step 7 you have to wait the poll
> > > (2)
> > > to get commands processed again.
> > > I'm not saying that the patch address also some other issues but that the
> > > glib
> > > code
> > > actually make this patch more useful (that is glib have regressions).
> > > 
> > > On a more wide looking. What are this (and possibly "worker: avoid server
> > > hanging
> > > when no client are connected.") patch trying to fix?
> > > Basically trying to handle slow clients. On fast clients you don't have
> > > many
> > > issues.
> > > Pipe queue is (or should be) quite small and all commands are processed.
> > > But
> > > on slow
> > > clients we process till the queue reach a given size. I honestly don't
> > > have
> > > all
> > > knowledge on what happen next. When you do a push on RedChannel you try to
> > > send as
> > > many items till network queue is full. If the client is just not
> > > processing
> > > you can
> > > only add items to the queue. After a while the limit is reached and to
> > > avoid
> > > queuing
> > > indefinitely you stop processing commands. Now I would have some
> > > questions:
> > > - what happen in the guest? Just after a while the screen get no more
> > > updates?
> > >   And if the client ask for an update?
> > > - does item collapsing occurs only if client is not connected? I think
> > > that
> > > what should
> > >   happen is that if there are too items items should be collapsed (reduced
> > > drawing
> > >   on surfaces) and processing should continue.
> > > - what happen if client keep sending request that trigger appending items
> > > but
> > > avoid
> > >   the read? Can a DoS be triggered?
> > > 
> > > Frediano
> > 
> > 
> > 
> > So I did some very simplistic testing here. Basically I looked at the
> > red_process_display() function to see how often we encountered a full pipe
> > queue
> > (since that is the situation that Marc-Andre thinks this patch is trying to
> > improve). I was curious how often this happened, so I just instrumented the
> > code
> > to count how many times we called red_process_display() and how many of
> > those
> > times we returned from this function due to a full queue (rather than
> > because
> > there were no more commands to process). Here's the patch:
> > 
> > diff --git a/server/red-worker.c b/server/red-worker.c
> > index dfaf4ba..dec635d 100644
> > --- a/server/red-worker.c
> > +++ b/server/red-worker.c
> > @@ -229,7 +235,10 @@ static int red_process_display(RedWorker *worker,
> > uint32_t
> > max_pipe_size, int *r
> >      QXLCommandExt ext_cmd;
> >      int n = 0;
> >      uint64_t start = spice_get_monotonic_time_ns();
> > +    static int total = 0;
> > +    static int full = 0;
> >  
> > +    total++;
> >      if (!worker->running) {
> >          *ring_is_empty = TRUE;
> >          return n;
> > @@ -335,6 +344,9 @@ static int red_process_display(RedWorker *worker,
> > uint32_t
> > max_pipe_size, int *r
> >              return n;
> >          }
> >      }
> > +    full++;
> > +    fprintf(stderr, "display channel exceeded max pipe size: %i/%i (%g)\n",
> > +            full, total, (double)full / total * 100.0);
> >      return n;
> >  }
> >  
> > 
> > This was my procedure:
> > 
> > - Boot RHEL7 guest
> > - wait for boot sequence to finish
> > - log into GNOME
> > - start web browser, open a page
> > - open another page in another tab
> > - open video player and play video in full screen
> > - enable second display
> > - resize second display
> > - disable second display
> > - exit video player
> > - shut down guest
> > 
> > I did this same test with the current master branch, with the glib loop
> > patch
> > applied, and also with this patch on top of the glib loop patch.
> > 
> > What I noticed is that during the boot sequence and the shutdown sequence
> > (when
> > we're in a VT and text is scrolling up the screen), we hit the queue limit
> > quite
> > frequently. In fact, nearly 30% of the time that we call
> > red_process_display(),
> > we return because our pipe queue is full. However, after X has started and
> > we're
> > in the desktop, we never hit a full-queue condition. When I shut down the
> > guest,
> > it briefly returns to the VT until it powers down, and we again hit some
> > full
> > -queue conditions.
> > 
> > This was true of both the old loop and the new glib loop (with and without
> > the
> > patch).
> > 
> > There were some differences between the old and new loop. In particular, the
> > new
> > glib loop called red_process_display() more times in roughly the same
> > elapsed
> > time. It also had a higher percentage of calls that returned because the
> > queue
> > was full (but again, this did not happen during "normal" operation, but only
> > during the boot and shutdown phase). The glib loop with this patch on top
> > had
> > roughly the same percentage as the glib loop on its own, but the total
> > number
> > of
> > calls to red_process_display() was higher, unsurprisingly.
> > 
> > 
> > Here is some example output:
> > 
> > Old loop: http://ur1.ca/ocnyn
> > Glib loop: http://ur1.ca/ocnyp
> > glib loop with this patch applied: http://ur1.ca/oco1u
> > 
> > Notice that lines such as this indicate that GDM has started or I've logged
> > into
> > GNOME:
> > 
> >    main_channel_handle_parsed: agent start
> > 
> > Then I use the desktop for about a minute before shutting down again. No
> > more
> > log messages are printed until shutdown begins.
> > 
> >  I also did another test where I used the guest "normally" (browsing
> >  internet,
> > watching videos, writing text in libreoffice, etc) for quite a while, and
> > still
> > never triggered the full-queue condition.
> > 
> > I'm not sure what conclusions to draw from this exactly. There is probably
> > some
> > better way to test this. Perhaps we should test how long a pipe item stays
> > in
> > the queue before it is sent.
> > 
> > Jonathon
> > 
> 
> That's great!
> 
> I'm quite scare about the loop counts using Glib. Mainly looks like loop is
> executed twice as much (with possible cpu usage increased).

Yeah, with the plain glib loop it's about 1.5x, but with the additional wakeup
-after-releasing-a-pipe-item patch, it's closer to double.

> 
> I agree would be helpful to see the latency of the items.
> 
> Did you try increasing client latency or decreasing bandwidth?
> I think replay utility could be useful for these tests (didn't try).
> 
> Frediano

Nope, I didn't try any latency or bandwidth adjustments yet. This was just a
first step to get a rough comparison and then figure out exactly what we should
be testing. I haven't actually used the reply utility much yet, so I may have
some questions about it when I start using it. But that's a good suggestion.

Jonathon
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/spice-devel