Re: [PATCH 8/9] worker: make sure we dispatch after releasing items

Frediano Ziglio <fziglio@xxxxxxxxxx> · Mon, 21 Dec 2015 10:17:20 -0500 (EST)

> 
> On Fri, 2015-12-18 at 06:10 -0500, Frediano Ziglio wrote:
> > > Hi
> > > 
> > > ----- Original Message -----
> > > > On Wed, 2015-12-16 at 18:16 -0500, Marc-André Lureau wrote:
> > > > > Hi
> > > > > 
> > > > > ----- Original Message -----
> > > > > I actually think that was it. The worker thread may currently block
> > > > > because
> > > > > the client is too slow. But then, when we release item, the pipe is
> > > > > going
> > > > > to
> > > > > be smaller hopefully, and thus we can try to dispatch again without
> > > > > waiting
> > > > > for the timeout. (waiting for timers can really degrade performances
> > > > > in
> > > > > some
> > > > > cases, like you see today with the qemu virgl timer, but that's
> > > > > another
> > > > > story where I can show you figures when we merge gl scanounts in
> > > > > spice
> > > > > ;)
> > > > 
> > > > Then I'm still confused. In the old loop, we had a very explicit
> > > > sequence
> > > > of
> > > > behavior. For each loop:
> > > > 
> > > >  1 calculate timeout for poll() function
> > > >  2 poll() fds with timeout from step 1
> > > >  3 service expired timers
> > > >  4 handle fd watch events
> > > >  5 process cursor commands from guest
> > > >  6 process display commands from guest
> > > >  7 push queued messages to clients
> > > >  8 GOTO 1
> > > > 
> > > > After pushing the queued messages to the client, we did not go back and
> > > > handle
> > > > cursor and display commands again. We went back to the top of the loop
> > > > and
> > > > did a
> > > > poll with timeout again. So I fail to see why the glib loop needs to
> > > > short
> > > > -circuit the timeout and return to step 5 when the old design didn't.
> > > > 
> > > > Or is this change unrelated to the new glib loop and the worker has
> > > > simply
> > > > always been too slow even with the old loop implementation?
> > > > 
> > > 
> > > I wouldn't say "too slow", and I wouldn't try to mimic the old code
> > > either.
> > > For me the goal is simply to process the pipe/command as fast as possible
> > > and if possible, get rid of timeouts.
> > 
> > I can understand that not mimic can be better but at the moment this is
> > causing
> > a bit of regressions and this patch was trying to avoid some (I think).
> > 
> > I looked more deeply at glib code. The sources are stored (appended) in
> > multiple
> > list (each for every priority and given at the moment we are using a single
> > priority which was similar to former code anyway). Considering that the
> > processing
> > code is registered at worker initialization and that watches are recreated
> > when
> > event masks are updated this possibly lead the watched to be after the
> > process
> > leading the loop to be:
> > 
> >  1 calculate timeout for poll() function
> >  2 poll() fds with timeout from step 1
> >  3 process cursor commands from guest
> >  4 process display commands from guest
> >  5 push queued messages to clients
> >  6 service expired timers
> >  7 handle fd watch events
> >  8 GOTO 1
> > (not sure about timers but are not that frequent).
> > This obviously make stuff worse as after step 7 you have to wait the poll
> > (2)
> > to get commands processed again.
> > I'm not saying that the patch address also some other issues but that the
> > glib
> > code
> > actually make this patch more useful (that is glib have regressions).
> > 
> > On a more wide looking. What are this (and possibly "worker: avoid server
> > hanging
> > when no client are connected.") patch trying to fix?
> > Basically trying to handle slow clients. On fast clients you don't have
> > many
> > issues.
> > Pipe queue is (or should be) quite small and all commands are processed.
> > But
> > on slow
> > clients we process till the queue reach a given size. I honestly don't have
> > all
> > knowledge on what happen next. When you do a push on RedChannel you try to
> > send as
> > many items till network queue is full. If the client is just not processing
> > you can
> > only add items to the queue. After a while the limit is reached and to
> > avoid
> > queuing
> > indefinitely you stop processing commands. Now I would have some questions:
> > - what happen in the guest? Just after a while the screen get no more
> > updates?
> >   And if the client ask for an update?
> > - does item collapsing occurs only if client is not connected? I think that
> > what should
> >   happen is that if there are too items items should be collapsed (reduced
> > drawing
> >   on surfaces) and processing should continue.
> > - what happen if client keep sending request that trigger appending items
> > but
> > avoid
> >   the read? Can a DoS be triggered?
> > 
> > Frediano
> 
> 
> 
> So I did some very simplistic testing here. Basically I looked at the
> red_process_display() function to see how often we encountered a full pipe
> queue
> (since that is the situation that Marc-Andre thinks this patch is trying to
> improve). I was curious how often this happened, so I just instrumented the
> code
> to count how many times we called red_process_display() and how many of those
> times we returned from this function due to a full queue (rather than because
> there were no more commands to process). Here's the patch:
> 
> diff --git a/server/red-worker.c b/server/red-worker.c
> index dfaf4ba..dec635d 100644
> --- a/server/red-worker.c
> +++ b/server/red-worker.c
> @@ -229,7 +235,10 @@ static int red_process_display(RedWorker *worker,
> uint32_t
> max_pipe_size, int *r
>      QXLCommandExt ext_cmd;
>      int n = 0;
>      uint64_t start = spice_get_monotonic_time_ns();
> +    static int total = 0;
> +    static int full = 0;
>  
> +    total++;
>      if (!worker->running) {
>          *ring_is_empty = TRUE;
>          return n;
> @@ -335,6 +344,9 @@ static int red_process_display(RedWorker *worker,
> uint32_t
> max_pipe_size, int *r
>              return n;
>          }
>      }
> +    full++;
> +    fprintf(stderr, "display channel exceeded max pipe size: %i/%i (%g)\n",
> +            full, total, (double)full / total * 100.0);
>      return n;
>  }
>  
> 
> This was my procedure:
> 
> - Boot RHEL7 guest
> - wait for boot sequence to finish
> - log into GNOME
> - start web browser, open a page
> - open another page in another tab
> - open video player and play video in full screen
> - enable second display
> - resize second display
> - disable second display
> - exit video player
> - shut down guest
> 
> I did this same test with the current master branch, with the glib loop patch
> applied, and also with this patch on top of the glib loop patch.
> 
> What I noticed is that during the boot sequence and the shutdown sequence
> (when
> we're in a VT and text is scrolling up the screen), we hit the queue limit
> quite
> frequently. In fact, nearly 30% of the time that we call
> red_process_display(),
> we return because our pipe queue is full. However, after X has started and
> we're
> in the desktop, we never hit a full-queue condition. When I shut down the
> guest,
> it briefly returns to the VT until it powers down, and we again hit some full
> -queue conditions.
> 
> This was true of both the old loop and the new glib loop (with and without
> the
> patch).
> 
> There were some differences between the old and new loop. In particular, the
> new
> glib loop called red_process_display() more times in roughly the same elapsed
> time. It also had a higher percentage of calls that returned because the
> queue
> was full (but again, this did not happen during "normal" operation, but only
> during the boot and shutdown phase). The glib loop with this patch on top had
> roughly the same percentage as the glib loop on its own, but the total number
> of
> calls to red_process_display() was higher, unsurprisingly.
> 
> 
> Here is some example output:
> 
> Old loop: http://ur1.ca/ocnyn
> Glib loop: http://ur1.ca/ocnyp
> glib loop with this patch applied: http://ur1.ca/oco1u
> 
> Notice that lines such as this indicate that GDM has started or I've logged
> into
> GNOME:
> 
>    main_channel_handle_parsed: agent start
> 
> Then I use the desktop for about a minute before shutting down again. No more
> log messages are printed until shutdown begins.
> 
>  I also did another test where I used the guest "normally" (browsing
>  internet,
> watching videos, writing text in libreoffice, etc) for quite a while, and
> still
> never triggered the full-queue condition.
> 
> I'm not sure what conclusions to draw from this exactly. There is probably
> some
> better way to test this. Perhaps we should test how long a pipe item stays in
> the queue before it is sent.
> 
> Jonathon
> 

That's great!

I'm quite scare about the loop counts using Glib. Mainly looks like loop is
executed twice as much (with possible cpu usage increased).

I agree would be helpful to see the latency of the items.

Did you try increasing client latency or decreasing bandwidth?
I think replay utility could be useful for these tests (didn't try).

Frediano
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/spice-devel