On Mon, 2015-12-21 at 10:17 -0500, Frediano Ziglio wrote: > > > > On Fri, 2015-12-18 at 06:10 -0500, Frediano Ziglio wrote: > > > > Hi > > > > > > > > ----- Original Message ----- > > > > > On Wed, 2015-12-16 at 18:16 -0500, Marc-André Lureau wrote: > > > > > > Hi > > > > > > > > > > > > ----- Original Message ----- > > > > > > I actually think that was it. The worker thread may currently block > > > > > > because > > > > > > the client is too slow. But then, when we release item, the pipe is > > > > > > going > > > > > > to > > > > > > be smaller hopefully, and thus we can try to dispatch again without > > > > > > waiting > > > > > > for the timeout. (waiting for timers can really degrade performances > > > > > > in > > > > > > some > > > > > > cases, like you see today with the qemu virgl timer, but that's > > > > > > another > > > > > > story where I can show you figures when we merge gl scanounts in > > > > > > spice > > > > > > ;) > > > > > > > > > > Then I'm still confused. In the old loop, we had a very explicit > > > > > sequence > > > > > of > > > > > behavior. For each loop: > > > > > > > > > > 1 calculate timeout for poll() function > > > > > 2 poll() fds with timeout from step 1 > > > > > 3 service expired timers > > > > > 4 handle fd watch events > > > > > 5 process cursor commands from guest > > > > > 6 process display commands from guest > > > > > 7 push queued messages to clients > > > > > 8 GOTO 1 > > > > > > > > > > After pushing the queued messages to the client, we did not go back > > > > > and > > > > > handle > > > > > cursor and display commands again. We went back to the top of the loop > > > > > and > > > > > did a > > > > > poll with timeout again. So I fail to see why the glib loop needs to > > > > > short > > > > > -circuit the timeout and return to step 5 when the old design didn't. > > > > > > > > > > Or is this change unrelated to the new glib loop and the worker has > > > > > simply > > > > > always been too slow even with the old loop implementation? > > > > > > > > > > > > > I wouldn't say "too slow", and I wouldn't try to mimic the old code > > > > either. > > > > For me the goal is simply to process the pipe/command as fast as > > > > possible > > > > and if possible, get rid of timeouts. > > > > > > I can understand that not mimic can be better but at the moment this is > > > causing > > > a bit of regressions and this patch was trying to avoid some (I think). > > > > > > I looked more deeply at glib code. The sources are stored (appended) in > > > multiple > > > list (each for every priority and given at the moment we are using a > > > single > > > priority which was similar to former code anyway). Considering that the > > > processing > > > code is registered at worker initialization and that watches are recreated > > > when > > > event masks are updated this possibly lead the watched to be after the > > > process > > > leading the loop to be: > > > > > > 1 calculate timeout for poll() function > > > 2 poll() fds with timeout from step 1 > > > 3 process cursor commands from guest > > > 4 process display commands from guest > > > 5 push queued messages to clients > > > 6 service expired timers > > > 7 handle fd watch events > > > 8 GOTO 1 > > > (not sure about timers but are not that frequent). > > > This obviously make stuff worse as after step 7 you have to wait the poll > > > (2) > > > to get commands processed again. > > > I'm not saying that the patch address also some other issues but that the > > > glib > > > code > > > actually make this patch more useful (that is glib have regressions). > > > > > > On a more wide looking. What are this (and possibly "worker: avoid server > > > hanging > > > when no client are connected.") patch trying to fix? > > > Basically trying to handle slow clients. On fast clients you don't have > > > many > > > issues. > > > Pipe queue is (or should be) quite small and all commands are processed. > > > But > > > on slow > > > clients we process till the queue reach a given size. I honestly don't > > > have > > > all > > > knowledge on what happen next. When you do a push on RedChannel you try to > > > send as > > > many items till network queue is full. If the client is just not > > > processing > > > you can > > > only add items to the queue. After a while the limit is reached and to > > > avoid > > > queuing > > > indefinitely you stop processing commands. Now I would have some > > > questions: > > > - what happen in the guest? Just after a while the screen get no more > > > updates? > > > And if the client ask for an update? > > > - does item collapsing occurs only if client is not connected? I think > > > that > > > what should > > > happen is that if there are too items items should be collapsed (reduced > > > drawing > > > on surfaces) and processing should continue. > > > - what happen if client keep sending request that trigger appending items > > > but > > > avoid > > > the read? Can a DoS be triggered? > > > > > > Frediano > > > > > > > > So I did some very simplistic testing here. Basically I looked at the > > red_process_display() function to see how often we encountered a full pipe > > queue > > (since that is the situation that Marc-Andre thinks this patch is trying to > > improve). I was curious how often this happened, so I just instrumented the > > code > > to count how many times we called red_process_display() and how many of > > those > > times we returned from this function due to a full queue (rather than > > because > > there were no more commands to process). Here's the patch: > > > > diff --git a/server/red-worker.c b/server/red-worker.c > > index dfaf4ba..dec635d 100644 > > --- a/server/red-worker.c > > +++ b/server/red-worker.c > > @@ -229,7 +235,10 @@ static int red_process_display(RedWorker *worker, > > uint32_t > > max_pipe_size, int *r > > QXLCommandExt ext_cmd; > > int n = 0; > > uint64_t start = spice_get_monotonic_time_ns(); > > + static int total = 0; > > + static int full = 0; > > > > + total++; > > if (!worker->running) { > > *ring_is_empty = TRUE; > > return n; > > @@ -335,6 +344,9 @@ static int red_process_display(RedWorker *worker, > > uint32_t > > max_pipe_size, int *r > > return n; > > } > > } > > + full++; > > + fprintf(stderr, "display channel exceeded max pipe size: %i/%i (%g)\n", > > + full, total, (double)full / total * 100.0); > > return n; > > } > > > > > > This was my procedure: > > > > - Boot RHEL7 guest > > - wait for boot sequence to finish > > - log into GNOME > > - start web browser, open a page > > - open another page in another tab > > - open video player and play video in full screen > > - enable second display > > - resize second display > > - disable second display > > - exit video player > > - shut down guest > > > > I did this same test with the current master branch, with the glib loop > > patch > > applied, and also with this patch on top of the glib loop patch. > > > > What I noticed is that during the boot sequence and the shutdown sequence > > (when > > we're in a VT and text is scrolling up the screen), we hit the queue limit > > quite > > frequently. In fact, nearly 30% of the time that we call > > red_process_display(), > > we return because our pipe queue is full. However, after X has started and > > we're > > in the desktop, we never hit a full-queue condition. When I shut down the > > guest, > > it briefly returns to the VT until it powers down, and we again hit some > > full > > -queue conditions. > > > > This was true of both the old loop and the new glib loop (with and without > > the > > patch). > > > > There were some differences between the old and new loop. In particular, the > > new > > glib loop called red_process_display() more times in roughly the same > > elapsed > > time. It also had a higher percentage of calls that returned because the > > queue > > was full (but again, this did not happen during "normal" operation, but only > > during the boot and shutdown phase). The glib loop with this patch on top > > had > > roughly the same percentage as the glib loop on its own, but the total > > number > > of > > calls to red_process_display() was higher, unsurprisingly. > > > > > > Here is some example output: > > > > Old loop: http://ur1.ca/ocnyn > > Glib loop: http://ur1.ca/ocnyp > > glib loop with this patch applied: http://ur1.ca/oco1u > > > > Notice that lines such as this indicate that GDM has started or I've logged > > into > > GNOME: > > > > main_channel_handle_parsed: agent start > > > > Then I use the desktop for about a minute before shutting down again. No > > more > > log messages are printed until shutdown begins. > > > > I also did another test where I used the guest "normally" (browsing > > internet, > > watching videos, writing text in libreoffice, etc) for quite a while, and > > still > > never triggered the full-queue condition. > > > > I'm not sure what conclusions to draw from this exactly. There is probably > > some > > better way to test this. Perhaps we should test how long a pipe item stays > > in > > the queue before it is sent. > > > > Jonathon > > > > That's great! > > I'm quite scare about the loop counts using Glib. Mainly looks like loop is > executed twice as much (with possible cpu usage increased). Yeah, with the plain glib loop it's about 1.5x, but with the additional wakeup -after-releasing-a-pipe-item patch, it's closer to double. > > I agree would be helpful to see the latency of the items. > > Did you try increasing client latency or decreasing bandwidth? > I think replay utility could be useful for these tests (didn't try). > > Frediano Nope, I didn't try any latency or bandwidth adjustments yet. This was just a first step to get a rough comparison and then figure out exactly what we should be testing. I haven't actually used the reply utility much yet, so I may have some questions about it when I start using it. But that's a good suggestion. Jonathon _______________________________________________ Spice-devel mailing list Spice-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/spice-devel