> > Hi, > I spent quite some time trying to understand the main loop of spice-server. > (Perhaps I'm tired of rebasing the glib loop patches after 250 patches!) > > There were quite some questions on the loop and the patches left: > - why new code using glib is doing much more iterations? > - where the client blocking problems came? See [1], [2] and [3]. > - why all that polling code in red-worker is needed? > > Glib and iterations > ------------------- > Using strace (poll and other debugging) I tried to understand where all these > iterations came. > The reason is multi-threading (and some problems on Glib). > In our implementation multithreading is simple: we only have a thread! > But Glib supports multithreading! So assume one thread is doing the poll > and another what to add a file descriptor or change a timeout how to stop > the other thread and make it consider the modified information? > The solution (which I even coded for similar stuff!) is adding a file > descriptor > to wake up the other thread. You can see quite clearly from strace, > spice-server-replay > was waken up by a file descriptor 10 which turned out to be a eventfd. > These are the events which should make the poll stop: > - timeout changes; > - file descriptor added or removed; > - file descriptor event types (like read and/or write) changed; > - other special events (loop exit, explicit wakeups, signals). > But which kind of events of the above happens in spice-server? > Our current implementation of timers/watches add and remove source events > for many changes (timer started again or event mask changed or even set with > same value). Turn out however that these events are not happening that > frequently. > Turn out that to handle recursion of loops Glib does a lot of file descriptor > addition/removal during the loop. This happen every time so every poll call > is > done twice. > I then added some debugging on when the loop happened (worker_source_prepare > is > called), when the poll happened and when the eventfd is cleared. > What usually happened was > - poll, one of our events was triggered > - prepare called > - poll, eventfd triggered > - prepare called > - poll, eventfd triggered > - eventfd cleared > So each event needed two prepare call and 3 polls (so 3 glib iteration) > happened. > One thing to consider: we run everything in a thread so why changes to file > descriptor should behave like they are waking the poll even if the poll > was surely not run? The reason is that although this comment in Glib: > > /* Now wake up the main loop if it is waiting in the poll() */ > > the wakeup line: > > g_wakeup_signal (context->wakeup); > > is most of the time executed! > Second consideration: why eventfd is cleared the second time the event is > triggered and not at the first time? The reason is that in these lines > > if (context->wake_up_rec.revents) > g_wakeup_acknowledge (context->wakeup); > > revents is not set and you have to wait another poll round! > > So how to fix these Glib issues? I wrote this patch (still to send): > http://pastebin.com/XT0W7xnQ. Is working quite fine on my Fedora system > (I replaced the system library!). > On my tests (using spice-server with replay utility) the iteration went from > 700 to 350! > > > Spice-server loop > ----------------- > So why is client sometimes blocking? > First let's see the events the worker loop is handling: > 1- data from dispatcher (this includes wakeups from guest); > 2- data from channels (so clients); > 3- timeouts on streams; > 4- other timeouts (like pings); > 5- polling from last ring empty (200 times with a poll of > 10 milliseconds). > About this last point why is this needed? > Also note that loops are counted so that 200 times can expire quite > fast if guest (for instance) keeps sending wakeups (see [2])! > > What happens if: > - client does not read data for a while and channel pipes are full > (network buffer full and items maxed out) > - guest fills the command ring as command processing is not running > due to first condition > - guest send a lot of wakeups (see [2]) > ? > Simple: the loop stop entirely! Let's look at the above conditions: > 1- if nothing special (like user trying to resize client windows) > happens no data are sent to dispatcher as guest is not appending > to ring and so wakeups are not sent; > 2- channels have no reasons to send data; > 3- there are no stream so no timeouts; > 4- there are quite long to happen, we can't wait minutes; > 5- the counter run out so the polling stopped. > What about if after a while client read data and network buffer became > available again? Nothing! There is no check on this (1 and 2 are just > checking from data received). > > Happily I managed to reproduce this event with this command > > xzcat win7-boot-record2.xz | SPICE_DEBUG_LEVEL=3 strace -o trace.txt -f -e > poll,close,execve,clone,write,writev,read,recv -s 100 ./libtool > --mode=execute ./server/tests/spice-server-replay -p 1235 -c 'remote-viewer > spice://localhost:1235' -w - -C 5 > > Could be that this command is helpful for somebody that wants to try, the > server > hangs when large portion of screen are updated (this was a recording of a > machine > booting as name suggests). > > So why the polling is helpful? Well... to try avoiding these condition to > happen! However as me and others demonstrate this is still possible! > > How to fix this? My actual fix is using setting event_timeout when we exit > from > red_process_display from last return inside the loop (see also [1]) and last > return. > > What should be in my opinion a better solution? > I think that instead of using polling we should change code in order to wait > if > we can send data to client (see my "implement no pushing on RedWorker" patch > on the ML for a possible solution, see [3]). This will solve most of the > problem > but still consider this: > - we are in previous condition (no wakeup, ring full, channel pipes full, > full > sending network buffers); > - client receive data and buffers became available; > - poll is waken up (now we detect above condition): > - we try to check commands but channel pipes are still full; > - we send all data to network so now we stop polling for network buffers; > - poll is run again. > Let's see the above poll conditions: > 1- ring is still full so no wakeups; > 2- no data from channels; > 3- no streams; > 4- far to happen; > 5- counter are already expired; > 6- (new, writing) no more reason to poll for writing. > We are stuck again. I think this can be easyly be fixed with: > - during worker_source_prepare (before poll): > if (worker->was_stuck && !is_stuck(worker)) > return TRUE; > so event is triggered and poll resume > - during worker_source_dispatch (after poll): > worker->was_stuck = is_stuck(worker); > if (!worker->was_stuck) > red_process_XXX(...); > > Hope I didn't forget nothing (took me quite a while to write this mail) and > sorry > for the possible thousand mistakes. > > Oh... what about iterations before and after all of this? > Before glib loop: about 500 iterations; > After glib loop: about 700 iterations; > After all these fixes: about 350 iterations (probably some are due to > slowdown > and iteration collapsing due to debugging stuff). > > Frediano > > [1] > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=6131c1942a80c4fb909697f5f4c44fe010fc6f92 > [2] > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=aef334e6e8abc629a4603c890a2982d542b627f4 > [3] > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=ad51aeec0913e6d5ecf7e403e96612658f3b13f2 Would be really helpful if someone instead of digging directly into the code could ask me what is not clear to have a good explanation understandable so to have a good documentation/commit log. I have a small roadmap that (not in strictly order) contains: - finally glib integration - loop opimizations - glib patches (just opened https://bugzilla.gnome.org/show_bug.cgi?id=761102) - cork flag usage to reduce network traffic decreasing fragmentation - more automatic tests on loop (test-loop already existing) currently mainly testing timers but not watches - changes to remove mostly polling (usleep calls) - handle cursor and display commands at same time during flushing - use direct poll for limited loops - try to reproduce automatically some issues - better handling of window acknowledge - automatic bandwidth detection Frediano _______________________________________________ Spice-devel mailing list Spice-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/spice-devel