> > On Tue, 2016-01-26 at 11:14 -0500, Frediano Ziglio wrote: > > > > > > Hi, > > > I spent quite some time trying to understand the main loop of spice > > > -server. > > > (Perhaps I'm tired of rebasing the glib loop patches after 250 patches!) > > > > > > There were quite some questions on the loop and the patches left: > > > - why new code using glib is doing much more iterations? > > > - where the client blocking problems came? See [1], [2] and [3]. > > > - why all that polling code in red-worker is needed? > > > > > > Glib and iterations > > > ------------------- > > > Using strace (poll and other debugging) I tried to understand where all > > > these > > > iterations came. > > > The reason is multi-threading (and some problems on Glib). > > > In our implementation multithreading is simple: we only have a thread! > > > But Glib supports multithreading! So assume one thread is doing the poll > > > and another what to add a file descriptor or change a timeout how to stop > > > the other thread and make it consider the modified information? > > > The solution (which I even coded for similar stuff!) is adding a file > > > descriptor > > > to wake up the other thread. You can see quite clearly from strace, > > > spice-server-replay > > > was waken up by a file descriptor 10 which turned out to be a eventfd. > > > These are the events which should make the poll stop: > > > - timeout changes; > > > - file descriptor added or removed; > > > - file descriptor event types (like read and/or write) changed; > > > - other special events (loop exit, explicit wakeups, signals). > > > But which kind of events of the above happens in spice-server? > > > Our current implementation of timers/watches add and remove source events > > > for many changes (timer started again or event mask changed or even set > > > with > > > same value). Turn out however that these events are not happening that > > > frequently. > > > Turn out that to handle recursion of loops Glib does a lot of file > > > descriptor > > > addition/removal during the loop. This happen every time so every poll > > > call > > > is > > > done twice. > > > I then added some debugging on when the loop happened > > > (worker_source_prepare > > > is > > > called), when the poll happened and when the eventfd is cleared. > > > What usually happened was > > > - poll, one of our events was triggered > > > - prepare called > > > - poll, eventfd triggered > > > - prepare called > > > - poll, eventfd triggered > > > - eventfd cleared > > > So each event needed two prepare call and 3 polls (so 3 glib iteration) > > > happened. > > > One thing to consider: we run everything in a thread so why changes to > > > file > > > descriptor should behave like they are waking the poll even if the poll > > > was surely not run? The reason is that although this comment in Glib: > > > > > > /* Now wake up the main loop if it is waiting in the poll() */ > > > > > > the wakeup line: > > > > > > g_wakeup_signal (context->wakeup); > > > > > > is most of the time executed! > > > Second consideration: why eventfd is cleared the second time the event is > > > triggered and not at the first time? The reason is that in these lines > > > > > > if (context->wake_up_rec.revents) > > > g_wakeup_acknowledge (context->wakeup); > > > > > > revents is not set and you have to wait another poll round! > > > > > > So how to fix these Glib issues? I wrote this patch (still to send): > > > http://pastebin.com/XT0W7xnQ. Is working quite fine on my Fedora system > > > (I replaced the system library!). > > > On my tests (using spice-server with replay utility) the iteration went > > > from > > > 700 to 350! > > > > > > > Some additional informations. > > > > reds streams and reds channels. > > > > reds streams are the base. They mostly resemble to sockets. You can write > > bytes, read bytes, check if you can write or if they have data to read, > > they have a file descriptor to use for select/poll. > > A reds streams send data using on the lower level a file descriptor (either > > a internet socket or a unix one) so use network buffers from the kernel. > > > > reds channels use reds streams but they send and receive messages. > > Messages are defined by the protocol. When you want to send a message > > you append a pipe item to the channel. When reds channel can send data > > if take one pipe item from the queue, the item marshals one or more > > messages > > to a buffer and reds channel then start sending the buffer. > > Once buffer is entirely sent it can dequeue another pipe item use it to > > marshall more messages in another buffer and so on. > > All writing to the reds stream are no-blocking. If reds channel detect > > that it cannot write because there is no space it stops sending data and > > the reds channel became blocked. > > > > command processing (in red worker, implemented in red-worker.c) > > > > for every red worker loop iteration the worker try to get and handle > > as many commands as possible. There are however some limitations: > > - if there are too items queued on the reds channel it stops. The number > > of maximum items is MAX_PIPE_SIZE; > > - if some processing is taking too much (currently defined as 10ms) > > it stops; > > From reading the code, it looks like it just processes as many commands as it > can within 10ms (assuming the other conditions are met). Maybe this is what > you > meant above, but your statement above made it sound like it was checking if > one > particular operation is taking too much time, which is not really the case. > Would this be better? There are however some limitations on the loop: - if there are too items queued on the reds channel it stops. The number of maximum items is MAX_PIPE_SIZE; - if loop processing is taking too much (currently defined as 10ms) it stops; > > > - if all clients are blocked it stops. > > From this explanation channel pipes are full means that more than > > MAX_PIPE_SIZE items are queued in the channel. > > > > > > > > Spice-server loop > > > ----------------- > > > So why is client sometimes blocking? > > > First let's see the events the worker loop is handling: > > > 1- data from dispatcher (this includes wakeups from guest); > > > 2- data from channels (so clients); > > > 3- timeouts on streams; > > > 4- other timeouts (like pings); > > > 5- polling from last ring empty (200 times with a poll of > > > 10 milliseconds). > > > About this last point why is this needed? > > > Also note that loops are counted so that 200 times can expire quite > > > fast if guest (for instance) keeps sending wakeups (see [2])! > > > > > > What happens if: > > > - client does not read data for a while and channel pipes are full > > > (network buffer full and items maxed out) > > > - guest fills the command ring as command processing is not running > > > due to first condition > > > - guest send a lot of wakeups (see [2]) > > > ? > > > Simple: the loop stop entirely! Let's look at the above conditions: > > > 1- if nothing special (like user trying to resize client windows) > > > happens no data are sent to dispatcher as guest is not appending > > > to ring and so wakeups are not sent; > > > 2- channels have no reasons to send data; > > > 3- there are no stream so no timeouts; > > > 4- there are quite long to happen, we can't wait minutes; > > > 5- the counter run out so the polling stopped. > > > What about if after a while client read data and network buffer became > > > available again? Nothing! There is no check on this (1 and 2 are just > > > checking from data received). > > > > > > Happily I managed to reproduce this event with this command > > > > > > xzcat win7-boot-record2.xz | SPICE_DEBUG_LEVEL=3 strace -o trace.txt -f > > > -e > > > poll,close,execve,clone,write,writev,read,recv -s 100 ./libtool > > > --mode=execute ./server/tests/spice-server-replay -p 1235 -c > > > 'remote-viewer > > > spice://localhost:1235' -w - -C 5 > > > > > > Could be that this command is helpful for somebody that wants to try, the > > > server > > > hangs when large portion of screen are updated (this was a recording of a > > > machine > > > booting as name suggests). > > > > > > So why the polling is helpful? Well... to try avoiding these condition to > > > happen! However as me and others demonstrate this is still possible! > > > > > > How to fix this? My actual fix is using setting event_timeout when we > > > exit > > > from > > > red_process_display from last return inside the loop (see also [1]) and > > > last > > > return. > > > > > > What should be in my opinion a better solution? > > > I think that instead of using polling we should change code in order to > > > wait > > > if > > > we can send data to client (see my "implement no pushing on RedWorker" > > > patch > > > on the ML for a possible solution, see [3]). This will solve most of the > > > problem > > > but still consider this: > > > - we are in previous condition (no wakeup, ring full, channel pipes full, > > > full > > > sending network buffers); > > > - client receive data and buffers became available; > > > - poll is waken up (now we detect above condition): > > > - we try to check commands but channel pipes are still full; > > > - we send all data to network so now we stop polling for network > > > buffers; > > > - poll is run again. > > > Let's see the above poll conditions: > > > 1- ring is still full so no wakeups; > > > 2- no data from channels; > > > 3- no streams; > > > 4- far to happen; > > > 5- counter are already expired; > > > 6- (new, writing) no more reason to poll for writing. > > > We are stuck again. I think this can be easyly be fixed with: > > > - during worker_source_prepare (before poll): > > > if (worker->was_stuck && !is_stuck(worker)) > > > return TRUE; > > > so event is triggered and poll resume > > > - during worker_source_dispatch (after poll): > > > worker->was_stuck = is_stuck(worker); > > > if (!worker->was_stuck) > > > red_process_XXX(...); > > > > > > Hope I didn't forget nothing (took me quite a while to write this mail) > > > and > > > sorry > > > for the possible thousand mistakes. > > > > > > Oh... what about iterations before and after all of this? > > > Before glib loop: about 500 iterations; > > > After glib loop: about 700 iterations; > > > After all these fixes: about 350 iterations (probably some are due to > > > slowdown > > > and iteration collapsing due to debugging stuff). > > > > > > Frediano > > > > > > [1] > > > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=613 > > > 1c1942a80c4fb909697f5f4c44fe010fc6f92 > > > [2] > > > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=aef > > > 334e6e8abc629a4603c890a2982d542b627f4 > > > [3] > > > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=ad5 > > > 1aeec0913e6d5ecf7e403e96612658f3b13f2 > > > > _______________________________________________ > > Spice-devel mailing list > > Spice-devel@xxxxxxxxxxxxxxxxxxxxx > > http://lists.freedesktop.org/mailman/listinfo/spice-devel > _______________________________________________ Spice-devel mailing list Spice-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/spice-devel