On Wed, 2016-01-27 at 06:49 -0500, Frediano Ziglio wrote: > > > > On Tue, 2016-01-26 at 11:14 -0500, Frediano Ziglio wrote: > > > > > > > > Hi, > > > > I spent quite some time trying to understand the main loop of spice > > > > -server. > > > > (Perhaps I'm tired of rebasing the glib loop patches after 250 patches!) > > > > > > > > There were quite some questions on the loop and the patches left: > > > > - why new code using glib is doing much more iterations? > > > > - where the client blocking problems came? See [1], [2] and [3]. > > > > - why all that polling code in red-worker is needed? > > > > > > > > Glib and iterations > > > > ------------------- > > > > Using strace (poll and other debugging) I tried to understand where all > > > > these > > > > iterations came. > > > > The reason is multi-threading (and some problems on Glib). > > > > In our implementation multithreading is simple: we only have a thread! > > > > But Glib supports multithreading! So assume one thread is doing the poll > > > > and another what to add a file descriptor or change a timeout how to > > > > stop > > > > the other thread and make it consider the modified information? > > > > The solution (which I even coded for similar stuff!) is adding a file > > > > descriptor > > > > to wake up the other thread. You can see quite clearly from strace, > > > > spice-server-replay > > > > was waken up by a file descriptor 10 which turned out to be a eventfd. > > > > These are the events which should make the poll stop: > > > > - timeout changes; > > > > - file descriptor added or removed; > > > > - file descriptor event types (like read and/or write) changed; > > > > - other special events (loop exit, explicit wakeups, signals). > > > > But which kind of events of the above happens in spice-server? > > > > Our current implementation of timers/watches add and remove source > > > > events > > > > for many changes (timer started again or event mask changed or even set > > > > with > > > > same value). Turn out however that these events are not happening that > > > > frequently. > > > > Turn out that to handle recursion of loops Glib does a lot of file > > > > descriptor > > > > addition/removal during the loop. This happen every time so every poll > > > > call > > > > is > > > > done twice. > > > > I then added some debugging on when the loop happened > > > > (worker_source_prepare > > > > is > > > > called), when the poll happened and when the eventfd is cleared. > > > > What usually happened was > > > > - poll, one of our events was triggered > > > > - prepare called > > > > - poll, eventfd triggered > > > > - prepare called > > > > - poll, eventfd triggered > > > > - eventfd cleared > > > > So each event needed two prepare call and 3 polls (so 3 glib iteration) > > > > happened. > > > > One thing to consider: we run everything in a thread so why changes to > > > > file > > > > descriptor should behave like they are waking the poll even if the poll > > > > was surely not run? The reason is that although this comment in Glib: > > > > > > > > /* Now wake up the main loop if it is waiting in the poll() */ > > > > > > > > the wakeup line: > > > > > > > > g_wakeup_signal (context->wakeup); > > > > > > > > is most of the time executed! > > > > Second consideration: why eventfd is cleared the second time the event > > > > is > > > > triggered and not at the first time? The reason is that in these lines > > > > > > > > if (context->wake_up_rec.revents) > > > > g_wakeup_acknowledge (context->wakeup); > > > > > > > > revents is not set and you have to wait another poll round! > > > > > > > > So how to fix these Glib issues? I wrote this patch (still to send): > > > > http://pastebin.com/XT0W7xnQ. Is working quite fine on my Fedora system > > > > (I replaced the system library!). > > > > On my tests (using spice-server with replay utility) the iteration went > > > > from > > > > 700 to 350! > > > > > > > > > > Some additional informations. > > > > > > reds streams and reds channels. > > > > > > reds streams are the base. They mostly resemble to sockets. You can write > > > bytes, read bytes, check if you can write or if they have data to read, > > > they have a file descriptor to use for select/poll. > > > A reds streams send data using on the lower level a file descriptor > > > (either > > > a internet socket or a unix one) so use network buffers from the kernel. > > > > > > reds channels use reds streams but they send and receive messages. > > > Messages are defined by the protocol. When you want to send a message > > > you append a pipe item to the channel. When reds channel can send data > > > if take one pipe item from the queue, the item marshals one or more > > > messages > > > to a buffer and reds channel then start sending the buffer. > > > Once buffer is entirely sent it can dequeue another pipe item use it to > > > marshall more messages in another buffer and so on. > > > All writing to the reds stream are no-blocking. If reds channel detect > > > that it cannot write because there is no space it stops sending data and > > > the reds channel became blocked. > > > > > > command processing (in red worker, implemented in red-worker.c) > > > > > > for every red worker loop iteration the worker try to get and handle > > > as many commands as possible. There are however some limitations: > > > - if there are too items queued on the reds channel it stops. The number > > > of maximum items is MAX_PIPE_SIZE; > > > - if some processing is taking too much (currently defined as 10ms) > > > it stops; > > > > From reading the code, it looks like it just processes as many commands as > > it > > can within 10ms (assuming the other conditions are met). Maybe this is what > > you > > meant above, but your statement above made it sound like it was checking if > > one > > particular operation is taking too much time, which is not really the case. > > > > Would this be better? > > There are however some limitations on the loop: > - if there are too items queued on the reds channel it stops. The number > of maximum items is MAX_PIPE_SIZE; > - if loop processing is taking too much (currently defined as 10ms) > it stops; > Yeah, that's fine (although there's a little typo: "too items" should be "too many items") > > > > > - if all clients are blocked it stops. > > > From this explanation channel pipes are full means that more than > > > MAX_PIPE_SIZE items are queued in the channel. > > > > > > > > > > > Spice-server loop > > > > ----------------- > > > > So why is client sometimes blocking? > > > > First let's see the events the worker loop is handling: > > > > 1- data from dispatcher (this includes wakeups from guest); > > > > 2- data from channels (so clients); > > > > 3- timeouts on streams; > > > > 4- other timeouts (like pings); > > > > 5- polling from last ring empty (200 times with a poll of > > > > 10 milliseconds). > > > > About this last point why is this needed? > > > > Also note that loops are counted so that 200 times can expire quite > > > > fast if guest (for instance) keeps sending wakeups (see [2])! > > > > > > > > What happens if: > > > > - client does not read data for a while and channel pipes are full > > > > (network buffer full and items maxed out) > > > > - guest fills the command ring as command processing is not running > > > > due to first condition > > > > - guest send a lot of wakeups (see [2]) > > > > ? > > > > Simple: the loop stop entirely! Let's look at the above conditions: > > > > 1- if nothing special (like user trying to resize client windows) > > > > happens no data are sent to dispatcher as guest is not appending > > > > to ring and so wakeups are not sent; > > > > 2- channels have no reasons to send data; > > > > 3- there are no stream so no timeouts; > > > > 4- there are quite long to happen, we can't wait minutes; > > > > 5- the counter run out so the polling stopped. > > > > What about if after a while client read data and network buffer became > > > > available again? Nothing! There is no check on this (1 and 2 are just > > > > checking from data received). > > > > > > > > Happily I managed to reproduce this event with this command > > > > > > > > xzcat win7-boot-record2.xz | SPICE_DEBUG_LEVEL=3 strace -o trace.txt -f > > > > -e > > > > poll,close,execve,clone,write,writev,read,recv -s 100 ./libtool > > > > --mode=execute ./server/tests/spice-server-replay -p 1235 -c > > > > 'remote-viewer > > > > spice://localhost:1235' -w - -C 5 > > > > > > > > Could be that this command is helpful for somebody that wants to try, > > > > the > > > > server > > > > hangs when large portion of screen are updated (this was a recording of > > > > a > > > > machine > > > > booting as name suggests). > > > > > > > > So why the polling is helpful? Well... to try avoiding these condition > > > > to > > > > happen! However as me and others demonstrate this is still possible! > > > > > > > > How to fix this? My actual fix is using setting event_timeout when we > > > > exit > > > > from > > > > red_process_display from last return inside the loop (see also [1]) and > > > > last > > > > return. > > > > > > > > What should be in my opinion a better solution? > > > > I think that instead of using polling we should change code in order to > > > > wait > > > > if > > > > we can send data to client (see my "implement no pushing on RedWorker" > > > > patch > > > > on the ML for a possible solution, see [3]). This will solve most of the > > > > problem > > > > but still consider this: > > > > - we are in previous condition (no wakeup, ring full, channel pipes > > > > full, > > > > full > > > > sending network buffers); > > > > - client receive data and buffers became available; > > > > - poll is waken up (now we detect above condition): > > > > - we try to check commands but channel pipes are still full; > > > > - we send all data to network so now we stop polling for network > > > > buffers; > > > > - poll is run again. > > > > Let's see the above poll conditions: > > > > 1- ring is still full so no wakeups; > > > > 2- no data from channels; > > > > 3- no streams; > > > > 4- far to happen; > > > > 5- counter are already expired; > > > > 6- (new, writing) no more reason to poll for writing. > > > > We are stuck again. I think this can be easyly be fixed with: > > > > - during worker_source_prepare (before poll): > > > > if (worker->was_stuck && !is_stuck(worker)) > > > > return TRUE; > > > > so event is triggered and poll resume > > > > - during worker_source_dispatch (after poll): > > > > worker->was_stuck = is_stuck(worker); > > > > if (!worker->was_stuck) > > > > red_process_XXX(...); > > > > > > > > Hope I didn't forget nothing (took me quite a while to write this mail) > > > > and > > > > sorry > > > > for the possible thousand mistakes. > > > > > > > > Oh... what about iterations before and after all of this? > > > > Before glib loop: about 500 iterations; > > > > After glib loop: about 700 iterations; > > > > After all these fixes: about 350 iterations (probably some are due to > > > > slowdown > > > > and iteration collapsing due to debugging stuff). > > > > > > > > Frediano > > > > > > > > [1] > > > > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id > > > > =613 > > > > 1c1942a80c4fb909697f5f4c44fe010fc6f92 > > > > [2] > > > > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id > > > > =aef > > > > 334e6e8abc629a4603c890a2982d542b627f4 > > > > [3] > > > > http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id > > > > =ad5 > > > > 1aeec0913e6d5ecf7e403e96612658f3b13f2 > > > > > > _______________________________________________ > > > Spice-devel mailing list > > > Spice-devel@xxxxxxxxxxxxxxxxxxxxx > > > http://lists.freedesktop.org/mailman/listinfo/spice-devel > > _______________________________________________ Spice-devel mailing list Spice-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/spice-devel