Unexpected (to me) loop behaviour

Frediano Ziglio <fziglio@xxxxxxxxxx> · Wed, 7 Dec 2016 07:12:26 -0500 (EST)

Hi,
  yesterday I sent the series about improving RHEL7.
Doing the testing was really hard to have the same performance results using
the modified replay utility.
Initially I though was a delay introduced by the recording or the compression
so to reduce it I moved to record to tmpfs without compression.
But even with this change results were not perfect.
I ended up increasing the "join" timeout introduced by my patches.
But curious about what's was going on I tried to understand why with the
replay utility I was getting the timeout while on real machine I NEVER did.

So I write this "small" command trying to understand where the timeout came

strings spice-record | grep ^event -A32 | perl -pe 'BEGIN { $prev = 0 }; s!^(event \d+ \d+ \d+ )(\d+)!$n = $2-$prev; $prev = $2; $a = ""; $a = " <--- ALARM" if $n > 5000000 && $n < 20000000; sprintf("%s%d%s", $1, $n, $a); !e' | tee record.txt

With my surprise there were some timeout between the joined commands of
about (really close!) 10ms. Taking into account that usually the time
between every joinable commands was kind of 15-20us the 10ms was too weird
to be a coincidence. The reason about this delay is due to these lines

    if (!red_qxl_get_command(worker->qxl, &ext_cmd)) {
        *ring_is_empty = TRUE;
        if (worker->display_poll_tries < CMD_RING_POLL_RETRIES) {
            worker->event_timeout = MIN(worker->event_timeout, CMD_RING_POLL_TIMEOUT);
        } else if (worker->display_poll_tries == CMD_RING_POLL_RETRIES &&
                   !red_qxl_req_cmd_notification(worker->qxl)) {
            continue;
        }
        worker->display_poll_tries++;

Note that CMD_RING_POLL_RETRIES is 1 while CMD_RING_POLL_TIMEOUT is 10 (our 10ms).
Basically if no commands are found as a first step we try again after 10ms, if
no commands are found on next iteration (which can happen before the 10ms) we
call red_qxl_req_cmd_notification (basically we change from polling mode to
"interrupt") than we wait. So assuming an idle situation if guest start queueing
commands it will wake the server. If the server is fast enough to handle all commands
faster than how the guest queue them it reachs red_qxl_get_command == TRUE and trigger the
polling after 10ms, not asking for notification from guest. So we'll process any
next command queued after 10ms even if there are commands in the queue much earlier.
As commands are recorded when read from the queue this cause some joinable commands
to be recorded with 10ms delay so using the replay utility this caused the commands
to be not joined. This don't happen on real VM as after 10ms the loop find the next
command and can join them.

However this beside the delay introduced in command handling (I don't think 10ms
is an issue by the way beside replay utility) can introduce a limit on the time
spent executing commands.
The queue contains up to 32 commands so if the loop hit the polling there is a limit
of 10ms / 32 = 312us. Could this be an issue? Potentially the 10ms polling can
limit the number of loops done by the server but this should be automatic if
the host machine is quite loaded while increase the delay if host is unloaded.

Note that this polling was also (in the past) necessary as the client sockets
where not poll(2)ed for write so blocking condition on the channel client were
handled with this polling code (the CMD_RING_POLL_RETRIES was much higher) and
was basically retained for safety reasons.

Frediano
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/spice-devel