Re: Remote 3d support - streaming and lag

Frediano Ziglio <fziglio@xxxxxxxxxx> · Tue, 14 Feb 2017 14:00:03 -0500 (EST)

Francois did an hard job finding good parameters.
For instance for H264 he disabled B frames to avoid to wait for a future
frame and add slices to make bandwidth usage and CPU usage more regular.

Finding the right combination of flags to get good quality without
spending too much CPU or adding lot of latency is yes tricky.

Had some experiments in the past and with a bit of h/w support and
a limit of 4Mbps I had a latency of about 10ms using H264, 25ms with
software encoding. You can clearly see some artifacts due to the
compression.

Frediano

> From: "Christophe de Dinechin" <cdupontd@xxxxxxxxxx>
> Sent: Tuesday, February 14, 2017 10:25:07 AM
> 
> One thing I started to discuss last Wed is how video encoding may also
> require / introduce a lag on its own. The reason is that some video
> encodings use predictive (P) or bidirectional (B) frames that are relative
> to surrounding frames. Only a small fraction of the frames, the intracoded
> frames (or I-frames, sometimes also called key frames), contain data for the
> whole picture. The P and B frames need neighboring frames to be correctly
> rendered. You can encode practically in any way you want, e.g. IIII would
> encode every full frame, but IP…PI with 29 P frames would emit only one I
> frame per second at 30FPS. For an overview of predictive encoding, see
> https://en.wikipedia.org/wiki/Video_compression_picture_types. For details
> about predictive encoding in H264, see section 6.4 of
> http://lib.mdp.ac.id/ebook/Karya%20Umum/Video-Compression-Video-Coding-for-Next-generation-Multimedia.pdf
> (but probably way too many details).
> 
> Why does it matter in terms of latency? Because while I-frames can be encoded
> on the spot, P frames can only be encoded based on past frames, so you need
> history. B frames are worse, because you need “the future”, so you need to
> wait for future frames to have been emitted before being able to generate a
> B frame. In short, to get a better encoding that uses lower bandwidth, the
> encoder needs more frames, so it ends up introducing latency itself.
> 
> In addition, as I understand it, H264 is very flexible with respect to how
> you actually encode the frames. You don’t really encode pictures, you
> practically encode a program that reconstructs the pictures. So there’s a
> lot of on-going effort with respect to how to optimize encoding for this or
> that scenario. As seen from the point of view of a photographer, images that
> are generated from a video game, for example, are “low texture” (i.e. there
> isn’t much detail over large areas) with sharp edges (i.e. you expect sharp
> lines separating areas, and it’s important to preserve the shape of these
> boundaries). It was hard to judge over Blue Jeans, but the perceived quality
> issues on my side in the demo we had last Wed were most likely related to
> bad preservation of these two properties (i.e. we introduced color /
> luminosity noise in flat areas, whereas the shape of edges was visibly
> affected by compression). These are things that would not be half as
> annoying if you were compressing, say, a picture of waves crashing on a
> beach, or of a landscape with grass and foliage, because we don’t have the
> same expectations with respect to the precise shape of foliage that we have
> for the shape of buildings, streets or cars. But this means we may also need
> heuristic to dynamically adjust encoding based on the kind of input we have.
> Streaming a movie and streaming a 3D game may require different parameters
> if we want the best perceived quality.
> 
> Back to latency. I have not re-done the experiment this week, but from
> memory, last time I tried (that was about two years ago), encoding a full-HD
> screen containing “artificial” moving pictures (e.g. a typical Gnome
> desktop) with h264 with an 8-CPU i7 with an acceptable level of quality and
> an output bandwidth compatible with WAN streaming introduced, by itself, at
> least 100ms of latency with relatively low quality, and 500ms if you really
> wanted an output that had no visible artefacts. I may not have perfectly
> optimized things at the time, since I’m not a guru in ffmpeg encoder
> settings, but that’s how I remember it. If you stream to a local desktop,
> then you can use more bandwidth, and getting sub-100ms latency is relatively
> easy. If you want to try this by yourself, instructions are here:
> http://fomori.org/blog/?p=1213. If you run these experiments, please share
> your findings with various kind of inputs (including, obviously, latency and
> bandwidth).
> 
> In short, I don’t doubt that the latency we have been observing is largely a
> result of the simple buffering Frediano mentioned. But I suspect that even
> if we reduced this buffering to practically zero, we would still have at
> least about 100ms of latency just to get good-enough encoding. That’s just a
> suspicion at this point, and further testing and encoder-tuning is required
> to hopefully prove me wrong.
> 
> 
> Christophe
> 
> 
> > On 9 Feb 2017, at 15:30, Frediano Ziglio <fziglio@xxxxxxxxxx> wrote:
> > 
> > Seems weird to reply to my mail after more than 6 months but apparently
> > the content is still worth.
> > 
> > There is a concept about streaming that people seems to ignore, forget
> > or not understand. Why the streaming was added and why it lags!
> > 
> > If you are at home and if you are watching your Netflix/Amazon prime or
> > whatever service you have you want to have good quality and no network
> > issues like the movie lagging or showing buffering issues.
> > The solution is simple, you increase and monitor the buffer, basically
> > you try to have, say, 10 seconds, in the buffer so even if the network
> > completely stop working for 8/9 seconds is not an issue and family and
> > children are happy!
> > Good and simple! However... they know the future! Yes, basically they
> > are sending you 10 seconds of movie before knowing that you are going
> > to see these 10 seconds. Easy to predict: usually you continue to
> > watch the movie!
> > 
> > We introduced the streaming code, beside for compression, for streaming
> > purposes like the above. But... how to send the future? Simple! We can't!
> > So how do we "create" this future, this buffering? We lie! We just tell
> > the client to wait a bit for display creating a sort of buffer but
> > basically showing a recent past!
> > 
> > Back to code. How is it implemented in spice-server/client ?
> > Basically spice-server send a multimedia time which is a bit less than
> > the frame ones. Say we start with mm_time (multimedia time) == 0 and we
> > sent to the client a time of -100 and the current frame as having 0 as
> > time.
> > The client will think that the frame is in the future (as -100 < 0.. by
> > the way, the multimedia time is expressed in milliseconds), wait for the
> > difference (in this case 100 ms) and then display it.
> > The 100 is actually in the code mm_time_latency and it's minimal value
> > is MM_TIME_DELTA currently defined as 400. In practice every streams
> > will have a minimum of 400ms delay. Compared to the 10 seconds buffering
> > of the above case is really small but if potentially I'm drawing something
> > with the mouse and streaming is detected the lag will make my drawing
> > attempt really bad (OT: by the way I'm really bad at drawing so the result
> > won't be much worse).
> > 
> > Is it a bad solution? Beside some implementation problems it's just a
> > problem of use cases. Probably we are going to use streaming for
> > really different use cases.
> > 
> > How is updated this mm_time_latency? This is another complicated
> > subject!
> > 
> > Frediano
> > 
> >> 
> >> Some updates.
> >> 
> >> Yesterday I found a big cause of part of the lag. The client and
> >> multimedia synchronization. After some video playing/game running pressing
> >> Ctrl-Z to suspend Qemu you can see the client still playing for a while.
> >> I checked my software to reduce bandwidth and was working correctly not
> >> sending
> >> any more data after the set latency. But the client continued to play for
> >> couple
> >> of seconds! This could be good if we are just watching a movie but as soon
> >> as
> >> we
> >> get more interactive and want to have some feedback 2 seconds make working
> >> impossible.
> >> So I changed the code of the client to remove any delay to try to sync and
> >> I get this https://www.youtube.com/watch?v=D_DCs2sriu0. Quite good
> >> (unfortunately
> >> there is no audio, this was quite out of sync).
> >> Seems that the latency/bandwidth computation is not able to handle well
> >> the
> >> current queued data causing the bandwidth detected to be reduced a lot (so
> >> video
> >> quality decrease a lot) while the latency computed is so high that the
> >> client
> >> use this big delay (I got some experiment were the lag was much more than
> >> 2
> >> seconds!).
> >> To make the video so good I had to force the bitrate in our gstreamer
> >> code.
> >> Also the compressed frame size of this game are quite low.
> >> 
> >> About VAAPI, gstreamer and our code. It looks like our code is not able to
> >> reduce
> >> the bitrate used by the encoder (I'm actually using H264 and Intel
> >> implementation
> >> of vaapi). The result is that in some cases the frame rate is reduced to
> >> 3/4
> >> fps.
> >> I tried lot of parameters (like cabac and dct8x8) but had no luck.
> >> Sometimes
> >> our code seems to deadlock (I had some chat with Francois some day ago and
> >> could
> >> be due to the way buffers are produced by the encoder). Setting a
> >> different
> >> rate-control for vaapih264enc seems to cause our code to fail (other rate
> >> control
> >> settings should behave much better for limiting the bit rate).
> >> 
> >> Frediano
> >> 
> >>> 
> >>> Hi,
> >>>  some news on the patch and tests.
> >>> 
> >>> The patch is still more or less as I send it last time
> >>> (https://lists.freedesktop.org/archives/spice-devel/2016-July/030662.html).
> >>> 
> >>> So the a bit of history.
> >>> Time ago I started a branch with the idea to fed frames from Virgl to
> >>> the old drawing path to see what would happen. Many reason to do this,
> >>> one is to exercise the streaming path for this and also see with the
> >>> refactory work this could be done easier.
> >>> The intention wasn't a final patch for this (extracting texture is
> >>> surely not a good idea if it can be avoided and is not clear if doing
> >>> this long trip is the good way or if there are shorter path for instance
> >>> injecting directly into streaming code).
> >>> The branch got stuck for a while (kind of a month or two) as just
> >>> extracting the raw frame was not as easy (and got lost in different
> >>> stuff). By the way when I got back time later I found a way using DRM
> >>> directly and was easy to insert the frames. Beside some memory issues
> >>> (fixed) and some frame flipping (worked around) was working!
> >>> Locally is working very well, surprisingly all is smooth and fast
> >>> (I run everything in a laptop machine with an Intel card).
> >>> Obviously once is more or less working you try to get a bit harder
> >>> and more real world setup so... playing games with even some network
> >>> restriction (after some thinking I think this is one of the worst
> >>> cases you can imagine that is if this works fine you are not far from
> >>> a release!).
> >>> 
> >>> Here of course problems started.
> >>> 
> >>> Simulation
> >>> To simulate some more real network case I used a program which
> >>> "slow down sockets" forwarding data (I used Linux traffic shaping but
> >>> this cause some problems). I knew this is not optimal (for instance
> >>> queues and rtt detection from program are quite impossible) so I
> >>> decided to use tun/tap (I tried to avoid having to use root to do such
> >>> tests) and the final version
> >>> (https://cgit.freedesktop.org/~fziglio/latency)
> >>> is working really well (I just did some more tuning on CPU scheduling
> >>> and the program is using just 2/3% of CPU so should not change tests
> >>> that much).
> >>> 
> >>> Latency
> >>> One of the first issue of introducing a real network in the path was
> >>> latency. Especially playing you can feel a very long lag (kind of
> >>> seconds even if the stream is quite fast). At the end I'm using xterm
> >>> and wireshark to measure the delay. The reason is that xterm cursor does
> >>> not blink and does very few screen operations so in wireshark you
> >>> can see a single DRAW_COPY operation and as this change is quite small
> >>> you can also feel the delay without using wireshark. This test is quite
> >>> reliable and the simulator behave very fine (also a real network).
> >>> I usually use h264 for encoding. Using normal stream configuration
> >>> the lag is much lower (also the video quality) but even if the video
> >>> is fluid the delay is higher than xterm. I put some debugging on the
> >>> frames trying to introduce delays for encoding and extraction and
> >>> usualy a frame is processed in 5 ms (since Qemu call) so I don't
> >>> understand where the lag came. Could be some options of the encoders,
> >>> the encoding buffer is too large (the network one isn't) or some problems
> >>> with gstreamer interaction (server/gstreamer-encoder.c file).
> >>> Trying to use vaapi the lag is getting much worse, even combined with
> >>> very
> >>> large bandwidth, however the behaviour of gstreamer vaapi is quite
> >>> different
> >>> and the options are also much different. Maybe there are options to
> >>> improve compression/delay, maybe some detail in the plugin introduce
> >>> other delays. For sure the vaapi h264 has bitrate which cannot be changed
> >>> dynamically so this could be an issue. The results is that quality is
> >>> much better but frame rate and delay is terrible. Also while using x264
> >>> encoder (software one) the network queue (you can see using netstat)
> >>> is quite low (kind of 20-80kb) with low bandwidth while with vaapi
> >>> is always too high (kind of 1-3mb) which obviously do not help with
> >>> latency.
> >>> 
> >>> Bandwidth
> >>> Obviously an high bandwidth helps. But I can say that x264 encoder
> >>> do quite a good job when the bandwidth is not enough. On the opposite
> >>> it get quite some time (kind of 10-20 minutes) to understand that
> >>> bandwidth got better. vaapi was mainly not working.
> >>> Sometimes using a real wifi connection (with a cheap and old router)
> >>> you can see bandwidth get down for a while, probably some packet
> >>> lost and retransmission kick in).
> >>> 
> >>> CPU usage
> >>> Running all in a single machine without helping in encoding decoding
> >>> made this problem quite difficult you end up using all CPU power and
> >>> even more turning kernel schedule in the equation. Sometimes I try
> >>> using another machine as client so I can see more clearly where the CPU
> >>> is used to support a virtual machine.
> >>> 
> >>> Qemu
> >>> There is still an hack to support listening to tcp instead of unix
> >>> sockets,
> >>> will be changed with spice-server changes.
> >>> Turns out that for every frame a monitor_config is sent. Due to the
> >>> implementation of spice-server this is not helping improving the latency.
> >>> I merge my cork branch and did some changes in spice-server and you can
> >>> get some good improvement.
> >>> Got a patch from Marc-andre for remove a timer which is causing lot
> >>> of cpu usage on RedWorker, still to try.
> >>> The VM with Virgl is not powering off, didn't investigate.
> >>> 
> >>> 
> >>> In the end lot of small issues and stuff to investigate, I don't have
> >>> a clear idea on how to progress. My last though is avoid vaapi for
> >>> a while and fix some small issues (like monitor_config and trying to
> >>> understand additional lag when stream is using). vaapi state and
> >>> gstreamer to full implement offloading of the encoding has too
> >>> variables (our gstreamer code, options, pipeline to use, code
> >>> stability, card support).
> >>> gstreamer and texture data extraction (a fallback we should have)
> >>> seems to work better with GL stuff so possibly having Qemu communicate
> >>> some EGL setup will be required (that is ABI change between Qemu and
> >>> spice-server).
> >>> Maybe EGL extraction, data extraction lazyness (to avoid expensive
> >>> data copy if frames are dropped) could be a possible step
> >>> stable enough to have some code merged.
> >>> 
> > _______________________________________________
> > Spice-devel mailing list
> > Spice-devel@xxxxxxxxxxxxxxxxxxxxx
> > https://lists.freedesktop.org/mailman/listinfo/spice-devel
> 
> 
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/spice-devel