Re: Migration with "--p2p --tunnelled" hanging in v6.9.0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 25, 2020 at 2:47 PM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:
>
> On Wed, Nov 25, 2020 at 02:33:44PM +0100, Christian Ehrhardt wrote:
> > On Wed, Nov 25, 2020 at 1:38 PM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Nov 25, 2020 at 01:28:09PM +0100, Christian Ehrhardt wrote:
> > > > On Wed, Nov 25, 2020 at 10:55 AM Christian Ehrhardt
> > > > <christian.ehrhardt@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
> > > > > > > Hi,
> > > > > >
> > > > > > [...]
> > > > >
> > > > > BTW to reduce the scope what to think about - I have rebuilt 6.8 as
> > > > > well it works.
> > > > > Thereby I can confirm that the offending change should be in between
> > > > > 6.8.0 -> 6.9.0.
> > > >
> > > > I was able to get this working in git bisect builds from git between
> > > > v6.8 / v6.9.
> > > > I identified the following offending commit:
> > > >   7d959c30  rpc: Fix virt-ssh-helper detection
> > > >
> > > > Ok that makes a bit of sense, first we had in 6.8
> > > >   f8ec7c84 rpc: use new virt-ssh-helper binary for remote tunnelling
> > > > That makes it related to tunneling which matches our broken use-case.
> > > >
> > > > The identified commit "7d959c30 rpc: Fix virt-ssh-helper detection" might
> > > > finally really enable the new helper and that is then broken?
> > > >
> > > > With that knowledge I was able to confirm that it really is the native mode
> > > >
> > > > $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test
> > > > qemu+ssh://testkvm-hirsute-to/system?proxy=netcat
> > > > <works>
> > > > $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test
> > > > qemu+ssh://testkvm-hirsute-to/system?proxy=native
> > > > <hangs>
> > > >
> > > > I recently discussed with Andrea if we'd need apparmor rules for
> > > > virt-ssh-helper,
> > > > but there are no denials nor libvirt log entries related to virt-ssh-helper.
> > > > But we don't need such rules since it is spawned on the ssh login and
> > > > not under libvirtd itself.
> > > >
> > > > PS output of the hanging receiving virt-ssh-helper (looks not too unhappy):
> > > > Source:
> > > > 4     0   41305       1  20   0 1627796 23360 poll_s Ssl ?
> > > > 0:05 /usr/sbin/libvirtd
> > > > 0     0   41523   41305  20   0   9272  4984 poll_s S    ?
> > > > 0:02  \_ ssh -T -e none -- testkvm-hirsute-to sh -c 'virt-ssh-helper
> > > > 'qemu:///system''
> > > > Target
> > > > 4     0     213       1  20   0  13276  4132 poll_s Ss   ?
> > > > 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 250-500 startups
> > > > 4     0   35148     213  20   0  19048 11320 poll_s Ss   ?
> > > > 0:02  \_ sshd: root@notty
> > > > 4     0   35206   35148  20   0   2584   544 do_wai Ss   ?
> > > > 0:00      \_ sh -c virt-ssh-helper qemu:///system
> > > > 0     0   35207   35206  20   0  81348 26684 -      R    ?
> > > > 0:34          \_ virt-ssh-helper qemu:///system
> > > >
> > > > I've looked at it with strace [1] and gdb for backtraces [2] - it is
> > > > not dead or stuck and keeps working.
> > > > Could it be just so slow that it appears to hang until it times out?
> > > > Or is the event mechanism having issues and it wakes up too rarely?
> > >
> > > Lets take migration out of the picture. What if you simply do
> > >
> > >     virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native  list
> > >
> > > does that work ?
> >
> > Yes it does, no hang and proper results
>
> Ok, so that shows  virt-ssh-helper is not completely broken at least.
>
> Makes me think there is possibly something related to streams code
> that causes the issue.
>
> You might try the virsh "console" or "vol-upload" commands to test
> the streams stuff in isolation. If that also works, then the problem
> is specific to migration,

Thanks for the hint Daniel, it is indeed not migration specific - it
seems that virs-ssh-helper is just very slow.

rm testfile; virsh -c
qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool
uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s;
ll -laFh testfile; done
[1] 42285
-rw-r--r-- 1 root root 24M Nov 25 15:20 testfile
-rw-r--r-- 1 root root 220M Nov 25 15:20 testfile
-rw-r--r-- 1 root root 396M Nov 25 15:20 testfile
-rw-r--r-- 1 root root 558M Nov 25 15:20 testfile
-rw-r--r-- 1 root root 756M Nov 25 15:20 testfile
-rw-r--r-- 1 root root 868M Nov 25 15:20 testfile
[1]+  Done                    virsh -c
qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool
uvtool h-migr-test.qcow testfile

rm testfile; virsh -c
qemu+ssh://testkvm-hirsute-to/system?proxy=native vol-download --pool
uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s;
ll -laFh testfile; done
[1] 42307
-rw-r--r-- 1 root root 1.8M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 6.8M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 9.8M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 13M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 15M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 16M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 18M Nov 25 15:21 testfile
-rw-r--r-- 1 root root 19M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 21M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 22M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 23M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 25M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 26M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 27M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 28M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 29M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 30M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 31M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 32M Nov 25 15:22 testfile
-rw-r--r-- 1 root root 32M Nov 25 15:22 testfile

That is ~150-200 MB/s vs 1-5 MB/s and as seen it seems to start slow
AND degrades further.
I'm not at 90MB overall and down to ~150 KB/s

> and we'll probably wnt to colllect debug
> level logs from src+dst hosts.

I already had debug level logs of the migration [1] attached to the
launchpad bug I use to take my notes on this.
Taken with these configs:
log_filters="1:qemu 1:libvirt 3:object 3:json 3:event 1:util"
log_outputs="1:file:/var/log/libvirtd.log"

You can fetch the logs (of a migration), but I'm happy to generate you
logs of any other command (or with other log settings) as you'd prefer
them.

The network used in this case is a bridge between two containers, but
we can cut out more components.
I found that the same vol-download vs 127.0.0.1 gives the same results.
That in turn makes it easier to gather results as we only need one system.

Here are logs of:
virsh -c qemu+ssh://127.0.0.1/system?proxy=netcat vol-download --pool
uvtool h-migr-test.qcow testfile
=> ~150-220MB/s [2]
vs
virsh -c qemu+ssh://127.0.0.1/system?proxy=native vol-download --pool
uvtool h-migr-test.qcow testfile
=> 5 MB/s degrading to ~200 KB/s and less [3]

[1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437541/+files/full-log.tgz
[2]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437959/+files/vol-download.netcat.log
[3]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437960/+files/vol-download.native.log

> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>


-- 
Christian Ehrhardt
Staff Engineer, Ubuntu Server
Canonical Ltd





[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux