Intermittent loss of connectivity with KVM-Ceph-Network (solved)

Australian Jade <sales@xxxxxxxxxxxxxxxxxx> · Tue, 17 Jul 2012 09:51:54 +0930

Hello,

Just want to share my recent experience with KVM backed by RBD. Ceph 
appears to be not at fault but I post it here for others to read as my 
configuration is something other users of Ceph may configure.

Over last three weeks I was battling some elusive issue: KVM backed by 
RBD intermittently lost network under some consistent (but relatively 
low) load. It was driving me nuts as nothing appeared in the logs and 
everything was seemingly OK. With only one exception: pings to nearby 
host sometimes were coming with "No buffer space available" error and 
number of pings could be delayed by 20-30 seconds (obviously such delay 
causes a lot of timeouts). KVM has two network virtio interfaces and on 
host vhost_net module running. One interface is publicly available, 
second one connected to private network. I have tried to change virtio 
to e1000, increase network buffers - all in vain. I also noticed that 
when hold up on interface happened pings come back with second 
difference: ie they piling up on interface and then suddenly sent 
through all at once and remote host returns all of them pretty much 
simultaneously. Another observation which I have made: this behaviour 
was clearly evident when network/disk activity was reasonable - during 
backup. I have stopped backup for one day but it did not help (however 
loss of connectivity did not happen as much).

Being unable to identify the cause I started to pull things apart: I 
have moved image from RBD to qcow and magically everything became 
normal. Back to RBD and issue manifested itself again. But on other hand 
I had number of freshly installed VMs which also backed by RBD and do 
not have such issue. VMs which had this fault were different: they were 
migrated from hardware hosts to VM environment. Fresh VMs and migrated 
VMs are distro-synched FC17 and so I did not expect any difference. The 
only difference I had left was that migrated VMs were 32 bit and freshly 
installed were 64 bit. So in the end I have upgraded kernel in one 
faulty VM to 64 bit (while leaving balance of the system 32 bit) and 
problem disappeared! Next day I have upgraded another VM the same way 
and it also became problem free! So I am now sure that problem lies in 
32 bit kernel which is run on 64 bit host.

So I guess that there some race condition, likely in virtio_net driver 
or in tcp stack, which is apparently triggered by context switching from 
32 bit to 64 bit and io delays introduced by QEMU-RBD driver. Only when 
32 bit VM runs on 64 bit host and it is backed by RBD image this issue 
appears. Being unable to identify exact spot in the kernel were the 
problem is I even not sure where exactly I should report it to so I 
decided to post it here as the place where people with VMs backed by RBD 
most likely will look for solution.

Regards,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html