I migrated virtual to my second node which is running
qemu-kvm version 1:2.1+dfsg-12+deb8u6 (from debian oldstable)
the same situation - frozen after approx 30-40 seconds when
"libceph: osd6 down" appeared in syslog (not before).
Also my other virtual on first node was frozen in the same time.
Both virtuals are running debian stretch, one with
4.9.0-3-amd64 kernel, second with
4.10.17-2-pve kernel
Cannot test Windows virtuals now.
One of my virtuals is on pool, where i forced primary OSD to other nodes
(OSDs) than I'm stopping and I have pool min_size 1, so I assume (when
PRIMARY OSD is still online and available) I shouldn't have issue with
disk writes or reads. But that virtual is also affected and don't
survive MON+OSD stopping.
I tried to set
[global]
heartbeat interval = 5
[osd]
osd heartbeat interval = 3
osd heartbeat grace = 10
in my ceph.conf
and after my test I got no "heartbeat_check: no reply from" in syslog,
just "libceph: osd6 down" and virtuals survived that.
That can be workaround for me, but it can also be only coincidence that
other part of mon code disabled osd before my problem occured. I also
assume, that everybody else is using defaults heartbeat settings.
My cluster was installed on luminous (not migrated from previous
versions) and node OS is stretch (one node is lenny).
With regards
Jan Pekar
Imatic
On 7.11.2017 14:16, Jason Dillaman wrote:
If you are seeing this w/ librbd and krbd, I would suggest trying a
different version of QEMU and/or different host OS since loss of a disk
shouldn't hang it -- only potentially the guest OS.
On Tue, Nov 7, 2017 at 5:17 AM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx
<mailto:jan.pekar@xxxxxxxxx>> wrote:
I'm calling kill -STOP to simulate behavior, that occurred, when on
one ceph node i was out of memory. Processes was not killed, but
were somehow suspended/unresponsible (they couldn't create new
threads etc), and that caused all virtuals (on other nodes) to hung.
I decided to simulate it with kill -STOP MONPID OSDPID and I succeeded.
After I stop MON with OSD, it took few seconds to get osd
unresponsive messages, and exactly when I get final
libceph: osd6 down
all my virtuals stops responding (stop pinging, unable to use VNC etc)
Tried with librdb disk definition or rbd map device attached inside
QEMU/KVM virtuals.
JP
On 7.11.2017 10:57, Piotr Dałek wrote:
On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:
Hi,
I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
1:2.8+dfsg-6+deb9u3
I'm running 3 nodes with 3 monitors and 8 osds on my nodes,
all on IPV6.
When I tested the cluster, I detected strange and severe
problem.
On first node I'm running qemu hosts with librados disk
connection to the cluster and all 3 monitors mentioned in
connection.
On second node I stopped mon and osd with command
kill -STOP MONPID OSDPID
Within one minute all my qemu hosts on first node freeze, so
they even don't respond to ping. [..]
Why would you want to *stop* (as in, freeze) a process instead
of killing it?
Anyway, with processes still there, it may take a few minutes
before cluster realizes that daemons are stopped and kicks it
out of cluster, restoring normal behavior (assuming correctly
set crush rules).
--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx <mailto:jan.pekar@xxxxxxxxx> | +420603811737
<tel:%2B420603811737>
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
--
Jason
--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com