Re: Libvirt hosts freeze after ceph osd+mon problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I migrated virtual to my second node which is running
qemu-kvm version 1:2.1+dfsg-12+deb8u6 (from debian oldstable)
the same situation - frozen after approx 30-40 seconds when
"libceph: osd6 down" appeared in syslog (not before).
Also my other virtual on first node was frozen in the same time.
Both virtuals are running debian stretch, one with
4.9.0-3-amd64 kernel, second with
4.10.17-2-pve kernel

Cannot test Windows virtuals now.

One of my virtuals is on pool, where i forced primary OSD to other nodes (OSDs) than I'm stopping and I have pool min_size 1, so I assume (when PRIMARY OSD is still online and available) I shouldn't have issue with disk writes or reads. But that virtual is also affected and don't survive MON+OSD stopping.

I tried to set

[global]
heartbeat interval = 5
[osd]
osd heartbeat interval = 3
osd heartbeat grace = 10

in my ceph.conf

and after my test I got no "heartbeat_check: no reply from" in syslog, just "libceph: osd6 down" and virtuals survived that. That can be workaround for me, but it can also be only coincidence that other part of mon code disabled osd before my problem occured. I also assume, that everybody else is using defaults heartbeat settings. My cluster was installed on luminous (not migrated from previous versions) and node OS is stretch (one node is lenny).

With regards
Jan Pekar
Imatic


On 7.11.2017 14:16, Jason Dillaman wrote:
If you are seeing this w/ librbd and krbd, I would suggest trying a different version of QEMU and/or different host OS since loss of a disk shouldn't hang it -- only potentially the guest OS.

On Tue, Nov 7, 2017 at 5:17 AM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx <mailto:jan.pekar@xxxxxxxxx>> wrote:

    I'm calling kill -STOP to simulate behavior, that occurred, when on
    one ceph node i was out of memory. Processes was not killed, but
    were somehow suspended/unresponsible (they couldn't create new
    threads etc), and that caused all virtuals (on other nodes) to hung.
    I decided to simulate it with kill -STOP MONPID OSDPID and I succeeded.

    After I stop MON with OSD, it took few seconds to get osd
    unresponsive messages, and exactly when I get final
    libceph: osd6 down
    all my virtuals stops responding (stop pinging, unable to use VNC etc)
    Tried with librdb disk definition or rbd map device attached inside
    QEMU/KVM virtuals.

    JP


    On 7.11.2017 10:57, Piotr Dałek wrote:

        On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:

            Hi,

            I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
            1:2.8+dfsg-6+deb9u3
            I'm running 3 nodes with 3 monitors and 8 osds on my nodes,
            all on IPV6.

            When I tested the cluster, I detected strange and severe
            problem.
            On first node I'm running qemu hosts with librados disk
            connection to the cluster and all 3 monitors mentioned in
            connection.
            On second node I stopped mon and osd with command

            kill -STOP MONPID OSDPID

            Within one minute all my qemu hosts on first node freeze, so
            they even don't respond to ping. [..]


        Why would you want to *stop* (as in, freeze) a process instead
        of killing it?
        Anyway, with processes still there, it may take a few minutes
        before cluster realizes that daemons are stopped and kicks it
        out of cluster, restoring normal behavior (assuming correctly
        set crush rules).


-- ============
    Ing. Jan Pekař
    jan.pekar@xxxxxxxxx <mailto:jan.pekar@xxxxxxxxx> | +420603811737
    <tel:%2B420603811737>
    ----
    Imatic | Jagellonská 14 | Praha 3 | 130 00
    http://www.imatic.cz
    ============
    --
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>




--
Jason

--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux