'defect PG' caused heartbeat_map is_healthy timeout and recurring OSD breakdowns

Daniel Marks <daniel.marks@xxxxxxxxxxxxxx> · Thu, 2 Mar 2017 10:36:50 +0100

Hi all,

yesterday we encountered a problem within our ceph cluster. After a long day we were able to fix it, but we are very unsatisfied with the fix and assume that it is also only temporary. Any hint or help is very appreciated.

We have a production ceph cluster (ceph version 10.2.5 on ubuntu 16.04, deployed with ceph-ansible) serving our private OpenStack cloud with object- and blockstorage. We went live a few days ago with a few friendly users / projects on it. Yesterday we woke up to a few alarms about failed VM creations and a few down OSDs. The troubleshooting took us the whole day and this is what we have found:

Starting situation:
Some OSDs were marked down, and we had 100 blocked requests on one OSD (+ a few on other OSDs), ceph was recovering and backfilling. We also saw OSDs rejoining the cluster from time to time, others left it as 'down'. Causing recovery traffic.

Analysis:
- The OSD process with 100 blocked request was using significantly more CPU than the other OSDs on the node. Lets call this OSD osd.113 (although this changed later on). Most of the time it was like >120% - it looked like one thread was using its CPU up to 100%.

- In the logs of the OSDs that recently left the cluster as down we found several of the following entries:

    2017-03-01 10:22:29.258982 7f840d258700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:29.258988 7f840d157700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:31.080447 7f8447ea6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:32.759999 7f840d258700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:32.760010 7f840d157700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:36.080545 7f8447ea6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:36.080568 7f8447ea6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f842a0bf700' had suicide timed out after 150

    ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55c490e24520]
    2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x259) [0x55c490d61209]
    3: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x55c490d61b36]
    4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55c490d6238c]
    5: (CephContextServiceThread::entry()+0x167) [0x55c490e3cda7]
    6: (()+0x76ba) [0x7f4f028096ba]
    7: (clone()+0x6d) [0x7f4f0088282d]
    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  On the monitor we also saw several of those entries for different OSDs:

    2017-03-01 08:34:23.321262 osd.77 10.97.162.16:6843/61369 62043 : cluster [WRN] map e357 wrongly marked me down

  Something was preventing OSDs from properly sending or answering heartbeats...

- We then tried to increase the osd_heartbeat_grace parameter with 'ceph tell osd --injectargs' to break the chain of leaving and rejoining OSDs. We were able to set it on all OSDs except osd.113 (the one using >120% CPU and having 100 blocked request), because it did not answer. What we got was a bit more stability. The cluster recovered to an almost stable state:

    # ceph pg dump_stuck | column -t
    ok
    pg_stat  state                                   up             up_primary  acting        acting_primary
    4.33     active+recovering+degraded+remapped     [113,104,115]  113         [113,77,87]   113
    8.7      active+recovery_wait+degraded           [113,75,94]    113         [113,75,94]   113
    3.40     active+recovery_wait+degraded+remapped  [113,92,78]    113         [113,116,84]  113

  Although it was saying ' 4.33     active+recovering' there was no actual recovery traffic visible in 'ceph -s’ over quiet some time. We tried 'ceph pg 4.33 query' but the command did not return.

- As we now had a clear indication that somethong was wrong with osd.113 we decided to take it down and let ceph recover. As soon as we took osd.113 down its PGs got a new primary OSD and ... the CPU usage >120% and the 100 blocked request disappeared from osd.113, only to reappear on the new primary node for PG 4.33. Also we now had a lot of additional recovery traffic, because osd.113 was taken down and the new primary OSD also went down with a suicide timout (as seen above), switching the primary to a new OSD, *leaving us exactly with our starting situation*, just with a different primary for the 'defect PG' 4.33. In this vicious circle the new primary OSDs also died from heartbeat_map timeouts from time to time, causing PG 4.33 to 'wander' through our OSDs, leaving behind a mess of down, rejoining and recovering OSDs/PGs on its trail.

(Temporary) Solution:
The pool with index number 4, that PG 4.33 belonged to, was an openstack-gnocchi pool containing only telemetry data. As this was not essential for our operations we decided to delete the complete pool to get rid of PG 4.33 and recreate it. This is clearly 'cracking a nut with a sledgehammer'. However, after PG 4.33 was gone the cluster was able to fully recover and remained stable since then. If the pool would have contained volume or object storage data objects this would not have been an option...

Any hint for a better solution (in case the problem reappears) is highly appreciated.

Daniel Marks | Cloud Engineer

codecentric AG | Hochstraße 11 | 42697 Solingen | Deutschland
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de  

Sitz der Gesellschaft: Solingen | HRB 25917 | Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com