Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

Michel Jouvin <jouvin@xxxxxxxxxxxx> · Mon, 17 Oct 2022 11:38:57 +0200

Hi,

In fact, a very stupid mistake. This is a CentOS 8 system where smartd 
was not installed. After installing and starting it, the OSD device is 
indeed in bad shape with many reported errors, explaining the behaviour 
observed.

We managed to drain gracefully the sick OSD using the approach proposed 
by Dan: set primary affinity to 0 for the drained OSD before doing the 
`ceph osd out`. In fact we were able to check that after setting primary 
affinity to 0, basically no I/O were handled by the faulty OSD. The 
backfilling completed in the expected time.

Thanks for the answers received. Cheers,

Michel

Le 16/10/2022 à 22:49, Michel Jouvin a écrit :
Hi,

In fact, a very stupid mistake. This is a CentOS 8 system where smartd 
was not installed. After installing and starting it, the OSD device is 
indeed in bad shape... Sorry for the noise.

Cheers,

Michel

Le 16/10/2022 à 22:26, Michel Jouvin a écrit :
Hi Dan,

Thanks for your quick answer. No I check, really nothing in dmesg or 
/var/log/messages. We'll try to remove it either gracefully or abruptly.

Cheers,

Michel

Le 16/10/2022 à 22:16, Dan van der Ster a écrit :
Hi Michel,

Are you sure there isn't a hardware problem with the disk? E.g. 
maybe you have SCSI timeouts in dmesg or high ioutil with iostat?

Anyway I don't think there's a big risk related to draining and 
stopping the osd. Just consider this a disk failure, which can 
happen at any time anyway.

Start by marking it out. If there are still too many slow requests 
or laggy PGs, try setting primary affinity to zero.
And if that still doesn't work, I wouldn't hesitate to stop that 
sick osd so objects backfill from the replicas

(We had a somewhat similar issue today, btw .. some brand of SSDs 
occasionally hangs IO across a whole SCSI bus when failing. Stopping 
the osd revives the rest of the disks on the box).

Cheers, Dan

On Sun, Oct 16, 2022, 22:08 Michel Jouvin <jouvin@xxxxxxxxxxxx> wrote:

    Hi,

    We have a production cluster made of 12 OSD servers with 16 OSD each
    (all the same HW) which has been running fine for 5 years (initially
    installed with Luminous) and which has been running Octopus 
(15.2.16)
    for 1 year and was recently upgraded to 15.2.17 (1 week before the
    problem started but doesn't seem to be linked with this upgrade).
    Since
    beginning of October, we started to see PGs in state
    "active+laggy" and
    slow requests always related to the same OSD and looking at its
    log, we
    saw "log_latency_fn slow" messages. There was no disk error 
logged in
    any system log file. Restarting the OSD didn't really help but no
    functionnal problems were seen.

    Looking again at the problem in the last days, we saw that the
    cluster
    was in HEALTH_WARN state because several PGs were not
    deep-scrubbed in
    time. In the logs we saw also (but may be we just missed them
    initially)
    "heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out
    after
    15" messages. This number increased days after days and is now
    almost 3
    times the number of PGs hosted by the laggy OSD (despite hundreds of
    deep scrubs running successfully, the cluster has 4297 PGs). It 
seems
    that in the list we find all PGs that have a replica (all the
    pools are
    with 3 replica, no EC) on the laggy OSD. We confirmed that there
    is no
    detected disk error in the system.

    Today we restarted the server hosted this OSD, without much hope. It
    didn't help and the same OSD (and only this one) continues to have
    the
    same problem. In addition to the messages mentioned, the admin 
socket
    for this OSD became unresponsive: despite command being executed 
(see
    below), it was not returning in a decent amount of times (several
    minutes).

    As the OSD RocksDB have probably never been compacted, we decided to
    compact the laggy OSD. Despite the "ceph tell osd.10 compact" never
    returned (it was killed after a few hours as the OSD has been marked
    down during a few seconds), the compaction started and lasted ~5
    hours... but completed successfully. But the only improvement that
    was
    seen after the compaction was that the admin socket is now 
responsive
    (despite a bit slow). The messages about log_latency_fn and
    heartbeat_map are still present (and frequent) and the deep scrubs
    are
    still blocked.

    We are looking for advices on what to do to fix this issue. We'd
    in mind
    to stop this OSD, zap it and resintall it but we are worrying it
    may be
    risky to do this with an OSD that has not been deep scrubbed for a
    long
    time. And we are sure there is a better solution! Understanding the
    cause would be a much better approach!

    Thanks in advance for any help. Best regards,

    Michel

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx