Slow Ops start piling up, Mon Corruption ?

Daniel Poelzleithner <poelzi@xxxxxxxxxx> · Tue, 16 Jun 2020 19:05:48 +0200

Hi,

we had bad blocks on one OSD and around the same time a network switch
outage, which seems to have caused some corruption on the mon service.

> # ceph -s

  cluster:
    id:     d7c5c9c7-a227-4e33-ab43-3f4aa1eb0630
    health: HEALTH_WARN
            1 daemons have recently crashed
            14097 slow ops, oldest one blocked for 56417 sec,
mon.server6 has slow ops
            mon server6 is low on available space

  services:
    mon: 3 daemons, quorum server6,server3,server5 (age 15h)
    mgr: server4(active, since 3w), standbys: server6, server5
    mds: xpool:1 {0=server6=up:active} 1 up:standby
    osd: 21 osds: 21 up (since 15h), 20 in (since 16h)

  data:
    pools:   17 pools, 941 pgs
    objects: 6.80M objects, 18 TiB
    usage:   34 TiB used, 20 TiB / 54 TiB avail
    pgs:     940 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   23 MiB/s rd, 980 KiB/s wr, 30 op/s rd, 141 op/s wr

14097 slow ops, oldest one blocked for 56417 sec, mon.server6 has slow ops

The mon ops log looks like:
https://gist.github.com/poelzi/45f31f26f6a83f6406bb43553e0c237a

It seems, that the mds transactions don't finish, while waiting for
mdsmap. In the mds server, there are no ops in flight, nor any errors in
the log file.

What is the proper way to repair this ?

kind regards
 poelzi
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx