Ceph Outage (Nautilus) - 14.2.11

Suresh Rama <sstkadu@xxxxxxxxx> · Tue, 15 Dec 2020 15:44:53 -0500

Dear All,

We have a 38 node HP Apollo cluster with 24 3.7T Spinning disk and 2 NVME
for journal. This is one of our 13 clusters which was upgraded from
Luminous to Nautilus (14.2.11).  When one of our openstack customers  uses
elastic search (they offer Logging as a Service) to their end users
reported IO latency issues, our SME rebooted two nodes that he felt were
doing memory leak.  The reboot didn't help rather worsen the situation and
he went ahead and recycled the entire cluster one node a time as to fix the
slow ops reported by OSDs.   This caused a huge issue and MONS were not
able to withstand the spam and started crashing.

1) We audited the network (inspecting TOR, iperf, MTR) and nothing was
indicating any issue but OSD logs were keep complaining about
BADAUTHORIZER

 2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1c1e20 0x7fea3c1bcdf0 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER

2) Made sure no clock skew and we use timesyncd.   After taking out a
couple of OSDs that were indicating slow in the ceph health response the
situation didn't improve.  After 3 days of troubleshooting we upgraded the
MONS to 14.2.15 and seems the situation improved a little but still
reporting  61308 slow ops which we really struggled to isolate with bad
OSDs as moving a couple of them didn't improve.  One of the MON(2) failed
to join the cluster and always doing compact and never was able to join
(see the size below).  I suspect that could be because the key value store
information between 1 and 3 is not up to date with 2.    At times, we had
to stop and start to compress to get a better response from Ceph MON
(keeping them running in one single MON).

root@pistoremon-as-c01:~# du -sh /var/lib/ceph/mon
391G /var/lib/ceph/mon
root@pistoremon-as-c03:~# du -sh /var/lib/ceph/mon
337G /var/lib/ceph/mon
root@pistoremon-as-c02:~# du -sh /var/lib/ceph/mon
13G /var/lib/ceph/mon

 cluster:
    id:     bac20301-d458-4828-9dd9-a8406acf5d0f
    health: HEALTH_WARN
            noout,noscrub,nodeep-scrub flag(s) set
            1 pools have many more objects per pg than average
            10969 pgs not deep-scrubbed in time
            46 daemons have recently crashed
            61308 slow ops, oldest one blocked for 2572 sec, daemons
[mon.pistoremon-as-c01,mon.pistoremon-as-c03] have slow ops.
            mons pistoremon-as-c01,pistoremon-as-c03 are using a lot of
disk space
            1/3 mons down, quorum pistoremon-as-c01,pistoremon-as-c03

  services:
    mon: 3 daemons, quorum pistoremon-as-c01,pistoremon-as-c03 (age 52m),
out of quorum: pistoremon-as-c02
    mgr: pistoremon-as-c01(active, since 2h), standbys: pistoremon-as-c03,
pistoremon-as-c02
    osd: 911 osds: 888 up (since 68m), 888 in
         flags noout,noscrub,nodeep-scrub
    rgw: 2 daemons active (pistorergw-as-c01, pistorergw-as-c02)

  task status:

  data:
    pools:   17 pools, 32968 pgs
    objects: 62.98M objects, 243 TiB
    usage:   748 TiB used, 2.4 PiB / 3.2 PiB avail
    pgs:     32968 active+clean

  io:
    client:   56 MiB/s rd, 95 MiB/s wr, 1.78k op/s rd, 4.27k op/s wr

3) When looking through ceph.log on the mon with tailf, I was getting a lot
of different time stamp reported in the ceph logs in MON1 which is master.
Confused on why the live log report various timestamps?

stat,write 2166784~4096] snapc 0=[] ondisk+write+known_if_redirected
e951384) initiated 2020-12-13 06:16:58.873964 currently delayed
2020-12-13 06:39:37.169504 osd.1224 (osd.1224) 325855 : cluster [WRN] slow
request osd_op(client.461445583.0:8881223 1.16aa
1:55684db0:::rbd_data.9ede65fc7af15.0000000000000000:head [stat,write
3547136~4096] snapc 0=[] ondisk+write+known_if_redirected e951384)
initiated 2020-12-13 06:16:59.082012 currently delayed
2020-12-13 06:39:37.169510 osd.1224 (osd.1224) 325856 : cluster [WRN] slow
request osd_op(client.461445583.0:8881191 1.16aa
1:55684db0:::rbd_data.9ede65fc7af15.0000000000000000:head [stat,write
2314240~4096] snapc 0=[] ondisk+write+known_if_redirected e951384)
initiated 2020-12-13 06:16:58.874031 currently delayed
2020-12-13 06:39:37.169513 osd.1224 (osd.1224) 325857 : cluster [WRN] slow
request osd_op(client.461445583.0:8881224 1.16aa
1:55684db0:::rbd_data.9ede65fc7af15.0000000000000000:head [stat,write
3571712~8192] snapc 0=[] ondisk+write+known_if_redirected e951384)
initiated 2020-12-13 06:16:59.082094 currently delayed
^C

root@pistoremon-as-c03:~# date
Tue Dec 15 20:12:02 UTC 2020
root@pistoremon-as-c03:~# timedatectl
      Local time: Tue 2020-12-15 20:12:04 UTC
  Universal time: Tue 2020-12-15 20:12:04 UTC
        RTC time: Tue 2020-12-15 20:12:04
       Time zone: Etc/UTC (UTC, +0000)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

4) Keep seeing the error below and assume that's due to the size of the MON
DB with high workload the clients are not able to fetch the mon map ?

2020-12-13 13:16:56.197 7f946cddc700 -1 monclient: get_monmap_and_config
failed to get config
2020-12-13 13:17:00.613 7f946cddc700 -1 monclient: get_monmap_and_config
failed to get config

We are getting all sorts of errors with primary concern on how to get the
slow ops down and improve the IO latency.  MONs using the SSD for the
storage with NOOP scheduler with nr_request (128).

Please provide some guidance on all 4 issues reported above.  Thank you for
your valueable time.

-- 
Regards,
Suresh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx