Quick interruptions in the Ceph cluster

Gesiel Galvão Bernardes <gesiel.bernardes@xxxxxxxxx> · Wed, 5 Aug 2020 15:17:04 -0300

Hi,

I have been experiencing rapid outage events in the Ceph cluster. During
these events I receive messages from slow ops, OSD downs, but at the same
time it is operating. Magically everything is back to normal. These events
usually last about 2 minutes.
I couldn't find anything that could direct me to what causes these events,
can you help me?

I'm using Mimic (13.2.6) and CentOS7 on all nodes.
Below is "ceph -s" and log of when event occurs:

# ceph -s
  cluster:
    id:     4ea72929-6f9e-453a-8cd5-bb0712f6b874
    health: HEALTH_OK

  services:
    mon:         2 daemons, quorum cmonitor,cmonitor2
    mgr:         cmonitor(active), standbys: cmonitor2
    osd:         74 osds: 74 up, 74 in
    tcmu-runner: 10 daemons active

  data:
    pools:   7 pools, 3072 pgs
    objects: 22.17 M objects, 83 TiB
    usage:   225 TiB used, 203 TiB / 428 TiB avail
    pgs:     3063 active+clean
             9    active+clean+scrubbing+deep

======================================
Log of event:
2020-08-05 18:00:00.000179 [INF]  overall HEALTH_OK
2020-08-05 17:55:28.905024 [INF]  Cluster is now healthy
2020-08-05 17:55:28.904975 [INF]  Health check cleared: PG_DEGRADED (was:
Degraded data redundancy: 1/60350974 objects degraded (0.000%), 1 pg
degraded)
2020-08-05 17:55:27.746606 [WRN]  Health check update: Degraded data
redundancy: 1/60350974 objects degraded (0.000%), 1 pg degraded
(PG_DEGRADED)
2020-08-05 17:55:22.745820 [WRN]  Health check update: Degraded data
redundancy: 55/60350897 objects degraded (0.000%), 26 pgs degraded, 1 pg
undersized (PG_DEGRADED)
2020-08-05 17:55:17.744218 [WRN]  Health check update: Degraded data
redundancy: 123/60350666 objects degraded (0.000%), 63 pgs degraded
(PG_DEGRADED)
2020-08-05 17:55:12.743568 [WRN]  Health check update: Degraded data
redundancy: 192/60350660 objects degraded (0.000%), 88 pgs degraded
(PG_DEGRADED)
2020-08-05 17:55:07.741759 [WRN]  Health check update: Degraded data
redundancy: 290/60350737 objects degraded (0.000%), 117 pgs degraded
(PG_DEGRADED)
2020-08-05 17:55:02.737913 [WRN]  Health check update: Degraded data
redundancy: 299/60350764 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:57.736694 [WRN]  Health check update: Degraded data
redundancy: 299/60350746 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:52.736132 [WRN]  Health check update: Degraded data
redundancy: 299/60350731 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:47.735612 [WRN]  Health check update: Degraded data
redundancy: 299/60350689 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:42.734877 [WRN]  Health check update: Degraded data
redundancy: 301/60350677 objects degraded (0.000%), 120 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:38.210906 [INF]  Health check cleared: SLOW_OPS (was: 35
slow ops, oldest one blocked for 1954017 sec, daemons
[mon.cmonitor,mon.cmonitor2] have slow ops.)
2020-08-05 17:54:37.734218 [WRN]  Health check update: 35 slow ops, oldest
one blocked for 1954017 sec, daemons [mon.cmonitor,mon.cmonitor2] have slow
ops. (SLOW_OPS)
2020-08-05 17:54:37.734132 [WRN]  Health check update: Degraded data
redundancy: 380/60350611 objects degraded (0.001%), 154 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:34.171483 [INF]  Health check cleared: PG_AVAILABILITY
(was: Reduced data availability: 3 pgs inactive, 6 pgs peering)
2020-08-05 17:54:32.733499 [WRN]  Health check update: Degraded data
redundancy: 52121/60350544 objects degraded (0.086%), 211 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:27.080529 [WRN]  Monitor daemon marked osd.72 down, but it
is still running
2020-08-05 17:54:32.102889 [WRN]  Health check failed: 60 slow ops, oldest
one blocked for 1954017 sec, daemons
[osd.16,osd.22,osd.23,osd.27,osd.28,osd.29,osd.30,osd.35,osd.48,osd.5]...
have slow ops. (SLOW_OPS)
2020-08-05 17:54:32.102699 [WRN]  Health check update: Reduced data
availability: 3 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2020-08-05 17:54:27.951343 [INF]  osd.72 192.168.200.25:6844/64565 boot
2020-08-05 17:54:27.935996 [INF]  Health check cleared: OSD_DOWN (was: 1
osds down)
2020-08-05 17:54:27.732679 [WRN]  Health check update: Degraded data
redundancy: 1781748/60350443 objects degraded (2.952%), 381 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:26.916269 [WRN]  Health check update: Reduced data
availability: 5 pgs inactive, 10 pgs peering, 18 pgs incomplete
(PG_AVAILABILITY)
2020-08-05 17:54:19.712714 [WRN]  Monitor daemon marked osd.71 down, but it
is still running
2020-08-05 17:54:22.716043 [WRN]  Health check update: Degraded data
redundancy: 4057630/60350485 objects degraded (6.723%), 566 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:22.715939 [WRN]  Health check update: 1 osds down
(OSD_DOWN)
2020-08-05 17:54:14.042363 [WRN]  Monitor daemon marked osd.62 down, but it
is still running
2020-08-05 17:54:20.858582 [WRN]  Health check update: Reduced data
availability: 8 pgs inactive, 15 pgs peering, 37 pgs incomplete
(PG_AVAILABILITY)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx