Hi,
I’m pretty sure that the deep-scrubs are causing the slow requests.
There have been several threads about it on this list [3], there are
two major things you can do:
1. Change default settings for deep-scrubs [1] to run them outside
business hours to avoid additional load.
2: Change OSD op queue [2]:
osd op queue = wpq
osd op queue cut off = high
We were able to reduce the slow requests drastically in our production
cluster with these actions.
Regards
Eugen
[1]
https://docs.ceph.com/docs/mimic/rados/configuration/osd-config-ref/#scrubbing
[2]
https://docs.ceph.com/docs/mimic/rados/configuration/osd-config-ref/#operations
[3] https://www.spinics.net/lists/ceph-users/msg60589.html
Zitat von Gesiel Galvão Bernardes <gesiel.bernardes@xxxxxxxxx>:
Hi,
I have been experiencing rapid outage events in the Ceph cluster. During
these events I receive messages from slow ops, OSD downs, but at the same
time it is operating. Magically everything is back to normal. These events
usually last about 2 minutes.
I couldn't find anything that could direct me to what causes these events,
can you help me?
I'm using Mimic (13.2.6) and CentOS7 on all nodes.
Below is "ceph -s" and log of when event occurs:
# ceph -s
cluster:
id: 4ea72929-6f9e-453a-8cd5-bb0712f6b874
health: HEALTH_OK
services:
mon: 2 daemons, quorum cmonitor,cmonitor2
mgr: cmonitor(active), standbys: cmonitor2
osd: 74 osds: 74 up, 74 in
tcmu-runner: 10 daemons active
data:
pools: 7 pools, 3072 pgs
objects: 22.17 M objects, 83 TiB
usage: 225 TiB used, 203 TiB / 428 TiB avail
pgs: 3063 active+clean
9 active+clean+scrubbing+deep
======================================
Log of event:
2020-08-05 18:00:00.000179 [INF] overall HEALTH_OK
2020-08-05 17:55:28.905024 [INF] Cluster is now healthy
2020-08-05 17:55:28.904975 [INF] Health check cleared: PG_DEGRADED (was:
Degraded data redundancy: 1/60350974 objects degraded (0.000%), 1 pg
degraded)
2020-08-05 17:55:27.746606 [WRN] Health check update: Degraded data
redundancy: 1/60350974 objects degraded (0.000%), 1 pg degraded
(PG_DEGRADED)
2020-08-05 17:55:22.745820 [WRN] Health check update: Degraded data
redundancy: 55/60350897 objects degraded (0.000%), 26 pgs degraded, 1 pg
undersized (PG_DEGRADED)
2020-08-05 17:55:17.744218 [WRN] Health check update: Degraded data
redundancy: 123/60350666 objects degraded (0.000%), 63 pgs degraded
(PG_DEGRADED)
2020-08-05 17:55:12.743568 [WRN] Health check update: Degraded data
redundancy: 192/60350660 objects degraded (0.000%), 88 pgs degraded
(PG_DEGRADED)
2020-08-05 17:55:07.741759 [WRN] Health check update: Degraded data
redundancy: 290/60350737 objects degraded (0.000%), 117 pgs degraded
(PG_DEGRADED)
2020-08-05 17:55:02.737913 [WRN] Health check update: Degraded data
redundancy: 299/60350764 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:57.736694 [WRN] Health check update: Degraded data
redundancy: 299/60350746 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:52.736132 [WRN] Health check update: Degraded data
redundancy: 299/60350731 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:47.735612 [WRN] Health check update: Degraded data
redundancy: 299/60350689 objects degraded (0.000%), 119 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:42.734877 [WRN] Health check update: Degraded data
redundancy: 301/60350677 objects degraded (0.000%), 120 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:38.210906 [INF] Health check cleared: SLOW_OPS (was: 35
slow ops, oldest one blocked for 1954017 sec, daemons
[mon.cmonitor,mon.cmonitor2] have slow ops.)
2020-08-05 17:54:37.734218 [WRN] Health check update: 35 slow ops, oldest
one blocked for 1954017 sec, daemons [mon.cmonitor,mon.cmonitor2] have slow
ops. (SLOW_OPS)
2020-08-05 17:54:37.734132 [WRN] Health check update: Degraded data
redundancy: 380/60350611 objects degraded (0.001%), 154 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:34.171483 [INF] Health check cleared: PG_AVAILABILITY
(was: Reduced data availability: 3 pgs inactive, 6 pgs peering)
2020-08-05 17:54:32.733499 [WRN] Health check update: Degraded data
redundancy: 52121/60350544 objects degraded (0.086%), 211 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:27.080529 [WRN] Monitor daemon marked osd.72 down, but it
is still running
2020-08-05 17:54:32.102889 [WRN] Health check failed: 60 slow ops, oldest
one blocked for 1954017 sec, daemons
[osd.16,osd.22,osd.23,osd.27,osd.28,osd.29,osd.30,osd.35,osd.48,osd.5]...
have slow ops. (SLOW_OPS)
2020-08-05 17:54:32.102699 [WRN] Health check update: Reduced data
availability: 3 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2020-08-05 17:54:27.951343 [INF] osd.72 192.168.200.25:6844/64565 boot
2020-08-05 17:54:27.935996 [INF] Health check cleared: OSD_DOWN (was: 1
osds down)
2020-08-05 17:54:27.732679 [WRN] Health check update: Degraded data
redundancy: 1781748/60350443 objects degraded (2.952%), 381 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:26.916269 [WRN] Health check update: Reduced data
availability: 5 pgs inactive, 10 pgs peering, 18 pgs incomplete
(PG_AVAILABILITY)
2020-08-05 17:54:19.712714 [WRN] Monitor daemon marked osd.71 down, but it
is still running
2020-08-05 17:54:22.716043 [WRN] Health check update: Degraded data
redundancy: 4057630/60350485 objects degraded (6.723%), 566 pgs degraded
(PG_DEGRADED)
2020-08-05 17:54:22.715939 [WRN] Health check update: 1 osds down
(OSD_DOWN)
2020-08-05 17:54:14.042363 [WRN] Monitor daemon marked osd.62 down, but it
is still running
2020-08-05 17:54:20.858582 [WRN] Health check update: Reduced data
availability: 8 pgs inactive, 15 pgs peering, 37 pgs incomplete
(PG_AVAILABILITY)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx