Hi,
maybe you could try to set 'ceph osd set nodeep-scrub' and wait for
the cluster to settle. Then you should also reduce osd_max_scrubs to 1
or 2 to not overload the OSDs, that should resolve the slow requests
(hopefully). The warning "Too many repaired reads on 1 OSDs" can be
resolved later, it's probably not critical at the moment.
If the slow requests resolve you can repair one PG at a time after
inspecting the output of 'rados -p <POOL> list-inconsistent-obj
<PG_ID>'.
Zitat von Frank Lee <by.yecao@xxxxxxxxx>:
Hi again,
My CEPH came up a while ago: 3 pgs not deep-scrubbed in time.
I googled to increase osd_scrub_begin_hour and osd_scrub_end_hour but not
seems to work.
There was a discussion on proxmox, a similar situation, he ran "ceph osd
repair all" and got it fixed. But it doesn't seem to work a day after I
execute it. When I continued searching and came across a blog, I ran:
ceph tell osd.* injectargs --osd_max_scrubs=100
ceph tell mon.* injectargs --osd_max_scrubs=100
This is the wrong start, the madness appears: pg
active+clean+scrubbing+deep+repair
I lowered this configuration immediately, but it was too late. Now:
cluster:
id: 48ff8b6e-1203-4dc8-b16e-d1e89f66e28f
health: HEALTH_ERR
110 scrub errors
Too many repaired reads on 1 OSDs
Possible data damage: 12 pgs inconsistent
16 pgs not deep-scrubbed in time
23 slow ops, oldest one blocked for 183 sec, daemons
[osd.1,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd
.22]...have slow ops.
services:
mon: 3 daemons, quorum ceph-node-1,ceph-node-2,ceph-node-3 (age 5M)
mgr: ceph-node-2(active, since 7M), standbys: ceph-node-1, ceph-node-3
osd: 32 osds: 32 up (since 21h), 32 in (since 4M)
data:
pools: 2 pools, 1025 pgs
objects: 6.78M objects, 25 TiB
usage: 76 TiB used, 41 TiB / 118 TiB avail
pgs: 624 active+clean
389 active+clean+scrubbing+deep+repair
12 active+clean+scrubbing+deep+inconsistent
io:
client: 6.9 MiB/s rd, 18 MiB/s wr, 648 op/s rd, 1.21k op/s wr
ceph osd perf
Sat
osd commit_latency(ms) apply_latency(ms)
31 11 11
28 17 17
25 1 1
24 5 5
21 1 1
17 6 6
7 0 0
30 16 16
29 13 13
26 37 37
19 6 6
3 12 12
2 4 4
1 2 2
0 15 15
13 27 27
15 33 33
12 21 21
14 36 36
18 15 15
9 26 26
8 5 5
6 1 1
5 1 1
4 6 6
27 1 1
23 5 5
10 11 11
11 17 17
20 6 6
16 6 6
22 0 0
And in the past >30 hours, except for the increase of pg inconsistent, the
number of active+clean+scrubbing+deep+repair pg has not changed.
Now ceph configuration:
ceph tell osd.* injectargs '--osd_scrub_begin_hour 0'
ceph tell osd.* injectargs '--osd_scrub_end_hour 0'
ceph tell mon.* injectargs '--osd_scrub_begin_hour 0'
ceph tell mon.* injectargs '--osd_scrub_end_hour 0'
ceph tell osd.* injectargs '--osd_max_scrubs 10'
ceph tell osd.* injectargs '--osd_scrub_chunk_min 5'
ceph tell osd.* injectargs '--osd_scrub_chunk_max 25'
ceph tell osd.* injectargs '--osd_deep_scrub_stride 196608'
ceph tell osd.* injectargs '--osd_scrub_priority 5'
ceph tell osd.* injectargs '--osd_scrub_load_threshold 10'
What should I do? Do I just have to wait for ceph to finish? Or is there
any way to make the repair stop? I heard that restarting the osd has an
effect, but I am afraid to do it now, because it may aggravate the error.
Thanks for any suggestions!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx