Re: A lot of pg repair, IO performance drops seriously

Eugen Block <eblock@xxxxxx> · Sat, 29 Oct 2022 10:01:02 +0000

Hi,

maybe you could try to set 'ceph osd set nodeep-scrub' and wait for  
the cluster to settle. Then you should also reduce osd_max_scrubs to 1  
or 2 to not overload the OSDs, that should resolve the slow requests  
(hopefully). The warning "Too many repaired reads on 1 OSDs" can be  
resolved later, it's probably not critical at the moment.
If the slow requests resolve you can repair one PG at a time after  
inspecting the output of 'rados -p <POOL> list-inconsistent-obj  
<PG_ID>'.

Zitat von Frank Lee <by.yecao@xxxxxxxxx>:

Hi again,

My CEPH came up a while ago: 3 pgs not deep-scrubbed in time.

I googled to increase osd_scrub_begin_hour and osd_scrub_end_hour but not
seems to work.

There was a discussion on proxmox, a similar situation, he ran "ceph osd
repair all" and got it fixed. But it doesn't seem to work a day after I
execute it. When I continued searching and came across a blog, I ran:

ceph tell osd.* injectargs --osd_max_scrubs=100
ceph tell mon.* injectargs --osd_max_scrubs=100

This is the wrong start, the madness appears: pg
active+clean+scrubbing+deep+repair

I lowered this configuration immediately, but it was too late. Now:

  cluster:
    id: 48ff8b6e-1203-4dc8-b16e-d1e89f66e28f
    health: HEALTH_ERR
            110 scrub errors
            Too many repaired reads on 1 OSDs
            Possible data damage: 12 pgs inconsistent
            16 pgs not deep-scrubbed in time
            23 slow ops, oldest one blocked for 183 sec, daemons
[osd.1,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd
.22]...have slow ops.

  services:
    mon: 3 daemons, quorum ceph-node-1,ceph-node-2,ceph-node-3 (age 5M)
    mgr: ceph-node-2(active, since 7M), standbys: ceph-node-1, ceph-node-3
    osd: 32 osds: 32 up (since 21h), 32 in (since 4M)

  data:
    pools: 2 pools, 1025 pgs
    objects: 6.78M objects, 25 TiB
    usage: 76 TiB used, 41 TiB / 118 TiB avail
    pgs: 624 active+clean
             389 active+clean+scrubbing+deep+repair
             12 active+clean+scrubbing+deep+inconsistent

  io:
    client: 6.9 MiB/s rd, 18 MiB/s wr, 648 op/s rd, 1.21k op/s wr

ceph osd perf
                                                          Sat
osd  commit_latency(ms)  apply_latency(ms)
 31                  11                 11
 28                  17                 17
 25                   1                  1
 24                   5                  5
 21                   1                  1
 17                   6                  6
  7                   0                  0
 30                  16                 16
 29                  13                 13
 26                  37                 37
 19                   6                  6
  3                  12                 12
  2                   4                  4
  1                   2                  2
  0                  15                 15
 13                  27                 27
 15                  33                 33
 12                  21                 21
 14                  36                 36
 18                  15                 15
  9                  26                 26
  8                   5                  5
  6                   1                  1
  5                   1                  1
  4                   6                  6
 27                   1                  1
 23                   5                  5
 10                  11                 11
 11                  17                 17
 20                   6                  6
 16                   6                  6
 22                   0                  0

And in the past >30 hours, except for the increase of pg inconsistent, the
number of active+clean+scrubbing+deep+repair pg has not changed.

Now ceph configuration:

ceph tell osd.* injectargs '--osd_scrub_begin_hour 0'
ceph tell osd.* injectargs '--osd_scrub_end_hour 0'
ceph tell mon.* injectargs '--osd_scrub_begin_hour 0'
ceph tell mon.* injectargs '--osd_scrub_end_hour 0'

ceph tell osd.* injectargs '--osd_max_scrubs 10'
ceph tell osd.* injectargs '--osd_scrub_chunk_min 5'
ceph tell osd.* injectargs '--osd_scrub_chunk_max 25'
ceph tell osd.* injectargs '--osd_deep_scrub_stride 196608'
ceph tell osd.* injectargs '--osd_scrub_priority 5'
ceph tell osd.* injectargs '--osd_scrub_load_threshold 10'

What should I do? Do I just have to wait for ceph to finish? Or is there
any way to make the repair stop? I heard that restarting the osd has an
effect, but I am afraid to do it now, because it may aggravate the error.

Thanks for any suggestions!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx