On 15-08-2022 08:24, Satoru Takeuchi wrote:
2022年8月13日(土) 1:35 Robert W. Eckert <rob@xxxxxxxxxxxxxxx>:
Interesting, a few weeks ago I added a new disk to each of my 3 node
cluster and saw the same 2 Mb/s recovery. What I had noticed was that
one OSD was using very high CPU and seems to have been the primary node on
the affected PGs. I couldn’t find anything overly wrong with the OSD,
network , etc.
You may want to look at the output of
ceph pg ls
to see if the recovery is sourced from one specific OSD or one host, then
check that host /osd for high CPU/memory.
Probably you hit this bug.
https://tracker.ceph.com/issues/56530
It can be bypassed by setting "osd_op_queue=wpq" configuration.
Thanks both of you. Doing "ceph config set osd osd_op_queue wpq" and
restarting the OSDs seems to have fixed it.
Mvh.
Torkil
Best,
Satoru
-----Original Message-----
From: Torkil Svensgaard <torkil@xxxxxxxx>
Sent: Friday, August 12, 2022 7:50 AM
To: ceph-users@xxxxxxx
Cc: Ruben Vestergaard <rkv@xxxxxxxx>
Subject: Recovery very slow after upgrade to quincy
6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from
pacific.
cluster:
id:
health: HEALTH_WARN
2 host(s) running different kernel versions
2071 pgs not deep-scrubbed in time
837 pgs not scrubbed in time
services:
mon: 5 daemons, quorum
test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s)
mgr: dcn-ceph-01.dzercj(active, since 6h), standbys:
dcn-ceph-03.lrhaxo
mds: 1/1 daemons up, 2 standby
osd: 118 osds: 118 up (since 6d), 118 in (since 6d); 66
remapped pgs
rbd-mirror: 2 daemons active (2 hosts)
data:
volumes: 1/1 healthy
pools: 9 pools, 2737 pgs
objects: 246.02M objects, 337 TiB
usage: 665 TiB used, 688 TiB / 1.3 PiB avail
pgs: 42128281/978408875 objects misplaced (4.306%)
2332 active+clean
281 active+clean+snaptrim_wait
66 active+remapped+backfilling
36 active+clean+snaptrim
11 active+clean+scrubbing+deep
8 active+clean+scrubbing
1 active+clean+scrubbing+deep+snaptrim_wait
1 active+clean+scrubbing+deep+snaptrim
1 active+clean+scrubbing+snaptrim
io:
client: 159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr
recovery: 2.0 MiB/s, 3 objects/s
Low load, low latency, low network traffic. Tried
osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and
snaptrim, no difference.
Am I missing something obvious I should have done after the upgrade?
Mvh.
Torkil
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx