On 24-03-2024 13:41, Tyler Stachecki wrote:
On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
Hi
... Using mclock with high_recovery_ops profile.
What is the bottleneck here? I would have expected a huge number of
simultaneous backfills. Backfill reservation logjam?
mClock is very buggy in my experience and frequently leads to issues like
this. Try using regular backfill and see if the problem goes away.
Hi Tyler
Just tried switching to wpq, same thing.
I'm inclined to think is must be a read reservation logjam of some sort,
given that increasing osd_max_backfills had an immediate effect and we
have 4 empty hosts as main write targets. Here's the output for one such
OSD from the script Alexander linked:
"
osd.539: gimpy =>0 0.0B <=42 1.54T (Δ1.54T) drive=9.0% 358.72G/3.90T
crush=9
.0% 358.72G/3.90T
<-11.54f waiting 44.8G 539<-421 1070 of 230320, 0.5%
<-37.538 waiting 22.3G 539<-121 2819 of 556087, 0.5%
<-11.507 waiting 45.1G 539<-61 912 of 139227, 0.7%
<-37.450 waiting 22.3G 539<-220 1235 of 632776, 0.2%
<-11.458 waiting 45.3G 539<-121 178 of 279150, 0.1%
<-37.47c waiting 22.2G 539<-83 2281 of 634472, 0.4%
<-37.434 waiting 22.1G 539<-78 9496 of 316052, 3.0%
<-11.3f3 waiting 44.9G 539<-109 2375 of 231055, 1.0%
<-37.3d3 waiting 22.0G 539<-73 2144 of 316508, 0.7%
<-37.3c5 waiting 22.2G 539<-83 313880 of 313880, 100.0%
<-11.3c1 waiting 44.8G 539<-223 93878 of 230270, 40.8%
<-37.3a4 waiting 21.9G 539<-85 4604 of 315504, 1.5%
<-11.344 waiting 44.5G 539<-63 728 of 182876, 0.4%
<-6.1ca waiting 100.9G 539<-443 36076 of 56270, 64.1%
<-4.1a2 waiting 157.6G 539<-218 508 of 91456, 0.6%
<-37.1ba waiting 22.2G 539<-64 316848 of 316848, 100.0%
<-37.84 waiting 22.0G 539<-33 4380 of 237633, 1.8%
<-37.ad waiting 22.2G 539<-77 6730 of 396635, 1.7%
<-37.36 waiting 22.1G 539<-47 2170 of 395955, 0.5%
<-11.b9 waiting 45.1G 539<-223 0 of 231940, 0.0%
<-37.11c waiting 22.1G 539<-33 9952 of 316448, 3.1%
<-11.144 waiting 45.1G 539<-207 528 of 278094, 0.2%
<-37.2ae waiting 22.1G 539<-224 2565 of 712539, 0.4%
<-37.285 waiting 22.0G 539<-65 441 of 315336, 0.1%
<-37.2ef waiting 22.0G 539<-414 2124 of 475410, 0.4%
<-37.674 waiting 22.0G 539<-56 60 of 236511, 0.0%
<-37.655 waiting 22.3G 539<-143 237316 of 237381, 100.0%
<-11.6b0 waiting 44.9G 539<-282 1131 of 277122, 0.4%
<-37.71a waiting 22.2G 539<-49 82865 of 315684, 26.2%
<-11.789 waiting 45.0G 539<-196 736 of 277584, 0.3%
<-11.7cf waiting 44.8G 539<-127 143 of 276582, 0.1%
<-11.7f2 waiting 45.2G 539<-272 145857 of 185680, 78.6%
<-37.7dd waiting 22.0G 539<-72 0 of 393475, 0.0%
<-37.7d9 waiting 22.2G 539<-37 930 of 237831, 0.4%
<-11.7fb waiting 45.2G 539<-78 1062 of 279042, 0.4%
<-37.7d2 waiting 22.0G 539<-71 2409 of 631368, 0.4%
<-11.8db waiting 44.9G 539<-84 108 of 277182, 0.0%
<-11.9b6 waiting 44.8G 539<-74 772 of 184432, 0.4%
<-11.b0b waiting 45.0G 539<-166 2569 of 231430, 1.1%
<-11.d42 waiting 45.2G 539<-118 15428 of 46429, 33.2%
<-11.d5f waiting 44.8G 539<-64 4 of 184356, 0.0%
<-11.d98 waiting 45.1G 539<-418 0 of 278568, 0.0%
"
All waiting for something.
Mvh.
Torkil
Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx