On 23-03-2024 21:19, Alexander E. Patrakov wrote:
Hi Torkil,
Hi Alexander
I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.
Thank you for taking the time =)
What happens if you increase the osd_max_backfills setting
temporarily?
We already had the mclock override option in place and I re-enabled
our
babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.
I think 53 is still a low number of active backfills given the large
percentage misplaced.
It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:
ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.
A few samples attached.
Still, I must say that some of your rules are actually unsafe.
The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs
from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.
Thanks for the explanation. This is an old pool predating the 3 DC
setup
and we'll migrate the data to a 4+5 pool when we can.
The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
number of data chunks. See above why it is bad. Please set
min_size to
5.
Thanks, that was a leftover for getting the PGs to peer (stuck at
creating+incomplete) when we created the pool. It's back to 5 now.
The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
100% active+clean.
There is very little data in this pool, that is probably the main
reason.
Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
have 300+ PGs, the observed maximum is 347. Please set it to 400.
Copy that. Didn't seem to make a difference though, and we have
osd_max_pg_per_osd_hard_ratio set to 5.000000.
Mvh.
Torkil
On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard
<torkil@xxxxxxxx> wrote:
On 23-03-2024 19:05, Alexander E. Patrakov wrote:
Sorry for replying to myself, but "ceph osd pool ls detail" by
itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:
ceph osd erasure-code-profile get prf-for-ec-data
...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.
[root@lazy ~]# ceph osd pool ls detail | grep erasure
pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
fast_read 1 compression_algorithm snappy compression_mode aggressive
application rbd
pool 37 'cephfs.hdd.data' erasure profile
DRCMR_k4m5_datacenter_hdd size
9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048
pgp_num 2048
autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
compression_algorithm zstd compression_mode aggressive
application cephfs
pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd
size 9
min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
compression_algorithm zstd compression_mode aggressive
application rbd
[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get
DRCMR_k4m5_datacenter_hdd
crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get
DRCMR_k4m5_datacenter_ssd
crush-device-class=ssd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8
But as I understand it those profiles are only used to create the
initial crush rule for the pool, and we have manually edited
those along
the way. Here are the 3 rules in use for the 3 EC pools:
rule rbd_ec_data {
id 0
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 2 type host
step emit
}
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}
rule rbd.ssd.data {
id 8
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class ssd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}
Which should first pick all 3 datacenters in the choose step and
then
either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
respectively.
Mvh.
Torkil
On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:
Hi Torkil,
I take my previous response back.
You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may
not be
available if the current EC setup is already incompatible, as
there is
no way to change the EC parameters.
It would help if you provided the output of "ceph osd pool ls
detail".
On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:
Hi Torkil,
Unfortunately, your files contain nothing obviously bad or
suspicious,
except for two things: more PGs than usual and bad balance.
What's your "mon max pg per osd" setting?
On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard
<torkil@xxxxxxxx> wrote:
On 2024-03-23 17:54, Kai Stian Olstad wrote:
On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard
wrote:
The other output is too big for pastebin and I'm not
familiar with
paste services, any suggestion for a preferred way to share
such
output?
You can attached files to the mail here on the list.
Doh, for some reason I was sure attachments would be
stripped. Thanks,
attached.
Mvh.
Torkil
--
Alexander E. Patrakov
--
Alexander E. Patrakov
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark