Re: Large number of misplaced PGs but little backfill going on

Torkil Svensgaard <torkil@xxxxxxxx> · Mon, 25 Mar 2024 21:28:01 +0100

Neither downing or restarting the OSD cleared the bogus blocked_by. I 
guess it makes no sense to look further at blocked_by as the cause when 
the data can't be trusted and there is no obvious smoking gun like a few 
OSDs blocking everything.

My tally came to 412 out of 539 OSDs showing up in a blocked_by list and 
that is about every OSD with data prior to adding ~100 empty OSDs. How 
400 read targets and 100 write targets can only equal ~60 backfills with 
osd_max_backill set at 3 just makes no sense to me but alas.

It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.

Mvh.

Torkil

On 25-03-2024 20:44, Anthony D'Atri wrote:
First try "ceph osd down 89"

On Mar 25, 2024, at 15:37, Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote:

On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil@xxxxxxxx> wrote:

On 24/03/2024 01:14, Torkil Svensgaard wrote:
On 24-03-2024 00:31, Alexander E. Patrakov wrote:
Hi Torkil,

Hi Alexander

Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?

I'll take a look at that tomorrow, perhaps we can script something
meaningful.

Hi Alexander

While working on a script querying all PGs and making a list of all OSDs
found in a blocked_by list, and how many times for each, I discovered
something odd about pool 38:

"
[root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
OSDs blocking other OSDs:
<snip>

All PGs in the pool are active+clean so why are there any blocked_by at
all? One example attached.

I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).

Mvh.

Torkil

I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt

Output attached.

Thanks again.

Mvh.

Torkil

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil@xxxxxxxx>
wrote:

Hi Alex

New query output attached after restarting both OSDs. OSD 237 is no
longer mentioned but it unfortunately made no difference for the number
of backfills which went 59->62->62.

Mvh.

Torkil

On 23-03-2024 22:26, Alexander E. Patrakov wrote:
Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx>
wrote:

On 23-03-2024 21:19, Alexander E. Patrakov wrote:
Hi Torkil,

Hi Alexander

I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.

Thank you for taking the time =)

What happens if you increase the osd_max_backfills setting
temporarily?

We already had the mclock override option in place and I re-enabled
our
babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.

I think 53 is still a low number of active backfills given the large
percentage misplaced.

It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.

A few samples attached.

Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs
from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.

Thanks for the explanation. This is an old pool predating the 3 DC
setup
and we'll migrate the data to a 4+5 pool when we can.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
number of data chunks. See above why it is bad. Please set
min_size to
5.

Thanks, that was a leftover for getting the PGs to peer (stuck at
creating+incomplete) when we created the pool. It's back to 5 now.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
100% active+clean.

There is very little data in this pool, that is probably the main
reason.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
have 300+ PGs, the observed maximum is 347. Please set it to 400.

Copy that. Didn't seem to make a difference though, and we have
osd_max_pg_per_osd_hard_ratio set to 5.000000.

Mvh.

Torkil

On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard
<torkil@xxxxxxxx> wrote:

On 23-03-2024 19:05, Alexander E. Patrakov wrote:
Sorry for replying to myself, but "ceph osd pool ls detail" by
itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:

ceph osd erasure-code-profile get prf-for-ec-data

...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.

[root@lazy ~]# ceph osd pool ls detail | grep erasure
pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
fast_read 1 compression_algorithm snappy compression_mode aggressive
application rbd
pool 37 'cephfs.hdd.data' erasure profile
DRCMR_k4m5_datacenter_hdd size
9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048
pgp_num 2048
autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
compression_algorithm zstd compression_mode aggressive
application cephfs
pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd
size 9
min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
compression_algorithm zstd compression_mode aggressive
application rbd

[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get
DRCMR_k4m5_datacenter_hdd
crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get
DRCMR_k4m5_datacenter_ssd
crush-device-class=ssd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8

But as I understand it those profiles are only used to create the
initial crush rule for the pool, and we have manually edited
those along
the way. Here are the 3 rules in use for the 3 EC pools:

rule rbd_ec_data {
            id 0
            type erasure
            step set_chooseleaf_tries 5
            step set_choose_tries 100
            step take default class hdd
            step choose indep 0 type datacenter
            step chooseleaf indep 2 type host
            step emit
}
rule cephfs.hdd.data {
            id 7
            type erasure
            step set_chooseleaf_tries 5
            step set_choose_tries 100
            step take default class hdd
            step choose indep 0 type datacenter
            step chooseleaf indep 3 type host
            step emit
}
rule rbd.ssd.data {
            id 8
            type erasure
            step set_chooseleaf_tries 5
            step set_choose_tries 100
            step take default class ssd
            step choose indep 0 type datacenter
            step chooseleaf indep 3 type host
            step emit
}

Which should first pick all 3 datacenters in the choose step and
then
either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
respectively.

Mvh.

Torkil

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:

Hi Torkil,

I take my previous response back.

You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may
not be
available if the current EC setup is already incompatible, as
there is
no way to change the EC parameters.

It would help if you provided the output of "ceph osd pool ls
detail".

On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:

Hi Torkil,

Unfortunately, your files contain nothing obviously bad or
suspicious,
except for two things: more PGs than usual and bad balance.

What's your "mon max pg per osd" setting?

On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard
<torkil@xxxxxxxx> wrote:

On 2024-03-23 17:54, Kai Stian Olstad wrote:
On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard
wrote:

The other output is too big for pastebin and I'm not
familiar with
paste services, any suggestion for a preferred way to share
such
output?

You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be
stripped. Thanks,
attached.

Mvh.

Torkil

--
Alexander E. Patrakov

--
Alexander E. Patrakov

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx

--
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx