Re: Large number of misplaced PGs but little backfill going on

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Sun, 24 Mar 2024 04:19:27 +0800

Hi Torkil,

I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.

What happens if you increase the osd_max_backfills setting temporarily?

It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.

Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
number of data chunks. See above why it is bad. Please set min_size to
5.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
100% active+clean.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
have 300+ PGs, the observed maximum is 347. Please set it to 400.

On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
>
>
>
> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> > Sorry for replying to myself, but "ceph osd pool ls detail" by itself
> > is insufficient. For every erasure code profile mentioned in the
> > output, please also run something like this:
> >
> > ceph osd erasure-code-profile get prf-for-ec-data
> >
> > ...where "prf-for-ec-data" is the name that appears after the words
> > "erasure profile" in the "ceph osd pool ls detail" output.
>
> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> fast_read 1 compression_algorithm snappy compression_mode aggressive
> application rbd
> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size
> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048
> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
> compression_algorithm zstd compression_mode aggressive application cephfs
> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9
> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
> compression_algorithm zstd compression_mode aggressive application rbd
>
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
> crush-device-class=hdd
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd
> crush-device-class=hdd
> crush-failure-domain=datacenter
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=5
> plugin=jerasure
> technique=reed_sol_van
> w=8
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd
> crush-device-class=ssd
> crush-failure-domain=datacenter
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=5
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> But as I understand it those profiles are only used to create the
> initial crush rule for the pool, and we have manually edited those along
> the way. Here are the 3 rules in use for the 3 EC pools:
>
> rule rbd_ec_data {
>          id 0
>          type erasure
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class hdd
>          step choose indep 0 type datacenter
>          step chooseleaf indep 2 type host
>          step emit
> }
> rule cephfs.hdd.data {
>          id 7
>          type erasure
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class hdd
>          step choose indep 0 type datacenter
>          step chooseleaf indep 3 type host
>          step emit
> }
> rule rbd.ssd.data {
>          id 8
>          type erasure
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class ssd
>          step choose indep 0 type datacenter
>          step chooseleaf indep 3 type host
>          step emit
> }
>
> Which should first pick all 3 datacenters in the choose step and then
> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
> respectively.
>
> Mvh.
>
> Torkil
>
> > On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
> > <patrakov@xxxxxxxxx> wrote:
> >>
> >> Hi Torkil,
> >>
> >> I take my previous response back.
> >>
> >> You have an erasure-coded pool with nine shards but only three
> >> datacenters. This, in general, cannot work. You need either nine
> >> datacenters or a very custom CRUSH rule. The second option may not be
> >> available if the current EC setup is already incompatible, as there is
> >> no way to change the EC parameters.
> >>
> >> It would help if you provided the output of "ceph osd pool ls detail".
> >>
> >> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
> >> <patrakov@xxxxxxxxx> wrote:
> >>>
> >>> Hi Torkil,
> >>>
> >>> Unfortunately, your files contain nothing obviously bad or suspicious,
> >>> except for two things: more PGs than usual and bad balance.
> >>>
> >>> What's your "mon max pg per osd" setting?
> >>>
> >>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
> >>>>
> >>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
> >>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> >>>>>>
> >>>>>> The other output is too big for pastebin and I'm not familiar with
> >>>>>> paste services, any suggestion for a preferred way to share such
> >>>>>> output?
> >>>>>
> >>>>> You can attached files to the mail here on the list.
> >>>>
> >>>> Doh, for some reason I was sure attachments would be stripped. Thanks,
> >>>> attached.
> >>>>
> >>>> Mvh.
> >>>>
> >>>> Torkil
> >>>
> >>>
> >>>
> >>> --
> >>> Alexander E. Patrakov
> >>
> >>
> >>
> >> --
> >> Alexander E. Patrakov
> >
> >
> >
>
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx