Re: Large number of misplaced PGs but little backfill going on

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Sun, 24 Mar 2024 05:26:46 +0800

Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
>
>
>
> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> > Hi Torkil,
>
> Hi Alexander
>
> > I have looked at the CRUSH rules, and the equivalent rules work on my
> > test cluster. So this cannot be the cause of the blockage.
>
> Thank you for taking the time =)
>
> > What happens if you increase the osd_max_backfills setting temporarily?
>
> We already had the mclock override option in place and I re-enabled our
> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> on how full they are. Active backfills went from 16 to 53 which is
> probably because default osd_max_backfills for mclock is 1.
>
> I think 53 is still a low number of active backfills given the large
> percentage misplaced.
>
> > It may be a good idea to investigate a few of the stalled PGs. Please
> > run commands similar to this one:
> >
> > ceph pg 37.0 query > query.37.0.txt
> > ceph pg 37.1 query > query.37.1.txt
> > ...
> > and the same for the other affected pools.
>
> A few samples attached.
>
> > Still, I must say that some of your rules are actually unsafe.
> >
> > The 4+2 rule as used by rbd_ec_data will not survive a
> > datacenter-offline incident. Namely, for each PG, it chooses OSDs from
> > two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> > offline, you will, therefore, have only 4 OSDs up, which is exactly
> > the number of data chunks. However, the pool requires min_size 5, so
> > all PGs will be inactive (to prevent data corruption) and will stay
> > inactive until the datacenter comes up again. However, please don't
> > set min_size to 4 - then, any additional incident (like a defective
> > disk) will lead to data loss, and the shards in the datacenter which
> > went offline would be useless because they do not correspond to the
> > updated shards written by the clients.
>
> Thanks for the explanation. This is an old pool predating the 3 DC setup
> and we'll migrate the data to a 4+5 pool when we can.
>
> > The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
> > number of data chunks. See above why it is bad. Please set min_size to
> > 5.
>
> Thanks, that was a leftover for getting the PGs to peer (stuck at
> creating+incomplete) when we created the pool. It's back to 5 now.
>
> > The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
> > 100% active+clean.
>
> There is very little data in this pool, that is probably the main reason.
>
> > Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
> > have 300+ PGs, the observed maximum is 347. Please set it to 400.
>
> Copy that. Didn't seem to make a difference though, and we have
> osd_max_pg_per_osd_hard_ratio set to 5.000000.
>
> Mvh.
>
> Torkil
>
> > On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> >>> Sorry for replying to myself, but "ceph osd pool ls detail" by itself
> >>> is insufficient. For every erasure code profile mentioned in the
> >>> output, please also run something like this:
> >>>
> >>> ceph osd erasure-code-profile get prf-for-ec-data
> >>>
> >>> ...where "prf-for-ec-data" is the name that appears after the words
> >>> "erasure profile" in the "ceph osd pool ls detail" output.
> >>
> >> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> >> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> >> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> >> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> >> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> >> fast_read 1 compression_algorithm snappy compression_mode aggressive
> >> application rbd
> >> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size
> >> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048
> >> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
> >> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
> >> compression_algorithm zstd compression_mode aggressive application cephfs
> >> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9
> >> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
> >> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
> >> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
> >> compression_algorithm zstd compression_mode aggressive application rbd
> >>
> >> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
> >> crush-device-class=hdd
> >> crush-failure-domain=host
> >> crush-root=default
> >> jerasure-per-chunk-alignment=false
> >> k=4
> >> m=2
> >> plugin=jerasure
> >> technique=reed_sol_van
> >> w=8
> >> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd
> >> crush-device-class=hdd
> >> crush-failure-domain=datacenter
> >> crush-root=default
> >> jerasure-per-chunk-alignment=false
> >> k=4
> >> m=5
> >> plugin=jerasure
> >> technique=reed_sol_van
> >> w=8
> >> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd
> >> crush-device-class=ssd
> >> crush-failure-domain=datacenter
> >> crush-root=default
> >> jerasure-per-chunk-alignment=false
> >> k=4
> >> m=5
> >> plugin=jerasure
> >> technique=reed_sol_van
> >> w=8
> >>
> >> But as I understand it those profiles are only used to create the
> >> initial crush rule for the pool, and we have manually edited those along
> >> the way. Here are the 3 rules in use for the 3 EC pools:
> >>
> >> rule rbd_ec_data {
> >>           id 0
> >>           type erasure
> >>           step set_chooseleaf_tries 5
> >>           step set_choose_tries 100
> >>           step take default class hdd
> >>           step choose indep 0 type datacenter
> >>           step chooseleaf indep 2 type host
> >>           step emit
> >> }
> >> rule cephfs.hdd.data {
> >>           id 7
> >>           type erasure
> >>           step set_chooseleaf_tries 5
> >>           step set_choose_tries 100
> >>           step take default class hdd
> >>           step choose indep 0 type datacenter
> >>           step chooseleaf indep 3 type host
> >>           step emit
> >> }
> >> rule rbd.ssd.data {
> >>           id 8
> >>           type erasure
> >>           step set_chooseleaf_tries 5
> >>           step set_choose_tries 100
> >>           step take default class ssd
> >>           step choose indep 0 type datacenter
> >>           step chooseleaf indep 3 type host
> >>           step emit
> >> }
> >>
> >> Which should first pick all 3 datacenters in the choose step and then
> >> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
> >> respectively.
> >>
> >> Mvh.
> >>
> >> Torkil
> >>
> >>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
> >>> <patrakov@xxxxxxxxx> wrote:
> >>>>
> >>>> Hi Torkil,
> >>>>
> >>>> I take my previous response back.
> >>>>
> >>>> You have an erasure-coded pool with nine shards but only three
> >>>> datacenters. This, in general, cannot work. You need either nine
> >>>> datacenters or a very custom CRUSH rule. The second option may not be
> >>>> available if the current EC setup is already incompatible, as there is
> >>>> no way to change the EC parameters.
> >>>>
> >>>> It would help if you provided the output of "ceph osd pool ls detail".
> >>>>
> >>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
> >>>> <patrakov@xxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi Torkil,
> >>>>>
> >>>>> Unfortunately, your files contain nothing obviously bad or suspicious,
> >>>>> except for two things: more PGs than usual and bad balance.
> >>>>>
> >>>>> What's your "mon max pg per osd" setting?
> >>>>>
> >>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
> >>>>>>
> >>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
> >>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> >>>>>>>>
> >>>>>>>> The other output is too big for pastebin and I'm not familiar with
> >>>>>>>> paste services, any suggestion for a preferred way to share such
> >>>>>>>> output?
> >>>>>>>
> >>>>>>> You can attached files to the mail here on the list.
> >>>>>>
> >>>>>> Doh, for some reason I was sure attachments would be stripped. Thanks,
> >>>>>> attached.
> >>>>>>
> >>>>>> Mvh.
> >>>>>>
> >>>>>> Torkil
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Alexander E. Patrakov
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Alexander E. Patrakov
> >>>
> >>>
> >>>
> >>
> >> --
> >> Torkil Svensgaard
> >> Systems Administrator
> >> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> >> Copenhagen University Hospital Amager and Hvidovre
> >> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> >>
> >
> >
>
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx