Re: Large number of misplaced PGs but little backfill going on

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Tue, 26 Mar 2024 03:37:14 +0800

On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
>
>
>
> On 24/03/2024 01:14, Torkil Svensgaard wrote:
> > On 24-03-2024 00:31, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> Thanks for the update. Even though the improvement is small, it is
> >> still an improvement, consistent with the osd_max_backfills value, and
> >> it proves that there are still unsolved peering issues.
> >>
> >> I have looked at both the old and the new state of the PG, but could
> >> not find anything else interesting.
> >>
> >> I also looked again at the state of PG 37.1. It is known what blocks
> >> the backfill of this PG; please search for "blocked_by." However, this
> >> is just one data point, which is insufficient for any conclusions. Try
> >> looking at other PGs. Is there anything too common in the non-empty
> >> "blocked_by" blocks?
> >
> > I'll take a look at that tomorrow, perhaps we can script something
> > meaningful.
>
> Hi Alexander
>
> While working on a script querying all PGs and making a list of all OSDs
> found in a blocked_by list, and how many times for each, I discovered
> something odd about pool 38:
>
> "
> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
> OSDs blocking other OSDs:
<snip>

> All PGs in the pool are active+clean so why are there any blocked_by at
> all? One example attached.

I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).

>
> Mvh.
>
> Torkil
>
> >> I think we have to look for patterns in other ways, too. One tool that
> >> produces good visualizations is TheJJ balancer. Although it is called
> >> a "balancer," it can also visualize the ongoing backfills.
> >>
> >> The tool is available at
> >> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> >>
> >> Run it as follows:
> >>
> >> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
> >
> > Output attached.
> >
> > Thanks again.
> >
> > Mvh.
> >
> > Torkil
> >
> >> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil@xxxxxxxx>
> >> wrote:
> >>>
> >>> Hi Alex
> >>>
> >>> New query output attached after restarting both OSDs. OSD 237 is no
> >>> longer mentioned but it unfortunately made no difference for the number
> >>> of backfills which went 59->62->62.
> >>>
> >>> Mvh.
> >>>
> >>> Torkil
> >>>
> >>> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
> >>>> Hi Torkil,
> >>>>
> >>>> I have looked at the files that you attached. They were helpful: pool
> >>>> 11 is problematic, it complains about degraded objects for no obvious
> >>>> reason. I think that is the blocker.
> >>>>
> >>>> I also noted that you mentioned peering problems, and I suspect that
> >>>> they are not completely resolved. As a somewhat-irrational move, to
> >>>> confirm this theory, you can restart osd.237 (it is mentioned at the
> >>>> end of query.11.fff.txt, although I don't understand why it is there)
> >>>> and then osd.298 (it is the primary for that pg) and see if any
> >>>> additional backfills are unblocked after that. Also, please re-query
> >>>> that PG again after the OSD restart.
> >>>>
> >>>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx>
> >>>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> >>>>>> Hi Torkil,
> >>>>>
> >>>>> Hi Alexander
> >>>>>
> >>>>>> I have looked at the CRUSH rules, and the equivalent rules work on my
> >>>>>> test cluster. So this cannot be the cause of the blockage.
> >>>>>
> >>>>> Thank you for taking the time =)
> >>>>>
> >>>>>> What happens if you increase the osd_max_backfills setting
> >>>>>> temporarily?
> >>>>>
> >>>>> We already had the mclock override option in place and I re-enabled
> >>>>> our
> >>>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> >>>>> on how full they are. Active backfills went from 16 to 53 which is
> >>>>> probably because default osd_max_backfills for mclock is 1.
> >>>>>
> >>>>> I think 53 is still a low number of active backfills given the large
> >>>>> percentage misplaced.
> >>>>>
> >>>>>> It may be a good idea to investigate a few of the stalled PGs. Please
> >>>>>> run commands similar to this one:
> >>>>>>
> >>>>>> ceph pg 37.0 query > query.37.0.txt
> >>>>>> ceph pg 37.1 query > query.37.1.txt
> >>>>>> ...
> >>>>>> and the same for the other affected pools.
> >>>>>
> >>>>> A few samples attached.
> >>>>>
> >>>>>> Still, I must say that some of your rules are actually unsafe.
> >>>>>>
> >>>>>> The 4+2 rule as used by rbd_ec_data will not survive a
> >>>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs
> >>>>>> from
> >>>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> >>>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly
> >>>>>> the number of data chunks. However, the pool requires min_size 5, so
> >>>>>> all PGs will be inactive (to prevent data corruption) and will stay
> >>>>>> inactive until the datacenter comes up again. However, please don't
> >>>>>> set min_size to 4 - then, any additional incident (like a defective
> >>>>>> disk) will lead to data loss, and the shards in the datacenter which
> >>>>>> went offline would be useless because they do not correspond to the
> >>>>>> updated shards written by the clients.
> >>>>>
> >>>>> Thanks for the explanation. This is an old pool predating the 3 DC
> >>>>> setup
> >>>>> and we'll migrate the data to a 4+5 pool when we can.
> >>>>>
> >>>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
> >>>>>> number of data chunks. See above why it is bad. Please set
> >>>>>> min_size to
> >>>>>> 5.
> >>>>>
> >>>>> Thanks, that was a leftover for getting the PGs to peer (stuck at
> >>>>> creating+incomplete) when we created the pool. It's back to 5 now.
> >>>>>
> >>>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
> >>>>>> 100% active+clean.
> >>>>>
> >>>>> There is very little data in this pool, that is probably the main
> >>>>> reason.
> >>>>>
> >>>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
> >>>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400.
> >>>>>
> >>>>> Copy that. Didn't seem to make a difference though, and we have
> >>>>> osd_max_pg_per_osd_hard_ratio set to 5.000000.
> >>>>>
> >>>>> Mvh.
> >>>>>
> >>>>> Torkil
> >>>>>
> >>>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard
> >>>>>> <torkil@xxxxxxxx> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> >>>>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by
> >>>>>>>> itself
> >>>>>>>> is insufficient. For every erasure code profile mentioned in the
> >>>>>>>> output, please also run something like this:
> >>>>>>>>
> >>>>>>>> ceph osd erasure-code-profile get prf-for-ec-data
> >>>>>>>>
> >>>>>>>> ...where "prf-for-ec-data" is the name that appears after the words
> >>>>>>>> "erasure profile" in the "ceph osd pool ls detail" output.
> >>>>>>>
> >>>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> >>>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> >>>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> >>>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> >>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> >>>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive
> >>>>>>> application rbd
> >>>>>>> pool 37 'cephfs.hdd.data' erasure profile
> >>>>>>> DRCMR_k4m5_datacenter_hdd size
> >>>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048
> >>>>>>> pgp_num 2048
> >>>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
> >>>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
> >>>>>>> compression_algorithm zstd compression_mode aggressive
> >>>>>>> application cephfs
> >>>>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd
> >>>>>>> size 9
> >>>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
> >>>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
> >>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
> >>>>>>> compression_algorithm zstd compression_mode aggressive
> >>>>>>> application rbd
> >>>>>>>
> >>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
> >>>>>>> crush-device-class=hdd
> >>>>>>> crush-failure-domain=host
> >>>>>>> crush-root=default
> >>>>>>> jerasure-per-chunk-alignment=false
> >>>>>>> k=4
> >>>>>>> m=2
> >>>>>>> plugin=jerasure
> >>>>>>> technique=reed_sol_van
> >>>>>>> w=8
> >>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
> >>>>>>> DRCMR_k4m5_datacenter_hdd
> >>>>>>> crush-device-class=hdd
> >>>>>>> crush-failure-domain=datacenter
> >>>>>>> crush-root=default
> >>>>>>> jerasure-per-chunk-alignment=false
> >>>>>>> k=4
> >>>>>>> m=5
> >>>>>>> plugin=jerasure
> >>>>>>> technique=reed_sol_van
> >>>>>>> w=8
> >>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
> >>>>>>> DRCMR_k4m5_datacenter_ssd
> >>>>>>> crush-device-class=ssd
> >>>>>>> crush-failure-domain=datacenter
> >>>>>>> crush-root=default
> >>>>>>> jerasure-per-chunk-alignment=false
> >>>>>>> k=4
> >>>>>>> m=5
> >>>>>>> plugin=jerasure
> >>>>>>> technique=reed_sol_van
> >>>>>>> w=8
> >>>>>>>
> >>>>>>> But as I understand it those profiles are only used to create the
> >>>>>>> initial crush rule for the pool, and we have manually edited
> >>>>>>> those along
> >>>>>>> the way. Here are the 3 rules in use for the 3 EC pools:
> >>>>>>>
> >>>>>>> rule rbd_ec_data {
> >>>>>>>             id 0
> >>>>>>>             type erasure
> >>>>>>>             step set_chooseleaf_tries 5
> >>>>>>>             step set_choose_tries 100
> >>>>>>>             step take default class hdd
> >>>>>>>             step choose indep 0 type datacenter
> >>>>>>>             step chooseleaf indep 2 type host
> >>>>>>>             step emit
> >>>>>>> }
> >>>>>>> rule cephfs.hdd.data {
> >>>>>>>             id 7
> >>>>>>>             type erasure
> >>>>>>>             step set_chooseleaf_tries 5
> >>>>>>>             step set_choose_tries 100
> >>>>>>>             step take default class hdd
> >>>>>>>             step choose indep 0 type datacenter
> >>>>>>>             step chooseleaf indep 3 type host
> >>>>>>>             step emit
> >>>>>>> }
> >>>>>>> rule rbd.ssd.data {
> >>>>>>>             id 8
> >>>>>>>             type erasure
> >>>>>>>             step set_chooseleaf_tries 5
> >>>>>>>             step set_choose_tries 100
> >>>>>>>             step take default class ssd
> >>>>>>>             step choose indep 0 type datacenter
> >>>>>>>             step chooseleaf indep 3 type host
> >>>>>>>             step emit
> >>>>>>> }
> >>>>>>>
> >>>>>>> Which should first pick all 3 datacenters in the choose step and
> >>>>>>> then
> >>>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
> >>>>>>> respectively.
> >>>>>>>
> >>>>>>> Mvh.
> >>>>>>>
> >>>>>>> Torkil
> >>>>>>>
> >>>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
> >>>>>>>> <patrakov@xxxxxxxxx> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Torkil,
> >>>>>>>>>
> >>>>>>>>> I take my previous response back.
> >>>>>>>>>
> >>>>>>>>> You have an erasure-coded pool with nine shards but only three
> >>>>>>>>> datacenters. This, in general, cannot work. You need either nine
> >>>>>>>>> datacenters or a very custom CRUSH rule. The second option may
> >>>>>>>>> not be
> >>>>>>>>> available if the current EC setup is already incompatible, as
> >>>>>>>>> there is
> >>>>>>>>> no way to change the EC parameters.
> >>>>>>>>>
> >>>>>>>>> It would help if you provided the output of "ceph osd pool ls
> >>>>>>>>> detail".
> >>>>>>>>>
> >>>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
> >>>>>>>>> <patrakov@xxxxxxxxx> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Torkil,
> >>>>>>>>>>
> >>>>>>>>>> Unfortunately, your files contain nothing obviously bad or
> >>>>>>>>>> suspicious,
> >>>>>>>>>> except for two things: more PGs than usual and bad balance.
> >>>>>>>>>>
> >>>>>>>>>> What's your "mon max pg per osd" setting?
> >>>>>>>>>>
> >>>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard
> >>>>>>>>>> <torkil@xxxxxxxx> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
> >>>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The other output is too big for pastebin and I'm not
> >>>>>>>>>>>>> familiar with
> >>>>>>>>>>>>> paste services, any suggestion for a preferred way to share
> >>>>>>>>>>>>> such
> >>>>>>>>>>>>> output?
> >>>>>>>>>>>>
> >>>>>>>>>>>> You can attached files to the mail here on the list.
> >>>>>>>>>>>
> >>>>>>>>>>> Doh, for some reason I was sure attachments would be
> >>>>>>>>>>> stripped. Thanks,
> >>>>>>>>>>> attached.
> >>>>>>>>>>>
> >>>>>>>>>>> Mvh.
> >>>>>>>>>>>
> >>>>>>>>>>> Torkil
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Alexander E. Patrakov
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Alexander E. Patrakov
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Torkil Svensgaard
> >>>>>>> Systems Administrator
> >>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> >>>>>>> Copenhagen University Hospital Amager and Hvidovre
> >>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Torkil Svensgaard
> >>>>> Systems Administrator
> >>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> >>>>> Copenhagen University Hospital Amager and Hvidovre
> >>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Torkil Svensgaard
> >>> Systems Administrator
> >>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> >>> Copenhagen University Hospital Amager and Hvidovre
> >>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> >>
> >>
> >>
> >
>
> --
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance
> Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: torkil@xxxxxxxx

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx