Re: Large number of misplaced PGs but little backfill going on

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Mon, 25 Mar 2024 15:44:12 -0400



First try "ceph osd down 89"

> On Mar 25, 2024, at 15:37, Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote:
> 
> On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
>> 
>> 
>> 
>> On 24/03/2024 01:14, Torkil Svensgaard wrote:
>>> On 24-03-2024 00:31, Alexander E. Patrakov wrote:
>>>> Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
>>>> Thanks for the update. Even though the improvement is small, it is
>>>> still an improvement, consistent with the osd_max_backfills value, and
>>>> it proves that there are still unsolved peering issues.
>>>> 
>>>> I have looked at both the old and the new state of the PG, but could
>>>> not find anything else interesting.
>>>> 
>>>> I also looked again at the state of PG 37.1. It is known what blocks
>>>> the backfill of this PG; please search for "blocked_by." However, this
>>>> is just one data point, which is insufficient for any conclusions. Try
>>>> looking at other PGs. Is there anything too common in the non-empty
>>>> "blocked_by" blocks?
>>> 
>>> I'll take a look at that tomorrow, perhaps we can script something
>>> meaningful.
>> 
>> Hi Alexander
>> 
>> While working on a script querying all PGs and making a list of all OSDs
>> found in a blocked_by list, and how many times for each, I discovered
>> something odd about pool 38:
>> 
>> "
>> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
>> OSDs blocking other OSDs:
> <snip>
> 
>> All PGs in the pool are active+clean so why are there any blocked_by at
>> all? One example attached.
> 
> I don't know. In any case, it doesn't match the "one OSD blocks them
> all" scenario that I was looking for. I think this is something bogus
> that can probably be cleared in your example by restarting osd.89
> (i.e, the one being blocked).
> 
>> 
>> Mvh.
>> 
>> Torkil
>> 
>>>> I think we have to look for patterns in other ways, too. One tool that
>>>> produces good visualizations is TheJJ balancer. Although it is called
>>>> a "balancer," it can also visualize the ongoing backfills.
>>>> 
>>>> The tool is available at
>>>> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
>>>> 
>>>> Run it as follows:
>>>> 
>>>> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
>>> 
>>> Output attached.
>>> 
>>> Thanks again.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
>>>> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil@xxxxxxxx>
>>>> wrote:
>>>>> 
>>>>> Hi Alex
>>>>> 
>>>>> New query output attached after restarting both OSDs. OSD 237 is no
>>>>> longer mentioned but it unfortunately made no difference for the number
>>>>> of backfills which went 59->62->62.
>>>>> 
>>>>> Mvh.
>>>>> 
>>>>> Torkil
>>>>> 
>>>>> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>>>>>> Hi Torkil,
>>>>>> 
>>>>>> I have looked at the files that you attached. They were helpful: pool
>>>>>> 11 is problematic, it complains about degraded objects for no obvious
>>>>>> reason. I think that is the blocker.
>>>>>> 
>>>>>> I also noted that you mentioned peering problems, and I suspect that
>>>>>> they are not completely resolved. As a somewhat-irrational move, to
>>>>>> confirm this theory, you can restart osd.237 (it is mentioned at the
>>>>>> end of query.11.fff.txt, although I don't understand why it is there)
>>>>>> and then osd.298 (it is the primary for that pg) and see if any
>>>>>> additional backfills are unblocked after that. Also, please re-query
>>>>>> that PG again after the OSD restart.
>>>>>> 
>>>>>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx>
>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
>>>>>>>> Hi Torkil,
>>>>>>> 
>>>>>>> Hi Alexander
>>>>>>> 
>>>>>>>> I have looked at the CRUSH rules, and the equivalent rules work on my
>>>>>>>> test cluster. So this cannot be the cause of the blockage.
>>>>>>> 
>>>>>>> Thank you for taking the time =)
>>>>>>> 
>>>>>>>> What happens if you increase the osd_max_backfills setting
>>>>>>>> temporarily?
>>>>>>> 
>>>>>>> We already had the mclock override option in place and I re-enabled
>>>>>>> our
>>>>>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
>>>>>>> on how full they are. Active backfills went from 16 to 53 which is
>>>>>>> probably because default osd_max_backfills for mclock is 1.
>>>>>>> 
>>>>>>> I think 53 is still a low number of active backfills given the large
>>>>>>> percentage misplaced.
>>>>>>> 
>>>>>>>> It may be a good idea to investigate a few of the stalled PGs. Please
>>>>>>>> run commands similar to this one:
>>>>>>>> 
>>>>>>>> ceph pg 37.0 query > query.37.0.txt
>>>>>>>> ceph pg 37.1 query > query.37.1.txt
>>>>>>>> ...
>>>>>>>> and the same for the other affected pools.
>>>>>>> 
>>>>>>> A few samples attached.
>>>>>>> 
>>>>>>>> Still, I must say that some of your rules are actually unsafe.
>>>>>>>> 
>>>>>>>> The 4+2 rule as used by rbd_ec_data will not survive a
>>>>>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs
>>>>>>>> from
>>>>>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
>>>>>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly
>>>>>>>> the number of data chunks. However, the pool requires min_size 5, so
>>>>>>>> all PGs will be inactive (to prevent data corruption) and will stay
>>>>>>>> inactive until the datacenter comes up again. However, please don't
>>>>>>>> set min_size to 4 - then, any additional incident (like a defective
>>>>>>>> disk) will lead to data loss, and the shards in the datacenter which
>>>>>>>> went offline would be useless because they do not correspond to the
>>>>>>>> updated shards written by the clients.
>>>>>>> 
>>>>>>> Thanks for the explanation. This is an old pool predating the 3 DC
>>>>>>> setup
>>>>>>> and we'll migrate the data to a 4+5 pool when we can.
>>>>>>> 
>>>>>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
>>>>>>>> number of data chunks. See above why it is bad. Please set
>>>>>>>> min_size to
>>>>>>>> 5.
>>>>>>> 
>>>>>>> Thanks, that was a leftover for getting the PGs to peer (stuck at
>>>>>>> creating+incomplete) when we created the pool. It's back to 5 now.
>>>>>>> 
>>>>>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
>>>>>>>> 100% active+clean.
>>>>>>> 
>>>>>>> There is very little data in this pool, that is probably the main
>>>>>>> reason.
>>>>>>> 
>>>>>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
>>>>>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400.
>>>>>>> 
>>>>>>> Copy that. Didn't seem to make a difference though, and we have
>>>>>>> osd_max_pg_per_osd_hard_ratio set to 5.000000.
>>>>>>> 
>>>>>>> Mvh.
>>>>>>> 
>>>>>>> Torkil
>>>>>>> 
>>>>>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard
>>>>>>>> <torkil@xxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
>>>>>>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by
>>>>>>>>>> itself
>>>>>>>>>> is insufficient. For every erasure code profile mentioned in the
>>>>>>>>>> output, please also run something like this:
>>>>>>>>>> 
>>>>>>>>>> ceph osd erasure-code-profile get prf-for-ec-data
>>>>>>>>>> 
>>>>>>>>>> ...where "prf-for-ec-data" is the name that appears after the words
>>>>>>>>>> "erasure profile" in the "ceph osd pool ls detail" output.
>>>>>>>>> 
>>>>>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure
>>>>>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
>>>>>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
>>>>>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
>>>>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
>>>>>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive
>>>>>>>>> application rbd
>>>>>>>>> pool 37 'cephfs.hdd.data' erasure profile
>>>>>>>>> DRCMR_k4m5_datacenter_hdd size
>>>>>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048
>>>>>>>>> pgp_num 2048
>>>>>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
>>>>>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
>>>>>>>>> compression_algorithm zstd compression_mode aggressive
>>>>>>>>> application cephfs
>>>>>>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd
>>>>>>>>> size 9
>>>>>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
>>>>>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
>>>>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>>>>>>>> compression_algorithm zstd compression_mode aggressive
>>>>>>>>> application rbd
>>>>>>>>> 
>>>>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
>>>>>>>>> crush-device-class=hdd
>>>>>>>>> crush-failure-domain=host
>>>>>>>>> crush-root=default
>>>>>>>>> jerasure-per-chunk-alignment=false
>>>>>>>>> k=4
>>>>>>>>> m=2
>>>>>>>>> plugin=jerasure
>>>>>>>>> technique=reed_sol_van
>>>>>>>>> w=8
>>>>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
>>>>>>>>> DRCMR_k4m5_datacenter_hdd
>>>>>>>>> crush-device-class=hdd
>>>>>>>>> crush-failure-domain=datacenter
>>>>>>>>> crush-root=default
>>>>>>>>> jerasure-per-chunk-alignment=false
>>>>>>>>> k=4
>>>>>>>>> m=5
>>>>>>>>> plugin=jerasure
>>>>>>>>> technique=reed_sol_van
>>>>>>>>> w=8
>>>>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
>>>>>>>>> DRCMR_k4m5_datacenter_ssd
>>>>>>>>> crush-device-class=ssd
>>>>>>>>> crush-failure-domain=datacenter
>>>>>>>>> crush-root=default
>>>>>>>>> jerasure-per-chunk-alignment=false
>>>>>>>>> k=4
>>>>>>>>> m=5
>>>>>>>>> plugin=jerasure
>>>>>>>>> technique=reed_sol_van
>>>>>>>>> w=8
>>>>>>>>> 
>>>>>>>>> But as I understand it those profiles are only used to create the
>>>>>>>>> initial crush rule for the pool, and we have manually edited
>>>>>>>>> those along
>>>>>>>>> the way. Here are the 3 rules in use for the 3 EC pools:
>>>>>>>>> 
>>>>>>>>> rule rbd_ec_data {
>>>>>>>>>            id 0
>>>>>>>>>            type erasure
>>>>>>>>>            step set_chooseleaf_tries 5
>>>>>>>>>            step set_choose_tries 100
>>>>>>>>>            step take default class hdd
>>>>>>>>>            step choose indep 0 type datacenter
>>>>>>>>>            step chooseleaf indep 2 type host
>>>>>>>>>            step emit
>>>>>>>>> }
>>>>>>>>> rule cephfs.hdd.data {
>>>>>>>>>            id 7
>>>>>>>>>            type erasure
>>>>>>>>>            step set_chooseleaf_tries 5
>>>>>>>>>            step set_choose_tries 100
>>>>>>>>>            step take default class hdd
>>>>>>>>>            step choose indep 0 type datacenter
>>>>>>>>>            step chooseleaf indep 3 type host
>>>>>>>>>            step emit
>>>>>>>>> }
>>>>>>>>> rule rbd.ssd.data {
>>>>>>>>>            id 8
>>>>>>>>>            type erasure
>>>>>>>>>            step set_chooseleaf_tries 5
>>>>>>>>>            step set_choose_tries 100
>>>>>>>>>            step take default class ssd
>>>>>>>>>            step choose indep 0 type datacenter
>>>>>>>>>            step chooseleaf indep 3 type host
>>>>>>>>>            step emit
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Which should first pick all 3 datacenters in the choose step and
>>>>>>>>> then
>>>>>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
>>>>>>>>> respectively.
>>>>>>>>> 
>>>>>>>>> Mvh.
>>>>>>>>> 
>>>>>>>>> Torkil
>>>>>>>>> 
>>>>>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
>>>>>>>>>> <patrakov@xxxxxxxxx> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Torkil,
>>>>>>>>>>> 
>>>>>>>>>>> I take my previous response back.
>>>>>>>>>>> 
>>>>>>>>>>> You have an erasure-coded pool with nine shards but only three
>>>>>>>>>>> datacenters. This, in general, cannot work. You need either nine
>>>>>>>>>>> datacenters or a very custom CRUSH rule. The second option may
>>>>>>>>>>> not be
>>>>>>>>>>> available if the current EC setup is already incompatible, as
>>>>>>>>>>> there is
>>>>>>>>>>> no way to change the EC parameters.
>>>>>>>>>>> 
>>>>>>>>>>> It would help if you provided the output of "ceph osd pool ls
>>>>>>>>>>> detail".
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
>>>>>>>>>>> <patrakov@xxxxxxxxx> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Torkil,
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately, your files contain nothing obviously bad or
>>>>>>>>>>>> suspicious,
>>>>>>>>>>>> except for two things: more PGs than usual and bad balance.
>>>>>>>>>>>> 
>>>>>>>>>>>> What's your "mon max pg per osd" setting?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard
>>>>>>>>>>>> <torkil@xxxxxxxx> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
>>>>>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The other output is too big for pastebin and I'm not
>>>>>>>>>>>>>>> familiar with
>>>>>>>>>>>>>>> paste services, any suggestion for a preferred way to share
>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>> output?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can attached files to the mail here on the list.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Doh, for some reason I was sure attachments would be
>>>>>>>>>>>>> stripped. Thanks,
>>>>>>>>>>>>> attached.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Mvh.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Torkil
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Alexander E. Patrakov
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Alexander E. Patrakov
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Torkil Svensgaard
>>>>>>>>> Systems Administrator
>>>>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>>>>>>> Copenhagen University Hospital Amager and Hvidovre
>>>>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Torkil Svensgaard
>>>>>>> Systems Administrator
>>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>>>>> Copenhagen University Hospital Amager and Hvidovre
>>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Torkil Svensgaard
>>>>> Systems Administrator
>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>>> Copenhagen University Hospital Amager and Hvidovre
>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> --
>> Torkil Svensgaard
>> Sysadmin
>> MR-Forskningssektionen, afs. 714
>> DRCMR, Danish Research Centre for Magnetic Resonance
>> Hvidovre Hospital
>> Kettegård Allé 30
>> DK-2650 Hvidovre
>> Denmark
>> Tel: +45 386 22828
>> E-mail: torkil@xxxxxxxx
> 
> 
> 
> -- 
> Alexander E. Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx