Hi Torkil, Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still unsolved peering issues. I have looked at both the old and the new state of the PG, but could not find anything else interesting. I also looked again at the state of PG 37.1. It is known what blocks the backfill of this PG; please search for "blocked_by." However, this is just one data point, which is insufficient for any conclusions. Try looking at other PGs. Is there anything too common in the non-empty "blocked_by" blocks? I think we have to look for patterns in other ways, too. One tool that produces good visualizations is TheJJ balancer. Although it is called a "balancer," it can also visualize the ongoing backfills. The tool is available at https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py Run it as follows: ./placementoptimizer.py showremapped --by-osd | tee remapped.txt On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote: > > Hi Alex > > New query output attached after restarting both OSDs. OSD 237 is no > longer mentioned but it unfortunately made no difference for the number > of backfills which went 59->62->62. > > Mvh. > > Torkil > > On 23-03-2024 22:26, Alexander E. Patrakov wrote: > > Hi Torkil, > > > > I have looked at the files that you attached. They were helpful: pool > > 11 is problematic, it complains about degraded objects for no obvious > > reason. I think that is the blocker. > > > > I also noted that you mentioned peering problems, and I suspect that > > they are not completely resolved. As a somewhat-irrational move, to > > confirm this theory, you can restart osd.237 (it is mentioned at the > > end of query.11.fff.txt, although I don't understand why it is there) > > and then osd.298 (it is the primary for that pg) and see if any > > additional backfills are unblocked after that. Also, please re-query > > that PG again after the OSD restart. > > > > On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote: > >> > >> > >> > >> On 23-03-2024 21:19, Alexander E. Patrakov wrote: > >>> Hi Torkil, > >> > >> Hi Alexander > >> > >>> I have looked at the CRUSH rules, and the equivalent rules work on my > >>> test cluster. So this cannot be the cause of the blockage. > >> > >> Thank you for taking the time =) > >> > >>> What happens if you increase the osd_max_backfills setting temporarily? > >> > >> We already had the mclock override option in place and I re-enabled our > >> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending > >> on how full they are. Active backfills went from 16 to 53 which is > >> probably because default osd_max_backfills for mclock is 1. > >> > >> I think 53 is still a low number of active backfills given the large > >> percentage misplaced. > >> > >>> It may be a good idea to investigate a few of the stalled PGs. Please > >>> run commands similar to this one: > >>> > >>> ceph pg 37.0 query > query.37.0.txt > >>> ceph pg 37.1 query > query.37.1.txt > >>> ... > >>> and the same for the other affected pools. > >> > >> A few samples attached. > >> > >>> Still, I must say that some of your rules are actually unsafe. > >>> > >>> The 4+2 rule as used by rbd_ec_data will not survive a > >>> datacenter-offline incident. Namely, for each PG, it chooses OSDs from > >>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is > >>> offline, you will, therefore, have only 4 OSDs up, which is exactly > >>> the number of data chunks. However, the pool requires min_size 5, so > >>> all PGs will be inactive (to prevent data corruption) and will stay > >>> inactive until the datacenter comes up again. However, please don't > >>> set min_size to 4 - then, any additional incident (like a defective > >>> disk) will lead to data loss, and the shards in the datacenter which > >>> went offline would be useless because they do not correspond to the > >>> updated shards written by the clients. > >> > >> Thanks for the explanation. This is an old pool predating the 3 DC setup > >> and we'll migrate the data to a 4+5 pool when we can. > >> > >>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the > >>> number of data chunks. See above why it is bad. Please set min_size to > >>> 5. > >> > >> Thanks, that was a leftover for getting the PGs to peer (stuck at > >> creating+incomplete) when we created the pool. It's back to 5 now. > >> > >>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are > >>> 100% active+clean. > >> > >> There is very little data in this pool, that is probably the main reason. > >> > >>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that > >>> have 300+ PGs, the observed maximum is 347. Please set it to 400. > >> > >> Copy that. Didn't seem to make a difference though, and we have > >> osd_max_pg_per_osd_hard_ratio set to 5.000000. > >> > >> Mvh. > >> > >> Torkil > >> > >>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: > >>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by itself > >>>>> is insufficient. For every erasure code profile mentioned in the > >>>>> output, please also run something like this: > >>>>> > >>>>> ceph osd erasure-code-profile get prf-for-ec-data > >>>>> > >>>>> ...where "prf-for-ec-data" is the name that appears after the words > >>>>> "erasure profile" in the "ceph osd pool ls detail" output. > >>>> > >>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure > >>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 > >>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 > >>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags > >>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 > >>>> fast_read 1 compression_algorithm snappy compression_mode aggressive > >>>> application rbd > >>>> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size > >>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 > >>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags > >>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 > >>>> compression_algorithm zstd compression_mode aggressive application cephfs > >>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 > >>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 > >>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags > >>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 > >>>> compression_algorithm zstd compression_mode aggressive application rbd > >>>> > >>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 > >>>> crush-device-class=hdd > >>>> crush-failure-domain=host > >>>> crush-root=default > >>>> jerasure-per-chunk-alignment=false > >>>> k=4 > >>>> m=2 > >>>> plugin=jerasure > >>>> technique=reed_sol_van > >>>> w=8 > >>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd > >>>> crush-device-class=hdd > >>>> crush-failure-domain=datacenter > >>>> crush-root=default > >>>> jerasure-per-chunk-alignment=false > >>>> k=4 > >>>> m=5 > >>>> plugin=jerasure > >>>> technique=reed_sol_van > >>>> w=8 > >>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd > >>>> crush-device-class=ssd > >>>> crush-failure-domain=datacenter > >>>> crush-root=default > >>>> jerasure-per-chunk-alignment=false > >>>> k=4 > >>>> m=5 > >>>> plugin=jerasure > >>>> technique=reed_sol_van > >>>> w=8 > >>>> > >>>> But as I understand it those profiles are only used to create the > >>>> initial crush rule for the pool, and we have manually edited those along > >>>> the way. Here are the 3 rules in use for the 3 EC pools: > >>>> > >>>> rule rbd_ec_data { > >>>> id 0 > >>>> type erasure > >>>> step set_chooseleaf_tries 5 > >>>> step set_choose_tries 100 > >>>> step take default class hdd > >>>> step choose indep 0 type datacenter > >>>> step chooseleaf indep 2 type host > >>>> step emit > >>>> } > >>>> rule cephfs.hdd.data { > >>>> id 7 > >>>> type erasure > >>>> step set_chooseleaf_tries 5 > >>>> step set_choose_tries 100 > >>>> step take default class hdd > >>>> step choose indep 0 type datacenter > >>>> step chooseleaf indep 3 type host > >>>> step emit > >>>> } > >>>> rule rbd.ssd.data { > >>>> id 8 > >>>> type erasure > >>>> step set_chooseleaf_tries 5 > >>>> step set_choose_tries 100 > >>>> step take default class ssd > >>>> step choose indep 0 type datacenter > >>>> step chooseleaf indep 3 type host > >>>> step emit > >>>> } > >>>> > >>>> Which should first pick all 3 datacenters in the choose step and then > >>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 > >>>> respectively. > >>>> > >>>> Mvh. > >>>> > >>>> Torkil > >>>> > >>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov > >>>>> <patrakov@xxxxxxxxx> wrote: > >>>>>> > >>>>>> Hi Torkil, > >>>>>> > >>>>>> I take my previous response back. > >>>>>> > >>>>>> You have an erasure-coded pool with nine shards but only three > >>>>>> datacenters. This, in general, cannot work. You need either nine > >>>>>> datacenters or a very custom CRUSH rule. The second option may not be > >>>>>> available if the current EC setup is already incompatible, as there is > >>>>>> no way to change the EC parameters. > >>>>>> > >>>>>> It would help if you provided the output of "ceph osd pool ls detail". > >>>>>> > >>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov > >>>>>> <patrakov@xxxxxxxxx> wrote: > >>>>>>> > >>>>>>> Hi Torkil, > >>>>>>> > >>>>>>> Unfortunately, your files contain nothing obviously bad or suspicious, > >>>>>>> except for two things: more PGs than usual and bad balance. > >>>>>>> > >>>>>>> What's your "mon max pg per osd" setting? > >>>>>>> > >>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: > >>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: > >>>>>>>>>> > >>>>>>>>>> The other output is too big for pastebin and I'm not familiar with > >>>>>>>>>> paste services, any suggestion for a preferred way to share such > >>>>>>>>>> output? > >>>>>>>>> > >>>>>>>>> You can attached files to the mail here on the list. > >>>>>>>> > >>>>>>>> Doh, for some reason I was sure attachments would be stripped. Thanks, > >>>>>>>> attached. > >>>>>>>> > >>>>>>>> Mvh. > >>>>>>>> > >>>>>>>> Torkil > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Alexander E. Patrakov > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Alexander E. Patrakov > >>>>> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> Torkil Svensgaard > >>>> Systems Administrator > >>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > >>>> Copenhagen University Hospital Amager and Hvidovre > >>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark > >>>> > >>> > >>> > >> > >> -- > >> Torkil Svensgaard > >> Systems Administrator > >> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > >> Copenhagen University Hospital Amager and Hvidovre > >> Kettegaard Allé 30, 2650 Hvidovre, Denmark > > > > > > > > -- > Torkil Svensgaard > Systems Administrator > Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > Copenhagen University Hospital Amager and Hvidovre > Kettegaard Allé 30, 2650 Hvidovre, Denmark -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx