On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil@xxxxxxxx> wrote: > > > > On 24/03/2024 01:14, Torkil Svensgaard wrote: > > On 24-03-2024 00:31, Alexander E. Patrakov wrote: > >> Hi Torkil, > > > > Hi Alexander > > > >> Thanks for the update. Even though the improvement is small, it is > >> still an improvement, consistent with the osd_max_backfills value, and > >> it proves that there are still unsolved peering issues. > >> > >> I have looked at both the old and the new state of the PG, but could > >> not find anything else interesting. > >> > >> I also looked again at the state of PG 37.1. It is known what blocks > >> the backfill of this PG; please search for "blocked_by." However, this > >> is just one data point, which is insufficient for any conclusions. Try > >> looking at other PGs. Is there anything too common in the non-empty > >> "blocked_by" blocks? > > > > I'll take a look at that tomorrow, perhaps we can script something > > meaningful. > > Hi Alexander > > While working on a script querying all PGs and making a list of all OSDs > found in a blocked_by list, and how many times for each, I discovered > something odd about pool 38: > > " > [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 > OSDs blocking other OSDs: <snip> > All PGs in the pool are active+clean so why are there any blocked_by at > all? One example attached. I don't know. In any case, it doesn't match the "one OSD blocks them all" scenario that I was looking for. I think this is something bogus that can probably be cleared in your example by restarting osd.89 (i.e, the one being blocked). > > Mvh. > > Torkil > > >> I think we have to look for patterns in other ways, too. One tool that > >> produces good visualizations is TheJJ balancer. Although it is called > >> a "balancer," it can also visualize the ongoing backfills. > >> > >> The tool is available at > >> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py > >> > >> Run it as follows: > >> > >> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt > > > > Output attached. > > > > Thanks again. > > > > Mvh. > > > > Torkil > > > >> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil@xxxxxxxx> > >> wrote: > >>> > >>> Hi Alex > >>> > >>> New query output attached after restarting both OSDs. OSD 237 is no > >>> longer mentioned but it unfortunately made no difference for the number > >>> of backfills which went 59->62->62. > >>> > >>> Mvh. > >>> > >>> Torkil > >>> > >>> On 23-03-2024 22:26, Alexander E. Patrakov wrote: > >>>> Hi Torkil, > >>>> > >>>> I have looked at the files that you attached. They were helpful: pool > >>>> 11 is problematic, it complains about degraded objects for no obvious > >>>> reason. I think that is the blocker. > >>>> > >>>> I also noted that you mentioned peering problems, and I suspect that > >>>> they are not completely resolved. As a somewhat-irrational move, to > >>>> confirm this theory, you can restart osd.237 (it is mentioned at the > >>>> end of query.11.fff.txt, although I don't understand why it is there) > >>>> and then osd.298 (it is the primary for that pg) and see if any > >>>> additional backfills are unblocked after that. Also, please re-query > >>>> that PG again after the OSD restart. > >>>> > >>>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx> > >>>> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote: > >>>>>> Hi Torkil, > >>>>> > >>>>> Hi Alexander > >>>>> > >>>>>> I have looked at the CRUSH rules, and the equivalent rules work on my > >>>>>> test cluster. So this cannot be the cause of the blockage. > >>>>> > >>>>> Thank you for taking the time =) > >>>>> > >>>>>> What happens if you increase the osd_max_backfills setting > >>>>>> temporarily? > >>>>> > >>>>> We already had the mclock override option in place and I re-enabled > >>>>> our > >>>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending > >>>>> on how full they are. Active backfills went from 16 to 53 which is > >>>>> probably because default osd_max_backfills for mclock is 1. > >>>>> > >>>>> I think 53 is still a low number of active backfills given the large > >>>>> percentage misplaced. > >>>>> > >>>>>> It may be a good idea to investigate a few of the stalled PGs. Please > >>>>>> run commands similar to this one: > >>>>>> > >>>>>> ceph pg 37.0 query > query.37.0.txt > >>>>>> ceph pg 37.1 query > query.37.1.txt > >>>>>> ... > >>>>>> and the same for the other affected pools. > >>>>> > >>>>> A few samples attached. > >>>>> > >>>>>> Still, I must say that some of your rules are actually unsafe. > >>>>>> > >>>>>> The 4+2 rule as used by rbd_ec_data will not survive a > >>>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs > >>>>>> from > >>>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is > >>>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly > >>>>>> the number of data chunks. However, the pool requires min_size 5, so > >>>>>> all PGs will be inactive (to prevent data corruption) and will stay > >>>>>> inactive until the datacenter comes up again. However, please don't > >>>>>> set min_size to 4 - then, any additional incident (like a defective > >>>>>> disk) will lead to data loss, and the shards in the datacenter which > >>>>>> went offline would be useless because they do not correspond to the > >>>>>> updated shards written by the clients. > >>>>> > >>>>> Thanks for the explanation. This is an old pool predating the 3 DC > >>>>> setup > >>>>> and we'll migrate the data to a 4+5 pool when we can. > >>>>> > >>>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the > >>>>>> number of data chunks. See above why it is bad. Please set > >>>>>> min_size to > >>>>>> 5. > >>>>> > >>>>> Thanks, that was a leftover for getting the PGs to peer (stuck at > >>>>> creating+incomplete) when we created the pool. It's back to 5 now. > >>>>> > >>>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are > >>>>>> 100% active+clean. > >>>>> > >>>>> There is very little data in this pool, that is probably the main > >>>>> reason. > >>>>> > >>>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that > >>>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400. > >>>>> > >>>>> Copy that. Didn't seem to make a difference though, and we have > >>>>> osd_max_pg_per_osd_hard_ratio set to 5.000000. > >>>>> > >>>>> Mvh. > >>>>> > >>>>> Torkil > >>>>> > >>>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard > >>>>>> <torkil@xxxxxxxx> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: > >>>>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by > >>>>>>>> itself > >>>>>>>> is insufficient. For every erasure code profile mentioned in the > >>>>>>>> output, please also run something like this: > >>>>>>>> > >>>>>>>> ceph osd erasure-code-profile get prf-for-ec-data > >>>>>>>> > >>>>>>>> ...where "prf-for-ec-data" is the name that appears after the words > >>>>>>>> "erasure profile" in the "ceph osd pool ls detail" output. > >>>>>>> > >>>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure > >>>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 > >>>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 > >>>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags > >>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 > >>>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive > >>>>>>> application rbd > >>>>>>> pool 37 'cephfs.hdd.data' erasure profile > >>>>>>> DRCMR_k4m5_datacenter_hdd size > >>>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 > >>>>>>> pgp_num 2048 > >>>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags > >>>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 > >>>>>>> compression_algorithm zstd compression_mode aggressive > >>>>>>> application cephfs > >>>>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd > >>>>>>> size 9 > >>>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 > >>>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags > >>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 > >>>>>>> compression_algorithm zstd compression_mode aggressive > >>>>>>> application rbd > >>>>>>> > >>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 > >>>>>>> crush-device-class=hdd > >>>>>>> crush-failure-domain=host > >>>>>>> crush-root=default > >>>>>>> jerasure-per-chunk-alignment=false > >>>>>>> k=4 > >>>>>>> m=2 > >>>>>>> plugin=jerasure > >>>>>>> technique=reed_sol_van > >>>>>>> w=8 > >>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get > >>>>>>> DRCMR_k4m5_datacenter_hdd > >>>>>>> crush-device-class=hdd > >>>>>>> crush-failure-domain=datacenter > >>>>>>> crush-root=default > >>>>>>> jerasure-per-chunk-alignment=false > >>>>>>> k=4 > >>>>>>> m=5 > >>>>>>> plugin=jerasure > >>>>>>> technique=reed_sol_van > >>>>>>> w=8 > >>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get > >>>>>>> DRCMR_k4m5_datacenter_ssd > >>>>>>> crush-device-class=ssd > >>>>>>> crush-failure-domain=datacenter > >>>>>>> crush-root=default > >>>>>>> jerasure-per-chunk-alignment=false > >>>>>>> k=4 > >>>>>>> m=5 > >>>>>>> plugin=jerasure > >>>>>>> technique=reed_sol_van > >>>>>>> w=8 > >>>>>>> > >>>>>>> But as I understand it those profiles are only used to create the > >>>>>>> initial crush rule for the pool, and we have manually edited > >>>>>>> those along > >>>>>>> the way. Here are the 3 rules in use for the 3 EC pools: > >>>>>>> > >>>>>>> rule rbd_ec_data { > >>>>>>> id 0 > >>>>>>> type erasure > >>>>>>> step set_chooseleaf_tries 5 > >>>>>>> step set_choose_tries 100 > >>>>>>> step take default class hdd > >>>>>>> step choose indep 0 type datacenter > >>>>>>> step chooseleaf indep 2 type host > >>>>>>> step emit > >>>>>>> } > >>>>>>> rule cephfs.hdd.data { > >>>>>>> id 7 > >>>>>>> type erasure > >>>>>>> step set_chooseleaf_tries 5 > >>>>>>> step set_choose_tries 100 > >>>>>>> step take default class hdd > >>>>>>> step choose indep 0 type datacenter > >>>>>>> step chooseleaf indep 3 type host > >>>>>>> step emit > >>>>>>> } > >>>>>>> rule rbd.ssd.data { > >>>>>>> id 8 > >>>>>>> type erasure > >>>>>>> step set_chooseleaf_tries 5 > >>>>>>> step set_choose_tries 100 > >>>>>>> step take default class ssd > >>>>>>> step choose indep 0 type datacenter > >>>>>>> step chooseleaf indep 3 type host > >>>>>>> step emit > >>>>>>> } > >>>>>>> > >>>>>>> Which should first pick all 3 datacenters in the choose step and > >>>>>>> then > >>>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 > >>>>>>> respectively. > >>>>>>> > >>>>>>> Mvh. > >>>>>>> > >>>>>>> Torkil > >>>>>>> > >>>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov > >>>>>>>> <patrakov@xxxxxxxxx> wrote: > >>>>>>>>> > >>>>>>>>> Hi Torkil, > >>>>>>>>> > >>>>>>>>> I take my previous response back. > >>>>>>>>> > >>>>>>>>> You have an erasure-coded pool with nine shards but only three > >>>>>>>>> datacenters. This, in general, cannot work. You need either nine > >>>>>>>>> datacenters or a very custom CRUSH rule. The second option may > >>>>>>>>> not be > >>>>>>>>> available if the current EC setup is already incompatible, as > >>>>>>>>> there is > >>>>>>>>> no way to change the EC parameters. > >>>>>>>>> > >>>>>>>>> It would help if you provided the output of "ceph osd pool ls > >>>>>>>>> detail". > >>>>>>>>> > >>>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov > >>>>>>>>> <patrakov@xxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi Torkil, > >>>>>>>>>> > >>>>>>>>>> Unfortunately, your files contain nothing obviously bad or > >>>>>>>>>> suspicious, > >>>>>>>>>> except for two things: more PGs than usual and bad balance. > >>>>>>>>>> > >>>>>>>>>> What's your "mon max pg per osd" setting? > >>>>>>>>>> > >>>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard > >>>>>>>>>> <torkil@xxxxxxxx> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: > >>>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> The other output is too big for pastebin and I'm not > >>>>>>>>>>>>> familiar with > >>>>>>>>>>>>> paste services, any suggestion for a preferred way to share > >>>>>>>>>>>>> such > >>>>>>>>>>>>> output? > >>>>>>>>>>>> > >>>>>>>>>>>> You can attached files to the mail here on the list. > >>>>>>>>>>> > >>>>>>>>>>> Doh, for some reason I was sure attachments would be > >>>>>>>>>>> stripped. Thanks, > >>>>>>>>>>> attached. > >>>>>>>>>>> > >>>>>>>>>>> Mvh. > >>>>>>>>>>> > >>>>>>>>>>> Torkil > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Alexander E. Patrakov > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Alexander E. Patrakov > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Torkil Svensgaard > >>>>>>> Systems Administrator > >>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > >>>>>>> Copenhagen University Hospital Amager and Hvidovre > >>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> Torkil Svensgaard > >>>>> Systems Administrator > >>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > >>>>> Copenhagen University Hospital Amager and Hvidovre > >>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark > >>>> > >>>> > >>>> > >>> > >>> -- > >>> Torkil Svensgaard > >>> Systems Administrator > >>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > >>> Copenhagen University Hospital Amager and Hvidovre > >>> Kettegaard Allé 30, 2650 Hvidovre, Denmark > >> > >> > >> > > > > -- > Torkil Svensgaard > Sysadmin > MR-Forskningssektionen, afs. 714 > DRCMR, Danish Research Centre for Magnetic Resonance > Hvidovre Hospital > Kettegård Allé 30 > DK-2650 Hvidovre > Denmark > Tel: +45 386 22828 > E-mail: torkil@xxxxxxxx -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx