First try "ceph osd down 89" > On Mar 25, 2024, at 15:37, Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote: > > On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil@xxxxxxxx> wrote: >> >> >> >> On 24/03/2024 01:14, Torkil Svensgaard wrote: >>> On 24-03-2024 00:31, Alexander E. Patrakov wrote: >>>> Hi Torkil, >>> >>> Hi Alexander >>> >>>> Thanks for the update. Even though the improvement is small, it is >>>> still an improvement, consistent with the osd_max_backfills value, and >>>> it proves that there are still unsolved peering issues. >>>> >>>> I have looked at both the old and the new state of the PG, but could >>>> not find anything else interesting. >>>> >>>> I also looked again at the state of PG 37.1. It is known what blocks >>>> the backfill of this PG; please search for "blocked_by." However, this >>>> is just one data point, which is insufficient for any conclusions. Try >>>> looking at other PGs. Is there anything too common in the non-empty >>>> "blocked_by" blocks? >>> >>> I'll take a look at that tomorrow, perhaps we can script something >>> meaningful. >> >> Hi Alexander >> >> While working on a script querying all PGs and making a list of all OSDs >> found in a blocked_by list, and how many times for each, I discovered >> something odd about pool 38: >> >> " >> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 >> OSDs blocking other OSDs: > <snip> > >> All PGs in the pool are active+clean so why are there any blocked_by at >> all? One example attached. > > I don't know. In any case, it doesn't match the "one OSD blocks them > all" scenario that I was looking for. I think this is something bogus > that can probably be cleared in your example by restarting osd.89 > (i.e, the one being blocked). > >> >> Mvh. >> >> Torkil >> >>>> I think we have to look for patterns in other ways, too. One tool that >>>> produces good visualizations is TheJJ balancer. Although it is called >>>> a "balancer," it can also visualize the ongoing backfills. >>>> >>>> The tool is available at >>>> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py >>>> >>>> Run it as follows: >>>> >>>> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt >>> >>> Output attached. >>> >>> Thanks again. >>> >>> Mvh. >>> >>> Torkil >>> >>>> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil@xxxxxxxx> >>>> wrote: >>>>> >>>>> Hi Alex >>>>> >>>>> New query output attached after restarting both OSDs. OSD 237 is no >>>>> longer mentioned but it unfortunately made no difference for the number >>>>> of backfills which went 59->62->62. >>>>> >>>>> Mvh. >>>>> >>>>> Torkil >>>>> >>>>> On 23-03-2024 22:26, Alexander E. Patrakov wrote: >>>>>> Hi Torkil, >>>>>> >>>>>> I have looked at the files that you attached. They were helpful: pool >>>>>> 11 is problematic, it complains about degraded objects for no obvious >>>>>> reason. I think that is the blocker. >>>>>> >>>>>> I also noted that you mentioned peering problems, and I suspect that >>>>>> they are not completely resolved. As a somewhat-irrational move, to >>>>>> confirm this theory, you can restart osd.237 (it is mentioned at the >>>>>> end of query.11.fff.txt, although I don't understand why it is there) >>>>>> and then osd.298 (it is the primary for that pg) and see if any >>>>>> additional backfills are unblocked after that. Also, please re-query >>>>>> that PG again after the OSD restart. >>>>>> >>>>>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil@xxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote: >>>>>>>> Hi Torkil, >>>>>>> >>>>>>> Hi Alexander >>>>>>> >>>>>>>> I have looked at the CRUSH rules, and the equivalent rules work on my >>>>>>>> test cluster. So this cannot be the cause of the blockage. >>>>>>> >>>>>>> Thank you for taking the time =) >>>>>>> >>>>>>>> What happens if you increase the osd_max_backfills setting >>>>>>>> temporarily? >>>>>>> >>>>>>> We already had the mclock override option in place and I re-enabled >>>>>>> our >>>>>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending >>>>>>> on how full they are. Active backfills went from 16 to 53 which is >>>>>>> probably because default osd_max_backfills for mclock is 1. >>>>>>> >>>>>>> I think 53 is still a low number of active backfills given the large >>>>>>> percentage misplaced. >>>>>>> >>>>>>>> It may be a good idea to investigate a few of the stalled PGs. Please >>>>>>>> run commands similar to this one: >>>>>>>> >>>>>>>> ceph pg 37.0 query > query.37.0.txt >>>>>>>> ceph pg 37.1 query > query.37.1.txt >>>>>>>> ... >>>>>>>> and the same for the other affected pools. >>>>>>> >>>>>>> A few samples attached. >>>>>>> >>>>>>>> Still, I must say that some of your rules are actually unsafe. >>>>>>>> >>>>>>>> The 4+2 rule as used by rbd_ec_data will not survive a >>>>>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs >>>>>>>> from >>>>>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is >>>>>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly >>>>>>>> the number of data chunks. However, the pool requires min_size 5, so >>>>>>>> all PGs will be inactive (to prevent data corruption) and will stay >>>>>>>> inactive until the datacenter comes up again. However, please don't >>>>>>>> set min_size to 4 - then, any additional incident (like a defective >>>>>>>> disk) will lead to data loss, and the shards in the datacenter which >>>>>>>> went offline would be useless because they do not correspond to the >>>>>>>> updated shards written by the clients. >>>>>>> >>>>>>> Thanks for the explanation. This is an old pool predating the 3 DC >>>>>>> setup >>>>>>> and we'll migrate the data to a 4+5 pool when we can. >>>>>>> >>>>>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the >>>>>>>> number of data chunks. See above why it is bad. Please set >>>>>>>> min_size to >>>>>>>> 5. >>>>>>> >>>>>>> Thanks, that was a leftover for getting the PGs to peer (stuck at >>>>>>> creating+incomplete) when we created the pool. It's back to 5 now. >>>>>>> >>>>>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are >>>>>>>> 100% active+clean. >>>>>>> >>>>>>> There is very little data in this pool, that is probably the main >>>>>>> reason. >>>>>>> >>>>>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that >>>>>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400. >>>>>>> >>>>>>> Copy that. Didn't seem to make a difference though, and we have >>>>>>> osd_max_pg_per_osd_hard_ratio set to 5.000000. >>>>>>> >>>>>>> Mvh. >>>>>>> >>>>>>> Torkil >>>>>>> >>>>>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard >>>>>>>> <torkil@xxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>>>>>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by >>>>>>>>>> itself >>>>>>>>>> is insufficient. For every erasure code profile mentioned in the >>>>>>>>>> output, please also run something like this: >>>>>>>>>> >>>>>>>>>> ceph osd erasure-code-profile get prf-for-ec-data >>>>>>>>>> >>>>>>>>>> ...where "prf-for-ec-data" is the name that appears after the words >>>>>>>>>> "erasure profile" in the "ceph osd pool ls detail" output. >>>>>>>>> >>>>>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure >>>>>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>>>>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >>>>>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >>>>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>>>>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>>>>>>>> application rbd >>>>>>>>> pool 37 'cephfs.hdd.data' erasure profile >>>>>>>>> DRCMR_k4m5_datacenter_hdd size >>>>>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 >>>>>>>>> pgp_num 2048 >>>>>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >>>>>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>>>>>>>> compression_algorithm zstd compression_mode aggressive >>>>>>>>> application cephfs >>>>>>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd >>>>>>>>> size 9 >>>>>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >>>>>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >>>>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>>>>>>>> compression_algorithm zstd compression_mode aggressive >>>>>>>>> application rbd >>>>>>>>> >>>>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >>>>>>>>> crush-device-class=hdd >>>>>>>>> crush-failure-domain=host >>>>>>>>> crush-root=default >>>>>>>>> jerasure-per-chunk-alignment=false >>>>>>>>> k=4 >>>>>>>>> m=2 >>>>>>>>> plugin=jerasure >>>>>>>>> technique=reed_sol_van >>>>>>>>> w=8 >>>>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>>>>>>> DRCMR_k4m5_datacenter_hdd >>>>>>>>> crush-device-class=hdd >>>>>>>>> crush-failure-domain=datacenter >>>>>>>>> crush-root=default >>>>>>>>> jerasure-per-chunk-alignment=false >>>>>>>>> k=4 >>>>>>>>> m=5 >>>>>>>>> plugin=jerasure >>>>>>>>> technique=reed_sol_van >>>>>>>>> w=8 >>>>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>>>>>>> DRCMR_k4m5_datacenter_ssd >>>>>>>>> crush-device-class=ssd >>>>>>>>> crush-failure-domain=datacenter >>>>>>>>> crush-root=default >>>>>>>>> jerasure-per-chunk-alignment=false >>>>>>>>> k=4 >>>>>>>>> m=5 >>>>>>>>> plugin=jerasure >>>>>>>>> technique=reed_sol_van >>>>>>>>> w=8 >>>>>>>>> >>>>>>>>> But as I understand it those profiles are only used to create the >>>>>>>>> initial crush rule for the pool, and we have manually edited >>>>>>>>> those along >>>>>>>>> the way. Here are the 3 rules in use for the 3 EC pools: >>>>>>>>> >>>>>>>>> rule rbd_ec_data { >>>>>>>>> id 0 >>>>>>>>> type erasure >>>>>>>>> step set_chooseleaf_tries 5 >>>>>>>>> step set_choose_tries 100 >>>>>>>>> step take default class hdd >>>>>>>>> step choose indep 0 type datacenter >>>>>>>>> step chooseleaf indep 2 type host >>>>>>>>> step emit >>>>>>>>> } >>>>>>>>> rule cephfs.hdd.data { >>>>>>>>> id 7 >>>>>>>>> type erasure >>>>>>>>> step set_chooseleaf_tries 5 >>>>>>>>> step set_choose_tries 100 >>>>>>>>> step take default class hdd >>>>>>>>> step choose indep 0 type datacenter >>>>>>>>> step chooseleaf indep 3 type host >>>>>>>>> step emit >>>>>>>>> } >>>>>>>>> rule rbd.ssd.data { >>>>>>>>> id 8 >>>>>>>>> type erasure >>>>>>>>> step set_chooseleaf_tries 5 >>>>>>>>> step set_choose_tries 100 >>>>>>>>> step take default class ssd >>>>>>>>> step choose indep 0 type datacenter >>>>>>>>> step chooseleaf indep 3 type host >>>>>>>>> step emit >>>>>>>>> } >>>>>>>>> >>>>>>>>> Which should first pick all 3 datacenters in the choose step and >>>>>>>>> then >>>>>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >>>>>>>>> respectively. >>>>>>>>> >>>>>>>>> Mvh. >>>>>>>>> >>>>>>>>> Torkil >>>>>>>>> >>>>>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>>>>>>>>> <patrakov@xxxxxxxxx> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Torkil, >>>>>>>>>>> >>>>>>>>>>> I take my previous response back. >>>>>>>>>>> >>>>>>>>>>> You have an erasure-coded pool with nine shards but only three >>>>>>>>>>> datacenters. This, in general, cannot work. You need either nine >>>>>>>>>>> datacenters or a very custom CRUSH rule. The second option may >>>>>>>>>>> not be >>>>>>>>>>> available if the current EC setup is already incompatible, as >>>>>>>>>>> there is >>>>>>>>>>> no way to change the EC parameters. >>>>>>>>>>> >>>>>>>>>>> It would help if you provided the output of "ceph osd pool ls >>>>>>>>>>> detail". >>>>>>>>>>> >>>>>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>>>>>>>>> <patrakov@xxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Torkil, >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately, your files contain nothing obviously bad or >>>>>>>>>>>> suspicious, >>>>>>>>>>>> except for two things: more PGs than usual and bad balance. >>>>>>>>>>>> >>>>>>>>>>>> What's your "mon max pg per osd" setting? >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard >>>>>>>>>>>> <torkil@xxxxxxxx> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The other output is too big for pastebin and I'm not >>>>>>>>>>>>>>> familiar with >>>>>>>>>>>>>>> paste services, any suggestion for a preferred way to share >>>>>>>>>>>>>>> such >>>>>>>>>>>>>>> output? >>>>>>>>>>>>>> >>>>>>>>>>>>>> You can attached files to the mail here on the list. >>>>>>>>>>>>> >>>>>>>>>>>>> Doh, for some reason I was sure attachments would be >>>>>>>>>>>>> stripped. Thanks, >>>>>>>>>>>>> attached. >>>>>>>>>>>>> >>>>>>>>>>>>> Mvh. >>>>>>>>>>>>> >>>>>>>>>>>>> Torkil >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Alexander E. Patrakov >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Alexander E. Patrakov >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Torkil Svensgaard >>>>>>>>> Systems Administrator >>>>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>>>>>> Copenhagen University Hospital Amager and Hvidovre >>>>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Torkil Svensgaard >>>>>>> Systems Administrator >>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>>>> Copenhagen University Hospital Amager and Hvidovre >>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Torkil Svensgaard >>>>> Systems Administrator >>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>> Copenhagen University Hospital Amager and Hvidovre >>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>> >>>> >>>> >>> >> >> -- >> Torkil Svensgaard >> Sysadmin >> MR-Forskningssektionen, afs. 714 >> DRCMR, Danish Research Centre for Magnetic Resonance >> Hvidovre Hospital >> Kettegård Allé 30 >> DK-2650 Hvidovre >> Denmark >> Tel: +45 386 22828 >> E-mail: torkil@xxxxxxxx > > > > -- > Alexander E. Patrakov > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx