On Wed, Dec 13, 2017 at 11:39 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but > this has a potential to break a few things for people moving from Jewel to > Luminous if they potentially had a few too many PG’s. > > > > Firstly, how I stumbled across it. I whacked the logging up to max on OSD 68 > and saw this mentioned in the logs > > > > osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >= > 400 > > > > This made me search through the code for this warning string > > > > https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221 > > > > Which jogged my memory about the changes in Luminous regarding max PG’s > warning, and in particular these two config options > > mon_max_pg_per_osd > > osd_max_pg_per_osd_hard_ratio > > > > In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68 > in, has 8TB disks instead of 3TB for the rest of the cluster. This means > these OSD’s were taking a lot more PG’s than the average would suggest. So > in Luminous 200x2 gives a hard limit of 400, which is what that error > message in the log suggests is the limit. I set the > osd_max_pg_per_osd_hard_ratio option to 3 and restarted the OSD and hey > presto everything fell into line. > > > > Now a question. I get the idea around these settings to stop making too many > or pools with too many PG’s. But is it correct they can break an existing > pool which is maybe making the new PG on an OSD due to CRUSH layout being > modified? It would be good to capture this in a tracker Nick so it can be explored in more depth. > > > > Nick > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Nick Fisk > Sent: 13 December 2017 11:14 > To: 'Gregory Farnum' <gfarnum@xxxxxxxxxx> > Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Odd object blocking IO on PG > > > > > > On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk <nick@xxxxxxxxxx> wrote: > > >> That doesn't look like an RBD object -- any idea who is >> "client.34720596.1:212637720"? > > So I think these might be proxy ops from the cache tier, as there are also > block ops on one of the cache tier OSD's, but this time it actually lists > the object name. Block op on cache tier. > > "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf > 17:f3831e75:::rbd_data.15a5e20238e1f29.00000000000388ad:head [set-alloc-hint > object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[] > RETRY=2 ondisk+retry+write+known_if_redirected e104841)", > "initiated_at": "2017-12-12 16:25:32.435718", > "age": 13996.681147, > "duration": 13996.681203, > "type_data": { > "flag_point": "reached pg", > "client_info": { > "client": "client.34720596", > "client_addr": "10.3.31.41:0/2600619462", > "tid": 212637720 > > I'm a bit baffled at the moment what's going. The pg query (attached) is not > showing in the main status that it has been blocked from peering or that > there are any missing objects. I've tried restarting all OSD's I can see > relating to the PG in case they needed a bit of a nudge. > > > > Did that fix anything? I don't see anything immediately obvious but I'm not > practiced in quickly reading that pg state output. > > > > What's the output of "ceph -s"? > > > > Hi Greg, > > > > No restarting OSD’s didn’t seem to help. But I did make some progress late > last night. By stopping OSD.68 the cluster unlocks itself and IO can > progress. However as soon as it starts back up, 0.1cf and a couple of other > PG’s again get stuck in an activating state. If I out the OSD, either with > it up or down, then some other PG’s seem to get hit by the same problem as > CRUSH moves PG mappings around to other OSD’s. > > > > So there definitely seems to be some sort of weird peering issue somewhere. > I have seen a very similar issue before on this cluster where after running > the crush reweight script to balance OSD utilization, the weight got set too > low and PG’s were unable to peer. I’m not convinced this is what’s happening > here as all the weights haven’t changed, but I’m intending to explore this > further just in case. > > > > With 68 down > > pgs: 1071783/48650631 objects degraded (2.203%) > > 5923 active+clean > > 399 active+undersized+degraded > > 7 active+clean+scrubbing+deep > > 7 active+clean+remapped > > > > With it up > > pgs: 0.047% pgs not active > > 67271/48651279 objects degraded (0.138%) > > 15602/48651279 objects misplaced (0.032%) > > 6051 active+clean > > 273 active+recovery_wait+degraded > > 4 active+clean+scrubbing+deep > > 4 active+remapped+backfill_wait > > 3 activating+remapped > > active+recovering+degraded > > > > PG Dump > > ceph pg dump | grep activatin > > dumped all > > 2.389 0 0 0 0 0 0 > 1500 1500 activating+remapped 2017-12-13 11:08:50.990526 > 76271'34230 106239:160310 [68,60,58,59,29,23] 68 > [62,60,58,59,29,23] 62 76271'34230 2017-12-13 > 09:00:08.359690 76271'34230 2017-12-10 10:05:10.931366 > > 0.1cf 3947 0 0 0 0 16472186880 > 1577 1577 activating+remapped 2017-12-13 11:08:50.641034 > 106236'7512915 106239:6176548 [34,68,8] 34 > [34,8,53] 34 106138'7512682 2017-12-13 10:27:37.400613 > 106138'7512682 2017-12-13 10:27:37.400613 > > 2.210 0 0 0 0 0 0 > 1500 1500 activating+remapped 2017-12-13 11:08:50.686193 > 76271'33304 106239:96797 [68,67,34,36,16,15] 68 > [62,67,34,36,16,15] 62 76271'33304 2017-12-12 > 00:49:21.038437 76271'33304 2017-12-10 16:05:12.751425 > > > > > > >> >> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> > Does anyone know what this object (0.ae78c1cf) might be, it's not your >> > normal run of the mill RBD object and I can't seem to find it in the >> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck >> > in an >> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD >> > activating+pool >> > with a cache tier above it. There is no current mention of unfound >> > objects or any other obvious issues. >> > >> > There is some backfilling going on, on another OSD which was upgraded >> > to bluestore, which was when the issue started. But I can't see any >> > link in the PG dump with upgraded OSD. My only thought so far is to >> > wait for this backfilling to finish and then deep-scrub this PG and >> > see if that reveals anything? >> > >> > Thanks, >> > Nick >> > >> > "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf >> > (undecoded) >> > ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected >> > e105014)", >> > "initiated_at": "2017-12-12 17:10:50.030660", >> > "age": 335.948290, >> > "duration": 335.948383, >> > "type_data": { >> > "flag_point": "delayed", >> > "events": [ >> > { >> > "time": "2017-12-12 17:10:50.030660", >> > "event": "initiated" >> > }, >> > { >> > "time": "2017-12-12 17:10:50.030692", >> > "event": "queued_for_pg" >> > }, >> > { >> > "time": "2017-12-12 17:10:50.030719", >> > "event": "reached_pg" >> > }, >> > { >> > "time": "2017-12-12 17:10:50.030727", >> > "event": "waiting for peered" >> > }, >> > { >> > "time": "2017-12-12 17:10:50.197353", >> > "event": "reached_pg" >> > }, >> > { >> > "time": "2017-12-12 17:10:50.197355", >> > "event": "waiting for peered" >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Jason >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com