Re: Odd object blocking IO on PG

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 15 Dec 2017 11:57:54 +1000

On Wed, Dec 13, 2017 at 11:39 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but
> this has a potential to break a few things for people moving from Jewel to
> Luminous if they potentially had a few too many PG’s.
>
>
>
> Firstly, how I stumbled across it. I whacked the logging up to max on OSD 68
> and saw this mentioned in the logs
>
>
>
> osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >=
> 400
>
>
>
> This made me search through the code for this warning string
>
>
>
> https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221
>
>
>
> Which jogged my memory about the changes in Luminous regarding max PG’s
> warning, and in particular these two config options
>
> mon_max_pg_per_osd
>
> osd_max_pg_per_osd_hard_ratio
>
>
>
> In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68
> in, has 8TB disks instead of 3TB for the rest of the cluster. This means
> these OSD’s were taking a lot more PG’s than the average would suggest. So
> in Luminous 200x2 gives a hard limit of 400, which is what that error
> message in the log suggests is the limit. I set the
> osd_max_pg_per_osd_hard_ratio  option to 3 and restarted the OSD and hey
> presto everything fell into line.
>
>
>
> Now a question. I get the idea around these settings to stop making too many
> or pools with too many PG’s. But is it correct they can break an existing
> pool which is maybe making the new PG on an OSD due to CRUSH layout being
> modified?

It would be good to capture this in a tracker Nick so it can be
explored in  more depth.

>
>
>
> Nick
>
>
>
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Nick Fisk
> Sent: 13 December 2017 11:14
> To: 'Gregory Farnum' <gfarnum@xxxxxxxxxx>
> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Odd object blocking IO on PG
>
>
>
>
>
> On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk <nick@xxxxxxxxxx> wrote:
>
>
>> That doesn't look like an RBD object -- any idea who is
>> "client.34720596.1:212637720"?
>
> So I think these might be proxy ops from the cache tier, as there are also
> block ops on one of the cache tier OSD's, but this time it actually lists
> the object name. Block op on cache tier.
>
>            "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
> 17:f3831e75:::rbd_data.15a5e20238e1f29.00000000000388ad:head [set-alloc-hint
> object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
> RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
>             "initiated_at": "2017-12-12 16:25:32.435718",
>             "age": 13996.681147,
>             "duration": 13996.681203,
>             "type_data": {
>                 "flag_point": "reached pg",
>                 "client_info": {
>                     "client": "client.34720596",
>                     "client_addr": "10.3.31.41:0/2600619462",
>                     "tid": 212637720
>
> I'm a bit baffled at the moment what's going. The pg query (attached) is not
> showing in the main status that it has been blocked from peering or that
> there are any missing objects. I've tried restarting all OSD's I can see
> relating to the PG in case they needed a bit of a nudge.
>
>
>
> Did that fix anything? I don't see anything immediately obvious but I'm not
> practiced in quickly reading that pg state output.
>
>
>
> What's the output of "ceph -s"?
>
>
>
> Hi Greg,
>
>
>
> No restarting OSD’s didn’t seem to help. But I did make some progress late
> last night. By stopping OSD.68 the cluster unlocks itself and IO can
> progress. However as soon as it starts back up, 0.1cf and a couple of other
> PG’s again get stuck in an activating state. If I out the OSD, either with
> it up or down, then some other PG’s seem to get hit by the same problem as
> CRUSH moves PG mappings around to other OSD’s.
>
>
>
> So there definitely seems to be some sort of weird peering issue somewhere.
> I have seen a very similar issue before on this cluster where after running
> the crush reweight script to balance OSD utilization, the weight got set too
> low and PG’s were unable to peer. I’m not convinced this is what’s happening
> here as all the weights haven’t changed, but I’m intending to explore this
> further just in case.
>
>
>
> With 68 down
>
>     pgs:     1071783/48650631 objects degraded (2.203%)
>
>              5923 active+clean
>
>              399  active+undersized+degraded
>
>              7    active+clean+scrubbing+deep
>
>              7    active+clean+remapped
>
>
>
> With it up
>
>     pgs:     0.047% pgs not active
>
>              67271/48651279 objects degraded (0.138%)
>
>              15602/48651279 objects misplaced (0.032%)
>
>              6051 active+clean
>
>              273  active+recovery_wait+degraded
>
>              4    active+clean+scrubbing+deep
>
>              4    active+remapped+backfill_wait
>
>             3    activating+remapped
>
> active+recovering+degraded
>
>
>
> PG Dump
>
> ceph pg dump | grep activatin
>
> dumped all
>
> 2.389         0                  0        0         0       0           0
> 1500     1500           activating+remapped 2017-12-13 11:08:50.990526
> 76271'34230    106239:160310 [68,60,58,59,29,23]         68
> [62,60,58,59,29,23]             62      76271'34230 2017-12-13
> 09:00:08.359690      76271'34230 2017-12-10 10:05:10.931366
>
> 0.1cf      3947                  0        0         0       0 16472186880
> 1577     1577           activating+remapped 2017-12-13 11:08:50.641034
> 106236'7512915   106239:6176548           [34,68,8]         34
> [34,8,53]             34   106138'7512682 2017-12-13 10:27:37.400613
> 106138'7512682 2017-12-13 10:27:37.400613
>
> 2.210         0                  0        0         0       0           0
> 1500     1500           activating+remapped 2017-12-13 11:08:50.686193
> 76271'33304     106239:96797 [68,67,34,36,16,15]         68
> [62,67,34,36,16,15]             62      76271'33304 2017-12-12
> 00:49:21.038437      76271'33304 2017-12-10 16:05:12.751425
>
>
>
>
>
>
>>
>> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> > Does anyone know what this object (0.ae78c1cf) might be, it's not your
>> > normal run of the mill RBD object and I can't seem to find it in the
>> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck
>> > in an
>> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD
>> > activating+pool
>> > with a cache tier above it. There is no current mention of unfound
>> > objects or any other obvious issues.
>> >
>> > There is some backfilling going on, on another OSD which was upgraded
>> > to bluestore, which was when the issue started. But I can't see any
>> > link in the PG dump with upgraded OSD. My only thought so far is to
>> > wait for this backfilling to finish and then deep-scrub this PG and
>> > see if that reveals anything?
>> >
>> > Thanks,
>> > Nick
>> >
>> >  "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf
>> > (undecoded)
>> > ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected
>> > e105014)",
>> >             "initiated_at": "2017-12-12 17:10:50.030660",
>> >             "age": 335.948290,
>> >             "duration": 335.948383,
>> >             "type_data": {
>> >                 "flag_point": "delayed",
>> >                 "events": [
>> >                     {
>> >                         "time": "2017-12-12 17:10:50.030660",
>> >                         "event": "initiated"
>> >                     },
>> >                     {
>> >                         "time": "2017-12-12 17:10:50.030692",
>> >                         "event": "queued_for_pg"
>> >                     },
>> >                     {
>> >                         "time": "2017-12-12 17:10:50.030719",
>> >                         "event": "reached_pg"
>> >                     },
>> >                     {
>> >                         "time": "2017-12-12 17:10:50.030727",
>> >                         "event": "waiting for peered"
>> >                     },
>> >                     {
>> >                         "time": "2017-12-12 17:10:50.197353",
>> >                         "event": "reached_pg"
>> >                     },
>> >                     {
>> >                         "time": "2017-12-12 17:10:50.197355",
>> >                         "event": "waiting for peered"
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com