Re: Odd object blocking IO on PG

Nick Fisk <nick@xxxxxxxxxx> · Wed, 13 Dec 2017 13:39:31 -0000

Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but this has a potential to break a few things for people moving from Jewel to Luminous if they potentially had a few too many PG’s.

Firstly, how I stumbled across it. I whacked the logging up to max on OSD 68 and saw this mentioned in the logs

osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >= 400

This made me search through the code for this warning string

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221

Which jogged my memory about the changes in Luminous regarding max PG’s warning, and in particular these two config options
mon_max_pg_per_osd
osd_max_pg_per_osd_hard_ratio

In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68 in, has 8TB disks instead of 3TB for the rest of the cluster. This means these OSD’s were taking a lot more PG’s than the average would suggest. So in Luminous 200x2 gives a hard limit of 400, which is what that error message in the log suggests is the limit. I set the osd_max_pg_per_osd_hard_ratio  option to 3 and restarted the OSD and hey presto everything fell into line.

Now a question. I get the idea around these settings to stop making too many or pools with too many PG’s. But is it correct they can break an existing pool which is maybe making the new PG on an OSD due to CRUSH layout being modified?

Nick

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Nick Fisk
Sent: 13 December 2017 11:14
To: 'Gregory Farnum' <gfarnum@xxxxxxxxxx>
Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Odd object blocking IO on PG

On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk <nick@xxxxxxxxxx> wrote:

> That doesn't look like an RBD object -- any idea who is
> "client.34720596.1:212637720"?

So I think these might be proxy ops from the cache tier, as there are also
block ops on one of the cache tier OSD's, but this time it actually lists
the object name. Block op on cache tier.

           "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
17:f3831e75:::rbd_data.15a5e20238e1f29.00000000000388ad:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
            "initiated_at": "2017-12-12 16:25:32.435718",
            "age": 13996.681147,
            "duration": 13996.681203,
            "type_data": {
                "flag_point": "reached pg",
                "client_info": {
                    "client": "client.34720596",
                    "client_addr": "10.3.31.41:0/2600619462",
                    "tid": 212637720

I'm a bit baffled at the moment what's going. The pg query (attached) is not
showing in the main status that it has been blocked from peering or that
there are any missing objects. I've tried restarting all OSD's I can see
relating to the PG in case they needed a bit of a nudge.

Did that fix anything? I don't see anything immediately obvious but I'm not practiced in quickly reading that pg state output.

What's the output of "ceph -s"?

Hi Greg,

No restarting OSD’s didn’t seem to help. But I did make some progress late last night. By stopping OSD.68 the cluster unlocks itself and IO can progress. However as soon as it starts back up, 0.1cf and a couple of other PG’s again get stuck in an activating state. If I out the OSD, either with it up or down, then some other PG’s seem to get hit by the same problem as CRUSH moves PG mappings around to other OSD’s.

So there definitely seems to be some sort of weird peering issue somewhere. I have seen a very similar issue before on this cluster where after running the crush reweight script to balance OSD utilization, the weight got set too low and PG’s were unable to peer. I’m not convinced this is what’s happening here as all the weights haven’t changed, but I’m intending to explore this further just in case.

With 68 down
    pgs:     1071783/48650631 objects degraded (2.203%)
             5923 active+clean
             399  active+undersized+degraded
             7    active+clean+scrubbing+deep
             7    active+clean+remapped

With it up
    pgs:     0.047% pgs not active
             67271/48651279 objects degraded (0.138%)
             15602/48651279 objects misplaced (0.032%)
             6051 active+clean
             273  active+recovery_wait+degraded
             4    active+clean+scrubbing+deep
             4    active+remapped+backfill_wait
            3    activating+remapped
active+recovering+degraded

PG Dump
ceph pg dump | grep activatin
dumped all
2.389         0                  0        0         0       0           0 1500     1500           activating+remapped 2017-12-13 11:08:50.990526      76271'34230    106239:160310 [68,60,58,59,29,23]         68 [62,60,58,59,29,23]             62      76271'34230 2017-12-13 09:00:08.359690      76271'34230 2017-12-10 10:05:10.931366
0.1cf      3947                  0        0         0       0 16472186880 1577     1577           activating+remapped 2017-12-13 11:08:50.641034   106236'7512915   106239:6176548           [34,68,8]         34           [34,8,53]             34   106138'7512682 2017-12-13 10:27:37.400613   106138'7512682 2017-12-13 10:27:37.400613
2.210         0                  0        0         0       0           0 1500     1500           activating+remapped 2017-12-13 11:08:50.686193      76271'33304     106239:96797 [68,67,34,36,16,15]         68 [62,67,34,36,16,15]             62      76271'33304 2017-12-12 00:49:21.038437      76271'33304 2017-12-10 16:05:12.751425

>
> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > Does anyone know what this object (0.ae78c1cf) might be, it's not your
> > normal run of the mill RBD object and I can't seem to find it in the
> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck
> > in an
> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD
> > activating+pool
> > with a cache tier above it. There is no current mention of unfound
> > objects or any other obvious issues.
> >
> > There is some backfilling going on, on another OSD which was upgraded
> > to bluestore, which was when the issue started. But I can't see any
> > link in the PG dump with upgraded OSD. My only thought so far is to
> > wait for this backfilling to finish and then deep-scrub this PG and
> > see if that reveals anything?
> >
> > Thanks,
> > Nick
> >
> >  "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf
> > (undecoded)
> > ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected
> > e105014)",
> >             "initiated_at": "2017-12-12 17:10:50.030660",
> >             "age": 335.948290,
> >             "duration": 335.948383,
> >             "type_data": {
> >                 "flag_point": "delayed",
> >                 "events": [
> >                     {
> >                         "time": "2017-12-12 17:10:50.030660",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.030692",
> >                         "event": "queued_for_pg"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.030719",
> >                         "event": "reached_pg"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.030727",
> >                         "event": "waiting for peered"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.197353",
> >                         "event": "reached_pg"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.197355",
> >                         "event": "waiting for peered"
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com