Re: Help rebalancing OSD usage, Luminus 1.2.2

Bryan Banister <bbanister@xxxxxxxxxxxxxxx> · Tue, 20 Feb 2018 21:53:12 +0000

HI David [Resending with smaller message size],

I tried setting the OSDs down and that does clear the blocked requests momentarily but they just return back to the same state.  Not sure how to proceed here,
 but one thought was just to do a full cold restart of the entire cluster.  We have disabled our backups so the cluster is effectively down.  Any recommendations on next steps?

This also seems like a pretty serious issue, given that making this change has effectively broken the cluster.  Perhaps Ceph should not allow you to increase
 the number of PGs so drastically or at least make you put in a ‘--yes-i-really-mean-it’ flag?

Or perhaps just some warnings on the docs.ceph.com placement groups page (http://docs.ceph.com/docs/master/rados/operations/placement-groups/
 ) and the ceph command man page?

Would be good to help other avoid this pitfall.

Thanks again,
-Bryan

From: David Turner [mailto:drakonstein@xxxxxxxxx]

Sent: Friday, February 16, 2018 3:21 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Cc: Bryan Stillwell <bstillwell@xxxxxxxxxxx>; Janne Johansson <icepic.dz@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

That sounds like a good next step.  Start with OSDs involved in the longest blocked requests.  Wait a couple minutes after the osd marks itself back up and continue through them.  Hopefully things will start clearing up so that you don't
 need to mark all of them down.  There is usually a only a couple OSDs holding everything up.

On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister <bbanister@xxxxxxxxxxxxxxx> wrote:

Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of all OSDs are in this condition. 
 There isn’t any discernable pattern that I can find and they are spread across the three servers.  All of the OSDs are online as far as the service is concern.

I have also taken all PGs that were reported the health detail output and looked for any that report
 “peering_blocked_by” but none do, so I can’t tell if any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort -k13
    pg 14.fe0 is stuck peering since forever, current state peering, last acting [104,94,108]
    pg 14.fe0 is stuck unclean since forever, current state peering, last acting [104,94,108]
    pg 14.fbc is stuck peering since forever, current state peering, last acting [110,91,0]
    pg 14.fd1 is stuck peering since forever, current state peering, last acting [130,62,111]
    pg 14.fd1 is stuck unclean since forever, current state peering, last acting [130,62,111]
    pg 14.fed is stuck peering since forever, current state peering, last acting [32,33,82]
    pg 14.fed is stuck unclean since forever, current state peering, last acting [32,33,82]
    pg 14.fee is stuck peering since forever, current state peering, last acting [37,96,68]
    pg 14.fee is stuck unclean since forever, current state peering, last acting [37,96,68]
    pg 14.fe8 is stuck peering since forever, current state peering, last acting [45,31,107]
    pg 14.fe8 is stuck unclean since forever, current state peering, last acting [45,31,107]
    pg 14.fc1 is stuck peering since forever, current state peering, last acting [59,124,39]
    pg 14.ff2 is stuck peering since forever, current state peering, last acting [62,117,7]
    pg 14.ff2 is stuck unclean since forever, current state peering, last acting [62,117,7]
    pg 14.fe4 is stuck peering since forever, current state peering, last acting [84,55,92]
    pg 14.fe4 is stuck unclean since forever, current state peering, last acting [84,55,92]

    pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]

    pg 14.ffc is stuck peering since forever, current state peering, last acting [96,53,70]
    pg 14.ffc is stuck unclean since forever, current state peering, last acting [96,53,70]

Some have common OSDs but some OSDs only listed once.

Should I try just marking OSDs with stuck requests down to see if that will re-assert them?

Thanks!!
-Bryan

From: David Turner [mailto:drakonstein@xxxxxxxxx]

Sent: Friday, February 16, 2018 2:51 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Cc: Bryan Stillwell <bstillwell@xxxxxxxxxxx>; Janne Johansson <icepic.dz@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

The questions I definitely know the answer to first, and then we'll continue from there.  If an OSD is blocking peering but is online, when you mark it as down in the cluster it
 receives a message in it's log saying it was wrongly marked down and tells the mons it is online.  That gets it to stop what it was doing and start talking again.  I referred to that as re-asserting.  If the OSD that you marked down doesn't mark itself back
 up within a couple minutes, restarting the OSD might be a good idea.  Then again actually restarting the daemon could be bad because the daemon is doing something.  With as much potential for places to work with to get things going, actually restarting the
 daemons is probably something I would wait to do for now.

The reason the cluster doesn't know anything about the PG is because it's still creating and hasn't actually been created.  Starting with some of the OSDs that you see with blocked
 requests would be a good idea.  Eventually you'll down an OSD that when it comes back up things start looking much better as things start peering and getting better.  Below are the list of OSDs you had from a previous email that if they're still there with
 stuck requests then they'll be good to start doing this to.  On closer review, it's almost all of them... but you have to start somewhere.  Another possible place to start with these is to look at a list of all of the peering PGs and see if there are any common
 OSDs when you look at all of them at once.  Some patterns may emerge and would be good options to try.

    osds 7,39,60,103,133 have stuck requests > 67108.9 sec
    osds 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have stuck requests >
 134218 sec
    osds 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
 have stuck requests > 268435 sec

On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister <bbanister@xxxxxxxxxxxxxxx> wrote:

Thanks David,

I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at this point and the backfills
 have stopped.  I’ll also stop the backups from pushing into ceph for now.

I don’t want to make things worse, so ask for some more guidance now. 

1)     
In looking at a PG that is still peering or one that is “unknown”, Ceph complains that it doesn’t have that pgid:

    pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]

[root@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
Error ENOENT: i don't have pgid 14.fb0
[root@carf-ceph-osd03 ~]#

2)     
One that is activating shows this for the recovery_state:
[root@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
[snip]
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2018-02-13 14:33:21.406919",
            "might_have_unfound": [
                {
                    "osd": "84(0)",
                    "status": "not queried"
                }
            ],
            "recovery_progress": {
                "backfill_targets": [
                    "56(0)",
                    "87(1)",
                    "88(2)"
                ],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2018-02-13 14:33:17.491148"
        }
    ],

Sorry for all the hand holding, but how do I determine if I need to set an OSD as ‘down’ to fix the
 issues, and how does it go about re-asserting itself?

I again tried looking at the ceph docs on troubleshooting OSDs but didn’t find any details.  Man page
 also has no details.

Thanks again,
-Bryan

From: David Turner [mailto:drakonstein@xxxxxxxxx]

Sent: Friday, February 16, 2018 1:21 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Cc: Bryan Stillwell <bstillwell@xxxxxxxxxxx>; Janne Johansson <icepic.dz@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

Your problem might have been creating too many PGs at once.  I generally increase pg_num and pgp_num by no more than 256 at a time.  Making sure that all PGs are creating, peered,
 and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of your PGs peered.  Go ahead and put a stop to backfilling, recovery, scrubbing, etc.  Those are all hindering
 the peering effort right now.  The more clients you can disable is also better.

ceph osd set nobackfill

ceph osd set norecovery

ceph osd set noscrubbing

ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their peering.  This is where you might need to be using `ceph osd down 23` (assuming you needed to kick osd.23)
 to mark them down in the cluster and let them re-assert themselves.  Once you have all PGs done with peering, go ahead and unset nobackfill and norecovery and let the cluster start moving data around.  Leaving noscrubbing and nodeep-scrubbing off is optional
 and up to you.  I'll never say it's better to leave them off, but scrubbing does use a fair bit of spindles while you're trying to backfill.

On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister <bbanister@xxxxxxxxxxxxxxx> wrote:

Well I decided to try the increase in PGs to 4096 and that seems to have caused some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 61802168/241154376 objects misplaced
 (25.628%); Reduced data availability: 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are active at this point and may are stuck peering, stuck unclean, or have a state
 of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
    pg 14.fae is stuck inactive for 253360.025730, current state activating+remapped, last acting [85,12,41]
    pg 14.faf is stuck inactive for 253368.511573, current state unknown, last acting []
    pg 14.fb0 is stuck peering since forever, current state peering, last acting [94,30,38]
    pg 14.fb1 is stuck inactive for 253362.605886, current state activating+remapped, last acting [6,74,34]
[snip]

The health also shows a large number of degraded data redundancy PGs:
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded
    pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last acting []
    pg 14.fc8 is stuck unclean for 531622.531271, current state active+remapped+backfill_wait, last acting
 [73,132,71]
    pg 14.fca is stuck unclean for 420540.396199, current state active+remapped+backfill_wait, last acting
 [0,80,61]
    pg 14.fcb is stuck unclean for 531622.421855, current state activating+remapped, last acting [70,26,75]
[snip]

We also now have a number of stuck requests:
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
    69 ops are blocked > 268435 sec
    66 ops are blocked > 134218 sec
   28 ops are blocked > 67108.9 sec
    osds 7,39,60,103,133 have stuck requests > 67108.9 sec
    osds 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have stuck requests >
 134218 sec
    osds 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
 have stuck requests > 268435 sec

I tried looking through the mailing list archive on how to solve the stuck requests, and it seems that restarting the OSDs is the right way?

At this point we have just been watching the backfills running and see a steady but slow decrease of misplaced objects.  When the cluster is idle, the
 overall OSD disk utilization is not too bad at roughly 40% on the physical disks running these backfills.

However we still have our backups trying to push new images to the cluster.  This worked ok for the first few days, but yesterday we were getting failure
 alerts.  I checked the status of the RGW service and noticed that 2 of the 3 RGW civetweb servers where not responsive.  I restarted the RGWs on the ones that appeared hung and that got them working for a while, but then the same condition happened.  The RGWs
 seem to have recovered on their own now, but again the cluster is idle and only backfills are currently doing anything (that I can tell).  I did see these log entries:
2018-02-15 16:46:07.541542 7fffe6c56700  1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700'
 had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700  1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700'
 had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700  1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700'
 had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700  1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700'
 had timed out after 600

At this point we do not know to proceed with recovery efforts.  I tried looking at the ceph docs and mail list archives but wasn’t able to determine
 the right path forward here.

Any help is appreciated,
-Bryan

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this
 email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness
 or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial
 product.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com