Re: Health Error : Request Stuck

Nick Fisk <nick@xxxxxxxxxx> · Wed, 13 Dec 2017 12:22:53 -0000

Ok, great glad you got your issue sorted. I’m still battling along with mine.

From: Karun Josy [mailto:karunjosy1@xxxxxxxxx] 
Sent: 13 December 2017 12:22
To: nick@xxxxxxxxxx
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Health Error : Request Stuck

Hi Nick,

Finally, was able to correct the issue!

We found that there were many slow requests in ceph health detail. 
And found that some osds were slowing the cluster down.

Initially the cluster was unusable when there were 10 PGs with "activating+remapped" status and slow requests.
Slow requests were mainly on 2 osds. And we restarted osd daemons one by one, which cleared the block requests.

And that made the cluster reusable. However, there were 4 PGs still in inactive state.
So I took down one of the osd with slow requests for some time, and allowed the cluster to rebalance.
And it worked!

To be honest, not exactly sure its the correct way. 

P.S : I had upgraded to Luminous 12.2.2 yesterday. 

Karun Josy

On Wed, Dec 13, 2017 at 4:31 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
Hi Karun,

I too am experiencing something very similar with a PG stuck in activating+remapped state after re-introducing a OSD back into the cluster as Bluestore. Although this new OSD is not the one listed against the PG’s stuck activating. I also see the same thing as you where the up set is different to the acting set.

Can I just ask what ceph version you are running and the output of ceph osd tree?

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Karun Josy
Sent: 13 December 2017 07:06
To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Health Error : Request Stuck

Cluster is unusable because of inactive PGs. How can we correct it?

=============
ceph pg dump_stuck inactive
ok
PG_STAT STATE               UP           UP_PRIMARY ACTING       ACTING_PRIMARY
1.4b    activating+remapped [5,2,0,13,1]          5 [5,2,13,1,4]              5
1.35    activating+remapped [2,7,0,1,12]          2 [2,7,1,12,9]              2
1.12    activating+remapped  [1,3,5,0,7]          1  [1,3,5,7,2]              1
1.4e    activating+remapped  [1,3,0,9,2]          1  [1,3,0,9,5]              1
2.3b    activating+remapped     [13,1,0]         13     [13,1,2]             13
1.19    activating+remapped [2,13,8,9,0]          2 [2,13,8,9,1]              2
1.1e    activating+remapped [2,3,1,10,0]          2 [2,3,1,10,5]              2
2.29    activating+remapped     [1,0,13]          1     [1,8,11]              1
1.6f    activating+remapped [8,2,0,4,13]          8 [8,2,4,13,1]              8
1.74    activating+remapped [7,13,2,0,4]          7 [7,13,2,4,1]              7
====

Karun Josy

On Wed, Dec 13, 2017 at 8:27 AM, Karun Josy <karunjosy1@xxxxxxxxx> wrote:
Hello,

We added a new disk to the cluster and while rebalancing we are getting error warnings.

=============
Overall status: HEALTH_ERR
REQUEST_SLOW: 1824 slow requests are blocked > 32 sec
REQUEST_STUCK: 1022 stuck requests are blocked > 4096 sec
==============

The load in the servers seems to be very low.

How can I correct it?

Karun 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com