[Ceph-community] Pgs are in stale+down+peering state

Sahana.Lokeshappa@xxxxxxxxxxx (Sahana Lokeshappa) · Thu, 25 Sep 2014 08:13:29 +0000

Hi Craig,

Sorry for late response. Somehow missed this mail.
All osds are up and running. There were no specific logs related to this activity.  And, there are no IOs running right now. Few osds were made in and out ,removed fully and recreated before these pgs coming to this stage.
I had tried restarting osds. It didn?t work.

Thanks
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
Sahana.Lokeshappa at SanDisk.com

From: Craig Lewis [mailto:clewis@xxxxxxxxxxxxxxxxxx]
Sent: Wednesday, September 24, 2014 5:44 AM
To: Sahana Lokeshappa
Cc: ceph-users at ceph.com
Subject: Re: [Ceph-community] Pgs are in stale+down+peering state

Is osd.12  doing anything strange?  Is it consuming lots of CPU or IO?  Is it flapping?   Writing any interesting logs?  Have you tried restarting it?

If that doesn't help, try the other involved osds: 56, 27, 6, 25, 23.  I doubt that it will help, but it won't hurt.

On Mon, Sep 22, 2014 at 11:21 AM, Varada Kari <Varada.Kari at sandisk.com<mailto:Varada.Kari at sandisk.com>> wrote:
Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG?s , but the problem happens for these three  PG?s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
    cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
     health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked > 32 sec
    monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0<http://10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0>}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
     osdmap e17842: 64 osds: 64 up, 64 in
      pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
            12504 GB used, 10971 GB / 23476 GB avail
                2145 active+clean
                   3 stale+down+peering

Snippet from pg dump:

2.a9    518     0       0       0       0       2172649472      3001    3001    active+clean    2014-09-22 17:49:35.357586      6826'35762      17842:72706     [12,7,28]       12      [12,7,28]   12       6826'35762      2014-09-22 11:33:55.985449      0'0     2014-09-16 20:11:32.693864
0.59    0       0       0       0       0       0       0       0       active+clean    2014-09-22 17:50:00.751218      0'0     17842:4472      [12,41,2]       12      [12,41,2]       12      0'0 2014-09-22 16:47:09.315499       0'0     2014-09-16 12:20:48.618726
0.4d    0       0       0       0       0       0       4       4       stale+down+peering      2014-09-18 17:51:10.038247      186'4   11134:498       [12,56,27]      12      [12,56,27]      12  186'4    2014-09-18 17:30:32.393188      0'0     2014-09-16 12:20:48.615322
0.49    0       0       0       0       0       0       0       0       stale+down+peering      2014-09-18 17:44:52.681513      0'0     11134:498       [12,6,25]       12      [12,6,25]       12  0'0      2014-09-18 17:16:12.986658      0'0     2014-09-16 12:20:48.614192
0.1c    0       0       0       0       0       0       12      12      stale+down+peering      2014-09-18 17:51:16.735549      186'12  11134:522       [12,25,23]      12      [12,25,23]      12  186'12   2014-09-18 17:16:04.457863      186'10  2014-09-16 14:23:58.731465
2.17    510     0       0       0       0       2139095040      3001    3001    active+clean    2014-09-22 17:52:20.364754      6784'30742      17842:72033     [12,27,23]      12      [12,27,23]  12       6784'30742      2014-09-22 00:19:39.905291      0'0     2014-09-16 20:11:17.016299
2.7e8   508     0       0       0       0       2130706432      3433    3433    active+clean    2014-09-22 17:52:20.365083      6702'21132      17842:64769     [12,25,23]      12      [12,25,23]  12       6702'21132      2014-09-22 17:01:20.546126      0'0     2014-09-16 14:42:32.079187
2.6a5   528     0       0       0       0       2214592512      2840    2840    active+clean    2014-09-22 22:50:38.092084      6775'34416      17842:83221     [12,58,0]       12      [12,58,0]   12       6775'34416      2014-09-22 22:50:38.091989      0'0     2014-09-16 20:11:32.703368

And we couldn?t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn?t contain any data.

Varada

From: Ceph-community [mailto:ceph-community-bounces@xxxxxxxxxxxxxx<mailto:ceph-community-bounces at lists.ceph.com>] On Behalf Of Sage Weil
Sent: Monday, September 22, 2014 10:44 PM
To: Sahana Lokeshappa; ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>; ceph-users at ceph.com<mailto:ceph-users at ceph.com>; ceph-community at lists.ceph.com<mailto:ceph-community at lists.ceph.com>
Subject: Re: [Ceph-community] Pgs are in stale+down+peering state

Stale means that the primary OSD for the PG went down and the status is stale.  They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa <Sahana.Lokeshappa at sandisk.com<mailto:Sahana.Lokeshappa at sandisk.com>> wrote:
Hi all,

I used command  ?ceph osd thrash ? command and after all osds are up and in, 3  pgs are in  stale+down+peering state

sudo ceph -s
    cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
     health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean
     monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0<http://10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0>}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
     osdmap e17031: 64 osds: 64 up, 64 in
      pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
            12501 GB used, 10975 GB / 23476 GB avail
                2145 active+clean
                   3 stale+down+peering

sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean
pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, last acting [12,25,23]
pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, last acting [12,25,23]
pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, last acting [12,25,23]

Please, can anyone explain why pgs are in this state.
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
Sahana.Lokeshappa at SanDisk.com<mailto:Sahana.Lokeshappa at SanDisk.com>

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

________________________________

Ceph-community mailing list
Ceph-community at lists.ceph.com<mailto:Ceph-community at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

--
Sent from Kaiten Mail. Please excuse my brevity.

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140925/15d0dc80/attachment.htm>