Hi Sage, To give more context on this problem, This cluster has two pools rbd and user-created. Osd.12 is a primary for some other PG?s , but the problem happens for these three PG?s. $ sudo ceph osd lspools 0 rbd,2 pool1, $ sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked > 32 sec monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17842: 64 osds: 64 up, 64 in pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12504 GB used, 10971 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering Snippet from pg dump: 2.a9 518 0 0 0 0 2172649472 3001 3001 active+clean 2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 0.59 0 0 0 0 0 0 0 0 active+clean 2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d 0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'4 2014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.49 0 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c 0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17 510 0 0 0 0 2139095040 3001 3001 active+clean 2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 3433 3433 active+clean 2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 2840 2840 active+clean 2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn?t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn?t contain any data. Varada From: Ceph-community [mailto:ceph-community-bounces@xxxxxxxxxxxxxx] On Behalf Of Sage Weil Sent: Monday, September 22, 2014 10:44 PM To: Sahana Lokeshappa; ceph-users at lists.ceph.com; ceph-users at ceph.com; ceph-community at lists.ceph.com Subject: Re: [Ceph-community] Pgs are in stale+down+peering state Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa <Sahana.Lokeshappa at sandisk.com<mailto:Sahana.Lokeshappa at sandisk.com>> wrote: Hi all, I used command ?ceph osd thrash ? command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, last acting [12,25,23] Please, can anyone explain why pgs are in this state. Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 Sahana.Lokeshappa at SanDisk.com<mailto:Sahana.Lokeshappa at SanDisk.com> ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ________________________________ Ceph-community mailing list Ceph-community at lists.ceph.com<mailto:Ceph-community at lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com -- Sent from Kaiten Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140922/7b1b00d7/attachment.htm>