[Ceph-community] Pgs are in stale+down+peering state

sweil@xxxxxxxxxx (Sage Weil) · Wed, 24 Sep 2014 05:39:38 -0700 (PDT)

On Wed, 24 Sep 2014, Sahana Lokeshappa wrote:
> 2.a9??? 518???? 0?????? 0?????? 0?????? 0?????? 2172649472????? 3001???
> 3001??? active+clean??? 2014-09-22 17:49:35.357586????? 6826'35762?????
> 17842:72706???? [12,7,28]?????? 12????? [12,7,28]?? 12?????? 6826'35762?????
> 2014-09-22 11:33:55.985449????? 0'0???? 2014-09-16 20:11:32.693864

Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 
28?  If so the next step would be to enable logging (debug osd = 20, debug 
ms = 1) and see wy peering is stuck...

sage

> 
> 0.59??? 0?????? 0?????? 0?????? 0?????? 0?????? 0?????? 0?????? 0??????
> active+clean??? 2014-09-22 17:50:00.751218????? 0'0???? 17842:4472?????
> [12,41,2]?????? 12????? [12,41,2]?????? 12????? 0'0 2014-09-22
> 16:47:09.315499? ?????0'0???? 2014-09-16 12:20:48.618726
> 
> 0.4d??? 0?????? 0?????? 0?????? 0?????? 0?????? 0?????? 4?????? 4??????
> stale+down+peering????? 2014-09-18 17:51:10.038247????? 186'4??
> 11134:498?????? [12,56,27]????? 12????? [12,56,27]????? 12? 186'4???
> 2014-09-18 17:30:32.393188????? 0'0???? 2014-09-16 12:20:48.615322
> 
> 0.49??? 0?????? 0?????? 0?????? 0?????? 0?????? 0?????? 0?????? 0??????
> stale+down+peering????? 2014-09-18 17:44:52.681513????? 0'0????
> 11134:498?????? [12,6,25]?????? 12????? [12,6,25]?????? 12? 0'0
> ?????2014-09-18 17:16:12.986658????? 0'0???? 2014-09-16 12:20:48.614192
> 
> 0.1c??? 0?????? 0?????? 0?????? 0?????? 0?????? 0?????? 12????? 12?????
> stale+down+peering????? 2014-09-18 17:51:16.735549????? 186'12?
> 11134:522?????? [12,25,23]????? 12????? [12,25,23]????? 12? 186'12??
> 2014-09-18 17:16:04.457863????? 186'10? 2014-09-16 14:23:58.731465
> 
> 2.17??? 510???? 0?????? 0?????? 0?????? 0?????? 2139095040????? 3001???
> 3001??? active+clean??? 2014-09-22 17:52:20.364754????? 6784'30742?????
> 17842:72033???? [12,27,23]????? 12????? [12,27,23]? 12?????? 6784'30742?????
> 2014-09-22 00:19:39.905291????? 0'0???? 2014-09-16 20:11:17.016299
> 
> 2.7e8?? 508???? 0?????? 0?????? 0?????? 0?????? 2130706432????? 3433???
> 3433??? active+clean??? 2014-09-22 17:52:20.365083????? 6702'21132?????
> 17842:64769???? [12,25,23]????? 12????? [12,25,23]? 12?????? 6702'21132?????
> 2014-09-22 17:01:20.546126????? 0'0???? 2014-09-16 14:42:32.079187
> 
> 2.6a5?? 528???? 0?????? 0?????? 0?????? 0?????? 2214592512????? 2840???
> 2840??? active+clean??? 2014-09-22 22:50:38.092084????? 6775'34416?????
> 17842:83221???? [12,58,0]?????? 12????? [12,58,0]?? 12?????? 6775'34416?????
> 2014-09-22 22:50:38.091989????? 0'0???? 2014-09-16 20:11:32.703368
> 
> ?
> 
> And we couldn?t observe and peering events happening on the primary osd.
> 
> ?
> 
> $ sudo ceph pg 0.49 query
> 
> Error ENOENT: i don't have pgid 0.49
> 
> $ sudo ceph pg 0.4d query
> 
> Error ENOENT: i don't have pgid 0.4d
> 
> $ sudo ceph pg 0.1c query
> 
> Error ENOENT: i don't have pgid 0.1c
> 
> ?
> 
> Not able to explain why the peering was stuck. BTW, Rbd pool doesn?t contain
> any data.
> 
> ?
> 
> Varada
> 
> ?
> 
> From: Ceph-community [mailto:ceph-community-bounces at lists.ceph.com] On
> Behalf Of Sage Weil
> Sent: Monday, September 22, 2014 10:44 PM
> To: Sahana Lokeshappa; ceph-users at lists.ceph.com; ceph-users at ceph.com;
> ceph-community at lists.ceph.com
> Subject: Re: [Ceph-community] Pgs are in stale+down+peering state
> 
> ?
> 
> Stale means that the primary OSD for the PG went down and the status is
> stale.? They all seem to be from OSD.12... Seems like something is
> preventing that OSD from reporting to the mon?
> 
> sage
> 
> ?
> 
> On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa
> <Sahana.Lokeshappa at sandisk.com> wrote:
> 
>       Hi all,
> 
>       ?
> 
>       I used command ??ceph osd thrash ? command and after all osds are up
>       and in, 3 ?pgs are in ?stale+down+peering state
> 
>       ?
> 
>       sudo ceph -s
> 
>       ????cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
> 
>       ???? health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale;
>       3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean
> 
>       ???? monmap e1: 3 mons at{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ra
>       m-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2
>       rack2-ram-1,rack2-ram-2,rack2-ram-3
> 
>       ???? osdmap e17031: 64 osds: 64 up, 64 in
> 
>       ????? pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033
>       kobjects
> 
>       ??????????? 12501 GB used, 10975 GB / 23476 GB avail
> 
>       ??????????????? 2145 active+clean
> 
>       ?????????????????? 3 stale+down+peering
> 
>       ?
> 
>       sudo ceph health detail
> 
>       HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck
>       inactive; 3 pgs stuck stale; 3 pgs stuck unclean
> 
>       pg 0.4d is stuck inactive for 341048.948643, current state
>       stale+down+peering, last acting [12,56,27]
> 
>       pg 0.49 is stuck inactive for 341048.948667, current state
>       stale+down+peering, last acting [12,6,25]
> 
>       pg 0.1c is stuck inactive for 341048.949362, current state
>       stale+down+peering, last acting [12,25,23]
> 
>       pg 0.4d is stuck unclean for 341048.948665, current state
>       stale+down+peering, last acting [12,56,27]
> 
>       pg 0.49 is stuck unclean for 341048.948687, current state
>       stale+down+peering, last acting [12,6,25]
> 
>       pg 0.1c is stuck unclean for 341048.949382, current state
>       stale+down+peering, last acting [12,25,23]
> 
>       pg 0.4d is stuck stale for 339823.956929, current state
>       stale+down+peering, last acting [12,56,27]
> 
>       pg 0.49 is stuck stale for 339823.956930, current state
>       stale+down+peering, last acting [12,6,25]
> 
>       pg 0.1c is stuck stale for 339823.956925, current state
>       stale+down+peering, last acting [12,25,23]
> 
>       ?
> 
>       ?
> 
>       Please, can anyone explain why pgs are in this state.
> 
>       Sahana Lokeshappa
>       Test Development Engineer I
>       SanDisk Corporation
>       3rd Floor, Bagmane Laurel, Bagmane Tech Park
> 
>       C V Raman nagar, Bangalore 560093
>       T: +918042422283
> 
>       Sahana.Lokeshappa at SanDisk.com
> 
>       ?
> 
>       ?
> 
> 
> ____________________________________________________________________________
> 
> 
> 
>       PLEASE NOTE: The information contained in this electronic mail
>       message is intended only for the use of the designated
>       recipient(s) named above. If the reader of this message is not
>       the intended recipient, you are hereby notified that you have
>       received this message in error and that any review,
>       dissemination, distribution, or copying of this message is
>       strictly prohibited. If you have received this communication in
>       error, please notify the sender by telephone or e-mail (as shown
>       above) immediately and destroy any and all copies of this
>       message in your possession (whether hard copies or
>       electronically stored copies).
> 
> ____________________________________________________________________________
> 
> Ceph-community mailing list
> Ceph-community at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
> 
> 
> --
> Sent from Kaiten Mail. Please excuse my brevity.
> 
> 
>