Re: [EXTERNAL] Re: pg stuck with unfound objects on non exsisting osd's

"Will.Boege" <Will.Boege@xxxxxxxxxx> · Wed, 2 Nov 2016 00:20:48 +0000

Start with a rolling restart of just the OSDs one system at a time, checking the status after each restart. 

On Nov 1, 2016, at 6:20 PM, Ronny Aasen <ronny+ceph-users@xxxxxxxx> wrote:

thanks for the suggestion.

is a rolling reboot sufficient? or must all osd's be down at the same time ? 

one is no problem.  the other takes some scheduling.. 

Ronny Aasen

On 01.11.2016 21:52, 
ceph@xxxxxxxxxx wrote:

Hello Ronny,

if it is possible for you, try to Reboot all OSD Nodes. 

I had this issue on my test Cluster and it become healthy after rebooting.

Hth

- Mehmet

Am 1. November 2016 19:55:07 MEZ, schrieb Ronny Aasen 
<ronny+ceph-users@xxxxxxxx>:

Hello.

I have a cluster stuck with 2 pg's stuck undersized degraded, with 25 
unfound objects.

# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 294599/149522370 objects degraded (0.197%); recovery 640073/149522370 objects misplaced (0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set
pg 6.d4 is stuck unclean for 8893374.380079, current state active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck unclean for 8896787.249470, current state active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck undersized for 438122.427341, current state active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck undersized for 416947.461950, current state active+recovering+undersized+degraded+remapped, last acting [18,12]pg
6.d4 is stuck degraded for 438122.427402, current state active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck degraded for 416947.462010, current state active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62], 25 unfound
pg 6.ab is active+recovering+undersized+degraded+remapped, acting [18,12]
recovery 294599/149522370 objects degraded (0.197%)
recovery 640073/149522370 objects misplaced (0.428%)
recovery 25/46579241 unfound (0.000%)
noout flag(s) set

have been following the troubleshooting guide at 
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/ 
but gets stuck without a resolution.

luckily it is not critical data. so i wanted to mark the pg lost so it 
could become health-ok<
 br
/>

# ceph pg 6.d4 mark_unfound_lost delete
Error EINVAL: pg has 25 unfound objects but we haven't probed all 
sources, not marking lost

querying the pg i see that it would want osd.80 and osd 36

      {
                     "osd": "80",
                     "status": "osd is down"
                 },

trying to mark the osd's lost does not work either. since the osd's was 
removed from the cluster a long time ago.

# ceph osd lost 80 --yes-i-really-mean-it
osd.80 is not down or doesn't exist

# ceph osd lost 36 --yes-i-really-mean-it
osd.36 is not down or doesn't exist

and this is where i am stuck.

have tried stopping and starting the 3 osd's but that did not have any 
effect.

Anyone have any advice how to proceed ?

full output at:  http://paste.debian.net/hidden/be03a185/

this is hammer 0.94.9  on debian 8.

kind regards

Ronny Aasen

ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com