Re: pg is stuck stale (osd.21 still removed)

Daniel Schwager <Daniel.Schwager@xxxxxxxx> · Wed, 13 Jan 2016 12:11:36 +0000

Hi ceph-users,

any idea to fix my cluster? OSD.21 removed, but still some (staled) PG's pointing to OSD.21...

I don't know how to proceed... Help is very welcome!

Best regards
Daniel

> -----Original Message-----
> From: Daniel Schwager
> Sent: Friday, January 08, 2016 3:10 PM
> To: 'ceph-users@xxxxxxxx'
> Subject: pg is stuck stale (osd.21 still removed)
> 
> Hi,
> 
> we had a HW-problem with OSD.21 today. The OSD daemon was down and "smartctl" told me about some
> hardware errors.
> 
> I decided to remove the HDD:
> 
>           ceph osd out 21
>           ceph osd crush remove osd.21
>           ceph auth del osd.21
>           ceph osd rm osd.21
> 
> But afterwards I saw that I have some stucked pg's for osd.21:
> 
> 	root@ceph-admin:~# ceph -w
> 	    cluster c7b12656-15a6-41b0-963f-4f47c62497dc
> 	     health HEALTH_WARN
>       	      50 pgs stale
>             	50 pgs stuck stale
> 	     monmap e4: 3 mons at {ceph-mon1=192.168.135.31:6789/0,ceph-mon2=192.168.135.32:6789/0,ceph-
> mon3=192.168.135.33:6789/0}
>       	      election epoch 404, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
> 	     mdsmap e136: 1/1/1 up {0=ceph-mon1=up:active}
> 	     osdmap e18259: 23 osds: 23 up, 23 in
> 	      pgmap v47879105: 6656 pgs, 10 pools, 23481 GB data, 6072 kobjects
>       	      54974 GB used, 30596 GB / 85571 GB avail
>             	    6605 active+clean
>                   	50 stale+active+clean
> 	                   1 active+clean+scrubbing+deep
> 
> 	root@ceph-admin:~# ceph health
> 	HEALTH_WARN 50 pgs stale; 50 pgs stuck stale
> 
> 	root@ceph-admin:~# ceph health detail
> 	HEALTH_WARN 50 pgs stale; 50 pgs stuck stale; noout flag(s) set
> 	pg 34.225 is stuck stale for 98780.399254, current state stale+active+clean, last acting [21]
> 	pg 34.186 is stuck stale for 98780.399195, current state stale+active+clean, last acting [21]
> 	...
> 
> 	root@ceph-admin:~# ceph pg 34.225   query
> 	Error ENOENT: i don't have pgid 34.225
> 
> 	root@ceph-admin:~# ceph pg 34.225  list_missing
> 	Error ENOENT: i don't have pgid 34.225
> 
> 	root@ceph-admin:~# ceph osd lost 21  --yes-i-really-mean-it
> 	osd.21 is not down or doesn't exist
> 
> 	# checking the crushmap
>       ceph osd getcrushmap -o crush.map
>       crushtool -d crush.map  -o crush.txt
> 	root@ceph-admin:~# grep 21 crush.txt
> 		-> nothing here....
> 
> 
> Of course, I cannot start OSD.21, because it's not available anymore - I removed it.
> 
> Is there a way to remap the stucked pg's to other OSD's than osd.21

....

> 
> One more - I tried to recreate the pg but now this pg this "stuck inactive":
> 
> 	root@ceph-admin:~# ceph pg force_create_pg 34.225
> 	pg 34.225 now creating, ok
> 
> 	root@ceph-admin:~# ceph health detail
> 	HEALTH_WARN 49 pgs stale; 1 pgs stuck inactive; 49 pgs stuck stale; 1 pgs stuck unclean
> 	pg 34.225 is stuck inactive since forever, current state creating, last acting []
> 	pg 34.225 is stuck unclean since forever, current state creating, last acting []
> 	pg 34.186 is stuck stale for 118481.013632, current state stale+active+clean, last acting [21]
> 	...
> 
> Maybe somebody has an idea how to fix this situation?
Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com