Re: Ceph PGs stuck creating after running force_create_pg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



HI Goncalo,

 

I removed the OSDs using the commands ceph auth del, ceph osd crush rm, and ceph osd rm as I have been with other OSDs that have failed. I noticed in the crushmap that I linked, the removed OSDs were still in the crushmap showing as “device 2 device2” I tried removing them from the map, recompiling it and applied it to the cluster but it did not seem to do anything. I also restarted all the OSDs as well as the monitors. I ended up rebooting all of the cluster member servers in a last ditch effort to get it back up and running with no luck. We ended up wiping the cluster and starting from scratch since we have backups of all of the data.

 

Thank you,

James Green

(401)-825-4401

 

From: Goncalo Borges [mailto:goncalo@xxxxxxxxxxxxxxxxxxx]
Sent: Thursday, October 15, 2015 3:32 AM
To: James Green <jgreen@xxxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
Subject: Re: Ceph PGs stuck creating after running force_create_pg

 

Hi James...

I am assuming that you have properly removed the dead OSDs from the crush map.

I've tested a scenario like the one you described and realized that pgs will never leave the creating state until you restart all OSDs.

Have you done that?

Cheers
Goncalo


On 10/15/2015 01:54 AM, James Green wrote:

Hello,

 

We recently had 2 nodes go down in our ceph cluster, one was repaired and the other had all 12 osds destroyed when it went down. We brought everything back online, there were several PGs that were showing as down+peering as well as down. After marking the failed OSDs as lost and removing them from the cluster we now have around 90 PGs that are showing as incomplete. At this point we just want to get the cluster back up and in a healthy state. I tried recreating the PGs using force_create_pg and now they are all stuck in creating.

 

PG dump shows 90 pgs all with the same output

2.182   0       0       0       0       0       0       0       0       creating        2015-10-14 10:31:28.832527      0'0     0:0     []      -1      []      -1      0'0     0.000000        0'0     0.000000

 

When I ran pg query on one of the groups I noticed under "down_osds_we_would_probe" one of the failed OSDs was listed. I already removed the OSD from the cluster, trying to mark it lost says the OSD does not exist.

 

Here is my crushmap http://pastebin.com/raw.php?i=vyk9vMT1

 

Why are the PGs trying to query osds that have been lost and removed from the cluster?

 




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux