On Mon, Dec 29, 2014 at 12:56 PM, Christian Eichelmann <christian.eichelmann@xxxxxxxx> wrote: > Hi all, > > we have a ceph cluster, with currently 360 OSDs in 11 Systems. Last week > we were replacing one OSD System with a new one. During that, we had a > lot of problems with OSDs crashing on all of our systems. But that is > not our current problem. > > After we got everything up and running again, we still have 3 PGs in the > state incomplete. I was checking one of them directly on the systems > (replication factor is 3). On two machines the directory was there but > empty, on the third one, I found some content. Using > ceph_objectstore_tool I exported this PG and imported it on the other > nodes. Nothing changed. > > We only use ceph for providing rbd images. Right now, two of them are > unusable, because ceph hangs when someone trys to access content in > these pgs. Not bad enough, if I create a new rbd image, ceph is still > using the incomplete pgs, so it is a pure gambling if a new volume will > be usable or not. That, for now, makes our 900TB ceph cluster unusable > because of 3 bad PGs. > > And right here it seems like I can't to anything. Instructing the ceph > cluster to scrub, deep-scrub or repair the pg does nothing, even after > several days. Checking which rbd images are affected is also not > possible, because rados -p poolname ls hangs forever when it comes to > one of the incomplete pgs. ceph osd lost also does actually nothing. > > So right now, I am OK if I lose the content of these three PGs. So how > can I get the cluster back to live without deleting the whole pool which > is not for discussion? > Christian, would you mind to provide an exact backtrace for those crashes from core file? This one is clearly represents one of my worst nightmares, domino crash of a healthy cluster and even for unstable version such as Giant issue should be at least properly pinned. I also suspect that you have an almost empty cluster or very low number of volumes, as only two volumes are affected in your case. If you don`t care about your data, after obtaining core dump you may want to try to mark those pgs as lost, as operational guide suggests. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com