Luminous cluster in very bad state need some assistance.

Philippe Van Hecke <Philippe.VanHecke@xxxxxxxxx> · Sun, 3 Feb 2019 15:17:16 +0000

Hello,
I'am working for BELNET the Belgian Natioanal Research Network

We currently a manage a luminous ceph cluster on ubuntu 16.04
with 144 hdd osd spread across two data centers with 6 osd nodes
on each datacenter. Osd(s) are 4 TB sata disk.

Last week we had a network incident and the link between our 2 DC
begin to flap due top spt flap. This let our ceph
cluster in a very bad state with many pg stuck in different state.
I let the cluster the time to recover , but some osd doesn't restart.
I have read and try different stuff found in this mailing list but
this had the effect to be in worst situation because all my osds began to falling down one  due to some bad pg. 

I then try the solution describ by our grec coleagues 
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

So i put a set noout and noscrub nodeep-scrub to osd that seem to freeze the situation.

The cluster is only used to provide rbd disk to our cloud-compute and cloud-storage solution 
and to our internal kvm vm 

It seem that only some pool are affected by unclean/unknown/unfound object 

And all is working well for other pool ( may be some speed issue )

I can confirm that data on affected pool are completly corrupted.

You can find here https://filesender.belnet.be/?s=download&token=1fac6b04-dd35-46f7-b4a8-c851cfa06379  
a tgz file with a maximum information i can dump to give an overview
of the current state of the cluster.

So i have 2 questions.

Does removing affected pools w with stuck pg associated will remove the deffect pg ? 
If not i am completly lost and will like to know if somes expert  can assist us even not for free.

If yes you can contact me by mail at philippe@xxxxxxxxx.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com