Re: Luminous cluster in very bad state need some assistance.

Sage Weil <sage@xxxxxxxxxxxx> · Sun, 3 Feb 2019 17:25:09 +0000 (UTC)

On Sun, 3 Feb 2019, Philippe Van Hecke wrote:
> Hello,
> I'am working for BELNET the Belgian Natioanal Research Network
> 
> We currently a manage a luminous ceph cluster on ubuntu 16.04
> with 144 hdd osd spread across two data centers with 6 osd nodes
> on each datacenter. Osd(s) are 4 TB sata disk.
> 
> Last week we had a network incident and the link between our 2 DC
> begin to flap due top spt flap. This let our ceph
> cluster in a very bad state with many pg stuck in different state.
> I let the cluster the time to recover , but some osd doesn't restart.
> I have read and try different stuff found in this mailing list but
> this had the effect to be in worst situation because all my osds began to falling down one  due to some bad pg. 
> 
> I then try the solution describ by our grec coleagues 
> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
> 
> So i put a set noout and noscrub nodeep-scrub to osd that seem to freeze the situation.
> 
> The cluster is only used to provide rbd disk to our cloud-compute and cloud-storage solution 
> and to our internal kvm vm 
> 
> It seem that only some pool are affected by unclean/unknown/unfound object 
> 
> And all is working well for other pool ( may be some speed issue )
> 
> I can confirm that data on affected pool are completly corrupted.
> 
> You can find here https://filesender.belnet.be/?s=download&token=1fac6b04-dd35-46f7-b4a8-c851cfa06379  
> a tgz file with a maximum information i can dump to give an overview
> of the current state of the cluster.
> 
> So i have 2 questions.
> 
> Does removing affected pools w with stuck pg associated will remove the deffect pg ? 

Yes, but don't do that yet!  From a quick look this looks like it can be 
worked around.

First question is why you're hitting the assert on e.g. osd.49

     0> 2019-02-01 09:23:36.963503 7fb548859e00 -1 
/build/ceph-12.2.5/src/osd/PGLog.h: In function 'static void 
PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, 
const pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, 
std::ostringstream&, bool, bool*, const DoutPrefixProvider*, 
std::set<std::__cxx11::basic_string<char> >*, bool) [with missing_type = 
pg_missing_set<true>; std::ostringstream = 
std::__cxx11::basic_ostringstream<char>]' thread 7fb548859e00 time 
2019-02-01 09:23:36.961237
/build/ceph-12.2.5/src/osd/PGLog.h: 1354: FAILED 
assert(last_e.version.version < e.version.version)

If you can set debug osd = 20 on that osd, start it, and ceph-post-file 
the log, that would be helpful.  12.2.5 is a pretty old luminous release, 
but I don't see this in the tracker, so a log would be great.

Your priority is probably to get the pools active, though.  For osd.49, 
the problematic pg is 11.182, which your pg ls output shows as online and 
undersized but usable.  You can use ceph-objectstore-tool --op 
export-remove to make a backup and remove it from the osd.49 and then that 
osd will likely start up.

If you look at 11.ac, your only incomplete pg in pool 11, the 
query says

            "down_osds_we_would_probe": [
                49,
                63
            ],

..so if you get that OSD up that PG should peer.

In pool 12, you have 12.14d

            "down_osds_we_would_probe": [
                9,
                51
            ],

osd.51 won't start due to the same assert but on pg 15.246, and hte pg ls 
shows that pg is undersized but active, so doing the same --op 
export-remove on that osd will hopefully let it start.  I'm guessing the 
same will work on the other 12.* pg, but see if it works on 11.182 first 
so that pool will be completely up and available.

Let us know how it goes!

sage

> If not i am completly lost and will like to know if somes expert  can assist us even not for free.
> 
> If yes you can contact me by mail at philippe@xxxxxxxxx.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com