Re: Power outages!!! help!

hjcho616 <hjcho616@xxxxxxxxx> · Fri, 15 Sep 2017 13:30:58 +0000 (UTC)

After running ceph osd lost osd.0, it started backfilling... I figured that was supposed to happen earlier when I added those missing PGs.  Running in to "too few PGs per OSD" I removed osds after cluster stopped working after adding osds.  But I guess I still needed them.  Currently I see several incomplete PGs and trying to import those PGs back. =P

As far as 1.28 goes, it didn't look like it was limited by osd.0, logs didn't show any signs of osd.0 and data is only available on osd.4, which wouldn't export... So I still need to deal with that one.  It is still showing up as incomplete.. =P  Any recommendations how to get that back?
pg 1.28 is stuck inactive since forever, current state down+incomplete, last acting [11,6]
pg 1.28 is stuck unclean since forever, current state down+incomplete, last acting [11,6]
pg 1.28 is down+incomplete, acting [11,6] (reducing pool metadata min_size from 2 may help; search ceph.com/docs for 'incomplete')

Regards,
Hong

    On Friday, September 15, 2017 4:51 AM, Ronny Aasen <ronny+ceph-users@xxxxxxxx> wrote:

you write you had all pg's exported except one. so i assume you have 
injected those pg's into the cluster again using the method linked a few 
times in this thread. How did that go, were you successfull in 
recovering those pg's ?

kind regards.
Ronny Aasen

On 15. sep. 2017 07:52, hjcho616 wrote:
> I just did this and backfilling started.  Let's see where this takes me.
> ceph osd lost 0 --yes-i-really-mean-it
> 
> Regards,
> Hong
> 
> 
> On Friday, September 15, 2017 12:44 AM, hjcho616 <hjcho616@xxxxxxxxx> wrote:
> 
> 
> Ronny,
> 
> Working with all of the pgs shown in the "ceph health detail", I ran 
> below for each PG to export.
> ceph-objectstore-tool --op export --pgid 0.1c   --data-path 
> /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal 
> --skip-journal-replay --file 0.1c.export
> 
> I have all PGs exported, except 1... PG 1.28.  It is on ceph-4.  This 
> error doesn't make much sense to me.  Looking at the source code from 
> https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc, that 
> message is telling me struct_v is 1... but not sure how it ended up in 
> the default in the case statement when 1 case is defined...  I tried 
> with --skip-journal-replay, fails with same error message.
> ceph-objectstore-tool --op export --pgid 1.28  --data-path 
> /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal 
> --file 1.28.export
> terminate called after throwing an instance of 'std::domain_error'
>    what():  coll_t::decode(): don't know how to decode version 1
> *** Caught signal (Aborted) **
>   in thread 7fabc5ecc940 thread_name:ceph-objectstor
>   ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>   1: (()+0x996a57) [0x55b2d3323a57]
>   2: (()+0x110c0) [0x7fabc46d50c0]
>   3: (gsignal()+0xcf) [0x7fabc2b08fcf]
>   4: (abort()+0x16a) [0x7fabc2b0a3fa]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fabc33efb3d]
>   6: (()+0x5ebb6) [0x7fabc33edbb6]
>   7: (()+0x5ec01) [0x7fabc33edc01]
>   8: (()+0x5ee19) [0x7fabc33ede19]
>   9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) [0x55b2d2ff401e]
>   10: 
> (DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
> [0x55b2d31315f5]
>   11: (DBObjectMap::check(std::ostream&, bool)+0x279) [0x55b2d3126bb9]
>   12: (DBObjectMap::init(bool)+0x288) [0x55b2d3125eb8]
>   13: (FileStore::mount()+0x2525) [0x55b2d305ceb5]
>   14: (main()+0x28c0) [0x55b2d2c8d400]
>   15: (__libc_start_main()+0xf1) [0x7fabc2af62b1]
>   16: (()+0x34f747) [0x55b2d2cdc747]
> Aborted
> 
> Then wrote a simple script to run import process... just created an OSD 
> per PG.  Basically ran below for each PG.
> mkdir /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> ceph-disk prepare /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> ceph-disk activate /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> ceph osd crush reweight osd.$(cat 
> /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) 0
> systemctl stop ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
> ceph-objectstore-tool --op import --pgid 0.1c   --data-path 
> /var/lib/ceph/osd/ceph-$(cat 
> /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) --journal-path 
> /var/lib/ceph/osd/ceph-$(cat 
> /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)/journal --file 
> ./export/0.1c.export
> chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
> systemctl start ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
> 
> Sometimes import didn't work.. but stopping OSD and rerunning 
> ceph-objectstore-tool again seems to help or when some PG didn't really 
> want to import .
> 
> Unfound messages are gone!   But I still have down+peering, or 
> down+remapped+peering.
> # ceph health detail
> HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs 
> down; 1 pgs inconsistent; 22 pgs peering; 22 pgs stuck inactive; 22 pgs 
> stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow 
> requests; 2 scrub errors; mds cluster is degraded; noout flag(s) set; no 
> legacy OSD present but 'sortbitwise' flag is not set
> pg 1.d is stuck inactive since forever, current state down+peering, last 
> acting [11,2]
> pg 0.a is stuck inactive since forever, current state 
> down+remapped+peering, last acting [11,7]
> pg 2.8 is stuck inactive since forever, current state 
> down+remapped+peering, last acting [11,7]
> pg 2.b is stuck inactive since forever, current state 
> down+remapped+peering, last acting [7,11]
> pg 1.9 is stuck inactive since forever, current state 
> down+remapped+peering, last acting [11,7]
> pg 0.e is stuck inactive since forever, current state down+peering, last 
> acting [11,2]
> pg 1.3d is stuck inactive since forever, current state 
> down+remapped+peering, last acting [10,6]
> pg 0.2c is stuck inactive since forever, current state down+peering, 
> last acting [1,11]
> pg 0.0 is stuck inactive since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 1.2b is stuck inactive since forever, current state down+peering, 
> last acting [1,11]
> pg 0.29 is stuck inactive since forever, current state down+peering, 
> last acting [11,6]
> pg 1.28 is stuck inactive since forever, current state down+peering, 
> last acting [11,6]
> pg 2.3 is stuck inactive since forever, current state down+peering, last 
> acting [11,7]
> pg 1.1b is stuck inactive since forever, current state 
> down+remapped+peering, last acting [11,6]
> pg 0.d is stuck inactive since forever, current state 
> down+remapped+peering, last acting [7,11]
> pg 1.c is stuck inactive since forever, current state 
> down+remapped+peering, last acting [7,11]
> pg 0.3b is stuck inactive since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 2.39 is stuck inactive since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 1.3a is stuck inactive since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 0.5 is stuck inactive since forever, current state down+peering, last 
> acting [11,7]
> pg 1.4 is stuck inactive since forever, current state down+peering, last 
> acting [11,7]
> pg 0.1c is stuck inactive since forever, current state down+peering, 
> last acting [11,6]
> pg 1.d is stuck unclean since forever, current state down+peering, last 
> acting [11,2]
> pg 0.a is stuck unclean since forever, current state 
> down+remapped+peering, last acting [11,7]
> pg 2.8 is stuck unclean since forever, current state 
> down+remapped+peering, last acting [11,7]
> pg 2.b is stuck unclean since forever, current state 
> down+remapped+peering, last acting [7,11]
> pg 1.9 is stuck unclean since forever, current state 
> down+remapped+peering, last acting [11,7]
> pg 0.e is stuck unclean since forever, current state down+peering, last 
> acting [11,2]
> pg 1.3d is stuck unclean since forever, current state 
> down+remapped+peering, last acting [10,6]
> pg 0.d is stuck unclean since forever, current state 
> down+remapped+peering, last acting [7,11]
> pg 1.c is stuck unclean since forever, current state 
> down+remapped+peering, last acting [7,11]
> pg 0.3b is stuck unclean since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 1.3a is stuck unclean since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 2.39 is stuck unclean since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 0.5 is stuck unclean since forever, current state down+peering, last 
> acting [11,7]
> pg 1.4 is stuck unclean since forever, current state down+peering, last 
> acting [11,7]
> pg 0.1c is stuck unclean since forever, current state down+peering, last 
> acting [11,6]
> pg 1.1b is stuck unclean since forever, current state 
> down+remapped+peering, last acting [11,6]
> pg 2.3 is stuck unclean since forever, current state down+peering, last 
> acting [11,7]
> pg 0.0 is stuck unclean since forever, current state 
> down+remapped+peering, last acting [10,7]
> pg 1.28 is stuck unclean since forever, current state down+peering, last 
> acting [11,6]
> pg 0.29 is stuck unclean since forever, current state down+peering, last 
> acting [11,6]
> pg 1.2b is stuck unclean since forever, current state down+peering, last 
> acting [1,11]
> pg 0.2c is stuck unclean since forever, current state down+peering, last 
> acting [1,11]
> pg 0.2c is down+peering, acting [1,11]
> pg 1.2b is down+peering, acting [1,11]
> pg 0.29 is down+peering, acting [11,6]
> pg 1.28 is down+peering, acting [11,6]
> pg 0.0 is down+remapped+peering, acting [10,7]
> pg 2.3 is down+peering, acting [11,7]
> pg 1.1b is down+remapped+peering, acting [11,6]
> pg 0.1c is down+peering, acting [11,6]
> pg 2.39 is down+remapped+peering, acting [10,7]
> pg 1.3a is down+remapped+peering, acting [10,7]
> pg 0.3b is down+remapped+peering, acting [10,7]
> pg 1.3d is down+remapped+peering, acting [10,6]
> pg 2.7 is active+clean+inconsistent, acting [2,11]
> pg 1.4 is down+peering, acting [11,7]
> pg 0.5 is down+peering, acting [11,7]
> pg 1.9 is down+remapped+peering, acting [11,7]
> pg 2.b is down+remapped+peering, acting [7,11]
> pg 2.8 is down+remapped+peering, acting [11,7]
> pg 0.a is down+remapped+peering, acting [11,7]
> pg 1.d is down+peering, acting [11,2]
> pg 1.c is down+remapped+peering, acting [7,11]
> pg 0.d is down+remapped+peering, acting [7,11]
> pg 0.e is down+peering, acting [11,2]
> 1 ops are blocked > 8388.61 sec on osd.10
> 1 osds have slow requests
> 2 scrub errors
> mds cluster is degraded
> mds.MDS1.2 at 192.168.1.20:6801/3142084617 rank 0 is replaying journal
> noout flag(s) set
> 
> What would be the next step?
> 
> Regards,
> Hong

> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com