Re: Power outages!!! help!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




you write you had all pg's exported except one. so i assume you have injected those pg's into the cluster again using the method linked a few times in this thread. How did that go, were you successfull in recovering those pg's ?

kind regards.
Ronny Aasen



On 15. sep. 2017 07:52, hjcho616 wrote:
I just did this and backfilling started.  Let's see where this takes me.
ceph osd lost 0 --yes-i-really-mean-it

Regards,
Hong


On Friday, September 15, 2017 12:44 AM, hjcho616 <hjcho616@xxxxxxxxx> wrote:


Ronny,

Working with all of the pgs shown in the "ceph health detail", I ran below for each PG to export. ceph-objectstore-tool --op export --pgid 0.1c --data-path /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal --skip-journal-replay --file 0.1c.export

I have all PGs exported, except 1... PG 1.28. It is on ceph-4. This error doesn't make much sense to me. Looking at the source code from https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc, that message is telling me struct_v is 1... but not sure how it ended up in the default in the case statement when 1 case is defined... I tried with --skip-journal-replay, fails with same error message. ceph-objectstore-tool --op export --pgid 1.28 --data-path /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --file 1.28.export
terminate called after throwing an instance of 'std::domain_error'
   what():  coll_t::decode(): don't know how to decode version 1
*** Caught signal (Aborted) **
  in thread 7fabc5ecc940 thread_name:ceph-objectstor
  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
  1: (()+0x996a57) [0x55b2d3323a57]
  2: (()+0x110c0) [0x7fabc46d50c0]
  3: (gsignal()+0xcf) [0x7fabc2b08fcf]
  4: (abort()+0x16a) [0x7fabc2b0a3fa]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fabc33efb3d]
  6: (()+0x5ebb6) [0x7fabc33edbb6]
  7: (()+0x5ec01) [0x7fabc33edc01]
  8: (()+0x5ee19) [0x7fabc33ede19]
  9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) [0x55b2d2ff401e]
10: (DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) [0x55b2d31315f5]
  11: (DBObjectMap::check(std::ostream&, bool)+0x279) [0x55b2d3126bb9]
  12: (DBObjectMap::init(bool)+0x288) [0x55b2d3125eb8]
  13: (FileStore::mount()+0x2525) [0x55b2d305ceb5]
  14: (main()+0x28c0) [0x55b2d2c8d400]
  15: (__libc_start_main()+0xf1) [0x7fabc2af62b1]
  16: (()+0x34f747) [0x55b2d2cdc747]
Aborted

Then wrote a simple script to run import process... just created an OSD per PG. Basically ran below for each PG.
mkdir /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph-disk prepare /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph-disk activate /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph osd crush reweight osd.$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) 0
systemctl stop ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
ceph-objectstore-tool --op import --pgid 0.1c --data-path /var/lib/ceph/osd/ceph-$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) --journal-path /var/lib/ceph/osd/ceph-$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)/journal --file ./export/0.1c.export
chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
systemctl start ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)

Sometimes import didn't work.. but stopping OSD and rerunning ceph-objectstore-tool again seems to help or when some PG didn't really want to import .

Unfound messages are gone! But I still have down+peering, or down+remapped+peering.
# ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs down; 1 pgs inconsistent; 22 pgs peering; 22 pgs stuck inactive; 22 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests; 2 scrub errors; mds cluster is degraded; noout flag(s) set; no legacy OSD present but 'sortbitwise' flag is not set pg 1.d is stuck inactive since forever, current state down+peering, last acting [11,2] pg 0.a is stuck inactive since forever, current state down+remapped+peering, last acting [11,7] pg 2.8 is stuck inactive since forever, current state down+remapped+peering, last acting [11,7] pg 2.b is stuck inactive since forever, current state down+remapped+peering, last acting [7,11] pg 1.9 is stuck inactive since forever, current state down+remapped+peering, last acting [11,7] pg 0.e is stuck inactive since forever, current state down+peering, last acting [11,2] pg 1.3d is stuck inactive since forever, current state down+remapped+peering, last acting [10,6] pg 0.2c is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.0 is stuck inactive since forever, current state down+remapped+peering, last acting [10,7] pg 1.2b is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.29 is stuck inactive since forever, current state down+peering, last acting [11,6] pg 1.28 is stuck inactive since forever, current state down+peering, last acting [11,6] pg 2.3 is stuck inactive since forever, current state down+peering, last acting [11,7] pg 1.1b is stuck inactive since forever, current state down+remapped+peering, last acting [11,6] pg 0.d is stuck inactive since forever, current state down+remapped+peering, last acting [7,11] pg 1.c is stuck inactive since forever, current state down+remapped+peering, last acting [7,11] pg 0.3b is stuck inactive since forever, current state down+remapped+peering, last acting [10,7] pg 2.39 is stuck inactive since forever, current state down+remapped+peering, last acting [10,7] pg 1.3a is stuck inactive since forever, current state down+remapped+peering, last acting [10,7] pg 0.5 is stuck inactive since forever, current state down+peering, last acting [11,7] pg 1.4 is stuck inactive since forever, current state down+peering, last acting [11,7] pg 0.1c is stuck inactive since forever, current state down+peering, last acting [11,6] pg 1.d is stuck unclean since forever, current state down+peering, last acting [11,2] pg 0.a is stuck unclean since forever, current state down+remapped+peering, last acting [11,7] pg 2.8 is stuck unclean since forever, current state down+remapped+peering, last acting [11,7] pg 2.b is stuck unclean since forever, current state down+remapped+peering, last acting [7,11] pg 1.9 is stuck unclean since forever, current state down+remapped+peering, last acting [11,7] pg 0.e is stuck unclean since forever, current state down+peering, last acting [11,2] pg 1.3d is stuck unclean since forever, current state down+remapped+peering, last acting [10,6] pg 0.d is stuck unclean since forever, current state down+remapped+peering, last acting [7,11] pg 1.c is stuck unclean since forever, current state down+remapped+peering, last acting [7,11] pg 0.3b is stuck unclean since forever, current state down+remapped+peering, last acting [10,7] pg 1.3a is stuck unclean since forever, current state down+remapped+peering, last acting [10,7] pg 2.39 is stuck unclean since forever, current state down+remapped+peering, last acting [10,7] pg 0.5 is stuck unclean since forever, current state down+peering, last acting [11,7] pg 1.4 is stuck unclean since forever, current state down+peering, last acting [11,7] pg 0.1c is stuck unclean since forever, current state down+peering, last acting [11,6] pg 1.1b is stuck unclean since forever, current state down+remapped+peering, last acting [11,6] pg 2.3 is stuck unclean since forever, current state down+peering, last acting [11,7] pg 0.0 is stuck unclean since forever, current state down+remapped+peering, last acting [10,7] pg 1.28 is stuck unclean since forever, current state down+peering, last acting [11,6] pg 0.29 is stuck unclean since forever, current state down+peering, last acting [11,6] pg 1.2b is stuck unclean since forever, current state down+peering, last acting [1,11] pg 0.2c is stuck unclean since forever, current state down+peering, last acting [1,11]
pg 0.2c is down+peering, acting [1,11]
pg 1.2b is down+peering, acting [1,11]
pg 0.29 is down+peering, acting [11,6]
pg 1.28 is down+peering, acting [11,6]
pg 0.0 is down+remapped+peering, acting [10,7]
pg 2.3 is down+peering, acting [11,7]
pg 1.1b is down+remapped+peering, acting [11,6]
pg 0.1c is down+peering, acting [11,6]
pg 2.39 is down+remapped+peering, acting [10,7]
pg 1.3a is down+remapped+peering, acting [10,7]
pg 0.3b is down+remapped+peering, acting [10,7]
pg 1.3d is down+remapped+peering, acting [10,6]
pg 2.7 is active+clean+inconsistent, acting [2,11]
pg 1.4 is down+peering, acting [11,7]
pg 0.5 is down+peering, acting [11,7]
pg 1.9 is down+remapped+peering, acting [11,7]
pg 2.b is down+remapped+peering, acting [7,11]
pg 2.8 is down+remapped+peering, acting [11,7]
pg 0.a is down+remapped+peering, acting [11,7]
pg 1.d is down+peering, acting [11,2]
pg 1.c is down+remapped+peering, acting [7,11]
pg 0.d is down+remapped+peering, acting [7,11]
pg 0.e is down+peering, acting [11,2]
1 ops are blocked > 8388.61 sec on osd.10
1 osds have slow requests
2 scrub errors
mds cluster is degraded
mds.MDS1.2 at 192.168.1.20:6801/3142084617 rank 0 is replaying journal
noout flag(s) set

What would be the next step?

Regards,
Hong



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux