Re: ceph recovery incomplete PGs on Luminous RC

Daniel K <sathackr@xxxxxxxxx> · Mon, 24 Jul 2017 12:57:10 -0400

I was able to export the PGs using the ceph-object-store tool and import them to the new OSDs. 
I moved some other OSDs from the bare metal on a node into a virtual machine on the same node and was surprised at how easy it was. Install ceph in the VM(using ceph-deploy) -- stop the OSD and dismount OSD drive from physical machine, mount it to the VM, the OSD was auto-detected and ceph-osd process started automatically and was up within a few seconds.

I'm having a different problem now that I will make a separate message about.
Thanks!

On Mon, Jul 24, 2017 at 12:52 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

On Fri, Jul 21, 2017 at 10:23 PM Daniel K <sathackr@xxxxxxxxx> wrote:
Luminous 12.1.0(RC)

I replaced two OSD drives(old ones were still good, just too small), using:

ceph osd out osd.12
ceph osd crush remove osd.12
ceph auth del osd.12
systemctl stop ceph-osd@osd.12
ceph osd rm osd.12

I later found that I also should have unmounted it from /var/lib/ceph/osd-12

(remove old disk, insert new disk)

I added the new disk/osd with ceph-deploy osd prepare stor-vm3:sdg --bluestore

This automatically activated the osd (not sure why, I thought it needed a ceph-deploy osd activate as well)

Then, working on an unrelated issue, I upgraded one (out of 4 total) nodes to 12.1.1 using apt and rebooted. 

The mon daemon would not form a quorum with the others on 12.1.0, so, instead of troubleshooting that, I just went ahead and upgraded the other 3 nodes and rebooted.

Lots of recovery IO went on afterwards, but now things have stopped at:

    pools:   10 pools, 6804 pgs
    objects: 1784k objects, 7132 GB
    usage:   11915 GB used, 19754 GB / 31669 GB avail
    pgs:     0.353% pgs not active
             70894/2988573 objects degraded (2.372%)
             422090/2988573 objects misplaced (14.123%)
             6626 active+clean
             129  active+remapped+backfill_wait
             23   incomplete
             14   active+undersized+degraded+remapped+backfill_wait
             4    active+undersized+degraded+remapped+backfilling
             4    active+remapped+backfilling
             2    active+clean+scrubbing+deep
             1    peering
             1    active+recovery_wait+degraded+remapped

when I run ceph pg query on the incompletes, they all list at least one of the two removed OSDs(12,17) in "down_osds_we_would_probe"

most pools are size:2 min_size 1(trusting bluestore to tell me which one is valid). One pool is size:1 min size:1 and I'm okay with losing it, except I had it mounted in a directory on cephfs, I rm'd the directory but I can't delete the pool because it's "in use by CephFS"

I still have the old drives, can I stick them into another host and re-add them somehow?

Yes, that'll probably be your easiest solution. You may have some trouble because you already deleted them, but I'm not sure.

Alternatively, you ought to be able to remove the pool from CephFS using some of the monitor commands and then delete it.

This data isn't super important, but I'd like to learn a bit on how to recover when bad things happen as we are planning a production deployment in a couple of weeks.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com