OSD's won't start - thread abort

Austin Workman <soilflames@xxxxxxxxx> · Wed, 3 Jul 2019 13:08:57 -0500

So several events unfolded that may have led to this situation.  Some of them in hindsight were probably not the smartest decision around adjusting the ec pool and restarting the OSD's several times during these migrations.
Added a new 6th OSD with ceph-ansible
Hung during restart of OSD's because they were set to noup and one of the original OSD's wouldn't come back online because of the noup.  Manually unset noup and all 6 OSD's went up/in.
Objects showing in degraded/misplaced
Strange behavior restarting one OSD at a time and waiting for it to stabilize, depending on which was the last OSD restarted, different resulting backfill or move operations were taking place.
Adjusted recovery/backfill sleep/concurrent moves to speed up re-location.
Decided that if all the data was going to move, I should adjust my jerasure ec profile from k=4, m=1 -> k=5, m=1 with force(is this even recommended vs. just creating new pools???)
Initially it unset crush-device-class=hdd to be blank
Re-set crush-device-class
Couldn't determine if this had any effect on the move operations.
Changed back to k=4
Let some of the backfill work through, ran into toofull situations even though OSD's had plenty of space.
Decided to add PG's to the EC pool 64->150
Restarted one OSD at a time again, waiting for them to be healthy before moving on.(probably should have been setting noout)
Eventually one of the old OSD's refused to start due to a thread abort relating to stripe size(see below).
Tried restarting other OSD's they all came back online fine.
Some time passes and then the new OSD crashes, and won't start back up with the same stripe size abort.
Now 2 OSD's are down, and won't start back up due to that same condition, and data is no longer available.
149 PG's showing as incomplete due to the min size 5(which shouldn't it be 1 from the original EC/new EC profile settings?)
1 pg as down
21 unknown
Some of the PG's were still "new pg's" from increasing the PG size of the pool.
So yeah, somewhat of a cluster of changing too many things at once here, but I didn't realize the things I was doing would potentially have this result.

The two OSD's that won't start should still have all of the data on them, it seems like they are having issues with at least one of the PG's in particular from the EC pool that was adjusted, but presumably the rest of the data should be fine, and hopefully there is a way to get them to start up again.  I saw a similar issue posted in the list a few years ago but there was never any follow up from the user having the issue.

https://gist.github.com/arodd/c95355a7b55f3e4a94f21bc5e801943d

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com