Hi Mark, I wonder if the following will help you: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/ There are instructions there on how to mark unfound PGs lost and delete them. You will regain a healthy cluster that way, and then you can adjust replica counts etc to best practice, and restore your objects. Best regards, -- Alex Gorbachev ISS/Storcium On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson <markj@xxxxxxxxx> wrote: > I ran an fsck on the problem OSD and found and repaired a couple of > errors. Remounted and started the OSD but it crashed again shortly after > as before. So (and possibly from bad advise) I figured I'd mark the OSD > lost and let it write out the pgs to other OSDs which it's in the process > of backfilling. However, I'm seeing 1 down+incomplete and 3 incomplete and > I'm expecting that these won't recover. > > So, would love to know what my options are here when all the backfilling > has finished (or stalled). Losing data or even entire PGs isn't a big > problem as this cluster is really just a replica of our main cluster so we > can restore lost objects manually from there. Is there a way I can clear > out/repair/whatever these pgs so I can get a healthy cluster again? > > Yes, I know this would have probably been easier with an additional > storage server and a pool size of 3. But that's not going to help me right > now. > > > > -----Original Message----- > From: Mark Johnson <markj@xxxxxxxxx<mailto: > Mark%20Johnson%20%3cmarkj@xxxxxxxxx%3e>> > To: ceph-users@xxxxxxx <ceph-users@xxxxxxx<mailto:%22ceph-users@xxxxxxx% > 22%20%3cceph-users@xxxxxxx%3e>> > Subject: Can't get one OSD (out of 14) to start > Date: Fri, 16 Apr 2021 12:43:33 +0000 > > > Really not sure where to go with this one. Firstly, a description of my > cluster. Yes, I know there are a lot of "not ideals" here but this is what > I inherited. > > > The cluster is running Jewel and has two storage/mon nodes and an > additional mon only node, with a pool size of 2. Today, we had a some > power issues in the data centre and we very ungracefully lost both storage > servers at the same time. Node 1 came back online before node 2 but I > could see there were a few OSDs that were down. When node 2 came back, I > started trying to get OSDs up. Each node has 14 OSDs and I managed to get > all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting > and crashing and just won't stay up. I'm not finding the OSD log output to > be much use. Current health status looks like this: > > > # ceph health > > HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs > down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 > requests are blocked > 32 sec > > # ceph status > > cluster e2391bbf-15e0-405f-af12-943610cb4909 > > health HEALTH_ERR > > 26 pgs are stuck inactive for more than 300 seconds > > 26 pgs down > > 26 pgs peering > > 26 pgs stuck inactive > > 26 pgs stuck unclean > > 5 requests are blocked > 32 sec > > > Any clues as to what I should be looking for or what sort of action I > should be taking to troubleshoot this? Unfortunately, I'm a complete > novice with Ceph. > > > Here's a snippet from the OSD log that means little to me... > > > --- begin dump of recent events --- > > 0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal > (Aborted) ** > > in thread 7f2e23921ac0 thread_name:ceph-osd > > > ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) > > 1: (()+0x9f1c2a) [0x7f2e24330c2a] > > 2: (()+0xf5d0) [0x7f2e21ee95d0] > > 3: (gsignal()+0x37) [0x7f2e2049f207] > > 4: (abort()+0x148) [0x7f2e204a08f8] > > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x267) [0x7f2e2442fd47] > > 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x90c) [0x7f2e2417bc7c] > > 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) > [0x7f2e240c8dce] > > 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546] > > 9: (OSD::init()+0x27d) [0x7f2e23d5828d] > > 10: (main()+0x2c18) [0x7f2e23c71088] > > 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5] > > 12: (()+0x3c8847) [0x7f2e23d07847] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > Thanks in advance, > > Mark > > > _______________________________________________ > > ceph-users mailing list -- > > <mailto:ceph-users@xxxxxxx> > > ceph-users@xxxxxxx > > > To unsubscribe send an email to > > <mailto:ceph-users-leave@xxxxxxx> > > ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx