Re: Can't get one OSD (out of 14) to start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



That's the exact same page I used to mark the osd as lost.  Nothing in there seems to reference the incomplete and down+incomplete pgs that I have however so I really don't know if it helps me.  I don't really understand what my problem is here.



-----Original Message-----
From: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx<mailto:Alex%20Gorbachev%20%3cag@xxxxxxxxxxxxxxxxxxx%3e>>
To: Mark Johnson <markj@xxxxxxxxx<mailto:Mark%20Johnson%20%3cmarkj@xxxxxxxxx%3e>>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx<mailto:%22ceph-users@xxxxxxx%22%20%3cceph-users@xxxxxxx%3e>>
Subject: Re:  Re: Can't get one OSD (out of 14) to start
Date: Fri, 16 Apr 2021 14:16:28 -0400

Hi Mark,

I wonder if the following will help you: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/

There are instructions there on how to mark unfound PGs lost and delete them.  You will regain a healthy cluster that way, and then you can adjust replica counts etc to best practice, and restore your objects.

Best regards,
--
Alex Gorbachev
ISS/Storcium



On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson <markj@xxxxxxxxx<mailto:markj@xxxxxxxxx>> wrote:
I ran an fsck on the problem OSD and found and repaired a couple of errors.  Remounted and started the OSD but it crashed again shortly after as before.  So (and possibly from bad advise) I figured I'd mark the OSD lost and let it write out the pgs to other OSDs which it's in the process of backfilling.  However, I'm seeing 1 down+incomplete and 3 incomplete and I'm expecting that these won't recover.

So, would love to know what my options are here when all the backfilling has finished (or stalled).  Losing data or even entire PGs isn't a big problem as this cluster is really just a replica of our main cluster so we can restore lost objects manually from there.  Is there a way I can clear out/repair/whatever these pgs so I can get a healthy cluster again?

Yes, I know this would have probably been easier with an additional storage server and a pool size of 3.  But that's not going to help me right now.



-----Original Message-----
From: Mark Johnson <markj@xxxxxxxxx<mailto:markj@xxxxxxxxx><mailto:Mark%20Johnson%20%3cmarkj@xxxxxxxxx<mailto:Mark%2520Johnson%2520%253cmarkj@xxxxxxxxx>%3e>>
To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> <ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:%22ceph-users@xxxxxxx<mailto:22ceph-users@xxxxxxx>%22%20%3cceph-users@xxxxxxx<mailto:22%2520%253cceph-users@xxxxxxx>%3e>>
Subject:  Can't get one OSD (out of 14) to start
Date: Fri, 16 Apr 2021 12:43:33 +0000


Really not sure where to go with this one.  Firstly, a description of my cluster.  Yes, I know there are a lot of "not ideals" here but this is what I inherited.


The cluster is running Jewel and has two storage/mon nodes and an additional mon only node, with a pool size of 2.  Today, we had a some power issues in the data centre and we very ungracefully lost both storage servers at the same time.  Node 1 came back online before node 2 but I could see there were a few OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  I'm not finding the OSD log output to be much use.  Current health status looks like this:


# ceph health

HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are blocked > 32 sec

# ceph status

    cluster e2391bbf-15e0-405f-af12-943610cb4909

     health HEALTH_ERR

            26 pgs are stuck inactive for more than 300 seconds

            26 pgs down

            26 pgs peering

            26 pgs stuck inactive

            26 pgs stuck unclean

            5 requests are blocked > 32 sec


Any clues as to what I should be looking for or what sort of action I should be taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.


Here's a snippet from the OSD log that means little to me...


--- begin dump of recent events ---

     0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) **

 in thread 7f2e23921ac0 thread_name:ceph-osd


 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)

 1: (()+0x9f1c2a) [0x7f2e24330c2a]

 2: (()+0xf5d0) [0x7f2e21ee95d0]

 3: (gsignal()+0x37) [0x7f2e2049f207]

 4: (abort()+0x148) [0x7f2e204a08f8]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f2e2442fd47]

 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) [0x7f2e2417bc7c]

 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) [0x7f2e240c8dce]

 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]

 9: (OSD::init()+0x27d) [0x7f2e23d5828d]

 10: (main()+0x2c18) [0x7f2e23c71088]

 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]

 12: (()+0x3c8847) [0x7f2e23d07847]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Thanks in advance,

Mark


_______________________________________________

ceph-users mailing list --

<mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>

ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>


To unsubscribe send an email to

<mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>

ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux