Re: Can't get one OSD (out of 14) to start

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Fri, 16 Apr 2021 14:16:28 -0400

Hi Mark,

I wonder if the following will help you:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/

There are instructions there on how to mark unfound PGs lost and delete
them.  You will regain a healthy cluster that way, and then you can adjust
replica counts etc to best practice, and restore your objects.

Best regards,
--
Alex Gorbachev
ISS/Storcium

On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson <markj@xxxxxxxxx> wrote:

> I ran an fsck on the problem OSD and found and repaired a couple of
> errors.  Remounted and started the OSD but it crashed again shortly after
> as before.  So (and possibly from bad advise) I figured I'd mark the OSD
> lost and let it write out the pgs to other OSDs which it's in the process
> of backfilling.  However, I'm seeing 1 down+incomplete and 3 incomplete and
> I'm expecting that these won't recover.
>
> So, would love to know what my options are here when all the backfilling
> has finished (or stalled).  Losing data or even entire PGs isn't a big
> problem as this cluster is really just a replica of our main cluster so we
> can restore lost objects manually from there.  Is there a way I can clear
> out/repair/whatever these pgs so I can get a healthy cluster again?
>
> Yes, I know this would have probably been easier with an additional
> storage server and a pool size of 3.  But that's not going to help me right
> now.
>
>
>
> -----Original Message-----
> From: Mark Johnson <markj@xxxxxxxxx<mailto:
> Mark%20Johnson%20%3cmarkj@xxxxxxxxx%3e>>
> To: ceph-users@xxxxxxx <ceph-users@xxxxxxx<mailto:%22ceph-users@xxxxxxx%
> 22%20%3cceph-users@xxxxxxx%3e>>
> Subject:  Can't get one OSD (out of 14) to start
> Date: Fri, 16 Apr 2021 12:43:33 +0000
>
>
> Really not sure where to go with this one.  Firstly, a description of my
> cluster.  Yes, I know there are a lot of "not ideals" here but this is what
> I inherited.
>
>
> The cluster is running Jewel and has two storage/mon nodes and an
> additional mon only node, with a pool size of 2.  Today, we had a some
> power issues in the data centre and we very ungracefully lost both storage
> servers at the same time.  Node 1 came back online before node 2 but I
> could see there were a few OSDs that were down.  When node 2 came back, I
> started trying to get OSDs up.  Each node has 14 OSDs and I managed to get
> all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting
> and crashing and just won't stay up.  I'm not finding the OSD log output to
> be much use.  Current health status looks like this:
>
>
> # ceph health
>
> HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs
> down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5
> requests are blocked > 32 sec
>
> # ceph status
>
>     cluster e2391bbf-15e0-405f-af12-943610cb4909
>
>      health HEALTH_ERR
>
>             26 pgs are stuck inactive for more than 300 seconds
>
>             26 pgs down
>
>             26 pgs peering
>
>             26 pgs stuck inactive
>
>             26 pgs stuck unclean
>
>             5 requests are blocked > 32 sec
>
>
> Any clues as to what I should be looking for or what sort of action I
> should be taking to troubleshoot this?  Unfortunately, I'm a complete
> novice with Ceph.
>
>
> Here's a snippet from the OSD log that means little to me...
>
>
> --- begin dump of recent events ---
>
>      0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal
> (Aborted) **
>
>  in thread 7f2e23921ac0 thread_name:ceph-osd
>
>
>  ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
>
>  1: (()+0x9f1c2a) [0x7f2e24330c2a]
>
>  2: (()+0xf5d0) [0x7f2e21ee95d0]
>
>  3: (gsignal()+0x37) [0x7f2e2049f207]
>
>  4: (abort()+0x148) [0x7f2e204a08f8]
>
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x267) [0x7f2e2442fd47]
>
>  6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x90c) [0x7f2e2417bc7c]
>
>  7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee)
> [0x7f2e240c8dce]
>
>  8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]
>
>  9: (OSD::init()+0x27d) [0x7f2e23d5828d]
>
>  10: (main()+0x2c18) [0x7f2e23c71088]
>
>  11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]
>
>  12: (()+0x3c8847) [0x7f2e23d07847]
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
> Thanks in advance,
>
> Mark
>
>
> _______________________________________________
>
> ceph-users mailing list --
>
> <mailto:ceph-users@xxxxxxx>
>
> ceph-users@xxxxxxx
>
>
> To unsubscribe send an email to
>
> <mailto:ceph-users-leave@xxxxxxx>
>
> ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx