Can't get one OSD (out of 14) to start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Really not sure where to go with this one.  Firstly, a description of my cluster.  Yes, I know there are a lot of "not ideals" here but this is what I inherited.

The cluster is running Jewel and has two storage/mon nodes and an additional mon only node, with a pool size of 2.  Today, we had a some power issues in the data centre and we very ungracefully lost both storage servers at the same time.  Node 1 came back online before node 2 but I could see there were a few OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  I'm not finding the OSD log output to be much use.  Current health status looks like this:

# ceph health
HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are blocked > 32 sec
# ceph status
    cluster e2391bbf-15e0-405f-af12-943610cb4909
     health HEALTH_ERR
            26 pgs are stuck inactive for more than 300 seconds
            26 pgs down
            26 pgs peering
            26 pgs stuck inactive
            26 pgs stuck unclean
            5 requests are blocked > 32 sec

Any clues as to what I should be looking for or what sort of action I should be taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.

Here's a snippet from the OSD log that means little to me...

--- begin dump of recent events ---
     0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) **
 in thread 7f2e23921ac0 thread_name:ceph-osd

 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (()+0x9f1c2a) [0x7f2e24330c2a]
 2: (()+0xf5d0) [0x7f2e21ee95d0]
 3: (gsignal()+0x37) [0x7f2e2049f207]
 4: (abort()+0x148) [0x7f2e204a08f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f2e2442fd47]
 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) [0x7f2e2417bc7c]
 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) [0x7f2e240c8dce]
 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]
 9: (OSD::init()+0x27d) [0x7f2e23d5828d]
 10: (main()+0x2c18) [0x7f2e23c71088]
 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]
 12: (()+0x3c8847) [0x7f2e23d07847]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks in advance,
Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux