Really not sure where to go with this one. Firstly, a description of my cluster. Yes, I know there are a lot of "not ideals" here but this is what I inherited. The cluster is running Jewel and has two storage/mon nodes and an additional mon only node, with a pool size of 2. Today, we had a some power issues in the data centre and we very ungracefully lost both storage servers at the same time. Node 1 came back online before node 2 but I could see there were a few OSDs that were down. When node 2 came back, I started trying to get OSDs up. Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting and crashing and just won't stay up. I'm not finding the OSD log output to be much use. Current health status looks like this: # ceph health HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are blocked > 32 sec # ceph status cluster e2391bbf-15e0-405f-af12-943610cb4909 health HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds 26 pgs down 26 pgs peering 26 pgs stuck inactive 26 pgs stuck unclean 5 requests are blocked > 32 sec Any clues as to what I should be looking for or what sort of action I should be taking to troubleshoot this? Unfortunately, I'm a complete novice with Ceph. Here's a snippet from the OSD log that means little to me... --- begin dump of recent events --- 0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) ** in thread 7f2e23921ac0 thread_name:ceph-osd ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) 1: (()+0x9f1c2a) [0x7f2e24330c2a] 2: (()+0xf5d0) [0x7f2e21ee95d0] 3: (gsignal()+0x37) [0x7f2e2049f207] 4: (abort()+0x148) [0x7f2e204a08f8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f2e2442fd47] 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) [0x7f2e2417bc7c] 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) [0x7f2e240c8dce] 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546] 9: (OSD::init()+0x27d) [0x7f2e23d5828d] 10: (main()+0x2c18) [0x7f2e23c71088] 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5] 12: (()+0x3c8847) [0x7f2e23d07847] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Thanks in advance, Mark _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx