Single OSD crash/restarting during scrub operation on specific PG

Mark Johnson <markj@xxxxxxxxx> · Wed, 21 Apr 2021 04:14:08 +0000

We've recently recovered from a bit of a disaster where we had some power outages (combination of data centre power maintenance and us not having our redundant power supplies connected to the correct redundant power circuits - lesson learnt).  We ended up with one OSD that wouldn't start - seems like perhaps filesystem corruption as an FSCK found and fixed a couple of errors but the OSD still wouldn't start so we ended up marking it lost and letting data backfill to other OSDs.  Ended up with a handful of 'incomplete' or 'incomplete/down' pgs which was causing radosgw to stop accepting connections.  Found a useful blog post from somebody that got us to a point of using the ceph-objectstore-tool to determine the correct remaining copy and mark pgs as complete.  Backfill operations then wrote out pgs to different OSDs and the cluster returned to a health OK state, and radosgw started working normally.  At least, that's what I thought.

Now, I'm occasionally seeing one OSD crashing every now and then, sometimes after a few hours, sometimes after only 10 minutes.  It always starts itself up again and the queued up backfills cancel and the cluster returns to OK until the next time.  It's always the same OSD and going through the logs just now, it seems to always occur when performing a scrub operation on the same pg (although, I haven't checked every single instance to be completely sure).

We're running Jewel (yes, I know it's old but we can't upgrade).

Here's the last couple of lines from the OSD log when it crashes on two different occasions - I've used the hex code from the "Caught signal" line to reference events that match that same code in both instances.  Looks  roughly the same on both occasions in that it's always the same pg, however the last object shown in the log prior to the crash always seems to be different.

   -91> 2021-04-21 02:37:32.118290 7fed046e6700  5 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0, writeout_from: 3110'3174946, trimmed:
   -90> 2021-04-21 02:37:32.118380 7fed046e6700  5 -- op tracker -- seq: 2219, time: 2021-04-21 02:37:32.118379, event: commit_queued_for_journal_write, op: osd_repop(client.3095191.0:19172420 30.65 30:a78d321d:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.18432929_1149756827.ogg:head v 3110'3174946)
    -1> 2021-04-21 02:37:32.748831 7fed046e6700  5 -- op tracker -- seq: 2221, time: 2021-04-21 02:37:32.748830, event: reached_pg, op: replica scrub(pg: 30.65,from:0'0,to:2923'3171906,epoch:3110,start:30:a63a08df:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.22382834__shadow_.cqXthfu1litKEyNZ53I_voGLwuhonVX_1:0,end:30:a63a13fd:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.26065941_1234799607.gsm:0,chunky:1,deep:1,seed:4294967295,version:6)
     0> 2021-04-21 02:37:32.797826 7fed046e6700 -1 os/filestore/FileStore.cc: In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, IndexedPath*)' thread 7fed046e6700 time 2021-04-21 02:37:32.790356
2021-04-21 02:37:32.859265 7fed046e6700 -1 *** Caught signal (Aborted) **
 in thread 7fed046e6700 thread_name:tp_osd_tp
     0> 2021-04-21 02:37:32.859265 7fed046e6700 -1 *** Caught signal (Aborted) **
 in thread 7fed046e6700 thread_name:tp_osd_tp

    -17> 2021-04-21 03:55:09.090430 7f43382c7700  5 -- op tracker -- seq: 1596, time: 2021-04-21 03:55:09.090430, event: done, op: replica scrub(pg: 30.65,from:0'0,to:2979'3174652,epoch:3122,start:30:a639eb4a:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.17157485_1132337117.ogg:0,end:30:a639f7f1:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.21768594_1188151257.ogg:0,chunky:1,deep:1,seed:4294967295,version:6)
    -5> 2021-04-21 03:55:09.777503 7f43382c7700  5 -- op tracker -- seq: 1598, time: 2021-04-21 03:55:09.777476, event: reached_pg, op: replica scrub(pg: 30.65,from:0'0,to:2929'3172006,epoch:3122,start:30:a63a047c:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.18649542__shadow_.tKOWzKIibnLhX3Bu32FiiuG0FH1lIl4_1:0,end:30:a63a0ea6:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.14411965_1101425097.gsm:0,chunky:1,deep:1,seed:4294967295,version:6)
     0> 2021-04-21 03:55:10.089217 7f43382c7700 -1 os/filestore/FileStore.cc: In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, IndexedPath*)' thread 7f43382c7700 time 2021-04-21 03:55:10.081373
2021-04-21 03:55:10.157208 7f43382c7700 -1 *** Caught signal (Aborted) **
 in thread 7f43382c7700 thread_name:tp_osd_tp
     0> 2021-04-21 03:55:10.157208 7f43382c7700 -1 *** Caught signal (Aborted) **
 in thread 7f43382c7700 thread_name:tp_osd_tp

Any ideas what to do next?

Regards,
Mark Johnson

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx