Re: [ceph-users] Upgrade to Infernalis: OSDs crash all the time

Kees Meijs <kees@xxxxxxxx> · Mon, 12 Nov 2018 08:50:13 +0100

Hi list,

Between crashes we were able to allow the cluster to backfill as much as 
possible (all monitors Infernalis, OSDs being Hammer again).

Leftover PGs wouldn't backfill until we removed files such as:

8.0M -rw-r--r-- 1 root root 8.0M Aug 24 23:56 
temp\u3.bd\u0\u16175417\u2718__head_000000BD__fffffffffffffffb
8.0M -rw-r--r-- 1 root root 8.0M Aug 28 05:51 
temp\u3.bd\u0\u16175417\u3992__head_000000BD__fffffffffffffffb
8.0M -rw-r--r-- 1 root root 8.0M Aug 30 03:40 
temp\u3.bd\u0\u16175417\u4521__head_000000BD__fffffffffffffffb
8.0M -rw-r--r-- 1 root root 8.0M Aug 31 03:46 
temp\u3.bd\u0\u16175417\u4817__head_000000BD__fffffffffffffffb
8.0M -rw-r--r-- 1 root root 8.0M Sep  5 19:44 
temp\u3.bd\u0\u16175417\u6252__head_000000BD__fffffffffffffffb
8.0M -rw-r--r-- 1 root root 8.0M Sep  6 14:44 
temp\u3.bd\u0\u16175417\u6593__head_000000BD__fffffffffffffffb
8.0M -rw-r--r-- 1 root root 8.0M Sep  7 10:21 
temp\u3.bd\u0\u16175417\u6870__head_000000BD__fffffffffffffffb

Restarting the given OSD didn't seem necessary; backfilling started to 
work and at some point enough replicas were available for each PG.

Finally deep scrubbing repaired the inconsistent PGs automagically and 
we arrived at HEALTH_OK again!

Case closed: up to Jewel.

For everyone involved: a big, big and even bigger thank you for all 
pointers and support!

Regards,
Kees

On 10-09-18 16:43, Kees Meijs wrote:
A little update: meanwhile we added a new node consisting of Hammer OSDs
to ensure sufficient cluster capacity.

The upgraded node with Infernalis OSDs is completely removed from the
CRUSH map and the OSDs removed (obviously we didn't wipe the disks yet).

At the moment we're still running using flags
noout,nobackfill,noscrub,nodeep-scrub. Although now only Hammer OSDs
reside, we still experience OSD crashes on backfilling so we're unable
to achieve HEALTH_OK state.

Using debug 20 level we're (mostly my coworker Willem Jan is) figuring
out why the crashes happen exactly. Hopefully we'll figure it out.

To be continued...