Re: [ceph-users] Upgrade to Infernalis: OSDs crash all the time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi list,

Between crashes we were able to allow the cluster to backfill as much as possible (all monitors Infernalis, OSDs being Hammer again).

Leftover PGs wouldn't backfill until we removed files such as:

8.0M -rw-r--r-- 1 root root 8.0M Aug 24 23:56 temp\u3.bd\u0\u16175417\u2718__head_000000BD__fffffffffffffffb 8.0M -rw-r--r-- 1 root root 8.0M Aug 28 05:51 temp\u3.bd\u0\u16175417\u3992__head_000000BD__fffffffffffffffb 8.0M -rw-r--r-- 1 root root 8.0M Aug 30 03:40 temp\u3.bd\u0\u16175417\u4521__head_000000BD__fffffffffffffffb 8.0M -rw-r--r-- 1 root root 8.0M Aug 31 03:46 temp\u3.bd\u0\u16175417\u4817__head_000000BD__fffffffffffffffb 8.0M -rw-r--r-- 1 root root 8.0M Sep  5 19:44 temp\u3.bd\u0\u16175417\u6252__head_000000BD__fffffffffffffffb 8.0M -rw-r--r-- 1 root root 8.0M Sep  6 14:44 temp\u3.bd\u0\u16175417\u6593__head_000000BD__fffffffffffffffb 8.0M -rw-r--r-- 1 root root 8.0M Sep  7 10:21 temp\u3.bd\u0\u16175417\u6870__head_000000BD__fffffffffffffffb

Restarting the given OSD didn't seem necessary; backfilling started to work and at some point enough replicas were available for each PG.

Finally deep scrubbing repaired the inconsistent PGs automagically and we arrived at HEALTH_OK again!

Case closed: up to Jewel.

For everyone involved: a big, big and even bigger thank you for all pointers and support!

Regards,
Kees

On 10-09-18 16:43, Kees Meijs wrote:
A little update: meanwhile we added a new node consisting of Hammer OSDs
to ensure sufficient cluster capacity.

The upgraded node with Infernalis OSDs is completely removed from the
CRUSH map and the OSDs removed (obviously we didn't wipe the disks yet).

At the moment we're still running using flags
noout,nobackfill,noscrub,nodeep-scrub. Although now only Hammer OSDs
reside, we still experience OSD crashes on backfilling so we're unable
to achieve HEALTH_OK state.

Using debug 20 level we're (mostly my coworker Willem Jan is) figuring
out why the crashes happen exactly. Hopefully we'll figure it out.

To be continued...




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux