Unexpected OSD down during deep-scrub

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

I have a cluster with 5 hosts and 18 OSDs, today I faced with a unexpected issue when multiple OSD goes down.

The first OSD go down, was osd.8, feel minutes after, another OSD goes down on the same host, the osd.1. So, I tried restart the OSDs (osd.8 and osd.1) but doesn’t worked and I decided put this OSDs out of cluster and wait the recovery complete.

During the recovery, more two OSDs goes down, osd.6 in another host… and seconds after, osd.0 on the same host that first osd goes down too.

Looking to the “ceph -w” status I realised some slow/stuck ops and I decided stop the writes on cluster. After that I restarted the OSDs 0 and 6 and bouth became UP and I was able to wait the recovery finish, which happened successfully.

I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down?

Bellow the osd.8 trace:

    -2> 2015-03-03 16:31:48.191796 7f91a388b700  5 -- op tracker -- seq: 2633606, time: 2015-03-03 16:31:48.191796, event: done, op: osd_op(client.3880912.0:236
8430 notify.6 [watch ping cookie 140352686583296] 40.97c520d4 ack+write+known_if_redirected e4231)
    -1> 2015-03-03 16:31:48.192174 7f91af8a3700  1 -- 10.32.30.11:6804/3991 <== client.3880912 10.32.30.10:0/1001424 282597 ==== ping magic: 0 v1 ==== 0+0+0 (0
0 0) 0x3333f500 con 0x1535c580
     0> 2015-03-03 16:31:48.251131 7f91a0084700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)' thread 7
f91a0084700 time 2015-03-03 16:31:48.169895
osd/ReplicatedPG.cc: 7494: FAILED assert(!i->mod_desc.empty())

 ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcc86c2]
 2: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x49c) [0x9624fc]
 3: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x9698ba]
 4: (ReplicatedPG::_scrub(ScrubMap&)+0x2e62) [0x99b072]
 5: (PG::scrub_compare_maps()+0x511) [0x90f0d1]
 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x204) [0x910bb4]
 7: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x912c53]
 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7ebdd3]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcbade9]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xcbbfe0]
 11: (()+0x6b50) [0x7f91bfe46b50]
 12: (clone()+0x6d) [0x7f91be8627bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

At.

Italo Santos
http://italosantos.com.br/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux