Re: Segmentation fault on rbd client ceph version 0.48.2argonaut

Vladislav Gorbunov <vadikgo@xxxxxxxxx> · Tue, 11 Dec 2012 21:44:07 +1200



I found a hardware error in the osd server the day before:
Dec 10 05:40:20 zstore kernel: EDAC MC1: 1 CE error on
CPU#1Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8
syndrome:0x0)

Сould it affect the replication process?
2012-12-11 00:15:17.705096 7f22b27f4700  0 log [ERR] : 4.6 osd.0: soid
fe0ab176/seodo1.rbd/head//4 size 0 != known size 112
2012-12-11 00:15:17.705100 7f22b27f4700  0 log [ERR] : 4.6 scrub 0
missing, 1 inconsistent objects
2012-12-11 00:15:17.706169 7f22b27f4700  0 log [ERR] : scrub 4.6
fe0ab176/seodo1.rbd/head//4 on disk size (112) does not match object
info size (0)
2012-12-11 00:15:17.706452 7f22b27f4700  0 log [ERR] : 4.6 scrub 1 errors
2012-12-11 00:21:58.214974 7f23a5ffb700  0 log [ERR] : 3.5 scrub stat
mismatch, got 21841/21839 objects, 199/199 clones,
90932097984/90932097760 bytes.
2012-12-11 00:21:58.214993 7f23a5ffb700  0 log [ERR] : 3.5 scrub 1 errors


2012/12/11 Vladislav Gorbunov <vadikgo@xxxxxxxxx>:
> Look like the header object on broken images is empty.
>
> root@bender:~# rados -p iscsi stat seodo1.rbd
> iscsi/seodo1.rbd mtime 1354795057, size 0
>
> root@bender:~# rados -p iscsi stat siri.rbd
> iscsi/siri.rbd mtime 1355151093, size 0
>
> On accessible image header size not empty:
> root@bender:~# rados -p iscsi stat siri1.rbd
> iscsi/siri1.rbd mtime 1355174156, size 112
>
> and header can't saved:
> root@bender:~# rados -p iscsi get seodo1.rbd seodo1.header
> 2012-12-11 11:34:06.044164 7fe732f52780  0 wrote 0 byte payload to seodo1.header
>
> Before this header became unreadable new osd server added and cluster
> was rebalanced. One of the mon server (mon.0) crushed, and i restart
> them.
>
> 2012/12/11 Josh Durgin <josh.durgin@xxxxxxxxxxx>:
>> On 12/10/2012 01:54 PM, Vladislav Gorbunov wrote:
>>>
>>> but access to iscsi/seodo1 and iscsi/siri1 fail on every rbd client
>>> hosts. Data completely inaccessible.
>>>
>>> root@bender:~# rbd info iscsi/seodo1
>>> *** Caught signal (Segmentation fault) **
>>>   in thread 7fb8c93f5780
>>>   ceph version 0.48.2argonaut
>>> (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
>>>   1: rbd() [0x41dfea]
>>>   2: (()+0xfcb0) [0x7fb8c796fcb0]
>>>   3: (()+0x16244d) [0x7fb8c6ae444d]
>>>   4: (librbd::read_header_bl(librados::IoCtx&, std::string const&,
>>> ceph::buffer::list&, unsigned long*)+0xf9) [0x7fb8c8fadb99]
>>>   5: (librbd::read_header(librados::IoCtx&, std::string const&,
>>> rbd_obj_header_ondisk*, unsigned long*)+0x82) [0x7fb8c8fadda2]
>>>   6: (librbd::ictx_refresh(librbd::ImageCtx*)+0x90b) [0x7fb8c8fb05eb]
>>>   7: (librbd::open_image(librbd::ImageCtx*)+0x1b5) [0x7fb8c8fb1165]
>>>   8: (librbd::RBD::open(librados::IoCtx&, librbd::Image&, char const*,
>>> char const*)+0x5f) [0x7fb8c8fb16af]
>>>   9: (main()+0x73c) [0x41721c]
>>>   10: (__libc_start_main()+0xed) [0x7fb8c69a376d]
>>>   11: rbd() [0x41a0c9]
>>> 2012-12-11 09:33:14.264755 7fb8c93f5780 -1 *** Caught signal
>>> (Segmentation fault) **
>>>   in thread 7fb8c93f5780
>>
>>
>> It sounds like the header object (which rbd uses to determine the
>> prefix for data object names) is corrupted or otherwise inaccessible.
>>
>> Could you save the header object to a file ('rados -p iscsi get seodo1.rbd')
>> and put that file somewhere accessible?
>>
>> Did anything happen to your cluster before this header became
>> unreadable? Any disk problems, or osds crashing?
>>
>> Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html