Simultaneous CEPH OSD crashes

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Sun, 27 Sep 2015 09:15:03 +0200

Hi,

we just had a quasi simultaneous crash on two different OSD which
blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9.

the first OSD to go down had this error :

2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27
06:30:33.145251
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

the second OSD crash was similar :

2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
06:30:57.260978
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

I'm familiar with this error : it happened already with a BTRFS read
error (invalid csum) and I could correct it after flush-journal/deleting
the corrupted file/starting OSD/pg repair.
This time though there isn't any kernel log indicating an invalid csum.
The kernel is different though : we use 3.18.9 on these two servers and
the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors
with this version. I've launched btrfs scrub on the 2 filesystems just
in case (still waiting for completion).

The first attempt to restart these OSDs failed: one OSD died 19 seconds
after start, the other 21 seconds. Seeing that, I temporarily brought
down the min_size to 1 which allowed the 9 incomplete PG to recover. I
verified this by bringing min_size again to 2 and then restarted the 2
OSDs. They didn't crash yet.

For reference the assert failures were still the same when the OSD died
shortly after start :
2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27
08:20:19.325126
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27
08:20:50.605234
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving
one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with
currently a 10 minute interval), this might be relevant (or just a
coincidence).

I made copies of the ceph osd logs (including the stack trace and the
recent events) if needed.

Can anyone put some light on why these OSDs died ?

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com