Hi, we just had a quasi simultaneous crash on two different OSD which blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9. the first OSD to go down had this error : 2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27 06:30:33.145251 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) the second OSD crash was similar : 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27 06:30:57.260978 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) I'm familiar with this error : it happened already with a BTRFS read error (invalid csum) and I could correct it after flush-journal/deleting the corrupted file/starting OSD/pg repair. This time though there isn't any kernel log indicating an invalid csum. The kernel is different though : we use 3.18.9 on these two servers and the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors with this version. I've launched btrfs scrub on the 2 filesystems just in case (still waiting for completion). The first attempt to restart these OSDs failed: one OSD died 19 seconds after start, the other 21 seconds. Seeing that, I temporarily brought down the min_size to 1 which allowed the 9 incomplete PG to recover. I verified this by bringing min_size again to 2 and then restarted the 2 OSDs. They didn't crash yet. For reference the assert failures were still the same when the OSD died shortly after start : 2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27 08:20:19.325126 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) 2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27 08:20:50.605234 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with currently a 10 minute interval), this might be relevant (or just a coincidence). I made copies of the ceph osd logs (including the stack trace and the recent events) if needed. Can anyone put some light on why these OSDs died ? Best regards, Lionel Bouton _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com