Re: Simultaneous CEPH OSD crashes

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Sun, 27 Sep 2015 10:25:27 +0200

Le 27/09/2015 09:15, Lionel Bouton a écrit :
> Hi,
>
> we just had a quasi simultaneous crash on two different OSD which
> blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9.
>
> the first OSD to go down had this error :
>
> 2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27
> 06:30:33.145251
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> the second OSD crash was similar :
>
> 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
> 06:30:57.260978
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> I'm familiar with this error : it happened already with a BTRFS read
> error (invalid csum) and I could correct it after flush-journal/deleting
> the corrupted file/starting OSD/pg repair.
> This time though there isn't any kernel log indicating an invalid csum.
> The kernel is different though : we use 3.18.9 on these two servers and
> the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors
> with this version. I've launched btrfs scrub on the 2 filesystems just
> in case (still waiting for completion).
>
> The first attempt to restart these OSDs failed: one OSD died 19 seconds
> after start, the other 21 seconds. Seeing that, I temporarily brought
> down the min_size to 1 which allowed the 9 incomplete PG to recover. I
> verified this by bringing min_size again to 2 and then restarted the 2
> OSDs. They didn't crash yet.
>
> For reference the assert failures were still the same when the OSD died
> shortly after start :
> 2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27
> 08:20:19.325126
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> 2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27
> 08:20:50.605234
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving
> one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with
> currently a 10 minute interval), this might be relevant (or just a
> coincidence).
>
> I made copies of the ceph osd logs (including the stack trace and the
> recent events) if needed.
>
> Can anyone put some light on why these OSDs died ?

I just had a thought. Could launching a defragmentation on a file in a
BTRFS OSD filestore trigger this problem?
We have a process doing just that. It waits until there's no recent
access to queue files for defragmentation but there's no guarantee that
it will not defragment a file an OSD is about to use.
This might explain the nearly simultaneous crash as the defragmentation
is triggered by write access patterns which should be the roughly the
same on all 3 OSDs hosting a copy of the file. The defragmentation isn't
running at the exact same time because it is queued which could explain
why we got 2 crashes instead of 3.

I'll probably ask on linux-btrfs but the possible conditions leading to
this assert failure would help pinpoint the problem, so if someone knows
this code well enough without knowing how BTRFS behaves while
defragmenting I'll bridge the gap.

I just activated autodefrag on one of the two affected servers for all
its BTRFS filesystems and disabled our own defragmentation process.
With recent tunings we might not need our own defragmentation scheduler
anymore and we can afford to lose some performance while investigating this.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com