Re: osd dies with m_filestore_fail_eio without dmesg error

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Tue, 6 Sep 2016 14:45:05 +0200

On 06. sep. 2016 00:58, Brad Hubbard wrote:
On Mon, Sep 05, 2016 at 12:54:40PM +0200, Ronny Aasen wrote:
> Hello
>
> I have a osd that regularly dies on io, especially scrubbing.
> normaly i would assume a bad disk, and replace it. but then i normaly see
> messages in dmesg about the device and it's errors. for this OSD
> there are no errors in dmesg at all after a crash like this.
>
> this osd is a 5 disk software raid5 array. and it have had broken disks in
> the past that have been replaces and parity recalculated. running XFS with a
> journal SSD partition.
>
>
> i can start the osd again and it works for a while. (several days) before it
> crashes again.
> could one of you look at the log for this osd and see if there is any way to
> salvage this osd?
>
> And is there any information i should gather before i scratch the filesystem
> and recreates it, perhaps there is some valuable insight into whats's going
> on ??
>
> kind regards
> Ronny Aasen
>
>
>     -1> 2016-09-05 12:09:28.185977 7eff0dbb9700  1 -- 10.24.12.22:6806/7970
> --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp 2016-09-05
> 12:09:28.184760) v2 -- ?+0 0x6a634800 con 0x63888160
>      0> 2016-09-05 12:09:28.186884 7eff03ba5700 -1 os/FileStore.cc: In
> function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7eff03ba5700 time
> 2016-09-05 12:09:27.988279
> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio ||
> got != -5)

Error 5 is EIO or "I/O error" of course so it is receiving an I/O error when it
attempts to read the file. According to this code [1] if you reproduce the error
with "debug_filestore = 10" you should be able to retrieve the object ID and
find it on disk for inspection and comparison to the other replicas.

[1] https://github.com/ceph/ceph/blob/hammer/src/os/FileStore.cc#L2852

-- Cheers, Brad

thanks
have added debug_filestore=10 to this osd. and can see a lot more in the 
logs, am going to leave it running until it crashes the next time, 
hopefully it will have some more details.

kind regards
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com