Re: osd dies with m_filestore_fail_eio without dmesg error

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Tue, 6 Sep 2016 18:37:59 +0200

On 06.09.2016 14:45, Ronny Aasen wrote:
On 06. sep. 2016 00:58, Brad Hubbard wrote:
On Mon, Sep 05, 2016 at 12:54:40PM +0200, Ronny Aasen wrote:
> Hello
>
> I have a osd that regularly dies on io, especially scrubbing.
> normaly i would assume a bad disk, and replace it. but then i 
normaly see
> messages in dmesg about the device and it's errors. for this OSD
> there are no errors in dmesg at all after a crash like this.
>
> this osd is a 5 disk software raid5 array. and it have had broken 
disks in
> the past that have been replaces and parity recalculated. running 
XFS with a
> journal SSD partition.
>
>
> i can start the osd again and it works for a while. (several days) 
before it
> crashes again.
> could one of you look at the log for this osd and see if there is 
any way to
> salvage this osd?
>
> And is there any information i should gather before i scratch the 
filesystem
> and recreates it, perhaps there is some valuable insight into 
whats's going
> on ??
>
> kind regards
> Ronny Aasen
>
>
>     -1> 2016-09-05 12:09:28.185977 7eff0dbb9700  1 -- 
10.24.12.22:6806/7970
> --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp 
2016-09-05
> 12:09:28.184760) v2 -- ?+0 0x6a634800 con 0x63888160
>      0> 2016-09-05 12:09:28.186884 7eff03ba5700 -1 
os/FileStore.cc: In
> function 'virtual int FileStore::read(coll_t, const ghobject_t&, 
uint64_t,
> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7eff03ba5700 time
> 2016-09-05 12:09:27.988279
> os/FileStore.cc: 2854: FAILED assert(allow_eio || 
!m_filestore_fail_eio ||
> got != -5)

Error 5 is EIO or "I/O error" of course so it is receiving an I/O 
error when it
attempts to read the file. According to this code [1] if you 
reproduce the error
with "debug_filestore = 10" you should be able to retrieve the object 
ID and
find it on disk for inspection and comparison to the other replicas.

[1] https://github.com/ceph/ceph/blob/hammer/src/os/FileStore.cc#L2852

-- Cheers, Brad

thanks
have added debug_filestore=10 to this osd. and can see a lot more in 
the logs, am going to leave it running until it crashes the next time, 
hopefully it will have some more details.

kind regards
Ronny Aasen

after a day's run the osd crashed again .

  -37> 2016-09-06 18:12:07.690091 7f201ddf0700 10 
filestore(/var/lib/ceph/osd/ceph-106) 
FileStore::read(1.30b_head/1/38a7e30b/rbd_data.545f06238e1f29.0000000000016f21/head) 
pread error: (5) Input/output error

tryng to read the object manually also gave a IO error.  so I rm'd the 
object and let ceph recreate it.  deep scrubbing should eventually 
locate all such issues on this osd.

thanks for the support. :)

Altho i do feel it is a bit drastic to crash the osd on a single corrupt 
file. it could have mv'd the file to a "pg_head/corrupted/../../../.." 
directory for safekeeping, and copied a working object from one of the 
replicas. and if there was objects in a osd's corrupted directory it 
could show a warn in ceph's status for the admin to inspect potentially 
failing drives.

kind regards
Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com