Re: corrupted rbd filesystems since jewel

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 16 May 2017 09:44:07 -0400

On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
> 3.) it still happens on pre jewel images even when they got restarted /
> killed and reinitialized. In that case they've the asok socket available
> for now. Should i issue any command to the socket to get log out of the
> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.

The best option would be to run "gcore" against the running VM whose
IO is stuck, compress the dump, and use the "ceph-post-file" to
provide the dump. I could then look at all the Ceph data structures to
hopefully find the issue.

Enabling debug logs after the IO has stuck will most likely be of
little value since it won't include the details of which IOs are
outstanding. You could attempt to use "ceph --admin-daemon
/path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
stuck waiting on an OSD to respond.

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com