Re: co-located cephfs client deadlock

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Thu, 2 May 2019 09:25:24 +0200

How did you retreive what osd nr to restart? 

Just for future reference, when I run into a similar situation. If you 
have a client hang on a osd node. This can be resolved by restarting
the osd that it is reading from?

-----Original Message-----
From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx] 
Sent: donderdag 2 mei 2019 8:51
To: Yan, Zheng
Cc: ceph-users; pablo.llopis@xxxxxxx
Subject: Re:  co-located cephfs client deadlock

On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>
> On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> 
wrote:
> >
> > Hi all,
> >
> > We have been benchmarking a hyperconverged cephfs cluster (kernel 
> > clients + osd on same machines) for awhile. Over the weekend (for 
> > the first time) we had one cephfs mount deadlock while some clients 
> > were running ior.
> >
> > All the ior processes are stuck in D state with this stack:
> >
> > [<ffffffffafdb53a3>] wait_on_page_bit+0x83/0xa0 [<ffffffffafdb54d1>] 

> > __filemap_fdatawait_range+0x111/0x190
> > [<ffffffffafdb5564>] filemap_fdatawait_range+0x14/0x30 
> > [<ffffffffafdb79e6>] filemap_write_and_wait_range+0x56/0x90
> > [<ffffffffc0f11575>] ceph_fsync+0x55/0x420 [ceph] 
> > [<ffffffffafe76247>] do_fsync+0x67/0xb0 [<ffffffffafe76530>] 
> > SyS_fsync+0x10/0x20 [<ffffffffb0372d5b>] 
> > system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 
> > 0xffffffffffffffff
> >
>
> are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc?

We never managed to reproduce on this cluster.

But on a separate (not co-located) cluster we had a similar issue. A 
client was stuck like this for several hours:

HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs 
report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to 
respond to capability release
    mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02 
failing to respond to capability release client_id: 69092525 
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 
sec

Indeed there was a hung write on hpc070.cern.ch:

245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
e74658  
fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
0x400024        1 write

I restarted osd.100 and the deadlocked request went away.
Does this sound like a known issue?

Thanks, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com