Re: co-located cephfs client deadlock

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 2 May 2019 09:27:44 +0200

On the stuck client:

  cat /sys/kernel/debug/ceph/*/osdc

REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS
REQUESTS 1 homeless 0
245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100 e74658
fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
0x400024 1 write
LINGER REQUESTS
BACKOFFS

osd.100 is clearly there ^^

-- dan

On Thu, May 2, 2019 at 9:25 AM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote:
>
>
> How did you retreive what osd nr to restart?
>
> Just for future reference, when I run into a similar situation. If you
> have a client hang on a osd node. This can be resolved by restarting
> the osd that it is reading from?
>
>
>
>
> -----Original Message-----
> From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx]
> Sent: donderdag 2 mei 2019 8:51
> To: Yan, Zheng
> Cc: ceph-users; pablo.llopis@xxxxxxx
> Subject: Re:  co-located cephfs client deadlock
>
> On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> >
> > On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster <dan@xxxxxxxxxxxxxx>
> wrote:
> > >
> > > Hi all,
> > >
> > > We have been benchmarking a hyperconverged cephfs cluster (kernel
> > > clients + osd on same machines) for awhile. Over the weekend (for
> > > the first time) we had one cephfs mount deadlock while some clients
> > > were running ior.
> > >
> > > All the ior processes are stuck in D state with this stack:
> > >
> > > [<ffffffffafdb53a3>] wait_on_page_bit+0x83/0xa0 [<ffffffffafdb54d1>]
>
> > > __filemap_fdatawait_range+0x111/0x190
> > > [<ffffffffafdb5564>] filemap_fdatawait_range+0x14/0x30
> > > [<ffffffffafdb79e6>] filemap_write_and_wait_range+0x56/0x90
> > > [<ffffffffc0f11575>] ceph_fsync+0x55/0x420 [ceph]
> > > [<ffffffffafe76247>] do_fsync+0x67/0xb0 [<ffffffffafe76530>]
> > > SyS_fsync+0x10/0x20 [<ffffffffb0372d5b>]
> > > system_call_fastpath+0x22/0x27 [<ffffffffffffffff>]
> > > 0xffffffffffffffff
> > >
> >
> > are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc?
>
> We never managed to reproduce on this cluster.
>
> But on a separate (not co-located) cluster we had a similar issue. A
> client was stuck like this for several hours:
>
> HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to
> respond to capability release
>     mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
> failing to respond to capability release client_id: 69092525
> MDS_SLOW_REQUEST 1 MDSs report slow requests
>     mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30
> sec
>
>
> Indeed there was a hung write on hpc070.cern.ch:
>
> 245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
> e74658
> fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
> 0x400024        1 write
>
> I restarted osd.100 and the deadlocked request went away.
> Does this sound like a known issue?
>
> Thanks, Dan
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com