On the stuck client: cat /sys/kernel/debug/ceph/*/osdc REQUESTS 0 homeless 0 LINGER REQUESTS BACKOFFS REQUESTS 1 homeless 0 245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100 e74658 fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001 0x400024 1 write LINGER REQUESTS BACKOFFS osd.100 is clearly there ^^ -- dan On Thu, May 2, 2019 at 9:25 AM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote: > > > How did you retreive what osd nr to restart? > > Just for future reference, when I run into a similar situation. If you > have a client hang on a osd node. This can be resolved by restarting > the osd that it is reading from? > > > > > -----Original Message----- > From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx] > Sent: donderdag 2 mei 2019 8:51 > To: Yan, Zheng > Cc: ceph-users; pablo.llopis@xxxxxxx > Subject: Re: co-located cephfs client deadlock > > On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> > wrote: > > > > > > Hi all, > > > > > > We have been benchmarking a hyperconverged cephfs cluster (kernel > > > clients + osd on same machines) for awhile. Over the weekend (for > > > the first time) we had one cephfs mount deadlock while some clients > > > were running ior. > > > > > > All the ior processes are stuck in D state with this stack: > > > > > > [<ffffffffafdb53a3>] wait_on_page_bit+0x83/0xa0 [<ffffffffafdb54d1>] > > > > __filemap_fdatawait_range+0x111/0x190 > > > [<ffffffffafdb5564>] filemap_fdatawait_range+0x14/0x30 > > > [<ffffffffafdb79e6>] filemap_write_and_wait_range+0x56/0x90 > > > [<ffffffffc0f11575>] ceph_fsync+0x55/0x420 [ceph] > > > [<ffffffffafe76247>] do_fsync+0x67/0xb0 [<ffffffffafe76530>] > > > SyS_fsync+0x10/0x20 [<ffffffffb0372d5b>] > > > system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] > > > 0xffffffffffffffff > > > > > > > are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc? > > We never managed to reproduce on this cluster. > > But on a separate (not co-located) cluster we had a similar issue. A > client was stuck like this for several hours: > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs > report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to > respond to capability release > mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02 > failing to respond to capability release client_id: 69092525 > MDS_SLOW_REQUEST 1 MDSs report slow requests > mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 > sec > > > Indeed there was a hung write on hpc070.cern.ch: > > 245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100 > e74658 > fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001 > 0x400024 1 write > > I restarted osd.100 and the deadlocked request went away. > Does this sound like a known issue? > > Thanks, Dan > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com