How did you retreive what osd nr to restart? Just for future reference, when I run into a similar situation. If you have a client hang on a osd node. This can be resolved by restarting the osd that it is reading from? -----Original Message----- From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx] Sent: donderdag 2 mei 2019 8:51 To: Yan, Zheng Cc: ceph-users; pablo.llopis@xxxxxxx Subject: Re: co-located cephfs client deadlock On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > Hi all, > > > > We have been benchmarking a hyperconverged cephfs cluster (kernel > > clients + osd on same machines) for awhile. Over the weekend (for > > the first time) we had one cephfs mount deadlock while some clients > > were running ior. > > > > All the ior processes are stuck in D state with this stack: > > > > [<ffffffffafdb53a3>] wait_on_page_bit+0x83/0xa0 [<ffffffffafdb54d1>] > > __filemap_fdatawait_range+0x111/0x190 > > [<ffffffffafdb5564>] filemap_fdatawait_range+0x14/0x30 > > [<ffffffffafdb79e6>] filemap_write_and_wait_range+0x56/0x90 > > [<ffffffffc0f11575>] ceph_fsync+0x55/0x420 [ceph] > > [<ffffffffafe76247>] do_fsync+0x67/0xb0 [<ffffffffafe76530>] > > SyS_fsync+0x10/0x20 [<ffffffffb0372d5b>] > > system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] > > 0xffffffffffffffff > > > > are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc? We never managed to reproduce on this cluster. But on a separate (not co-located) cluster we had a similar issue. A client was stuck like this for several hours: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02 failing to respond to capability release client_id: 69092525 MDS_SLOW_REQUEST 1 MDSs report slow requests mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 sec Indeed there was a hung write on hpc070.cern.ch: 245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100 e74658 fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001 0x400024 1 write I restarted osd.100 and the deadlocked request went away. Does this sound like a known issue? Thanks, Dan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com