I think I am having this issue also (at least I had with luminous) I had to remove the hidden temp files rsync had left, when the cephfs mount 'stalled', otherwise I would never be able to complete the rsync. -----Original Message----- Cc: ceph-users Subject: Re: hanging slow requests: failed to authpin, subtree is being exported I've made a ticket for this issue: https://tracker.ceph.com/issues/42338 Thanks again! K On 15/10/2019 18:00, Kenneth Waegeman wrote: > Hi Robert, all, > > > On 23/09/2019 17:37, Robert LeBlanc wrote: >> On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman >> <kenneth.waegeman@xxxxxxxx> wrote: >>> Hi all, >>> >>> When syncing data with rsync, I'm often getting blocked slow >>> requests, which also block access to this path. >>> >>>> 2019-09-23 11:25:49.477 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 31.895478 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:26:19.477 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 61.896079 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:27:19.478 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 121.897268 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:29:19.488 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 241.899467 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:33:19.680 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 482.087927 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:36:09.881 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 32.677511 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:36:39.881 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 62.678132 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:37:39.891 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 122.679273 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:39:39.892 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 242.684667 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:41:19.893 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 962.305681 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:43:39.923 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 482.712888 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:51:40.236 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 963.037049 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 11:57:20.308 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 1922.719287 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 12:07:40.621 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 1923.409501 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 12:29:20.639 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 3843.057602 seconds old, received at 2019-09-23 >>>> 11:25:17.598152: client_request(client.38352684:92684 lookup >>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>>> 2019-09-23 12:39:40.872 7f4f401e8700 0 log_channel(cluster) log >>>> [WRN] >>>> : slow request 3843.664914 seconds old, received at 2019-09-23 >>>> 11:35:37.217113: client_request(client.38347357:111963 lookup >>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, >>>> caller_gid=0{0,}) currently failed to authpin, subtree is being >>>> exported >>> If I try to ls this paths, the client will hang. >>> >>> I tried this using ceph kernel client of centos7.6 and now also with >>> the ceph-fuse of 14.2.3, I see the issue with both. I tried >>> remounting, but that did not solve the issue, if I restart the mds, >>> the issue goes away >>> - for some time >>> >>> >>>> [root@mds02 ~]# ceph -s >>>> cluster: >>>> id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47 >>>> health: HEALTH_WARN >>>> 1 MDSs report slow requests >>>> 2 MDSs behind on trimming >>>> >>>> services: >>>> mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d) >>>> mgr: mds02(active, since 4d), standbys: mds01, mds03 >>>> mds: ceph_fs:2 {0=mds03=up:active,1=mds02=up:active} 1 >>>> up:standby >>>> osd: 535 osds: 535 up, 535 in >>>> >>>> data: >>>> pools: 3 pools, 3328 pgs >>>> objects: 375.14M objects, 672 TiB >>>> usage: 1.0 PiB used, 2.2 PiB / 3.2 PiB avail >>>> pgs: 3319 active+clean >>>> 9 active+clean+scrubbing+deep >>>> >>>> io: >>>> client: 141 KiB/s rd, 54 MiB/s wr, 62 op/s rd, 577 op/s wr >>>> >>>> [root@mds02 ~]# ceph health detail >>>> HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming >>>> MDS_SLOW_REQUEST 1 MDSs report slow requests >>>> mdsmds02(mds.1): 2 slow requests are blocked > 30 secs >>>> MDS_TRIM 2 MDSs behind on trimming >>>> mdsmds02(mds.1): Behind on trimming (3407/200) max_segments: >>>> 200, >>>> num_segments: 3407 >>>> mdsmds03(mds.0): Behind on trimming (4240/200) max_segments: >>>> 200, >>>> num_segments: 4240 >>> Can someone help me to debug this further? >> What is the make up of your cluster? It sounds like it may be all HDD. > > metadata is on ssds, data pool is on hdd in erasure code > > >> >> If so, try adding this to /etc/ceph/ceph.conf on your OSDs and >> restart the processes. >> >> osd op queue cut off = high >> >> Depending on your version (default in newer versions), adding >> >> osd op queue = wpq > > I had some cluster outage to debug, but now the cluster was healthy > again, so I tried this. > > I've set these parameters for all osds and restarted them. Sadly I > still have the same issue: > >> 2019-10-15 17:49:45.752 7f3174869700 0 mds.0.migrator nicely >> exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] >> auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 >> state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) >> n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 >> 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 >> b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | >> child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] >> 2019-10-15 17:49:46.192 7f3174869700 0 mds.0.migrator nicely >> exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] >> auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 >> state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) >> n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 >> 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 >> b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | >> child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] >> 2019-10-15 17:49:48.062 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567880 from mon.0 >> 2019-10-15 17:49:51.332 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567881 from mon.0 >> 2019-10-15 17:49:59.812 7f317a875700 0 mds.0.migrator nicely >> exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] >> auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 >> state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) >> n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510 >> 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 >> replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00] >> 2019-10-15 17:50:03.762 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567882 from mon.0 >> 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 2 included below; oldest blocked for > >> 33.413517 secs >> 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : slow request 33.413517 seconds old, received at 2019-10-15 >> 17:49:46.402110: client_request(client.41588185:115259 lookup >> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat >> 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently >> failed to authpin, subtree is being exported >> 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : slow request 32.804341 seconds old, received at 2019-10-15 >> 17:49:47.011286: client_request(client.41588185:115282 lookup >> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat >> 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently >> failed to authpin, subtree is being exported >> 2019-10-15 17:50:22.002 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567883 from mon.0 >> 2019-10-15 17:50:24.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 38.413619 secs >> 2019-10-15 17:50:29.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 43.413714 secs >> 2019-10-15 17:50:34.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 48.413976 secs >> 2019-10-15 17:50:39.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 53.414073 secs >> 2019-10-15 17:50:44.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 58.414164 secs >> 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 2 included below; oldest blocked for > >> 63.414259 secs >> 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : slow request 63.414258 seconds old, received at 2019-10-15 >> 17:49:46.402110: client_request(client.41588185:115259 lookup >> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat >> 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently >> failed to authpin, subtree is being exported >> 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : slow request 62.805083 seconds old, received at 2019-10-15 >> 17:49:47.011286: client_request(client.41588185:115282 lookup >> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat >> 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently >> failed to authpin, subtree is being exported >> 2019-10-15 17:50:54.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 68.414353 secs >> 2019-10-15 17:50:59.802 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 73.414450 secs .... >> 019-10-15 17:53:02.422 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567906 from mon.0 >> 2019-10-15 17:53:04.812 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 198.417149 secs >> 2019-10-15 17:53:05.732 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567907 from mon.0 >> 2019-10-15 17:53:09.812 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 203.417241 secs >> 2019-10-15 17:53:14.412 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567908 from mon.0 >> 2019-10-15 17:53:14.812 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 208.417346 secs >> 2019-10-15 17:53:17.662 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567909 from mon.0 >> 2019-10-15 17:53:19.812 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 213.417443 secs >> 2019-10-15 17:53:22.002 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567910 from mon.0 >> 2019-10-15 17:53:24.812 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 218.417538 secs >> 2019-10-15 17:53:26.362 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567911 from mon.0 ... >> 2019-10-15 17:58:14.832 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 508.440553 secs >> 2019-10-15 17:58:18.062 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567970 from mon.0 >> 2019-10-15 17:58:19.832 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 513.440638 secs >> 2019-10-15 17:58:21.672 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567971 from mon.0 >> 2019-10-15 17:58:24.832 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 518.440702 secs >> 2019-10-15 17:58:26.162 7f317a875700 1 mds.mds01 Updating MDS map to >> version 567972 from mon.0 >> 2019-10-15 17:58:29.832 7f3178070700 0 log_channel(cluster) log >> [WRN] : 2 slow requests, 0 included below; oldest blocked for > >> 523.440775 secs > So this blocks the cluster again.. > > I'm running 14.2.3 , except for the mds which are running 14.2.2 due > to a bug in 14.2.3. > > Thanks again! > > Kenneth > >> >> can also help. >> >> ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com