Hi Robert, all, On 23/09/2019 17:37, Robert LeBlanc wrote:
On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> wrote:Hi all, When syncing data with rsync, I'm often getting blocked slow requests, which also block access to this path.2019-09-23 11:25:49.477 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 31.895478 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:26:19.477 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 61.896079 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:27:19.478 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 121.897268 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:29:19.488 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 241.899467 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:33:19.680 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 482.087927 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:36:09.881 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 32.677511 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:36:39.881 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 62.678132 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:37:39.891 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 122.679273 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:39:39.892 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 242.684667 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:41:19.893 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 962.305681 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:43:39.923 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 482.712888 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:51:40.236 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 963.037049 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 11:57:20.308 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 1922.719287 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 12:07:40.621 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 1923.409501 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 12:29:20.639 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 3843.057602 seconds old, received at 2019-09-23 11:25:17.598152: client_request(client.38352684:92684 lookup #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-09-23 12:39:40.872 7f4f401e8700 0 log_channel(cluster) log [WRN] : slow request 3843.664914 seconds old, received at 2019-09-23 11:35:37.217113: client_request(client.38347357:111963 lookup #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exportedIf I try to ls this paths, the client will hang. I tried this using ceph kernel client of centos7.6 and now also with the ceph-fuse of 14.2.3, I see the issue with both. I tried remounting, but that did not solve the issue, if I restart the mds, the issue goes away - for some time[root@mds02 ~]# ceph -s cluster: id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47 health: HEALTH_WARN 1 MDSs report slow requests 2 MDSs behind on trimming services: mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d) mgr: mds02(active, since 4d), standbys: mds01, mds03 mds: ceph_fs:2 {0=mds03=up:active,1=mds02=up:active} 1 up:standby osd: 535 osds: 535 up, 535 in data: pools: 3 pools, 3328 pgs objects: 375.14M objects, 672 TiB usage: 1.0 PiB used, 2.2 PiB / 3.2 PiB avail pgs: 3319 active+clean 9 active+clean+scrubbing+deep io: client: 141 KiB/s rd, 54 MiB/s wr, 62 op/s rd, 577 op/s wr [root@mds02 ~]# ceph health detail HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming MDS_SLOW_REQUEST 1 MDSs report slow requests mdsmds02(mds.1): 2 slow requests are blocked > 30 secs MDS_TRIM 2 MDSs behind on trimming mdsmds02(mds.1): Behind on trimming (3407/200) max_segments: 200, num_segments: 3407 mdsmds03(mds.0): Behind on trimming (4240/200) max_segments: 200, num_segments: 4240Can someone help me to debug this further?What is the make up of your cluster? It sounds like it may be all HDD.
metadata is on ssds, data pool is on hdd in erasure code
If so, try adding this to /etc/ceph/ceph.conf on your OSDs and restart the processes. osd op queue cut off = high Depending on your version (default in newer versions), adding osd op queue = wpq
I had some cluster outage to debug, but now the cluster was healthy again, so I tried this.
I've set these parameters for all osds and restarted them. Sadly I still have the same issue:
2019-10-15 17:49:45.752 7f3174869700 0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] 2019-10-15 17:49:46.192 7f3174869700 0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] 2019-10-15 17:49:48.062 7f317a875700 1 mds.mds01 Updating MDS map to version 567880 from mon.0 2019-10-15 17:49:51.332 7f317a875700 1 mds.mds01 Updating MDS map to version 567881 from mon.0 2019-10-15 17:49:59.812 7f317a875700 0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00] 2019-10-15 17:50:03.762 7f317a875700 1 mds.mds01 Updating MDS map to version 567882 from mon.0 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 33.413517 secs 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 33.413517 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 32.804341 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:22.002 7f317a875700 1 mds.mds01 Updating MDS map to version 567883 from mon.0 2019-10-15 17:50:24.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 38.413619 secs 2019-10-15 17:50:29.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 43.413714 secs 2019-10-15 17:50:34.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 48.413976 secs 2019-10-15 17:50:39.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 53.414073 secs 2019-10-15 17:50:44.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 58.414164 secs 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 63.414259 secs 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 63.414258 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 62.805083 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:54.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 68.414353 secs 2019-10-15 17:50:59.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 73.414450 secs....019-10-15 17:53:02.422 7f317a875700 1 mds.mds01 Updating MDS map to version 567906 from mon.0 2019-10-15 17:53:04.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 198.417149 secs 2019-10-15 17:53:05.732 7f317a875700 1 mds.mds01 Updating MDS map to version 567907 from mon.0 2019-10-15 17:53:09.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 203.417241 secs 2019-10-15 17:53:14.412 7f317a875700 1 mds.mds01 Updating MDS map to version 567908 from mon.0 2019-10-15 17:53:14.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 208.417346 secs 2019-10-15 17:53:17.662 7f317a875700 1 mds.mds01 Updating MDS map to version 567909 from mon.0 2019-10-15 17:53:19.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 213.417443 secs 2019-10-15 17:53:22.002 7f317a875700 1 mds.mds01 Updating MDS map to version 567910 from mon.0 2019-10-15 17:53:24.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 218.417538 secs 2019-10-15 17:53:26.362 7f317a875700 1 mds.mds01 Updating MDS map to version 567911 from mon.0...2019-10-15 17:58:14.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 508.440553 secs 2019-10-15 17:58:18.062 7f317a875700 1 mds.mds01 Updating MDS map to version 567970 from mon.0 2019-10-15 17:58:19.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 513.440638 secs 2019-10-15 17:58:21.672 7f317a875700 1 mds.mds01 Updating MDS map to version 567971 from mon.0 2019-10-15 17:58:24.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 518.440702 secs 2019-10-15 17:58:26.162 7f317a875700 1 mds.mds01 Updating MDS map to version 567972 from mon.0 2019-10-15 17:58:29.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 523.440775 secs
So this blocks the cluster again..I'm running 14.2.3 , except for the mds which are running 14.2.2 due to a bug in 14.2.3.
Thanks again! Kenneth
can also help. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com