Re: hanging slow requests: failed to authpin, subtree is being exported

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 
I think I am having this issue also (at least I had with luminous) I had 
to remove the hidden temp files rsync had left, when the cephfs mount 
'stalled', otherwise I would never be able to complete the rsync.


-----Original Message-----
Cc: ceph-users
Subject: Re:  hanging slow requests: failed to authpin, 
subtree is being exported


I've made a ticket for this issue: https://tracker.ceph.com/issues/42338

Thanks again!

K

On 15/10/2019 18:00, Kenneth Waegeman wrote:
> Hi Robert, all,
>
>
> On 23/09/2019 17:37, Robert LeBlanc wrote:
>> On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman 
>> <kenneth.waegeman@xxxxxxxx> wrote:
>>> Hi all,
>>>
>>> When syncing data with rsync, I'm often getting blocked slow 
>>> requests, which also block access to this path.
>>>
>>>> 2019-09-23 11:25:49.477 7f4f401e8700 0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 31.895478 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:26:19.477 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 61.896079 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:27:19.478 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 121.897268 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:29:19.488 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 241.899467 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:33:19.680 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 482.087927 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:36:09.881 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 32.677511 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:36:39.881 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 62.678132 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:37:39.891 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 122.679273 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:39:39.892 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 242.684667 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:41:19.893 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 962.305681 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:43:39.923 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 482.712888 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:51:40.236 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 963.037049 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 11:57:20.308 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 1922.719287 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 12:07:40.621 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 1923.409501 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 12:29:20.639 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 3843.057602 seconds old, received at 2019-09-23
>>>> 11:25:17.598152: client_request(client.38352684:92684 lookup
>>>> #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>>> 2019-09-23 12:39:40.872 7f4f401e8700  0 log_channel(cluster) log 
>>>> [WRN]
>>>> : slow request 3843.664914 seconds old, received at 2019-09-23
>>>> 11:35:37.217113: client_request(client.38347357:111963 lookup 
>>>> #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
>>>> caller_gid=0{0,}) currently failed to authpin, subtree is being 
>>>> exported
>>> If I try to ls this paths, the client will hang.
>>>
>>> I tried this using ceph kernel client of centos7.6 and now also with 

>>> the ceph-fuse of 14.2.3, I see the issue with both. I tried 
>>> remounting, but that did not solve the issue, if I restart the mds, 
>>> the issue goes away
>>> - for some time
>>>
>>>
>>>> [root@mds02 ~]# ceph -s
>>>>    cluster:
>>>>      id:     92bfcf0a-1d39-43b3-b60f-44f01b630e47
>>>>      health: HEALTH_WARN
>>>>              1 MDSs report slow requests
>>>>              2 MDSs behind on trimming
>>>>
>>>>    services:
>>>>      mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)
>>>>      mgr: mds02(active, since 4d), standbys: mds01, mds03
>>>>      mds: ceph_fs:2 {0=mds03=up:active,1=mds02=up:active} 1 
>>>> up:standby
>>>>      osd: 535 osds: 535 up, 535 in
>>>>
>>>>    data:
>>>>      pools:   3 pools, 3328 pgs
>>>>      objects: 375.14M objects, 672 TiB
>>>>      usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail
>>>>      pgs:     3319 active+clean
>>>>               9    active+clean+scrubbing+deep
>>>>
>>>>    io:
>>>>      client:   141 KiB/s rd, 54 MiB/s wr, 62 op/s rd, 577 op/s wr
>>>>
>>>> [root@mds02 ~]# ceph health detail
>>>> HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming 
>>>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>>>      mdsmds02(mds.1): 2 slow requests are blocked > 30 secs 
>>>> MDS_TRIM 2 MDSs behind on trimming
>>>>      mdsmds02(mds.1): Behind on trimming (3407/200) max_segments: 
>>>> 200,
>>>> num_segments: 3407
>>>>      mdsmds03(mds.0): Behind on trimming (4240/200) max_segments: 
>>>> 200,
>>>> num_segments: 4240
>>> Can someone help me to debug this further?
>> What is the make up of your cluster? It sounds like it may be all 
HDD.
>
> metadata is on ssds, data pool is on hdd in erasure code
>
>
>>
>> If so, try adding this to /etc/ceph/ceph.conf on your OSDs and 
>> restart the processes.
>>
>> osd op queue cut off = high
>>
>> Depending on your version (default in newer versions), adding
>>
>> osd op queue = wpq
>
> I had some cluster outage to debug, but now the cluster was healthy 
> again, so I tried this.
>
> I've set these parameters for all osds and restarted them. Sadly I 
> still have the same issue:
>
>> 2019-10-15 17:49:45.752 7f3174869700  0 mds.0.migrator nicely 
>> exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] 
>> auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 
>> state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9)
>> n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346
>> 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000
>> b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 |
>> child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00]
>> 2019-10-15 17:49:46.192 7f3174869700  0 mds.0.migrator nicely 
>> exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] 
>> auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 
>> state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9)
>> n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346
>> 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000
>> b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 |
>> child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00]
>> 2019-10-15 17:49:48.062 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567880 from mon.0
>> 2019-10-15 17:49:51.332 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567881 from mon.0
>> 2019-10-15 17:49:59.812 7f317a875700  0 mds.0.migrator nicely 
>> exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] 
>> auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 
>> state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9)
>> n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510
>> 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1
>> replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00]
>> 2019-10-15 17:50:03.762 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567882 from mon.0
>> 2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 2 included below; oldest blocked for >
>> 33.413517 secs
>> 2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : slow request 33.413517 seconds old, received at 2019-10-15
>> 17:49:46.402110: client_request(client.41588185:115259 lookup 
>> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat
>> 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently 
>> failed to authpin, subtree is being exported
>> 2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : slow request 32.804341 seconds old, received at 2019-10-15
>> 17:49:47.011286: client_request(client.41588185:115282 lookup 
>> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat
>> 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently 
>> failed to authpin, subtree is being exported
>> 2019-10-15 17:50:22.002 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567883 from mon.0
>> 2019-10-15 17:50:24.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 38.413619 secs
>> 2019-10-15 17:50:29.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 43.413714 secs
>> 2019-10-15 17:50:34.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 48.413976 secs
>> 2019-10-15 17:50:39.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 53.414073 secs
>> 2019-10-15 17:50:44.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 58.414164 secs
>> 2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 2 included below; oldest blocked for >
>> 63.414259 secs
>> 2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : slow request 63.414258 seconds old, received at 2019-10-15
>> 17:49:46.402110: client_request(client.41588185:115259 lookup 
>> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat
>> 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently 
>> failed to authpin, subtree is being exported
>> 2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : slow request 62.805083 seconds old, received at 2019-10-15
>> 17:49:47.011286: client_request(client.41588185:115282 lookup 
>> #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat
>> 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently 
>> failed to authpin, subtree is being exported
>> 2019-10-15 17:50:54.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 68.414353 secs
>> 2019-10-15 17:50:59.802 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for > 
>> 73.414450 secs ....
>> 019-10-15 17:53:02.422 7f317a875700  1 mds.mds01 Updating MDS map to 
>> version 567906 from mon.0
>> 2019-10-15 17:53:04.812 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 198.417149 secs
>> 2019-10-15 17:53:05.732 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567907 from mon.0
>> 2019-10-15 17:53:09.812 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 203.417241 secs
>> 2019-10-15 17:53:14.412 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567908 from mon.0
>> 2019-10-15 17:53:14.812 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 208.417346 secs
>> 2019-10-15 17:53:17.662 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567909 from mon.0
>> 2019-10-15 17:53:19.812 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 213.417443 secs
>> 2019-10-15 17:53:22.002 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567910 from mon.0
>> 2019-10-15 17:53:24.812 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 218.417538 secs
>> 2019-10-15 17:53:26.362 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567911 from mon.0 ...
>> 2019-10-15 17:58:14.832 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 508.440553 secs
>> 2019-10-15 17:58:18.062 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567970 from mon.0
>> 2019-10-15 17:58:19.832 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 513.440638 secs
>> 2019-10-15 17:58:21.672 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567971 from mon.0
>> 2019-10-15 17:58:24.832 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 518.440702 secs
>> 2019-10-15 17:58:26.162 7f317a875700  1 mds.mds01 Updating MDS map to 

>> version 567972 from mon.0
>> 2019-10-15 17:58:29.832 7f3178070700  0 log_channel(cluster) log 
>> [WRN] : 2 slow requests, 0 included below; oldest blocked for >
>> 523.440775 secs
> So this blocks the cluster again..
>
> I'm running 14.2.3 , except for the mds which are running 14.2.2 due 
> to a bug in 14.2.3.
>
> Thanks again!
>
> Kenneth
>
>>
>> can also help.
>>
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux