> On Mar 12, 2020, at 3:17 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > > On Tue, Mar 10, 2020 at 7:56 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >> >> >> >>> On Mar 10, 2020, at 5:07 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>> >>> Hi Chuck, >>> >>> On Tue, Mar 10, 2020 at 3:57 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >>>> >>>> Hi Olga- >>>> >>>>> On Mar 10, 2020, at 2:58 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>>> >>>>> Ever since commit 2c94b8eca1a26 "SUNRPC: Use au_rslack when computing >>>>> reply buffer size". It changed how "req->rq_rcvsize" is calculated. It >>>>> used to use au_cslack value which was nice and large and changed it to >>>>> au_rslack value which turns out to be too small. >>>>> >>>>> Since 5.1, v3 mount with sec=krb5p fails against an Ontap server >>>>> because client's receive buffer it too small. >>>> >>>> Can you be more specific? For instance, why is 100 bytes adequate for >>>> Linux servers, but not OnTAP? >>> >>> I don't know why Ontap sends more data than Linux server. >> >> Let's be sure we are fixing the right problem. Yes, au_rslack is >> smaller in v5.1, and that results in a behavioral regression. But >> exactly which part of the new calculation is incorrect is not yet >> clear. Simply bumping GSS_VERF_SLACK could very well plaster over >> the real problem. >> >> >>> The opaque_len is just a lot larger. For the first message Linux >>> opaque_len is 120bytes and Ontap it's 206. So it could be for instance >>> for FSINFO that sends the file handle, for Netapp the file handle is >>> 44bytes and for Linux it's only 28bytes. >> >> The maximum filehandle size should already be accounted for in the >> maxsize macro for FSINFO. >> >> Is this problem evident only with NFSv3 plus krb5p? > > So far that seems to be the case. Every other version and security flavor works. > >>>> Is this explanation for the current value not correct? >>>> >>>> 51 /* length of a krb5 verifier (48), plus data added before arguments when >>>> 52 * using integrity (two 4-byte integers): */ >>> >>> I'm not sure what it is suppose to be. Isn't "data before arguments" >>> can vary in length and thus explain why linux and onto sizes are >>> different? >>> Looking at the network trace the krb5 verifier I see is 36bytes. >> >> GSS_VERF_SLACK is only for the extra length added by GSS data. The >> length of the RPC message itself is handled separately, see above. >> >> Can you post a Wireshark dissection of the problematic FSINFO reply? >> (Having a working reply from Linux and a failing reply from OnTAP >> would be even better). > > I'm attaching two files. I mount against linux and mount against ontap. > > > > >>>>> For GSS, au_rslack is calculated from GSS_VERF_SLACK value which is >>>>> currently 100. And it's not enough. Changing it to 104 works and then >>>>> au_rslack is recalculated based on actual received mic.len and not >>>>> just the default buffer size. >> >> What are the computed au_ralign and au_rslack values after the first >> successful operation? > > With GSS_VERF_SLACK 100 > Linux run: > > Mar 12 13:14:29 localhost kernel: AGLO: gss_create_new setting for > auth=00000000e14fdc39 cslack=200 and rslack=25 > Mar 12 13:14:29 localhost kernel: AGLO: gss_create_new setting for > auth=00000000e14fdc39 ralign=25 > Mar 12 13:14:29 localhost kernel: NFS call fsinfo > ... <gssd upcall> > Mar 12 13:14:29 localhost kernel: AGLO: call_allocate > auth=00000000e14fdc39 au_cslack=200 au_rslack=25 rq_rcvsize=256 > p_replen=35 > Mar 12 13:14:29 localhost kernel: AGLO: gss_unwrap_resp_priv rcv_buf > len=176 is ok offset=56 opaque=120 > Mar 12 13:14:29 localhost kernel: AGLO: gss_unwrap_resp_priv **** > auth=00000000e14fdc39 resetting au_rslack=26 > Mar 12 13:14:29 localhost kernel: AGLO: gss_unwrap_resp_priv **** > auth=00000000e14fdc39 resetting au_ralign=26 > Mar 12 13:14:29 localhost kernel: NFS reply fsinfo: 0 > > Ontap run: > Mar 12 13:16:46 localhost kernel: AGLO: gss_create_new setting for > auth=00000000e02d9e6e cslack=200 and rslack=25 > Mar 12 13:16:46 localhost kernel: AGLO: gss_create_new setting for > auth=00000000e02d9e6e ralign=25 > Mar 12 13:16:46 localhost kernel: NFS call fsinfo > ... <gssd upcall> > Mar 12 13:16:46 localhost kernel: AGLO: call_allocate > auth=00000000e02d9e6e au_cslack=200 au_rslack=25 rq_rcvsize=256 > p_replen=35 > Mar 12 13:16:46 localhost kernel: AGLO: gss_unwrap_resp_priv rcv_buf > len=256 too small offset=56 opaque=204 > Mar 12 13:16:46 localhost kernel: NFS reply fsinfo: -5 > > With GSS_VERF_SLACK 104 > Mar 12 13:33:23 localhost kernel: AGLO: gss_create_new setting for > auth=000000004a545ea2 cslack=200 and rslack=26 > Mar 12 13:33:23 localhost kernel: AGLO: gss_create_new setting for > auth=000000004a545ea2 ralign=26 > Mar 12 13:33:23 localhost kernel: NFS call fsinfo > ... <gssd upcall> > Mar 12 13:33:23 localhost kernel: AGLO: call_allocate > auth=000000004a545ea2 au_cslack=200 au_rslack=26 rq_rcvsize=260 > p_replen=35 > Mar 12 13:33:23 localhost kernel: AGLO: gss_unwrap_resp_priv rcv_buf > len=260 is ok offset=56 opaque=204 > Mar 12 13:33:23 localhost kernel: AGLO: gss_unwrap_resp_priv **** > auth=000000004a545ea2 resetting au_rslack=26 > Mar 12 13:33:23 localhost kernel: AGLO: gss_unwrap_resp_priv **** > auth=000000004a545ea2 resetting au_ralign=26 > Mar 12 13:33:23 localhost kernel: NFS reply fsinfo: 0 > > difference in actual packets in fsinfo is that ontap sends postattrs > so that's 84bytes. > > req->rq_rcvsize = RPC_REPHDRSIZE + auth->au_rslack + \ > max_t(size_t, proc->p_replen, 2); > > RPC_REPHDRSIZE is defined to be 4 (*4) (it says it doesn't include > the verifier ???) > rslack needs to cover kerberos blob 25 (*4) (but that's the kerberos > part a part of the wrap and not the verifier) > p_replen to cover fs_info args 35 (*4) (seems like the right number) > > So we are missing the GSS to include the verifier and the kerberos > blob of the wrapper (and lengths!!). Basically we need GSS_VERF_SLACK > to cover 2 kerberos blobs (or more specifically KRB_TOKEN_CFX_GetMic > 9*4 and KRB_TOKEN_CFS_WRAP 15*4 + 2 lengths before the kerberos blobs > = 104 and we are only giving 100). GSS_VERF_SLACK is also used for setting au_verfsize, so please don't change its value. Define a new constant for initializing au_rslack. Let's construct that constant using the KRB5_TOKEN constants you mention here... include/linux/sunrpc/gss_krb5.h has 221 /* 222 * This compile-time check verifies that we will not exceed the 223 * slack space allotted by the client and server auth_gss code 224 * before they call gss_wrap(). 225 */ 226 #define GSS_KRB5_MAX_SLACK_NEEDED \ 227 (GSS_KRB5_TOK_HDR_LEN /* gss token header */ \ 228 + GSS_KRB5_MAX_CKSUM_LEN /* gss token checksum */ \ 229 + GSS_KRB5_MAX_BLOCKSIZE /* confounder */ \ 230 + GSS_KRB5_MAX_BLOCKSIZE /* possible padding */ \ 231 + GSS_KRB5_TOK_HDR_LEN /* encrypted hdr in v2 token */\ 232 + GSS_KRB5_MAX_CKSUM_LEN /* encryption hmac */ \ 233 + 4 + 4 /* RPC verifier */ \ 234 + GSS_KRB5_TOK_HDR_LEN \ 235 + GSS_KRB5_MAX_CKSUM_LEN) So this, or something like this, plus the comment below. > The reason things work against linux is because it has a nice buffer > of 84bytes of post attributes that it doesn't send. Missing post-attributes makes sense. Thank you for the analysis. > To address your later point that kerberos blob is encryption type > depended and that once some other encryption is added to gss-kerberos > that's larger than existing checksum then this value would need to be > adjusted again. > If you agree with my reasoning for the number then I'd like to send > out a patch now. The current numbers are based on the kernel GSS implementation supporting only Kerberos with a narrow set of enctypes. That needs to be made clear in a documenting comment. The reason this has been bothersome is because the existing setting is a magic number (100), and its documenting comment has been stale since 2006. Any proposed fix has to address the missing documentation. >>>>> I would like to propose to change it to something a little larger than >>>>> 104, like 120 to give room if some other server might reply with >>>>> something even larger. >>>> >>>> Why does it need to be larger than 104? >>> >>> I don't know why 100 was chosen and given that I think arguments are >>> taken into the account and arguments can change. I think NetApp has >>> changed their file handle sizes (at some point, not in the near past >>> but i think so?). Perhaps they might want to do that again so the size >>> will change again. >>> >>> Honestly, I would have like for 100 to be 200 to be safe. >> >> To be safe, I would like to have a good understanding of the details, >> rather than guessing at an arbitrary maximum value. Let's choose a >> rational maximum and include a descriptive comment about why that value >> is the best choice. >> >> >>>>> Thoughts? Will send an actual patch if no objections to this one. >>>>> >>>>> diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c >>>>> index 24ca861..44ae6bc 100644 >>>>> --- a/net/sunrpc/auth_gss/auth_gss.c >>>>> +++ b/net/sunrpc/auth_gss/auth_gss.c >>>>> @@ -50,7 +50,7 @@ >>>>> #define GSS_CRED_SLACK (RPC_MAX_AUTH_SIZE * 2) >>>>> /* length of a krb5 verifier (48), plus data added before arguments when >>>>> * using integrity (two 4-byte integers): */ >>>>> -#define GSS_VERF_SLACK 100 >>>>> +#define GSS_VERF_SLACK 120 >>>>> >>>>> static DEFINE_HASHTABLE(gss_auth_hash_table, 4); >>>>> static DEFINE_SPINLOCK(gss_auth_hash_lock); >>>> >>>> -- >>>> Chuck Lever >> >> -- >> Chuck Lever >> >> >> > <linux-v3-krb5p-mount.pcap.gz><ontap-v3-krb5-mount.pcap.gz> -- Chuck Lever