On Mon, Apr 16, 2018 at 5:29 PM, J. Bruce Fields <bfields@xxxxxxxxxx> wrote: > On Mon, Apr 16, 2018 at 03:48:49PM -0400, Olga Kornievskaia wrote: >> I have a failure that I’m investigating from the Bakeathon (this was >> going against redhat-75 server. Not sure who was running that server. >> But I believe that was RHEL7.5 server). I have a network trace and I >> was wondering if you could help with what the server is doing. >> >> I’m attaching a network trace. The parts I’m interested in explaining >> have to do with the kerberized backchannel for NFS4.0. >> >> A setup is client doing v3 and v4 mount and opening file with one >> version and appending to it with a different version. Its opened with >> 4.0 and got a delegation and it’s trying to write with v3 and server >> is recalling a delegation >> >> Server is issuing CB_NULL gss_init trying to establish a gss context. >> But it’s doing it twice in frame 259 and frame 261. It’s weird that >> it’s doing it twice. But Ok. > > I also wonder why the client sent two sets of > SETCLIENTID/SETCLIENTID_CONFIRM calls. The test does two different mounts and there is an unmount between them. There is 4.0 mount, unmount. Then v3 mount, then v4 mount again. I think that’s why there are 2 SETCLIENTIDs. > The second gets back the same > clientid as the first, so I think the only thing the server might do is > update the callback information--but the callback information is the > same in both cases. Maybe some server bug is causing it not to handle > that update correctly? > > I also expect the server to start a CB_NULL as soon as it gets the > setclientid_confirm, so I would have expected to see that sooner. That’s true but I have seen when CB_NULL doesn’t happen right away. So this happens once in a while, don’t know why... >> Now in frame, 283 it sends CB_COMPOUND CB_RECALL And in frame 285 it >> sends CB_NULL with gss_data with the CB_NULL as the payload. I think >> this is to establish the callback. >> >> In frame 287, client responds with RPC accept state of 6000 (which I >> believe is "drop reply"). > > That value shouldn't ever appear on the wire. > > Looks like RHEL7 may need 0533b13072f4 "svc: Avoid garbage replies when > pc_func() returns rpc_drop_reply". > > From nfs4_callback_compound, there are a couple reasons it might return > rpc_drop_reply--might be a failure to recognize the callback principal? So you think the callback server doesn’t need an initial NULL call before able to handle CB_RECALL? > >> I believe what’s happening is that because the client hasn’t received >> CB_NULL that establishes a callback channel but got a CB_RECALL it’s >> just ignoring it. > > I see two succesful CB_NULL calls and replies, so I think the context > establishment worked. I don't know why there's a third CB_NULL in frame > 285. The two CB_NULL calls are both gss_init calls. They are not RPC NULL call. Server for some reason establishes 2 different gss context (again not necessarily a problem). The 3rd CB_NULL is gss_data and it sends an actual NULL call. I believe that’s the one that establishes a callback channel. > >> What happens later is that server re-transmits the CB_COMPOUND but >> client replies out of the cache. What’s interesting is that by this >> time since CB_NULL that came after the CB_COMPOUND should have >> established the callback and if the re-trasmission was instead a new >> CB_RECALL, then it would have succeeded I would think. Server tries >> twice and then finally, the sets the CB_PATH_DOWN on the RENEW that >> client sends. >> >> Questions: 1. Do you see how CB_RECALL can travel before the callback >> is established? > > I assume it's using the context established by one of the previous > CB_NULLs. I guess you could check the context handle to confirm this. Both the CB_RECALL and CB_NULL carrying the NULL call that establishes context use the gss_context established by the 2nd of the CB_NULL calls. I still believe that CB_RECALL was sent incorrectly as a callback channel has not been established. >> 2. Should the server do something else beside >> re-transmitting the CB_RECALL because it got this “drop reply” error >> code back? > > Since the reply isn't really valid, I think retrying is the server's > only option. > > --b. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html