server continually returns NFS4ERR_DELAY to CREATE_SESSION

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 05 Apr 2024 10:14:33 -0400

Vladimir reported a problem running a certain set of testcases. These
tests would spin up a number of QEMU/KVM guests that use nfsroot.

After running for a while, they eventually reboot, and that boot hangs
mounting the server because the server is continually returning
NFS4ERR_DELAY on a CREATE_SESSION operation. With the addition of some
new tracepoints that I'll eventually post, I think I have a handle on
what's happening:

First, server decides to issue a CB_RECALL_ANY to the client, but there
is a fault on the callback channel:

            nfsd-42728   [000] .......  1678.271093: nfsd_fh_verify: xid=0x9109236c fh_hash=0x4394a635 type=0x0 access=BYPASS_GSS
            nfsd-42728   [000] .......  1678.271093: nfsd_compound_status: op=2/3 OP_PUTFH status=0
            nfsd-42728   [000] .......  1678.271093: nfsd_fh_verify: xid=0x9109236c fh_hash=0x4394a635 type=LNK access=
            nfsd-42728   [000] .......  1678.271095: nfsd_compound_status: op=3/3 OP_READLINK status=0
    kworker/u8:5-459     [002] .......  1690.685190: nfsd_cb_recall_any: addr=192.168.50.101:833 client 660ffa56:69849ebb keep=0 bmval0=RDATA_DLG
    kworker/u8:5-459     [002] .......  1690.685191: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (first try)
    kworker/u8:5-459     [002] .......  1690.685194: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=UP
    kworker/u8:0-42764   [002] .......  1699.821526: nfsd_cb_seq_status: task:00000001@00000003 sessionid=660ffa56:69849ebb:0000001e:00000000 tk_status=-5 seq_status=1
    kworker/u8:0-42764   [002] .......  1699.821527: nfsd_cb_new_state: addr=192.168.50.101:0 client 660ffa56:69849ebb state=FAULT
    kworker/u8:0-42764   [002] .......  1699.821528: nfsd_cb_restart: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (first try)
    kworker/u8:0-42764   [002] .......  1699.821533: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (need restart)
    kworker/u8:0-42764   [002] .......  1699.821538: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=FAULT
    kworker/u8:0-42764   [003] .......  1709.037319: nfsd_cb_seq_status: task:00000002@00000003 sessionid=660ffa56:69849ebb:0000001e:00000000 tk_status=-5 seq_status=1
    kworker/u8:0-42764   [003] .......  1709.037320: nfsd_cb_restart: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (first try)
    kworker/u8:0-42764   [003] .......  1709.037325: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (need restart)
    kworker/u8:0-42764   [003] .......  1709.037327: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=FAULT

This repeats for a while over several seconds. Eventually the server
tries to probe the callback channel, but it's DOWN. Cue more
nfsd_cb_queue/nfsd_cb_start calls repeating over time:

    kworker/u8:3-42      [003] .......  1773.550044: nfsd_cb_restart: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (first try)
    kworker/u8:3-42      [003] .......  1773.550049: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (need restart)
    kworker/u8:3-42      [003] .......  1773.550053: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=FAULT
            nfsd-42728   [000] ....1..  1781.401822: nfsd_cb_lost: addr=192.168.50.101:0 client 660ffa56:69849ebb state=FAULT
            nfsd-42728   [000] ....2..  1781.401823: nfsd_cb_probe: addr=192.168.50.101:0 client 660ffa56:69849ebb state=FAULT
            nfsd-42728   [000] ....2..  1781.401823: nfsd_cb_new_state: addr=192.168.50.101:0 client 660ffa56:69849ebb state=UNKNOWN
            nfsd-42728   [000] ....2..  1781.401824: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=00000000c4627b33 (first try)
    kworker/u8:3-42      [000] .......  1781.401832: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=UNKNOWN
    kworker/u8:5-459     [003] .......  1781.401833: nfsd_cb_seq_status: task:0000000a@00000003 sessionid=660ffa56:69849ebb:0000001e:00000000 tk_status=-107 seq_status=1
    kworker/u8:3-42      [000] .......  1781.401833: nfsd_cb_bc_update: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=00000000c4627b33 (first try)
    kworker/u8:3-42      [000] .......  1781.401833: nfsd_cb_bc_shutdown: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=00000000c4627b33 (first try)
    kworker/u8:5-459     [003] .......  1781.401833: nfsd_cb_restart: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (first try)
    kworker/u8:5-459     [003] .......  1781.401836: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (need restart)
    kworker/u8:3-42      [000] .......  1781.401858: nfsd_cb_new_state: addr=192.168.50.101:0 client 660ffa56:69849ebb state=DOWN
    kworker/u8:3-42      [000] .......  1781.401858: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=00000000c4627b33 (first try)
    kworker/u8:3-42      [000] .......  1781.401859: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=DOWN

Eventually, the client calls back (after rebooting), but it can't get through:
    kworker/u8:3-42      [000] .......  1782.391951: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=DOWN
    kworker/u8:3-42      [000] .......  1782.391951: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=00000000c4627b33 (first try)
    kworker/u8:3-42      [000] .......  1782.391952: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=DOWN
    kworker/u8:3-42      [000] .......  1782.391952: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (need restart)
            nfsd-42728   [000] .......  1782.409010: nfsd_compound: xid=0xa299edc2 opcnt=1 tag=
            nfsd-42728   [000] ....1..  1782.409039: nfsd_clid_verf_mismatch: client 660ffa56:69849ebb verf=0x17c364ee8596f72a, updated=0x17c3650afe9e0a7f from addr=192.168.50.101:926
            nfsd-42728   [000] .......  1782.409041: nfsd_compound_status: op=1/1 OP_EXCHANGE_ID status=0
            nfsd-42728   [000] .......  1782.409241: nfsd_compound: xid=0xa399edc2 opcnt=1 tag=
            nfsd-42728   [000] ....1..  1782.409245: nfsd_clid_verf_mismatch: client 660ffa56:69849ebb verf=0x17c364ee8596f72a, updated=0x17c3650afe9e0a7f from addr=192.168.50.101:926
            nfsd-42728   [000] .......  1782.409245: nfsd_compound_status: op=1/1 OP_EXCHANGE_ID status=0
            nfsd-42728   [000] .......  1782.409309: nfsd_compound: xid=0xa499edc2 opcnt=1 tag=
            nfsd-42728   [000] ....1..  1782.409325: check_slot_seqid: seqid=1 slot_seqid=0 inuse=0 conf=0 ret=0
            nfsd-42728   [000] ....1..  1782.409326: mark_client_expired_locked: addr=192.168.50.101:0 client 660ffa56:69849ebb error=10008
            nfsd-42728   [000] .......  1782.409329: nfsd_compound_status: op=1/1 OP_CREATE_SESSION status=10008
    kworker/u8:3-42      [000] .......  1782.417901: nfsd_cb_start: addr=192.168.50.101:0 client 660ffa56:69849ebb state=DOWN
    kworker/u8:3-42      [000] .......  1782.417901: nfsd_cb_queue: addr=192.168.50.101:0 client 660ffa56:69849ebb cb=000000006e42ee91 (need restart)

The EXCHANGE_ID works, but CREATE_SESSION returns NFS4ERR_DELAY because
it couldn't unhash the client due the refcount (cl_rpc_users) being
high. I didn't capture the value of the refcount at this point, but the
CB_RECALL_ANY job holds a reference to the old client, so that would be
enough to prevent unhashing it.

I think we need to rework the refcounting on the CB_RECALL_ANY job, at a
minimum. It probably shouldn't hold a reference to the client. Maybe
make it store the clientid and just redo the lookup if it's still
around? I might try to spin up a patch along those lines.

Longer term, I think we need to take a hard look at why the cl_rpc_users
refcount ensures that the client remains hashed. It seems like we ought
to be able to unhash it, even if there are outstanding references, no?

Thoughts on other ways to fix this or other structural fixes?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>