NFS4ERR_MOVED causing crash/stuck operations at source

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I am working on a  prototype server implementation of
Referral/Migration.  i have the following questions based on my issues
listed below:

1. How the Linux client will handle more than one NFS4ERR_MOVED error
code. During the first error exchange between the client and server
the connection to the destination is established. but on subsequent
exchanges  these are the following issues

- Client crash after migration to destination
- Error with regards to state migration in DMESG logs
- Stuck operations at the source

2. Is it mandatory to have a state shared between source and
destination in a migration scenario using NFS4ERR_MOVED. Because i
don't see code in linux client which does reclaim opens/locks after
migrating to destination.


Issue 1:
Crash if more than one NFS4ERR_MOVED is reported

Step 1: On going traffic in source server

step 2: Detect migration in server and return NFS4ERR_MOVED to client

step 3: The client requests either a GETATTR or LOOKUP to query FS_LOCATION

step 4: The server returns the FS_LOCATION

step 5: The client connects to the server mentioned in FS_LOCATION

step 6: Traffic is established with destination


I am seeing there is a crash in between step 4 and 5 if there is more
than one NFS4ERR_MOVED error returned from the server because the
traffic from the application does not stop because there are more than
one mounts.  I am seeing the following error message in DMESG logs and
there is also an observable client crash with the stack below


[ 2357.266551] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
[ 2357.266558] <-- nfs4_try_migration: migration succeeded
[ 2357.266561] --> nfs4_alloc_slot used_slots=0000
highest_used=4294967295 max_slots=16
[ 2357.266562] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
[ 2357.266828] nfs4_handle_migration: migration reported on "xxx.test.net"
[ 2357.266836] RPC:       xs_tcp_send_request(208) = 0
[ 2357.266836] --> nfs41_call_sync_prepare data->seq_server 000000000004335e
[ 2357.266839] --> nfs4_alloc_slot used_slots=1ffff highest_used=16 max_slots=64
[ 2357.266937] BUG: kernel NULL pointer dereference, address: 00000000000004c8

[ 2357.266938] encode_sequence: sessionid=5:0:532:824967168 seqid=1
slotid=24 max_slotid=24 cache_this=0
[ 2357.266942] --> nfs41_call_sync_prepare data->seq_server 000000000004335e
[ 2357.266944] RPC:       xs_tcp_send_request(208) = 0
[ 2357.266950] --> nfs41_call_sync_prepare data->seq_server 000000000004335e
[ 2357.271491] RPC:       xs_tcp_state_change client 000000002c9924cf...
[ 2357.271492] RPC:       state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
[ 2357.271494] RPC:       xs_data_ready...
[ 2357.619199] nfs4_close_prepare: begin!
[ 2357.619203] BUG: kernel NULL pointer dereference, address: 00000000000004c8
[ 2357.619204] #PF: supervisor read access in kernel mode
[ 2357.619205] #PF: error_code(0x0000) - not-present page
[ 2357.619207] PGD 11b74f067 P4D 12876b067 PUD 128771067 PMD 0
[ 2357.619210] Oops: 0000 [#2] SMP NOPTI
[ 2357.619212] CPU: 3 PID: 101 Comm: kworker/u8:1 Kdump: loaded
Tainted: G      D           5.15.0-1052-azure #60~20.04.1-Ubuntu
[ 2357.619215] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS 090008  12/07/2018
[ 2357.619216] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 2357.619254] RIP: 0010:xprt_release+0x115/0x140 [sunrpc]
[ 2357.619276] Code: 89 f7 48 8b 40 20 e8 7a c5 9e d1 41 5c 41 5d 41
5e 41 5f 5d c3 cc cc cc cc 48 83 bf a8 00 00 00 00 74 e8 48 8b bf b0
00 00 00 <4c> 3b af c8 04 00 00 75 d8 4c 89 ee e8 1a e4 ff ff eb ce 4c
89 e7
[ 2357.619278] RSP: 0018:ff5867500034bdb0 EFLAGS: 00010286
[ 2357.619280] RAX: 0000000000000005 RBX: ff20d0a84de34630 RCX: 0000000000000000
[ 2357.619281] RDX: 0000000000000000 RSI: ff20d0a8585ef070 RDI: 0000000000000000
[ 2357.619282] RBP: ff5867500034bdd0 R08: 0000000001000000 R09: 0000000000000012
[ 2357.619283] R10: ffffffff93e0a0ec R11: 0000000093e0a0cf R12: 0000000000000000
[ 2357.619284] R13: ff20d0a84de34600 R14: 0000000000000001 R15: ff20d0a84003d000
[ 2357.619285] FS:  0000000000000000(0000) GS:ff20d0abefd80000(0000)
knlGS:0000000000000000
[ 2357.619286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2357.619287] CR2: 00000000000004c8 CR3: 00000001186de005 CR4: 0000000000371ee0
[ 2357.619289] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2357.619290] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 2357.619291] Call Trace:
[ 2357.619293]  <TASK>
[ 2357.619295]  ? show_regs.cold+0x1a/0x1f
[ 2357.619299]  ? __die_body+0x20/0x70
[ 2357.619302]  ? __die+0x2b/0x37
[ 2357.619304]  ? page_fault_oops+0x136/0x2b0
[ 2357.619309]  ? do_user_addr_fault+0x2f2/0x640
[ 2357.619311]  ? _raw_spin_unlock_irqrestore+0xe/0x20
[ 2357.619314]  ? down_trylock+0x2e/0x40
[ 2357.619316]  ? irq_work_queue+0xe/0x30
[ 2357.619320]  ? exc_page_fault+0x71/0x160
[ 2357.619322]  ? asm_exc_page_fault+0x27/0x30
[ 2357.619324]  ? xprt_release+0x115/0x140 [sunrpc]
[ 2357.619346]  rpc_release_resources_task+0x13/0x60 [sunrpc]
[ 2357.619368]  __rpc_execute+0x22d/0x420 [sunrpc]
[ 2357.619390]  rpc_async_schedule+0x30/0x50 [sunrpc]
[ 2357.619409]  process_one_work+0x225/0x3d0
[ 2357.619412]  worker_thread+0x4d/0x3e0
[ 2357.619414]  ? process_one_work+0x3d0/0x3d0
[ 2357.619415]  kthread+0x12a/0x150
[ 2357.619417]  ? set_kthread_struct+0x50/0x50
[ 2357.619418]  ret_from_fork+0x22/0x30
[ 2357.619422]  </TASK>
[ 2357.619422] Modules linked in: nfnetlink rpcsec_gss_krb5
auth_rpcgss nfsv4 nfs lockd grace fscache netfs nls_iso8859_1
dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua mlx5_ib ib_uverbs
ib_core kvm_intel mlx5_core kvm binfmt_misc tls crct10dif_pclmul
joydev mlxfw crc32_pclmul psample hid_generic ghash_clmulni_intel
xt_conntrack aesni_intel crypto_simd nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 hyperv_drm cryptd drm_kms_helper libcrc32c xt_owner
iptable_security xt_tcpudp syscopyarea bpfilter sysfillrect sysimgblt
fb_sys_fops cec hyperv_keyboard hid_hyperv rc_core serio_raw hv_netvsc
pata_acpi hid sch_fq_codel msr drm efi_pstore i2c_core sunrpc
ip_tables x_tables autofs4
[ 2357.619458] CR2: 00000000000004c8
[ 2357.619459] ---[ end trace 8bb4378aaa5cb3ed ]---
[ 2357.619460] RIP: 0010:xprt_release+0x115/0x140 [sunrpc]
[ 2357.619483] Code: 89 f7 48 8b 40 20 e8 7a c5 9e d1 41 5c 41 5d 41
5e 41 5f 5d c3 cc cc cc cc 48 83 bf a8 00 00 00 00 74 e8 48 8b bf b0
00 00 00 <4c> 3b af c8 04 00 00 75 d8 4c 89 ee e8 1a e4 ff ff eb ce 4c
89 e7
[ 2357.619484] RSP: 0018:ff5867500836fdb0 EFLAGS: 00010286
[ 2357.619485] RAX: 0000000000000005 RBX: ff20d0a85b0a2030 RCX: 0000000000000000
[ 2357.619486] RDX: 0000000000000000 RSI: ff20d0a8585cdc70 RDI: 0000000000000000
[ 2357.619487] RBP: ff5867500836fdd0 R08: 0000000001000000 R09: 0000000000000001
[ 2357.619488] R10: ffffffff93e06d5c R11: 0000000093e06d3f R12: 0000000000000000
[ 2357.619488] R13: ff20d0a85b0a2000 R14: 0000000000000001 R15: ff20d0a84003d000
[ 2357.619489] FS:  0000000000000000(0000) GS:ff20d0abefd80000(0000)
knlGS:0000000000000000
[ 2357.619490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2357.619491] CR2: 00000000000004c8 CR3: 00000001186de005 CR4: 0000000000371ee0
[ 2357.619492] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2357.619492] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 2357.619494] Kernel panic - not syncing: Fatal exception
[ 2357.640366] Kernel Offset: 0xfc00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)


Issue 2:

In my implementation we don't share the state between the source and
destination. the destination returns a different scope. The client
does not seem to reopen the files or regain the locks. you can see
that XID (0x9658e553) that a write which had the lock id from source
was retried in destination, since we did not share the state we
returned BAD_STATEID.



No. Time                                  XID
         SessionId Info
3 2023-03-16 11:04:27.833041 0x62ddeac9 (1658710729)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 4) GETATTR FH:
0x1b6911b1
4 2023-03-16 11:04:27.924840 0x62ddeac9 (1658710729)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 3) GETATTR
6 2023-03-16 11:04:27.925211 0x63ddeac9 (1675487945)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 7) OPEN DH:
0xa27a003d/
7 2023-03-16 11:04:28.015350 0x63ddeac9 (1675487945)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 6) OPEN StateID:
0xb4d2
9 2023-03-16 11:04:28.015722 0x64ddeac9 (1692265161)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 10) READ StateID:
0xb17f Offset: 0 Len: 13336
10 2023-03-16 11:04:28.113724 0x64ddeac9 (1692265161)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 9) READ
12 2023-03-16 11:04:28.114343 0x65ddeac9 (1709042377)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 13) ACCESS FH:
0xa27a003d, [Check: RD MD XT XE]
13 2023-03-16 11:04:28.198397 0x65ddeac9 (1709042377)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 12) ACCESS,
[Allowed: RD MD XT XE]
14 2023-03-16 11:04:28.199844 0x66ddeac9 (1725819593)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 15) OPEN DH:
0x1b6911b1/test
15 2023-03-16 11:04:28.292948 0x66ddeac9 (1725819593)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 14) OPEN StateID:
0x0560
16 2023-03-16 11:04:28.294161 0x67ddeac9 (1742596809)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 17) LOCK FH:
0x256514d0 Offset: 0 Length: 512
17 2023-03-16 11:04:28.384516 0x67ddeac9 (1742596809)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 16) LOCK
18 2023-03-16 11:04:28.385153 0x68ddeac9 (1759374025)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 19) WRITE StateID:
0x0427 Offset: 0 Len: 512
19 2023-03-16 11:04:28.483614 0x68ddeac9 (1759374025)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 18) WRITE
25 2023-03-16 11:05:50.021966 0x69ddeac9 (1776151241)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 26) SEQUENCE
26 2023-03-16 11:05:50.103301 0x69ddeac9 (1776151241)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 25) SEQUENCE
152 2023-03-16 11:30:24.662815 0x7bddeac9 (2078141129)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 151) SEQUENCE
158 2023-03-16 11:31:44.082416 0x7cddeac9 (2094918345)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 160) WRITE StateID:
0x0427 Offset: 0 Len: 512
160 2023-03-16 11:31:46.247001 0x7cddeac9 (2094918345)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 158) WRITE Status:
NFS4ERR_MOVED
162 2023-03-16 11:31:46.247767 0x7dddeac9 (2111695561)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 164) GETATTR FH:
0x1b6911b1
164 2023-03-16 11:31:48.380479 0x7dddeac9 (2111695561)
0200000000000000fd6c00000000d0b2 V4 Reply (Call In 162) GETATTR
169 2023-03-16 11:31:48.492833 0x888e97f8 (2291046392) V4 NULL Call
(Reply In 170)
170 2023-03-16 11:31:48.573895 0x888e97f8 (2291046392) V4 NULL Reply
(Call In 169)
172 2023-03-16 11:31:48.574337 0x898e97f8 (2307823608) V4 Call (Reply
In 174) EXCHANGE_ID
174 2023-03-16 11:31:48.728966 0x898e97f8 (2307823608) V4 Reply (Call
In 172) EXCHANGE_ID
176 2023-03-16 11:31:48.729335 0x8a8e97f8 (2324600824) V4 Call (Reply
In 177) EXCHANGE_ID
177 2023-03-16 11:31:48.825899 0x8a8e97f8 (2324600824) V4 Reply (Call
In 176) EXCHANGE_ID
179 2023-03-16 11:31:48.826418 0x8b8e97f8 (2341378040) V4 Call (Reply
In 180) CREATE_SESSION
180 2023-03-16 11:31:48.915687 0x8b8e97f8 (2341378040)
02000000000000007c0800000000d0b2 V4 Reply (Call In 179) CREATE_SESSION
181 2023-03-16 11:31:48.916179 0x8c8e97f8 (2358155256)
02000000000000007c0800000000d0b2 V4 Call (Reply In 182)
RECLAIM_COMPLETE
182 2023-03-16 11:31:49.004744 0x8c8e97f8 (2358155256)
02000000000000007c0800000000d0b2 V4 Reply (Call In 181)
RECLAIM_COMPLETE
187 2023-03-16 11:31:49.088286 0x9458e553 (2488853843)
02000000000000007c0800000000d0b2 V4 Call (Reply In 189) GETATTR FH:
0x1b6911b1
189 2023-03-16 11:31:50.184343 0x9458e553 (2488853843)
02000000000000007c0800000000d0b2 V4 Reply (Call In 187) GETATTR
191 2023-03-16 11:31:50.184800 0x9558e553 (2505631059)
02000000000000007c0800000000d0b2 V4 Call (Reply In 192) GETATTR FH:
0x1b6911b1
192 2023-03-16 11:31:50.277471 0x9558e553 (2505631059)
02000000000000007c0800000000d0b2 V4 Reply (Call In 191) GETATTR
194 2023-03-16 11:31:50.277744 0x7eddeac9 (2128472777)
0200000000000000fd6c00000000d0b2 V4 Call (Reply In 196)
DESTROY_SESSION
195 2023-03-16 11:31:50.278013 0x9658e553 (2522408275)
02000000000000007c0800000000d0b2 V4 Call (Reply In 199) WRITE StateID:
0x0427 Offset: 0 Len: 512
196 2023-03-16 11:31:50.360818 0x7eddeac9 (2128472777) V4 Reply (Call
In 194) DESTROY_SESSION
198 2023-03-16 11:31:50.361088 0x7fddeac9 (2145249993) V4 Call (Reply
In 204) DESTROY_CLIENTID
199 2023-03-16 11:31:50.362113 0x9658e553 (2522408275)
02000000000000007c0800000000d0b2 V4 Reply (Call In 195) WRITE Status:
NFS4ERR_BAD_STATEID
201 2023-03-16 11:31:50.362740 0x9758e553 (2539185491)
02000000000000007c0800000000d0b2 V4 Call (Reply In 202) TEST_STATEID
202 2023-03-16 11:31:50.445104 0x9758e553 (2539185491)
02000000000000007c0800000000d0b2 V4 Reply (Call In 201) TEST_STATEID
203 2023-03-16 11:31:50.445737 0x9858e553 (2555962707)
02000000000000007c0800000000d0b2 V4 Call (Reply In 207) TEST_STATEID
204 2023-03-16 11:31:50.450554 0x7fddeac9 (2145249993) V4 Reply (Call
In 198) DESTROY_CLIENTID
207 2023-03-16 11:31:50.528249 0x9858e553 (2555962707)
02000000000000007c0800000000d0b2 V4 Reply (Call In 203) TEST_STATEID
208 2023-03-16 11:31:50.528565 0x9958e553 (2572739923)
02000000000000007c0800000000d0b2 V4 Call (Reply In 212) OPEN DH:
0x256514d0/
212 2023-03-16 11:31:50.702049 0x9958e553 (2572739923)
02000000000000007c0800000000d0b2 V4 Reply (Call In 208) OPEN StateID:
0x47d1
213 2023-03-16 11:31:50.702551 0x9a58e553 (2589517139)
02000000000000007c0800000000d0b2 V4 Call (Reply In 214) GETATTR FH:
0x256514d0
214 2023-03-16 11:31:50.791465 0x9a58e553 (2589517139)
02000000000000007c0800000000d0b2 V4 Reply (Call In 213) GETATTR
218 2023-03-16 11:32:10.792955 0x9b58e553 (2606294355)
02000000000000007c0800000000d0b2 V4 Call (Reply In 219) GETATTR FH:
0x256514d0
219 2023-03-16 11:32:10.878344 0x9b58e553 (2606294355)
02000000000000007c0800000000d0b2 V4 Reply (Call In 218) GETATTR

Any kind of pointers would be greatly appreciated. Thanks for your help.

--
-> hari
Success and defeat add beauty to a warrior




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux