Dear (p)NFS-ors, we observe VERY unpleasant situation with pNFS in the production. Our hosts run multiple DSes on different ports, usually 24001-24009. With CentOS7 (3.10.0-514.6.2.el7.x86_64) we see that client takes a wrong port number when talks to data server: If client uses different DSes on the same host, then at some point it starts to send data to the wrong port number: Client <=> MDS: 1 0.000000000 131.169.251.53 → 131.169.51.35 NFS V4 Call OPEN DH: 0x7cbc716b/MIL-68-onebatch-80C-30s-00057.tif.metadata 2 0.001469799 131.169.51.35 → 131.169.251.53 NFS V4 Reply (Call In 1) OPEN StateID: 0xec18 3 0.001578128 131.169.251.53 → 131.169.51.35 NFS V4 Call SETATTR FH: 0x6ccf3dfa 4 0.002657187 131.169.51.35 → 131.169.251.53 NFS V4 Reply (Call In 3) SETATTR 5 0.003243819 131.169.251.53 → 131.169.51.35 NFS V4 Call LAYOUTGET 6 0.014603386 131.169.51.35 → 131.169.251.53 NFS V4 Reply (Call In 5) LAYOUTGET 7 0.014899121 131.169.251.53 → 131.169.51.35 NFS V4 Call GETDEVINFO 8 0.015014216 131.169.51.35 → 131.169.251.53 NFS V4 Reply (Call In 7) GETDEVINFO Opcode: GETDEVINFO (47) Status: NFS4_OK (0) layout type: LAYOUT4_NFSV4_1_FILES (1) device index: 0 r_netid: tcp length: 3 contents: tcp fill bytes: opaque data r_addr: 131.169.51.50.93.197 length: 20 contents: 131.169.51.50.93.197 r_netid: tcp length: 3 contents: tcp fill bytes: opaque data r_addr: 131.169.51.50.93.197 length: 20 contents: 131.169.51.50.93.197 notification bitmap: 6 notification bitmap: 0 [Main Opcode: GETDEVINFO (47)] 9 0.105442455 131.169.251.53 → 131.169.51.35 NFS V4 Call TEST_STATEID 10 0.105521354 131.169.51.35 → 131.169.251.53 NFS V4 Reply (Call In 9) TEST_STATEID NOTICE, that 131.169.51.50.93.197 corresponds to port 24005. client <=> DS $ tshark -r ds-write.pcap -n -z conv,tcp 1 0.000000 131.169.251.53 → 131.169.51.50 NFS V4 Call WRITE StateID: 0xff01 Offset: 0 Len: 3968 2 0.000090 131.169.51.50 → 131.169.251.53 NFS V4 Reply (Call In 1) WRITE Status: NFS4ERR_BAD_STATEID ================================================================================ TCP Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 131.169.51.50:24006 <-> 131.169.251.53:847 1 4240 1 168 2 4408 0.000000000 0.0001 ================================================================================ NOTICE, that it talks to DS on port 24006! Is there know fix which is missing in CentOS7? I can't reproduce it with 4.9 kernel (or it's harder to reproduce). The packages are attached. Tigran.
Attachment:
ds-write.pcapng
Description: application/pcapng
Attachment:
mds.pcapng
Description: application/pcapng