Dear nfs developers,
We are running gitea in k8s in two pods, the two pods are running on
different vms. Gitea datadir is mounted as an nfs share, and we
periodically experience a caching issue, where one of the pods is
out-of-sync with directory entries:
# kubectl -n ci-cd exec -it gitea-546489b89b-c8j5j -- sh -c 'cd
git/repositories/.../objects/pack && ls -la'
total 32627
drwxr-xr-x 2 git git 5 Nov 16 08:59 .
drwxr-xr-x 4 git git 4 Nov 16 08:59 ..
-r--r--r-- 1 git git 28634 Nov 16 08:59
pack-f3648f5e8e42d671dee4868d08fe24fa50b47fac.bitmap
-r--r--r-- 1 git git 134744 Nov 16 08:59
pack-f3648f5e8e42d671dee4868d08fe24fa50b47fac.idx
-r--r--r-- 1 git git 33191104 Nov 16 08:59
pack-f3648f5e8e42d671dee4868d08fe24fa50b47fac.pack
# kubectl -n ci-cd exec -it gitea-546489b89b-d7lcn -- sh -c 'cd
git/repositories/.../objects/pack && ls -la'
ls: ./pack-249aee1788eeaca050d1c083e6598d675ba1017e.pack: No such file
or directory
ls: ./pack-249aee1788eeaca050d1c083e6598d675ba1017e.bitmap: No such file
or directory
ls: ./pack-249aee1788eeaca050d1c083e6598d675ba1017e.idx: No such file or
directory
total 25
drwxr-xr-x 2 git git 5 Nov 16 08:59 .
drwxr-xr-x 4 git git 4 Nov 16 08:59 ..
command terminated with exit code 1
The second pod has cached directory entries, howewer, they are not
present. But, if I stat the existing files on the failing pod, it
succeeds:
# kubectl -n ci-cd exec -it gitea-546489b89b-d7lcn -- sh -c 'cd
git/repositories/.../objects/pack && stat
pack-f3648f5e8e42d671dee4868d08fe24fa50b47fac.bitmap'
File: pack-f3648f5e8e42d671dee4868d08fe24fa50b47fac.bitmap
Size: 28634 Blocks: 41 IO Block: 131072 regular file
Device: c0h/192d Inode: 98069 Links: 1
Access: (0444/-r--r--r--) Uid: ( 1000/ git) Gid: ( 1000/ git)
Access: 2023-11-16 08:59:38.176448632 +0000
Modify: 2023-11-16 08:59:38.177839293 +0000
Change: 2023-11-16 08:59:38.203249724 +0000
Seems that the directory metadata is the same on the nodes:
# kubectl -n ci-cd exec -it gitea-546489b89b-c8j5j -- sh -c 'cd
git/repositories/.../objects/pack && stat .'
File: .
Size: 5 Blocks: 49 IO Block: 131072 directory
Device: 87h/135d Inode: 15013 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 1000/ git) Gid: ( 1000/ git)
Access: 2023-11-16 09:06:02.594066258 +0000
Modify: 2023-11-16 08:59:38.232154965 +0000
Change: 2023-11-16 08:59:38.232154965 +0000
# kubectl -n ci-cd exec -it gitea-546489b89b-d7lcn -- sh -c 'cd
git/repositories/.../objects/pack && stat .'
File: .
Size: 5 Blocks: 49 IO Block: 131072 directory
Device: c0h/192d Inode: 15013 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 1000/ git) Gid: ( 1000/ git)
Access: 2023-11-16 09:06:02.594066258 +0000
Modify: 2023-11-16 08:59:38.232154965 +0000
Change: 2023-11-16 08:59:38.232154965 +0000
Issuing ls on the failing generates getattr for the directory. I assume
it receives the already cached metadata, then assumes there were no
changes, and then tries to stat() the cached 3 files with no success.
It is enough to just touch the affected directory even on the other
node, this makes the failing node to recover, get in sync again.
Both nodes are running Debian stable kernel:
# uname -a
Linux node 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1
(2023-09-29) x86_64 GNU/Linux
The nfs server is a TrueNAS server (FreeBSD).
We have default mount options:
# mount | grep nfs4
x.x.x.x:/mnt/main/e-sz-k8s/csi/rgbcpj4nw9tywroxrqw8bpw1zmzyu5mg on
/var/lib/kubelet/pods/69d3e536-d0a9-4c02-9461-8f14d058ea60/volumes/kubernetes.io~csi/pvc-c76ad5da-9c1b-4692-b768-5469a9517d57/mount
type nfs4
(rw,relatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=c.c.c.c,local_lock=none,addr=x.x.x.x)
Is it cache configuration issue, or a bug in linux nfs client or freebsd
nfs server code?
Thanks in advance,
Richard