On 26/02/2021 15:03, Timo Rothenpieler wrote:
I think I can reproduce this, or something that at least looks very
similar to this, on 5.10. Namely on 5.10.17 (On both Client and Server).
I think this is a different issue - see below.
We are running slurm, and since a while now (coincides with updating
from 5.4 to 5.10, but a whole bunch of other stuff was updated at the
same time, so it took me a while to correlate this) the logs it writes
have been truncated, but only while they're being observed on the
client, using tail -f or something like that.
Looks like this then:
On Server:
store01 /srv/export/home/users/timo/TestRun # ls -l slurm-41101.out
-rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
store01 /srv/export/home/users/timo/TestRun # wc -l slurm-41101.out
61 slurm-41101.out
On Client:
timo@login01 ~/TestRun $ ls -l slurm-41101.out
-rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
timo@login01 ~/TestRun $ wc -l slurm-41101.out
24 slurm-41101.out
See https://gist.github.com/BtbN/b9eb4fc08ccc53bb20087bce0bf9f826 for
the respective file-contents.
If I run the same test job, wait until its done, and then look at its
slurm.out file, it matches between NFS Client and Server.
If I tail -f the slurm.out on an NFS client, the file stops getting
updated on the client, but keeps getting more logs written to it on
the NFS server.
The slurm.out file is being written to by another NFS client, which is
running on one of the compute nodes of the system. It's being reads
from a login node.
These are two different clients, then what you see is possible on NFS
with client side caching. If you have multiple clients reading/writing
to the same files you usually need to tune the caching options and/or
use locking. I suspect that if you leave it for a while (until the cache
expires) it will sort itself out.
In my test-case it is just one client, it missed a file deletion and
nothing short of an unmount and remount fixes that. I have waited for 30
mins+. It does not seem to refresh or expire. I also see the opposite
behavior - the bug shows up on 4.x up to at least 5.4. I do not see it
on 5.10.
Brgds,
Timo
On 21.02.2021 16:53, Anton Ivanov wrote:
Client side. This seems to be an entirely client side issue.
A variety of kernels on the clients starting from 4.9 and up to 5.10
using 4.19 servers. I have observed it on a 4.9 client versus 4.9
server earlier.
4.9 fails, 4.19 fails, 5.2 fails, 5.4 fails, 5.10 works.
At present the server is at 4.19.67 in all tests.
Linux jain 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2
(2019-11-11) x86_64 GNU/Linux
I can set-up a couple of alternative servers during the week, but so
far everything is pointing towards a client fs cache issue, not a
server one.
Brgds,
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/