Re: nfs cluster, problem with delete file in the failover case

bfields@xxxxxxxxxxxx (J. Bruce Fields) · Tue, 12 May 2015 11:25:17 -0400

On Tue, May 12, 2015 at 12:37:10AM +0200, gianpietro.sella@xxxxxxxx wrote:
> > On Sun, May 10, 2015 at 11:28:25AM +0200, gianpietro.sella@xxxxxxxx wrote:
> >> Hi, sorry for my bad english.
> >> I testing nfs cluster active/passsive (2 nodes).
> >> I use the next instruction for nfs:
> >>
> >> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Administration/s1-resourcegroupcreatenfs-HAAA.html
> >>
> >> I use centos 7.1 on the nodes.
> >> The 2 node of the cluster share the same iscsi volume.
> >> The nfs cluster is very good.
> >> I have only one problem.
> >> I mount the nfs cluster exported folder on my client node (nfsv3
> >> protocol).
> >> I write on the nfs folder an big data file (70GB):
> >> dd if=/dev/zero bs=1M count=70000 > /Instances/output.dat
> >> Before write is finished I put the active node in standby status.
> >> then the resource migrate in the other node.
> >> when the dd write finish the file is ok.
> >> I delete the file output.dat.
> >
> > So, the dd and the later rm are both run on the client, and the rm after
> > the dd has completed and exited?  And the rm doesn't happen till after
> > the first migration is completely finished?  What version of NFS are you
> > using?
> >
> > It sounds like a sillyrename problem, but I don't see the explanation.
> >
> > --b.
> 
> 
> Hi Bruce, thank for your answer.
> yes the dd command and the rm command (all on the client node) finish
> without error.
> I use nfsv3, but is the same with nfsv4 protocol.
> the s.o. is centos 7.1, the nfs package is nfs-utils-1.3.0-0.8.el7.x86_64.
> the pacemaker configuration is:
> 
> pcs resource create nfsclusterlv LVM volgrpname=nfsclustervg
> exclusive=true --group nfsclusterha
> 
> pcs resource create nfsclusterdata Filesystem
> device="/dev/nfsclustervg/nfsclusterlv" directory="/nfscluster"
> fstype="ext4" --group nfsclusterha
> 
> pcs resource create nfsclusterserver nfsserver
> nfs_shared_infodir=/nfscluster/nfsinfo nfs_no_notify=true --group
> nfsclusterha
> 
> pcs resource create nfsclusterroot exportfs
> clientspec=192.168.61.0/255.255.255.0 options=rw,sync,no_root_squash
> directory=/nfscluster/exports fsid=0 --group
>  nfsclusterha
> 
> pcs resource create nfsclusternova exportfs
> clientspec=192.168.61.0/255.255.255.0 options=rw,sync,no_root_squash
> directory=/nfscluster/exports/nova fsid=1 --
> group nfsclusterha
> 
> pcs resource create nfsclusterglance exportfs
> clientspec=192.168.61.0/255.255.255.0 options=rw,sync,no_root_squash
> directory=/nfscluster/exports/glance fsid=
> 2 --group nfsclusterha
> 
> pcs resource create nfsclustervip IPaddr2 ip=192.168.61.180 cidr_netmask=24
> --group nfsclusterha
> 
> pcs resource create nfsclusternotify nfsnotify source_host=192.168.61.180
> --group nfsclusterha
> 
> now I have done the next test.
> nfs cluster with 2 node.
> the first node in standby state.
> the second node in active state.
> I mount the empty (not used space) exported volume in the client with nfsv3
> protocol (with nfs4 protocol is the same).
> I write on the client an big file (70GB) in the mount directory with dd (but
> is the same with cp command).
> while the command write the file I disable nfsnotify, Iaddr2, exportfs and
> nfsserver resource in this order (pcs resource disable ...) and next I
> enable the resource (pcs resource enable ...) in the reverse order.
> when disable resource writing freeze, when enable resource writing restart
> without error.
> when the writing command is finished I delete the file.
> the mount directory is empty and the used space of exported volume is 0,
> this is ok.
> now i repead the test.
> but now I disable/enable even the Filesystem resource:
> disable nfsnotify, Iaddr2, exportfs, nfsserver and Filesystem resource
> (writing freeze) then enable in the reverse order (writing restart without
> error).
> when writing command is finished I delete the file.
> now the mounted directory is empty (not file) but the used space is not 0
> but is 70GB.
> this is not ok.
> now I execute the next command on the active node of the cluster where the
> volume is exported with nfs:
> mount -o remount /dev/nfsclustervg/nfsclusterlv
> where /dev/nfsclustervg/nfsclusterlv is the exported volume (iscsi volume
> configured with lvm).
> after this command the used space in the mounted directory of the client is
> 0, this is ok.
> I think that the problem is the Filesystem resource on the active node of
> the cluster.
> but is very strange.

So, the only difference between the "good" and "bad" cases was the
addition of the stop/start of the filesystem resource?  I assume that's
equivalent to an umount/mount.

I guess the server's dentry for that file is hanging around for a little
while for some reason.  We've run across at least one problem of that
sort before (see d891eedbc3b1 "fs/dcache: allow d_obtain_alias() to
return unhashed dentries").

In both cases after the restart the first operation the server will get
for that file is a write with a filehandle, and it will have to look up
that filehandle to find the file.  (Whereas without the restart the
initial discovery of the file will be a lookup by name.)

In the "good" case the server already has a dentry cached for that file,
in the "bad" case the umount/mount means that we'll be doing a
cold-cache lookup of that filehandle.

I wonder if the test case can be narrowed down any further....  Is the
large file necessary?  If it's needed only to ensure the writes are
actually sent to the server promptly then it might be enough to do the
nfs mount with -osync.

Instead of the cluster migration or restart, it might be possible to
reproduce the bug just with a

	echo 2 >/proc/sys/vm/drop_caches

run on the server side while the dd is in progress--I don't know if that
will reliably drop the one dentry, though.  Maybe do a few of those in a
row.

--b.

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster