Re: Gluster 9.6 changes to fix gluster NFS bug

"Jacobson, Erik" <erik.jacobson@xxxxxxx> · Thu, 21 Mar 2024 16:39:11 +0000

Dear team. I made a new PR (sorry some experience showing in github.com I created a new PR instead of updating the old one. Seemed easier to close the old one and use the new one than fix the old one).

In the new PR, I integrated feedback. Thank you so much.
https://github.com/gluster/glusterfs/pull/4322

I am attaching to this email my notes on reproducing this environment. I used virtual machines and a constrained test environment to duplicate the problem and test the fix. I hope these notes resolve all the outstanding questions.

If not, please let me know! Thanks again to all.

Erik

From:
Jacobson, Erik <erik.jacobson@xxxxxxx>

Date: Monday, March 18, 2024 at 10:22 AM

To: Aravinda <aravinda@xxxxxxxxxxx>

Cc: Gluster Devel <gluster-devel@xxxxxxxxxxx>

Subject: Re: [Gluster-devel] Gluster 9.6 changes to fix gluster NFS bug

I will need to set up a test case that is isolated.

In the meantime, I did a fork and a PR. I marked it as draft as I try to find an easier test case.

https://github.com/gluster/glusterfs/pull/4319

From:
Aravinda <aravinda@xxxxxxxxxxx>

Date: Saturday, March 16, 2024 at 9:37 AM

To: Jacobson, Erik <erik.jacobson@xxxxxxx>

Cc: Gluster Devel <gluster-devel@xxxxxxxxxxx>

Subject: Re: [Gluster-devel] Gluster 9.6 changes to fix gluster NFS bug

> We ran into some trouble in Gluster 9.3 with the Gluster NFS server. We updated to a supported Gluster  9.6 and reproduced the problem.

Please share the reproducer steps. We can include in our tests if possible.

> We understand the Gluster team recommends the use of Ganesha for NFS but in our specific environment and use case, Ganesha isn’t fast enough. No disrespect intended; we never
 got the chance to work with the Ganesha team on it.

That is totally fine. I think gnfs is disabled in the later versions, you have to build from source to enable it. Only issue I see is gnfs doesn't support NFS v4 and the NFS+Gluster
 team shifted the focus to NFS Ganesha.

> We tried to avoid Ganesha and Gluster NFS altogether, using kernel NFS with fuse mounts exported, and that was faster, but failover didn’t work. We could make the mount point
 highly available but not open files (so when the IP failover happened, the mount point would still function but the open file – a squashfs in this example – would not fail over).

Was Gluster backup volfile server option used or any other method used for high availability?

> So we embarked on a mission to try to figure out what was going on with the NFS server. I am not an expert in network code or distributed filesystems. So, someone with a
 careful eye would need to check these changes out. However, what I generally found was that the Gluster NFS server requires the layers of gluster to report back ‘errno’ to determine if EINVAL is set (to determine is_eof). In some instances, errno was not being
 passed down the chain or was being reset to 0. This resulted in NFS traces showing multiple READs for a 1 byte file and the NFS client showing an “I/O” error. It seemed like files above 170M worked ok. This is likely due to how the layers of gluster change
 with changing and certain file sizes. However, we did not track this part down.

> We found in one case disabling the NFS performance IO cache would fix the problem for a non-sharded volume, but the problem persisted in a sharded volume. Testing found our
 environment takes the disabling of the NFS performance IO cache quite hard anyway, so it wasn’t an option for us.

> We were curious why the fuse client wouldn’t be impacted but our quick look found that fuse doesn’t really use or need errno in the same way Gluster NFS does.

> So, the attached patch fixed the issue. Accessing small files in either case above now work properly. We tried running md5sum against large files over NFS and fuse mounts
 and everything seemed fine.

> In our environment, the NFS-exported directories tend to contain squashfs files representing read-only root filesystems for compute nodes, and those worked fine over NFS
 after the change as well.

> If you do not wish to include this patch because Gluster NFS is deprecated, I would greatly appreciate it if someone could validate my work as our solution will need Gluster
 NFS enabled for the time being. I am concerned I could have missed a nuance and caused a hard to detect problem.

We can surely include this patch in Gluster repo since many tests are still using this feature and it is available for interested users. Thanks for the PR. Please submit the
 PR to Github repo, I will followup with the maintainers and update. Let me know if you need any help to submit the PR.

--

Thanks and Regards

Aravinda

Kadalu Technologies

---- On Thu, 14 Mar 2024 01:32:50 +0530
Jacobson, Erik <erik.jacobson@xxxxxxx> wrote ---

Hello team.

We ran into some trouble in Gluster 9.3 with the Gluster NFS server. We updated to a supported Gluster  9.6 and reproduced the problem.

We understand the Gluster team recommends the use of Ganesha for NFS but in our specific environment and use case, Ganesha isn’t fast enough. No disrespect intended; we never
 got the chance to work with the Ganesha team on it.

We tried to avoid Ganesha and Gluster NFS altogether, using kernel NFS with fuse mounts exported, and that was faster, but failover didn’t work. We could make the mount point
 highly available but not open files (so when the IP failover happened, the mount point would still function but the open file – a squashfs in this example – would not fail over).

So we embarked on a mission to try to figure out what was going on with the NFS server. I am not an expert in network code or distributed filesystems. So, someone with a careful
 eye would need to check these changes out. However, what I generally found was that the Gluster NFS server requires the layers of gluster to report back ‘errno’ to determine if EINVAL is set (to determine is_eof). In some instances, errno was not being passed
 down the chain or was being reset to 0. This resulted in NFS traces showing multiple READs for a 1 byte file and the NFS client showing an “I/O” error. It seemed like files above 170M worked ok. This is likely due to how the layers of gluster change with changing
 and certain file sizes. However, we did not track this part down.

We found in one case disabling the NFS performance IO cache would fix the problem for a non-sharded volume, but the problem persisted in a sharded volume. Testing found our
 environment takes the disabling of the NFS performance IO cache quite hard anyway, so it wasn’t an option for us.

We were curious why the fuse client wouldn’t be impacted but our quick look found that fuse doesn’t really use or need errno in the same way Gluster NFS does.

So, the attached patch fixed the issue. Accessing small files in either case above now work properly. We tried running md5sum against large files over NFS and fuse mounts and
 everything seemed fine.

In our environment, the NFS-exported directories tend to contain squashfs files representing read-only root filesystems for compute nodes, and those worked fine over NFS after
 the change as well.

If you do not wish to include this patch because Gluster NFS is deprecated, I would greatly appreciate it if someone could validate my work as our solution will need Gluster
 NFS enabled for the time being. I am concerned I could have missed a nuance and caused a hard to detect problem.

Thank you all!

patch.txt attached.

-------

Community Meeting Calendar: 

Schedule - 

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC 

Bridge: https://meet.google.com/cpu-eiue-hvk 

Gluster-devel mailing list 

Gluster-devel@xxxxxxxxxxx 

https://lists.gluster.org/mailman/listinfo/gluster-devel

1. setup 3 sles15sp5 virtual machines
   80G virtual disk
   "house" network (NAT)
   "private network" (shared just among the 3 servers)
2. install sles15sp5 pretty normally
   - SUSE Linux Enteprise Server 15 SP5
   - no special software added
   - Text mode
   - Partition: 
       - I took the defaults but turned /home into /data (XFS)

3 Set the hostnames as 'gluster1, gluster2, gluste3' for the hostnames
    (hostnamectl set-hostname)

Now we have 3 sles15sp5 servers with a default setup except that all have a
an XFS filesystem mounted at /data (to be used with gluster).

All have a sles15sp5 (virtual) dvd and I enabled the repos by default in
zypper and added the HPC repo for pdsh.

I installed pdsh and pdsh-dshgroup to make this task easier and defined a
pdsh group named 'gluster' that holds the 3 nodes.

Some dependencies
------------------------------------------------------------------------------
pdsh -g gluster zypper install --no-confirm liburing1

Install Unpatched gluster:
------------------------------------------------------------------------------
Build glusterfs 9.6 without the patch

Install the rpms on the 3 test servers.

I setup pdsh, dsh group 'gluster' all 3 nodes

Unpatched gluster 9.6 packages were copied to all three servers in
/root/gluster-noaptch

Install (didn't bother to make a repo):

pdsh -g gluster rpm -Uvh \
/root/gluster-nopatch/glusterfs-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm  \
/root/gluster-nopatch/libglusterfs0-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm  \
/root/gluster-nopatch/libgfchangelog0-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm \
/root/gluster-nopatch/libglusterd0-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm  \
/root/gluster-nopatch/libgfapi0-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm \
/root/gluster-nopatch/libgfrpc0-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm  \
/root/gluster-nopatch/libgfxdr0-9.6-150400.100.7730.1550.240320T1310.a.sles15sp5hpeerikjno_errno_patch.x86_64.rpm

Configure gluster - Base setup
------------------------------------------------------------------------------
Note: kernel NFS server not installed (on purpose as this test is about
gluster nfs)

pdsh -g gluster systemctl enable glusterd
pdsh -g gluster systemctl start glusterd

# Simple test case let us not worry about firewall
pdsh -g gluster systemctl stop firewalld
pdsh -g gluster systemctl disable firewalld

gluster peer probe 192.168.128.2 # not needed since localhost
gluster peer probe 192.168.128.3
gluster peer probe 192.168.128.4

Verified each host shows two peers

Configure volume - sharded example
------------------------------------------------------------------------------
pdsh -g gluster mkdir /data/sharded

gluster volume create sharded replica 3 transport tcp 192.168.128.2:/data/sharded 192.168.128.3:/data/sharded 192.168.128.4:/data/sharded

gluster volume set sharded performance.cache-size 512MB
gluster volume set sharded performance.client-io-threads on
gluster volume set sharded performance.nfs.io-cache on
gluster volume set sharded nfs.nlm off
gluster volume set sharded nfs.ports-insecure off
gluster volume set sharded nfs.export-volumes on
gluster volume set sharded features.shard on
gluster volume set sharded nfs.disable off
   # answer yes
gluster volume start sharded

Sharded Problem Duplication:
------------------------------------------------------------------------------
I just run this on one of the servers locally. I used 192.168.128.2 but testing
has showed when the problem happens, it happens from any NFS client anywhere.

mkdir -p /mnt/sharded/fuse
mkdir -p /mnt/sharded/nfs

#FUSE mount:
mount -t glusterfs localhost:/sharded /mnt/sharded/fuse

systemctl restart glusterd
# Not sure why I had to restart glusterd again here

# NFS Mount:
mount -t nfs -o vers=3 localhost:/sharded /mnt/sharded/nfs

# Make a big and small test file - from fuse mountg
cd /mnt/sharded/fuse
dd if=/dev/random of=testfile bs=1024k count=1024

# Confirm md5sum the same on both nfs mount and fuse mount
# It should be. This always works (big files always work)
md5sum /mnt/sharded/fuse/testfile /mnt/sharded/nfs/testfile

# Now make a 1-byte file on the fuse mount
echo -n 1 > /mnt/sharded/fuse/small-testfile

# Now do an md5sum of fuse vs nfs - We reproduce the problem. Output:

gluster1:/mnt/sharded/fuse # md5sum /mnt/sharded/fuse/small-testfile  /mnt/sharded/nfs/small-testfile
c4ca4238a0b923820dcc509a6f75849b  /mnt/sharded/fuse/small-testfile
md5sum: /mnt/sharded/nfs/small-testfile: Input/output error

Configure volume - NON-sharded example
------------------------------------------------------------------------------
pdsh -g gluster mkdir /data/NON-sharded

gluster volume create NON-sharded replica 3 transport tcp 192.168.128.2:/data/NON-sharded 192.168.128.3:/data/NON-sharded 192.168.128.4:/data/NON-sharded

gluster volume set NON-sharded performance.cache-size 512MB
gluster volume set NON-sharded performance.client-io-threads on
gluster volume set NON-sharded performance.nfs.io-cache on
gluster volume set NON-sharded nfs.nlm off
gluster volume set NON-sharded nfs.ports-insecure off
gluster volume set NON-sharded nfs.export-volumes on
gluster volume set NON-sharded nfs.disable off
   # answer yes
gluster volume start NON-sharded

NON-Sharded Problem Duplication:
------------------------------------------------------------------------------
Like the sharded case, I just ran this locally on one of the servers but
testing has shown it happens from an NFS client with any location.

mkdir -p /mnt/NON-sharded/fuse
mkdir -p /mnt/NON-sharded/nfs

#FUSE mount:
mount -t glusterfs localhost:/NON-sharded /mnt/NON-sharded/fuse

systemctl restart glusterd
# Not sure why I had to restart glusterd again here

# NFS Mount:
mount -t nfs -o vers=3 localhost:/NON-sharded /mnt/NON-sharded/nfs

# Make a big and small test file - from fuse mountg
cd /mnt/NON-sharded/fuse
dd if=/dev/random of=testfile bs=1024k count=1024

# Confirm md5sum the same on both nfs mount and fuse mount
# It should be. (this works)
md5sum /mnt/NON-sharded/fuse/testfile /mnt/NON-sharded/nfs/testfile

# Now make a 1-byte file on the fuse mount
echo -n 1 > /mnt/NON-sharded/fuse/small-testfile

# Check the mdsum. NFS gives an IO error in the fault condition.
md5sum /mnt/NON-sharded/fuse/small-testfile  /mnt/NON-sharded/nfs/small-testfile

PROBLEM reproduced here too:

gluster1:/mnt/NON-sharded/fuse # md5sum /mnt/NON-sharded/fuse/small-testfile  /mnt/NON-sharded/nfs/small-testfile
c4ca4238a0b923820dcc509a6f75849b  /mnt/NON-sharded/fuse/small-testfile
md5sum: /mnt/NON-sharded/nfs/small-testfile: Input/output error

Problem Resolved with Patch
------------------------------------------------------------------------------
Created a new set of packages but with the errno patch (corrected per the
PR) included.

pdsh -g gluster mkdir -p /root/gluster-with-patch

# This copies the rpms to the first gluster server
scp *.rpm root@192.168.1.61:gluster-with-patch/

# copies to the other two
pdcp -g gluster /root/gluster-with-patch/* /root/gluster-with-patch/

# Perform update
pdsh -g gluster rpm -Fvh /root/gluster-with-patch/*.rpm

# It is good to check if the glusterfs process serving NFS restarted or not.
# It must have restarted for the fix to hold.

Now we repeat the working and previously failing md5sums

SUCCESS all md5sums report values now, no IO errors

gluster1:~ # md5sum /mnt/NON-sharded/fuse/testfile /mnt/NON-sharded/nfs/testfileb4b85d33d083374ea2b6cf1cb2e3039a  /mnt/NON-sharded/fuse/testfile
b4b85d33d083374ea2b6cf1cb2e3039a  /mnt/NON-sharded/nfs/testfile
gluster1:~ # md5sum /mnt/NON-sharded/fuse/small-testfile  /mnt/NON-sharded/nfs/small-testfile
c4ca4238a0b923820dcc509a6f75849b  /mnt/NON-sharded/fuse/small-testfile
c4ca4238a0b923820dcc509a6f75849b  /mnt/NON-sharded/nfs/small-testfile
gluster1:~ # md5sum /mnt/sharded/fuse/testfile /mnt/sharded/nfs/testfile
50397a73f68f272c28dd212671e22722  /mnt/sharded/fuse/testfile
50397a73f68f272c28dd212671e22722  /mnt/sharded/nfs/testfile
gluster1:~ # md5sum /mnt/sharded/fuse/small-testfile  /mnt/sharded/nfs/small-testfile
c4ca4238a0b923820dcc509a6f75849b  /mnt/sharded/fuse/small-testfile
c4ca4238a0b923820dcc509a6f75849b  /mnt/sharded/nfs/small-testfile

-------

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel