NFS interaction with RBD

jens-christian.fischer@xxxxxxxxx (Jens-Christian Fischer) · Sat, 23 May 2015 23:28:32 +0200

We see something very similar on our Ceph cluster, starting as of today.

We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse OpenStack cluster (we applied the RBD patches for live migration etc)

On this cluster we have a big ownCloud installation (Sync & Share) that stores its files on three NFS servers, each mounting 6 2TB RBD volumes and exposing them to around 10 web server VMs (we originally started with one NFS server with a 100TB volume, but that has become unwieldy). All of the servers (hypervisors, ceph storage nodes and VMs) are using Ubuntu 14.04

Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 OSDs (because we had 4 OSDs that were nearing the 90% full mark). The rebalancing process ended this morning (after around 12 hours)
The cluster has been clean since then:

    cluster b1f3f4c8-xxxxx
     health HEALTH_OK
     monmap e2: 3 mons at {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0}, election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025
     osdmap e43476: 125 osds: 125 up, 125 in
      pgmap v18928606: 3336 pgs, 17 pools, 82447 GB data, 22585 kobjects
            266 TB used, 187 TB / 454 TB avail
                3319 active+clean
                  17 active+clean+scrubbing+deep
  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s

At midnight, we run a script that creates an RBD snapshot of all RBD volumes that are attached to the NFS servers (for backup purposes). Looking at our monitoring, around that time, one of the NFS servers became unresponsive and took down the complete ownCloud installation (load on the web server was > 200 and they had lost some of the NFS mounts)

Rebooting the NFS server solved that problem, but the NFS kernel server kept crashing all day long after having run between 10 to 90 minutes.

We initially suspected a corrupt rbd volume (as it seemed that we could trigger the kernel crash by just ?ls -l? one of the volumes, but subsequent ?xfs_repair -n? checks on those RBD volumes showed no problems.

We migrated the NFS server off of its hypervisor, suspecting a problem with RBD kernel modules, rebooted the hypervisor but the problem persisted (both on the new hypervisor, and on the old one when we migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256 servers (even though the defaults had been working fine for over a year)

Only one of our 3 NFS servers crashes (see below for syslog information) - the other 2 have been fine

May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
May 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second grace period (net ffffffff81cdab00)
May 23 21:44:23 drive-nfs1 rpc.mountd[1724]: Version 1.2.8 starting
May 23 21:44:28 drive-nfs1 kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team
May 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
May 23 21:44:28 drive-nfs1 kernel: [  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team
May 23 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 23 21:45:17 drive-nfs1 collectd[1872]: python: Plugin loaded but not configured.
May 23 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering read-loop.
May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init: plymouth-upstart-bridge main process ended, respawning
May 23 21:51:26 drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked for more than 120 seconds.
May 23 21:51:26 drive-nfs1 kernel: [  600.778090]       Not tainted 3.13.0-53-generic #89-Ubuntu
May 23 21:51:26 drive-nfs1 kernel: [  600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 21:51:26 drive-nfs1 kernel: [  600.781504] nfsd            D ffff88013fd93180     0  1696      2 0x00000000
May 23 21:51:26 drive-nfs1 kernel: [  600.781508]  ffff8800b2391c50 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8
May 23 21:51:26 drive-nfs1 kernel: [  600.781511]  0000000000013180 0000000000013180 ffff8800b22f9800 ffff880035f48240
May 23 21:51:26 drive-nfs1 kernel: [  600.781513]  ffff880035f48244 ffff8800b22f9800 00000000ffffffff ffff880035f48248
May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call Trace:
May 23 21:51:26 drive-nfs1 kernel: [  600.781523]  [<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70
May 23 21:51:26 drive-nfs1 kernel: [  600.781526]  [<ffffffff817295b5>] __mutex_lock_slowpath+0x135/0x1b0
May 23 21:51:26 drive-nfs1 kernel: [  600.781528]  [<ffffffff8172964f>] mutex_lock+0x1f/0x2f
May 23 21:51:26 drive-nfs1 kernel: [  600.781557]  [<ffffffffa03b1761>] nfsd_lookup_dentry+0xa1/0x490 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781568]  [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781591]  [<ffffffffa03b1bb9>] nfsd_lookup+0x69/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781613]  [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781628]  [<ffffffffa03c055a>] nfsd4_proc_compound+0x56a/0x7d0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781638]  [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781662]  [<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc]
May 23 21:51:26 drive-nfs1 kernel: [  600.781678]  [<ffffffffa0287997>] svc_process+0x107/0x170 [sunrpc]
May 23 21:51:26 drive-nfs1 kernel: [  600.781687]  [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781696]  [<ffffffffa03ac660>] ? nfsd_destroy+0x80/0x80 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781702]  [<ffffffff8108b6b2>] kthread+0xd2/0xf0
May 23 21:51:26 drive-nfs1 kernel: [  600.781707]  [<ffffffff8108b5e0>] ? kthread_create_on_node+0x1c0/0x1c0
May 23 21:51:26 drive-nfs1 kernel: [  600.781712]  [<ffffffff81733868>] ret_from_fork+0x58/0x90
May 23 21:51:26 drive-nfs1 kernel: [  600.781717]  [<ffffffff8108b5e0>] ? kthread_create_on_node+0x1c0/0x1c0

Before each crash, we see the disk utilization of one or two random mounted RBD volumes to go to 100% - there is no pattern on which of the RBD disks start to act up.

We have scoured the log files of the Ceph cluster for any signs of problems but came up empty.

The NFS server has almost no load (compared to regular usage) as most sync clients are either turned off (weekend) or have given up connecting to the server. 

There haven't been any configuration change on the NFS servers prior to the problems. The only change was the adding of 23 OSDs.

We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)

Our team is completely out of ideas. We have removed the 100TB volume from the nfs server (we used the downtime to migrate the last data off of it to one of the smaller volumes). The NFS server has been running for 30 minutes now (with close to no load) but we don?t really expect it to make it until tomorrow.

send help
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fischer at switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 23.05.2015, at 20:38, John-Paul Robinson (Campus) <jpr at uab.edu> wrote:

> We've had a an NFS gateway serving up RBD images successfully for over a year. Ubuntu 12.04 and ceph .73 iirc. 
> 
> In the past couple of weeks we have developed a problem where the nfs clients hang while accessing exported rbd containers. 
> 
> We see errors on the server about nfsd hanging for 120sec etc. 
> 
> The nfs server is still able to successfully interact with the images it is serving. We can export non rbd shares from the local file system and nfs clients can use them just fine. 
> 
> There seems to be something weird going on with rbd and nfs kernel modules. 
> 
> Our ceph pool is in a warn state due to an osd rebalance that is continuing slowly. But the fact that we continue to have good rbd image access directly on the server makes me think this is not related.  Also the nfs server is only a client of the pool, it doesnt participate in it. 
> 
> Has anyone experienced similar issues?  
> 
> We do have a lot of images attached to the server but he issue is there even when we map only a few. 
> 
> Thanks for any pointers. 
> 
> ~jpr
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20150523/aca8d4b3/attachment.pgp>