NFS interaction with RBD

jens-christian.fischer@xxxxxxxxx (Jens-Christian Fischer) · Tue, 26 May 2015 15:50:27 +0200

I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections?

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fischer at switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer <chibi at gol.com> wrote:

> 
> Hello,
> 
> lets compare your case with John-Paul's.
> 
> Different OS and Ceph versions (thus we can assume different NFS versions
> as well).
> The only common thing is that both of you added OSDs and are likely
> suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
> 
> Ceph logs will only pipe up when things have been blocked for more than 30
> seconds, NFS might take offense to lower values (or the accumulation of
> several distributed delays).
> 
> You added 23 OSDs, tell us more about your cluster, HW, network.
> Were these added to the existing 16 nodes, are these on new storage nodes
> (so could there be something different with those nodes?), how busy is your
> network, CPU.
> Running something like collectd to gather all ceph perf data and other
> data from the storage nodes and then feeding it to graphite (or similar)
> can be VERY helpful to identify if something is going wrong and what it is
> in particular.
> Otherwise run atop on your storage nodes to identify if CPU, network,
> specific HDDs/OSDs are bottlenecks. 
> 
> Deep scrubbing can be _very_ taxing, do your problems persist if inject
> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower that
> until it hurts again) or if you turn of deep scrubs altogether for the
> moment?
> 
> Christian
> 
> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
> 
>> We see something very similar on our Ceph cluster, starting as of today.
>> 
>> We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
>> OpenStack cluster (we applied the RBD patches for live migration etc)
>> 
>> On this cluster we have a big ownCloud installation (Sync & Share) that
>> stores its files on three NFS servers, each mounting 6 2TB RBD volumes
>> and exposing them to around 10 web server VMs (we originally started
>> with one NFS server with a 100TB volume, but that has become unwieldy).
>> All of the servers (hypervisors, ceph storage nodes and VMs) are using
>> Ubuntu 14.04
>> 
>> Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
>> rebalancing process ended this morning (after around 12 hours) The
>> cluster has been clean since then:
>> 
>>    cluster b1f3f4c8-xxxxx
>>     health HEALTH_OK
>>     monmap e2: 3 mons at
>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
>> active+clean 17 active+clean+scrubbing+deep
>>  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>> 
>> At midnight, we run a script that creates an RBD snapshot of all RBD
>> volumes that are attached to the NFS servers (for backup purposes).
>> Looking at our monitoring, around that time, one of the NFS servers
>> became unresponsive and took down the complete ownCloud installation
>> (load on the web server was > 200 and they had lost some of the NFS
>> mounts)
>> 
>> Rebooting the NFS server solved that problem, but the NFS kernel server
>> kept crashing all day long after having run between 10 to 90 minutes.
>> 
>> We initially suspected a corrupt rbd volume (as it seemed that we could
>> trigger the kernel crash by just ?ls -l? one of the volumes, but
>> subsequent ?xfs_repair -n? checks on those RBD volumes showed no
>> problems.
>> 
>> We migrated the NFS server off of its hypervisor, suspecting a problem
>> with RBD kernel modules, rebooted the hypervisor but the problem
>> persisted (both on the new hypervisor, and on the old one when we
>> migrated it back)
>> 
>> We changed the /etc/default/nfs-kernel-server to start up 256 servers
>> (even though the defaults had been working fine for over a year)
>> 
>> Only one of our 3 NFS servers crashes (see below for syslog information)
>> - the other 2 have been fine
>> 
>> May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
>> 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
>> kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
>> 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0
>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>> [  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init:
>>> plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
>>> drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked for
>>> more than 120 seconds.
>> May 23 21:51:26 drive-nfs1 kernel: [  600.778090]       Not tainted
>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>> [  600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [  600.781504]
>> nfsd            D ffff88013fd93180     0  1696      2 0x00000000 May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781508]  ffff8800b2391c50
>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>> drive-nfs1 kernel: [  600.781511]  0000000000013180 0000000000013180
>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781513]  ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call
>> Trace: May 23 21:51:26 drive-nfs1 kernel: [  600.781523]
>> [<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>> drive-nfs1 kernel: [  600.781526]  [<ffffffff817295b5>]
>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781528]  [<ffffffff8172964f>] mutex_lock+0x1f/0x2f May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781557]  [<ffffffffa03b1761>]
>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781568]  [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd] May
>> 23 21:51:26 drive-nfs1 kernel: [  600.781591]  [<ffffffffa03b1bb9>]
>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781613]  [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd] May
>> 23 21:51:26 drive-nfs1 kernel: [  600.781628]  [<ffffffffa03c055a>]
>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>> kernel: [  600.781638]  [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200
>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [  600.781662]
>> [<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781678]  [<ffffffffa0287997>]
>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781687]  [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd] May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781696]  [<ffffffffa03ac660>] ?
>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781702]  [<ffffffff8108b6b2>] kthread+0xd2/0xf0 May 23 21:51:26
>> drive-nfs1 kernel: [  600.781707]  [<ffffffff8108b5e0>] ?
>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 kernel:
>> [  600.781712]  [<ffffffff81733868>] ret_from_fork+0x58/0x90 May 23
>> 21:51:26 drive-nfs1 kernel: [  600.781717]  [<ffffffff8108b5e0>] ?
>> kthread_create_on_node+0x1c0/0x1c0
>> 
>> Before each crash, we see the disk utilization of one or two random
>> mounted RBD volumes to go to 100% - there is no pattern on which of the
>> RBD disks start to act up.
>> 
>> We have scoured the log files of the Ceph cluster for any signs of
>> problems but came up empty.
>> 
>> The NFS server has almost no load (compared to regular usage) as most
>> sync clients are either turned off (weekend) or have given up connecting
>> to the server. 
>> 
>> There haven't been any configuration change on the NFS servers prior to
>> the problems. The only change was the adding of 23 OSDs.
>> 
>> We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>> 
>> Our team is completely out of ideas. We have removed the 100TB volume
>> from the nfs server (we used the downtime to migrate the last data off
>> of it to one of the smaller volumes). The NFS server has been running
>> for 30 minutes now (with close to no load) but we don?t really expect it
>> to make it until tomorrow.
>> 
>> send help
>> Jens-Christian
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20150526/8c02c210/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20150526/8c02c210/attachment.pgp>