NFS interaction with RBD

giorgis@xxxxxxxxxxxx (Georgios Dimitrakakis) · Tue, 26 May 2015 20:12:32 +0300

 Jens-Christian,

 how did you test that? Did you just tried to write to them 
 simultaneously? Any other tests that one can perform to verify that?

 In our installation we have a VM with 30 RBD volumes mounted which are 
 all exported via NFS to other VMs.
 No one has complaint for the moment but the load/usage is very minimal.
 If this problem really exists then very soon that the trial phase will 
 be over we will have millions of complaints :-(

 What version of QEMU are you using? We are using the one provided by 
 Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm

 Best regards,

 George

> I think we (i.e. Christian) found the problem:
>
> We created a test VM with 9 mounted RBD volumes (no NFS server). As
> soon as he hit all disks, we started to experience these 120 second
> timeouts. We realized that the QEMU process on the hypervisor is
> opening a TCP connection to every OSD for every mounted volume -
> exceeding the 1024 FD limit.
>
> So no deep scrubbing etc, but simply to many connections?
>
> cheers
> jc
>
>  --
> SWITCH
> Jens-Christian Fischer, Peta Solutions
> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
> phone +41 44 268 15 15, direct +41 44 268 15 71
> jens-christian.fischer at switch.ch [3]
> http://www.switch.ch
>
> http://www.switch.ch/stories
>
> On 25.05.2015, at 06:02, Christian Balzer  wrote:
>
>> Hello,
>>
>> lets compare your case with John-Paul's.
>>
>> Different OS and Ceph versions (thus we can assume different NFS
>> versions
>> as well).
>> The only common thing is that both of you added OSDs and are likely
>> suffering from delays stemming from Ceph re-balancing or
>> deep-scrubbing.
>>
>> Ceph logs will only pipe up when things have been blocked for more
>> than 30
>> seconds, NFS might take offense to lower values (or the accumulation
>> of
>> several distributed delays).
>>
>> You added 23 OSDs, tell us more about your cluster, HW, network.
>> Were these added to the existing 16 nodes, are these on new storage
>> nodes
>> (so could there be something different with those nodes?), how busy
>> is your
>> network, CPU.
>> Running something like collectd to gather all ceph perf data and
>> other
>> data from the storage nodes and then feeding it to graphite (or
>> similar)
>> can be VERY helpful to identify if something is going wrong and what
>> it is
>> in particular.
>> Otherwise run atop on your storage nodes to identify if CPU,
>> network,
>> specific HDDs/OSDs are bottlenecks.
>>
>> Deep scrubbing can be _very_ taxing, do your problems persist if
>> inject
>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower
>> that
>> until it hurts again) or if you turn of deep scrubs altogether for
>> the
>> moment?
>>
>> Christian
>>
>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>>
>>> We see something very similar on our Ceph cluster, starting as of
>>> today.
>>>
>>> We use a 16 node, 102 OSD Ceph installation as the basis for an
>>> Icehouse
>>> OpenStack cluster (we applied the RBD patches for live migration
>>> etc)
>>>
>>> On this cluster we have a big ownCloud installation (Sync & Share)
>>> that
>>> stores its files on three NFS servers, each mounting 6 2TB RBD
>>> volumes
>>> and exposing them to around 10 web server VMs (we originally
>>> started
>>> with one NFS server with a 100TB volume, but that has become
>>> unwieldy).
>>> All of the servers (hypervisors, ceph storage nodes and VMs) are
>>> using
>>> Ubuntu 14.04
>>>
>>> Yesterday evening we added 23 ODSs to the cluster bringing it up
>>> to 125
>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark).
>>> The
>>> rebalancing process ended this morning (after around 12 hours) The
>>> cluster has been clean since then:
>>>
>>> cluster b1f3f4c8-xxxxx
>>> health HEALTH_OK
>>> monmap e2: 3 mons at
>>>
>>
> 
> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
>>> pools,
>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail
>>> 3319
>>> active+clean 17 active+clean+scrubbing+deep
>>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>>>
>>> At midnight, we run a script that creates an RBD snapshot of all
>>> RBD
>>> volumes that are attached to the NFS servers (for backup
>>> purposes).
>>> Looking at our monitoring, around that time, one of the NFS
>>> servers
>>> became unresponsive and took down the complete ownCloud
>>> installation
>>> (load on the web server was > 200 and they had lost some of the
>>> NFS
>>> mounts)
>>>
>>> Rebooting the NFS server solved that problem, but the NFS kernel
>>> server
>>> kept crashing all day long after having run between 10 to 90
>>> minutes.
>>>
>>> We initially suspected a corrupt rbd volume (as it seemed that we
>>> could
>>> trigger the kernel crash by just ?ls -l? one of the volumes,
>>> but
>>> subsequent ?xfs_repair -n? checks on those RBD volumes showed
>>> no
>>> problems.
>>>
>>> We migrated the NFS server off of its hypervisor, suspecting a
>>> problem
>>> with RBD kernel modules, rebooted the hypervisor but the problem
>>> persisted (both on the new hypervisor, and on the old one when we
>>> migrated it back)
>>>
>>> We changed the /etc/default/nfs-kernel-server to start up 256
>>> servers
>>> (even though the defaults had been working fine for over a year)
>>>
>>> Only one of our 3 NFS servers crashes (see below for syslog
>>> information)
>>> - the other 2 have been fine
>>>
>>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD:
>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
>>> directory May
>>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting
>>> 90-second
>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28
>>> drive-nfs1
>>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team
>>> May
>>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version
>>> 0.5.0
>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>>
>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete,
>>>> entering
>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283]
>>>> init:
>>>> plymouth-upstart-bridge main process ended, respawning May 23
>>>> 21:51:26
>>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked
>>>> for
>>>> more than 120 seconds.
>>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted
>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [
>>> 600.781504]
>>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23
>>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50
>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180
>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1
>>> kernel:
>>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515]
>>> Call
>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523]
>>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>>> drive-nfs1 kernel: [ 600.781526] []
>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1
>>> kernel:
>>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23
>>> 21:51:26 drive-nfs1 kernel: [ 600.781557] []
>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1
>>> kernel:
>>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May
>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] []
>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May
>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] []
>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200
>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662]
>>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>>> 21:51:26 drive-nfs1 kernel: [ 600.781678] []
>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1
>>> kernel:
>>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23
>>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ?
>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26
>>> drive-nfs1 kernel: [ 600.781707] [] ?
>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1
>>> kernel:
>>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23
>>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ?
>>> kthread_create_on_node+0x1c0/0x1c0
>>>
>>> Before each crash, we see the disk utilization of one or two
>>> random
>>> mounted RBD volumes to go to 100% - there is no pattern on which
>>> of the
>>> RBD disks start to act up.
>>>
>>> We have scoured the log files of the Ceph cluster for any signs of
>>> problems but came up empty.
>>>
>>> The NFS server has almost no load (compared to regular usage) as
>>> most
>>> sync clients are either turned off (weekend) or have given up
>>> connecting
>>> to the server.
>>>
>>> There haven't been any configuration change on the NFS servers
>>> prior to
>>> the problems. The only change was the adding of 23 OSDs.
>>>
>>> We use ceph version 0.80.7
>>> (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>
>>> Our team is completely out of ideas. We have removed the 100TB
>>> volume
>>> from the nfs server (we used the downtime to migrate the last data
>>> off
>>> of it to one of the smaller volumes). The NFS server has been
>>> running
>>> for 30 minutes now (with close to no load) but we don?t really
>>> expect it
>>> to make it until tomorrow.
>>>
>>> send help
>>> Jens-Christian
>>
>> --
>> Christian Balzer Network/Systems Engineer
>> chibi at gol.com [1] Global OnLine Japan/Fusion Communications
>> http://www.gol.com/ [2]
>
>
>
> Links:
> ------
> [1] mailto:chibi at gol.com
> [2] http://www.gol.com/
> [3] mailto:jens-christian.fischer at switch.ch
> [4] mailto:chibi at gol.com

--