NFS interaction with RBD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



George,

I will let Christian provide you the details. As far as I know, it was enough to just do a ?ls? on all of the attached drives.

we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu                           1.0.0+git-20131111.c3d1e78-2ubuntu1   all          PXE boot firmware - ROM images for qemu
ii  qemu-keymaps                        2.0.0+dfsg-2ubuntu1.11                all          QEMU keyboard maps
ii  qemu-system                         2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries
ii  qemu-system-arm                     2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (arm)
ii  qemu-system-common                  2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (common files)
ii  qemu-system-mips                    2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (mips)
ii  qemu-system-misc                    2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (miscelaneous)
ii  qemu-system-ppc                     2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc                   2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (sparc)
ii  qemu-system-x86                     2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (x86)
ii  qemu-utils                          2.0.0+dfsg-2ubuntu1.11                amd64        QEMU utilities

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fischer at switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis <giorgis at acmac.uoc.gr> wrote:

> Jens-Christian,
> 
> how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that?
> 
> In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs.
> No one has complaint for the moment but the load/usage is very minimal.
> If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-(
> 
> What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
> 
> Best regards,
> 
> George
> 
>> I think we (i.e. Christian) found the problem:
>> 
>> We created a test VM with 9 mounted RBD volumes (no NFS server). As
>> soon as he hit all disks, we started to experience these 120 second
>> timeouts. We realized that the QEMU process on the hypervisor is
>> opening a TCP connection to every OSD for every mounted volume -
>> exceeding the 1024 FD limit.
>> 
>> So no deep scrubbing etc, but simply to many connections?
>> 
>> cheers
>> jc
>> 
>> --
>> SWITCH
>> Jens-Christian Fischer, Peta Solutions
>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>> phone +41 44 268 15 15, direct +41 44 268 15 71
>> jens-christian.fischer at switch.ch [3]
>> http://www.switch.ch
>> 
>> http://www.switch.ch/stories
>> 
>> On 25.05.2015, at 06:02, Christian Balzer  wrote:
>> 
>>> Hello,
>>> 
>>> lets compare your case with John-Paul's.
>>> 
>>> Different OS and Ceph versions (thus we can assume different NFS
>>> versions
>>> as well).
>>> The only common thing is that both of you added OSDs and are likely
>>> suffering from delays stemming from Ceph re-balancing or
>>> deep-scrubbing.
>>> 
>>> Ceph logs will only pipe up when things have been blocked for more
>>> than 30
>>> seconds, NFS might take offense to lower values (or the accumulation
>>> of
>>> several distributed delays).
>>> 
>>> You added 23 OSDs, tell us more about your cluster, HW, network.
>>> Were these added to the existing 16 nodes, are these on new storage
>>> nodes
>>> (so could there be something different with those nodes?), how busy
>>> is your
>>> network, CPU.
>>> Running something like collectd to gather all ceph perf data and
>>> other
>>> data from the storage nodes and then feeding it to graphite (or
>>> similar)
>>> can be VERY helpful to identify if something is going wrong and what
>>> it is
>>> in particular.
>>> Otherwise run atop on your storage nodes to identify if CPU,
>>> network,
>>> specific HDDs/OSDs are bottlenecks.
>>> 
>>> Deep scrubbing can be _very_ taxing, do your problems persist if
>>> inject
>>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower
>>> that
>>> until it hurts again) or if you turn of deep scrubs altogether for
>>> the
>>> moment?
>>> 
>>> Christian
>>> 
>>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>>> 
>>>> We see something very similar on our Ceph cluster, starting as of
>>>> today.
>>>> 
>>>> We use a 16 node, 102 OSD Ceph installation as the basis for an
>>>> Icehouse
>>>> OpenStack cluster (we applied the RBD patches for live migration
>>>> etc)
>>>> 
>>>> On this cluster we have a big ownCloud installation (Sync & Share)
>>>> that
>>>> stores its files on three NFS servers, each mounting 6 2TB RBD
>>>> volumes
>>>> and exposing them to around 10 web server VMs (we originally
>>>> started
>>>> with one NFS server with a 100TB volume, but that has become
>>>> unwieldy).
>>>> All of the servers (hypervisors, ceph storage nodes and VMs) are
>>>> using
>>>> Ubuntu 14.04
>>>> 
>>>> Yesterday evening we added 23 ODSs to the cluster bringing it up
>>>> to 125
>>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark).
>>>> The
>>>> rebalancing process ended this morning (after around 12 hours) The
>>>> cluster has been clean since then:
>>>> 
>>>> cluster b1f3f4c8-xxxxx
>>>> health HEALTH_OK
>>>> monmap e2: 3 mons at
>>>> 
>>> 
>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
>>>> pools,
>>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail
>>>> 3319
>>>> active+clean 17 active+clean+scrubbing+deep
>>>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>>>> 
>>>> At midnight, we run a script that creates an RBD snapshot of all
>>>> RBD
>>>> volumes that are attached to the NFS servers (for backup
>>>> purposes).
>>>> Looking at our monitoring, around that time, one of the NFS
>>>> servers
>>>> became unresponsive and took down the complete ownCloud
>>>> installation
>>>> (load on the web server was > 200 and they had lost some of the
>>>> NFS
>>>> mounts)
>>>> 
>>>> Rebooting the NFS server solved that problem, but the NFS kernel
>>>> server
>>>> kept crashing all day long after having run between 10 to 90
>>>> minutes.
>>>> 
>>>> We initially suspected a corrupt rbd volume (as it seemed that we
>>>> could
>>>> trigger the kernel crash by just ?ls -l? one of the volumes,
>>>> but
>>>> subsequent ?xfs_repair -n? checks on those RBD volumes showed
>>>> no
>>>> problems.
>>>> 
>>>> We migrated the NFS server off of its hypervisor, suspecting a
>>>> problem
>>>> with RBD kernel modules, rebooted the hypervisor but the problem
>>>> persisted (both on the new hypervisor, and on the old one when we
>>>> migrated it back)
>>>> 
>>>> We changed the /etc/default/nfs-kernel-server to start up 256
>>>> servers
>>>> (even though the defaults had been working fine for over a year)
>>>> 
>>>> Only one of our 3 NFS servers crashes (see below for syslog
>>>> information)
>>>> - the other 2 have been fine
>>>> 
>>>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD:
>>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
>>>> directory May
>>>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting
>>>> 90-second
>>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28
>>>> drive-nfs1
>>>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team
>>>> May
>>>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version
>>>> 0.5.0
>>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>>>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>>> 
>>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete,
>>>>> entering
>>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283]
>>>>> init:
>>>>> plymouth-upstart-bridge main process ended, respawning May 23
>>>>> 21:51:26
>>>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked
>>>>> for
>>>>> more than 120 seconds.
>>>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted
>>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>>>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [
>>>> 600.781504]
>>>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23
>>>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50
>>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>>>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180
>>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1
>>>> kernel:
>>>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515]
>>>> Call
>>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523]
>>>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>>>> drive-nfs1 kernel: [ 600.781526] []
>>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1
>>>> kernel:
>>>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23
>>>> 21:51:26 drive-nfs1 kernel: [ 600.781557] []
>>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1
>>>> kernel:
>>>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May
>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] []
>>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May
>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] []
>>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>>>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200
>>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662]
>>>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>>>> 21:51:26 drive-nfs1 kernel: [ 600.781678] []
>>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1
>>>> kernel:
>>>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23
>>>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ?
>>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26
>>>> drive-nfs1 kernel: [ 600.781707] [] ?
>>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1
>>>> kernel:
>>>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23
>>>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ?
>>>> kthread_create_on_node+0x1c0/0x1c0
>>>> 
>>>> Before each crash, we see the disk utilization of one or two
>>>> random
>>>> mounted RBD volumes to go to 100% - there is no pattern on which
>>>> of the
>>>> RBD disks start to act up.
>>>> 
>>>> We have scoured the log files of the Ceph cluster for any signs of
>>>> problems but came up empty.
>>>> 
>>>> The NFS server has almost no load (compared to regular usage) as
>>>> most
>>>> sync clients are either turned off (weekend) or have given up
>>>> connecting
>>>> to the server.
>>>> 
>>>> There haven't been any configuration change on the NFS servers
>>>> prior to
>>>> the problems. The only change was the adding of 23 OSDs.
>>>> 
>>>> We use ceph version 0.80.7
>>>> (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>> 
>>>> Our team is completely out of ideas. We have removed the 100TB
>>>> volume
>>>> from the nfs server (we used the downtime to migrate the last data
>>>> off
>>>> of it to one of the smaller volumes). The NFS server has been
>>>> running
>>>> for 30 minutes now (with close to no load) but we don?t really
>>>> expect it
>>>> to make it until tomorrow.
>>>> 
>>>> send help
>>>> Jens-Christian
>>> 
>>> --
>>> Christian Balzer Network/Systems Engineer
>>> chibi at gol.com [1] Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/ [2]
>> 
>> 
>> 
>> Links:
>> ------
>> [1] mailto:chibi at gol.com
>> [2] http://www.gol.com/
>> [3] mailto:jens-christian.fischer at switch.ch
>> [4] mailto:chibi at gol.com
> 
> -- 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20150527/1e0e23e3/attachment.pgp>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux