Re: NFS interaction with RBD

Christian Schnidrig <christian.schnidrig@xxxxxxxxx> · Fri, 29 May 2015 22:21:29 +0200

Hi George

Well that’s strange. I wonder why our systems behave so differently.

We’ve got:

Hypervisors running on Ubuntu 14.04. 
VMs with 9 ceph volumes: 2TB each.
XFS instead of your ext4

Maybe the number of placement groups plays a major role as well. Jens-Christian may be able to give you the specifics of our ceph cluster. 
I’m about to leave on vacation and don’t have time to look that up anymore.

Best regards
Christian

On 29 May 2015, at 14:42, Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> wrote:

> All,
> 
> I 've tried to recreate the issue without success!
> 
> My configuration is the following:
> 
> OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64)
> QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64
> Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB OSDs equally distributed on two disk nodes, 3xMonitors
> 
> 
> OpenStack Cinder has been configured to provide RBD Volumes from Ceph.
> 
> I have created 10x 500GB Volumes which were then all attached at a single Virtual Machine.
> 
> All volumes were formatted two times for comparison reasons, one using "mkfs.xfs" and one using "mkfs.ext4".
> I did try to issue the commands all at the same time (or as possible to that).
> 
> In both tests I didn't notice any interruption. It may took longer than just doing one at a time but the system was continuously up and everything was responding without the problem.
> 
> At the time of these processes the open connections were 100 with one of the OSD node and 111 with the other one.
> 
> So I guess I am not experiencing the issue due to the low number of OSDs I am having. Is my assumption correct?
> 
> 
> Best regards,
> 
> George
> 
> 
> 
>> Thanks a million for the feedback Christian!
>> 
>> I 've tried to recreate the issue with 10RBD Volumes mounted on a
>> single server without success!
>> 
>> I 've issued the "mkfs.xfs" command simultaneously (or at least as
>> fast I could do it in different terminals) without noticing any
>> problems. Can you please tell me what was the size of each one of the
>> RBD Volumes cause I have a feeling that mine were two small, and if so
>> I have to test it on our bigger cluster.
>> 
>> I 've also thought that besides QEMU version it might also be
>> important the underlying OS, so what was your testbed?
>> 
>> 
>> All the best,
>> 
>> George
>> 
>>> Hi George
>>> 
>>> In order to experience the error it was enough to simply run mkfs.xfs
>>> on all the volumes.
>>> 
>>> 
>>> In the meantime it became clear what the problem was:
>>> 
>>> ~ ; cat /proc/183016/limits
>>> ...
>>> Max open files            1024                 4096                 files
>>> ..
>>> 
>>> This can be changed by setting a decent value in
>>> /etc/libvirt/qemu.conf for max_files.
>>> 
>>> Regards
>>> Christian
>>> 
>>> 
>>> 
>>> On 27 May 2015, at 16:23, Jens-Christian Fischer
>>> <jens-christian.fischer@xxxxxxxxx> wrote:
>>> 
>>>> George,
>>>> 
>>>> I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives.
>>>> 
>>>> we are using Qemu 2.0:
>>>> 
>>>> $ dpkg -l | grep qemu
>>>> ii  ipxe-qemu                           1.0.0+git-20131111.c3d1e78-2ubuntu1   all          PXE boot firmware - ROM images for qemu
>>>> ii  qemu-keymaps                        2.0.0+dfsg-2ubuntu1.11      all          QEMU keyboard maps
>>>> ii  qemu-system                         2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries
>>>> ii  qemu-system-arm                     2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (arm)
>>>> ii  qemu-system-common                  2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (common files)
>>>> ii  qemu-system-mips                    2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (mips)
>>>> ii  qemu-system-misc                    2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (miscelaneous)
>>>> ii  qemu-system-ppc                     2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (ppc)
>>>> ii  qemu-system-sparc                   2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (sparc)
>>>> ii  qemu-system-x86                     2.0.0+dfsg-2ubuntu1.11      amd64        QEMU full system emulation binaries (x86)
>>>> ii  qemu-utils                          2.0.0+dfsg-2ubuntu1.11      amd64        QEMU utilities
>>>> 
>>>> cheers
>>>> jc
>>>> 
>>>> --
>>>> SWITCH
>>>> Jens-Christian Fischer, Peta Solutions
>>>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>>>> phone +41 44 268 15 15, direct +41 44 268 15 71
>>>> jens-christian.fischer@xxxxxxxxx
>>>> http://www.switch.ch
>>>> 
>>>> http://www.switch.ch/stories
>>>> 
>>>> On 26.05.2015, at 19:12, Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> wrote:
>>>> 
>>>>> Jens-Christian,
>>>>> 
>>>>> how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that?
>>>>> 
>>>>> In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs.
>>>>> No one has complaint for the moment but the load/usage is very minimal.
>>>>> If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-(
>>>>> 
>>>>> What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> George
>>>>> 
>>>>>> I think we (i.e. Christian) found the problem:
>>>>>> 
>>>>>> We created a test VM with 9 mounted RBD volumes (no NFS server). As
>>>>>> soon as he hit all disks, we started to experience these 120 second
>>>>>> timeouts. We realized that the QEMU process on the hypervisor is
>>>>>> opening a TCP connection to every OSD for every mounted volume -
>>>>>> exceeding the 1024 FD limit.
>>>>>> 
>>>>>> So no deep scrubbing etc, but simply to many connections…
>>>>>> 
>>>>>> cheers
>>>>>> jc
>>>>>> 
>>>>>> --
>>>>>> SWITCH
>>>>>> Jens-Christian Fischer, Peta Solutions
>>>>>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>>>>>> phone +41 44 268 15 15, direct +41 44 268 15 71
>>>>>> jens-christian.fischer@xxxxxxxxx [3]
>>>>>> http://www.switch.ch
>>>>>> 
>>>>>> http://www.switch.ch/stories
>>>>>> 
>>>>>> On 25.05.2015, at 06:02, Christian Balzer  wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> lets compare your case with John-Paul's.
>>>>>>> 
>>>>>>> Different OS and Ceph versions (thus we can assume different NFS
>>>>>>> versions
>>>>>>> as well).
>>>>>>> The only common thing is that both of you added OSDs and are likely
>>>>>>> suffering from delays stemming from Ceph re-balancing or
>>>>>>> deep-scrubbing.
>>>>>>> 
>>>>>>> Ceph logs will only pipe up when things have been blocked for more
>>>>>>> than 30
>>>>>>> seconds, NFS might take offense to lower values (or the accumulation
>>>>>>> of
>>>>>>> several distributed delays).
>>>>>>> 
>>>>>>> You added 23 OSDs, tell us more about your cluster, HW, network.
>>>>>>> Were these added to the existing 16 nodes, are these on new storage
>>>>>>> nodes
>>>>>>> (so could there be something different with those nodes?), how busy
>>>>>>> is your
>>>>>>> network, CPU.
>>>>>>> Running something like collectd to gather all ceph perf data and
>>>>>>> other
>>>>>>> data from the storage nodes and then feeding it to graphite (or
>>>>>>> similar)
>>>>>>> can be VERY helpful to identify if something is going wrong and what
>>>>>>> it is
>>>>>>> in particular.
>>>>>>> Otherwise run atop on your storage nodes to identify if CPU,
>>>>>>> network,
>>>>>>> specific HDDs/OSDs are bottlenecks.
>>>>>>> 
>>>>>>> Deep scrubbing can be _very_ taxing, do your problems persist if
>>>>>>> inject
>>>>>>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower
>>>>>>> that
>>>>>>> until it hurts again) or if you turn of deep scrubs altogether for
>>>>>>> the
>>>>>>> moment?
>>>>>>> 
>>>>>>> Christian
>>>>>>> 
>>>>>>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>>>>>>> 
>>>>>>>> We see something very similar on our Ceph cluster, starting as of
>>>>>>>> today.
>>>>>>>> 
>>>>>>>> We use a 16 node, 102 OSD Ceph installation as the basis for an
>>>>>>>> Icehouse
>>>>>>>> OpenStack cluster (we applied the RBD patches for live migration
>>>>>>>> etc)
>>>>>>>> 
>>>>>>>> On this cluster we have a big ownCloud installation (Sync & Share)
>>>>>>>> that
>>>>>>>> stores its files on three NFS servers, each mounting 6 2TB RBD
>>>>>>>> volumes
>>>>>>>> and exposing them to around 10 web server VMs (we originally
>>>>>>>> started
>>>>>>>> with one NFS server with a 100TB volume, but that has become
>>>>>>>> unwieldy).
>>>>>>>> All of the servers (hypervisors, ceph storage nodes and VMs) are
>>>>>>>> using
>>>>>>>> Ubuntu 14.04
>>>>>>>> 
>>>>>>>> Yesterday evening we added 23 ODSs to the cluster bringing it up
>>>>>>>> to 125
>>>>>>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark).
>>>>>>>> The
>>>>>>>> rebalancing process ended this morning (after around 12 hours) The
>>>>>>>> cluster has been clean since then:
>>>>>>>> 
>>>>>>>> cluster b1f3f4c8-xxxxx
>>>>>>>> health HEALTH_OK
>>>>>>>> monmap e2: 3 mons at
>>>>>>>> 
>>>>>>> 
>>>>>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>>>>>>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>>>>>>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
>>>>>>>> pools,
>>>>>>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail
>>>>>>>> 3319
>>>>>>>> active+clean 17 active+clean+scrubbing+deep
>>>>>>>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>>>>>>>> 
>>>>>>>> At midnight, we run a script that creates an RBD snapshot of all
>>>>>>>> RBD
>>>>>>>> volumes that are attached to the NFS servers (for backup
>>>>>>>> purposes).
>>>>>>>> Looking at our monitoring, around that time, one of the NFS
>>>>>>>> servers
>>>>>>>> became unresponsive and took down the complete ownCloud
>>>>>>>> installation
>>>>>>>> (load on the web server was > 200 and they had lost some of the
>>>>>>>> NFS
>>>>>>>> mounts)
>>>>>>>> 
>>>>>>>> Rebooting the NFS server solved that problem, but the NFS kernel
>>>>>>>> server
>>>>>>>> kept crashing all day long after having run between 10 to 90
>>>>>>>> minutes.
>>>>>>>> 
>>>>>>>> We initially suspected a corrupt rbd volume (as it seemed that we
>>>>>>>> could
>>>>>>>> trigger the kernel crash by just “ls -l” one of the volumes,
>>>>>>>> but
>>>>>>>> subsequent “xfs_repair -n” checks on those RBD volumes showed
>>>>>>>> no
>>>>>>>> problems.
>>>>>>>> 
>>>>>>>> We migrated the NFS server off of its hypervisor, suspecting a
>>>>>>>> problem
>>>>>>>> with RBD kernel modules, rebooted the hypervisor but the problem
>>>>>>>> persisted (both on the new hypervisor, and on the old one when we
>>>>>>>> migrated it back)
>>>>>>>> 
>>>>>>>> We changed the /etc/default/nfs-kernel-server to start up 256
>>>>>>>> servers
>>>>>>>> (even though the defaults had been working fine for over a year)
>>>>>>>> 
>>>>>>>> Only one of our 3 NFS servers crashes (see below for syslog
>>>>>>>> information)
>>>>>>>> - the other 2 have been fine
>>>>>>>> 
>>>>>>>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD:
>>>>>>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
>>>>>>>> directory May
>>>>>>>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting
>>>>>>>> 90-second
>>>>>>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>>>>>>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28
>>>>>>>> drive-nfs1
>>>>>>>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team
>>>>>>>> May
>>>>>>>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version
>>>>>>>> 0.5.0
>>>>>>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>>>>>>>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>>>>>>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>>>>>>> 
>>>>>>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>>>>>>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>>>>>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete,
>>>>>>>>> entering
>>>>>>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283]
>>>>>>>>> init:
>>>>>>>>> plymouth-upstart-bridge main process ended, respawning May 23
>>>>>>>>> 21:51:26
>>>>>>>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked
>>>>>>>>> for
>>>>>>>>> more than 120 seconds.
>>>>>>>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted
>>>>>>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>>>>>>>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [
>>>>>>>> 600.781504]
>>>>>>>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23
>>>>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50
>>>>>>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>>>>>>>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180
>>>>>>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1
>>>>>>>> kernel:
>>>>>>>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>>>>>>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515]
>>>>>>>> Call
>>>>>>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523]
>>>>>>>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>>>>>>>> drive-nfs1 kernel: [ 600.781526] []
>>>>>>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1
>>>>>>>> kernel:
>>>>>>>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23
>>>>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781557] []
>>>>>>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1
>>>>>>>> kernel:
>>>>>>>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May
>>>>>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] []
>>>>>>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>>>>>>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May
>>>>>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] []
>>>>>>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>>>>>>>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200
>>>>>>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662]
>>>>>>>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>>>>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781678] []
>>>>>>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1
>>>>>>>> kernel:
>>>>>>>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23
>>>>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ?
>>>>>>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>>>>>>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26
>>>>>>>> drive-nfs1 kernel: [ 600.781707] [] ?
>>>>>>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1
>>>>>>>> kernel:
>>>>>>>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23
>>>>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ?
>>>>>>>> kthread_create_on_node+0x1c0/0x1c0
>>>>>>>> 
>>>>>>>> Before each crash, we see the disk utilization of one or two
>>>>>>>> random
>>>>>>>> mounted RBD volumes to go to 100% - there is no pattern on which
>>>>>>>> of the
>>>>>>>> RBD disks start to act up.
>>>>>>>> 
>>>>>>>> We have scoured the log files of the Ceph cluster for any signs of
>>>>>>>> problems but came up empty.
>>>>>>>> 
>>>>>>>> The NFS server has almost no load (compared to regular usage) as
>>>>>>>> most
>>>>>>>> sync clients are either turned off (weekend) or have given up
>>>>>>>> connecting
>>>>>>>> to the server.
>>>>>>>> 
>>>>>>>> There haven't been any configuration change on the NFS servers
>>>>>>>> prior to
>>>>>>>> the problems. The only change was the adding of 23 OSDs.
>>>>>>>> 
>>>>>>>> We use ceph version 0.80.7
>>>>>>>> (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>>>>>> 
>>>>>>>> Our team is completely out of ideas. We have removed the 100TB
>>>>>>>> volume
>>>>>>>> from the nfs server (we used the downtime to migrate the last data
>>>>>>>> off
>>>>>>>> of it to one of the smaller volumes). The NFS server has been
>>>>>>>> running
>>>>>>>> for 30 minutes now (with close to no load) but we don’t really
>>>>>>>> expect it
>>>>>>>> to make it until tomorrow.
>>>>>>>> 
>>>>>>>> send help
>>>>>>>> Jens-Christian
>>>>>>> 
>>>>>>> --
>>>>>>> Christian Balzer Network/Systems Engineer
>>>>>>> chibi@xxxxxxx [1] Global OnLine Japan/Fusion Communications
>>>>>>> http://www.gol.com/ [2]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Links:
>>>>>> ------
>>>>>> [1] mailto:chibi@xxxxxxx
>>>>>> [2] http://www.gol.com/
>>>>>> [3] mailto:jens-christian.fischer@xxxxxxxxx
>>>>>> [4] mailto:chibi@xxxxxxx
>>>>> 
>>>>> --
>>>> 

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com