Re: NFS interaction with RBD

Christian Schnidrig <christian.schnidrig@xxxxxxxxx> · Wed, 27 May 2015 16:33:49 +0200

Hi George

In order to experience the error it was enough to simply run mkfs.xfs on all the volumes.

In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files            1024                 4096                 files
..

This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files.

Regards
Christian

On 27 May 2015, at 16:23, Jens-Christian Fischer <jens-christian.fischer@xxxxxxxxx> wrote:

> George,
> 
> I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives.
> 
> we are using Qemu 2.0:
> 
> $ dpkg -l | grep qemu
> ii  ipxe-qemu                           1.0.0+git-20131111.c3d1e78-2ubuntu1   all          PXE boot firmware - ROM images for qemu
> ii  qemu-keymaps                        2.0.0+dfsg-2ubuntu1.11                all          QEMU keyboard maps
> ii  qemu-system                         2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries
> ii  qemu-system-arm                     2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (arm)
> ii  qemu-system-common                  2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (common files)
> ii  qemu-system-mips                    2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (mips)
> ii  qemu-system-misc                    2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (miscelaneous)
> ii  qemu-system-ppc                     2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (ppc)
> ii  qemu-system-sparc                   2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (sparc)
> ii  qemu-system-x86                     2.0.0+dfsg-2ubuntu1.11                amd64        QEMU full system emulation binaries (x86)
> ii  qemu-utils                          2.0.0+dfsg-2ubuntu1.11                amd64        QEMU utilities
> 
> cheers
> jc
> 
> -- 
> SWITCH
> Jens-Christian Fischer, Peta Solutions
> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
> phone +41 44 268 15 15, direct +41 44 268 15 71
> jens-christian.fischer@xxxxxxxxx
> http://www.switch.ch
> 
> http://www.switch.ch/stories
> 
> On 26.05.2015, at 19:12, Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> wrote:
> 
>> Jens-Christian,
>> 
>> how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that?
>> 
>> In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs.
>> No one has complaint for the moment but the load/usage is very minimal.
>> If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-(
>> 
>> What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
>> 
>> Best regards,
>> 
>> George
>> 
>>> I think we (i.e. Christian) found the problem:
>>> 
>>> We created a test VM with 9 mounted RBD volumes (no NFS server). As
>>> soon as he hit all disks, we started to experience these 120 second
>>> timeouts. We realized that the QEMU process on the hypervisor is
>>> opening a TCP connection to every OSD for every mounted volume -
>>> exceeding the 1024 FD limit.
>>> 
>>> So no deep scrubbing etc, but simply to many connections…
>>> 
>>> cheers
>>> jc
>>> 
>>> --
>>> SWITCH
>>> Jens-Christian Fischer, Peta Solutions
>>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>>> phone +41 44 268 15 15, direct +41 44 268 15 71
>>> jens-christian.fischer@xxxxxxxxx [3]
>>> http://www.switch.ch
>>> 
>>> http://www.switch.ch/stories
>>> 
>>> On 25.05.2015, at 06:02, Christian Balzer  wrote:
>>> 
>>>> Hello,
>>>> 
>>>> lets compare your case with John-Paul's.
>>>> 
>>>> Different OS and Ceph versions (thus we can assume different NFS
>>>> versions
>>>> as well).
>>>> The only common thing is that both of you added OSDs and are likely
>>>> suffering from delays stemming from Ceph re-balancing or
>>>> deep-scrubbing.
>>>> 
>>>> Ceph logs will only pipe up when things have been blocked for more
>>>> than 30
>>>> seconds, NFS might take offense to lower values (or the accumulation
>>>> of
>>>> several distributed delays).
>>>> 
>>>> You added 23 OSDs, tell us more about your cluster, HW, network.
>>>> Were these added to the existing 16 nodes, are these on new storage
>>>> nodes
>>>> (so could there be something different with those nodes?), how busy
>>>> is your
>>>> network, CPU.
>>>> Running something like collectd to gather all ceph perf data and
>>>> other
>>>> data from the storage nodes and then feeding it to graphite (or
>>>> similar)
>>>> can be VERY helpful to identify if something is going wrong and what
>>>> it is
>>>> in particular.
>>>> Otherwise run atop on your storage nodes to identify if CPU,
>>>> network,
>>>> specific HDDs/OSDs are bottlenecks.
>>>> 
>>>> Deep scrubbing can be _very_ taxing, do your problems persist if
>>>> inject
>>>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower
>>>> that
>>>> until it hurts again) or if you turn of deep scrubs altogether for
>>>> the
>>>> moment?
>>>> 
>>>> Christian
>>>> 
>>>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
>>>> 
>>>>> We see something very similar on our Ceph cluster, starting as of
>>>>> today.
>>>>> 
>>>>> We use a 16 node, 102 OSD Ceph installation as the basis for an
>>>>> Icehouse
>>>>> OpenStack cluster (we applied the RBD patches for live migration
>>>>> etc)
>>>>> 
>>>>> On this cluster we have a big ownCloud installation (Sync & Share)
>>>>> that
>>>>> stores its files on three NFS servers, each mounting 6 2TB RBD
>>>>> volumes
>>>>> and exposing them to around 10 web server VMs (we originally
>>>>> started
>>>>> with one NFS server with a 100TB volume, but that has become
>>>>> unwieldy).
>>>>> All of the servers (hypervisors, ceph storage nodes and VMs) are
>>>>> using
>>>>> Ubuntu 14.04
>>>>> 
>>>>> Yesterday evening we added 23 ODSs to the cluster bringing it up
>>>>> to 125
>>>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark).
>>>>> The
>>>>> rebalancing process ended this morning (after around 12 hours) The
>>>>> cluster has been clean since then:
>>>>> 
>>>>> cluster b1f3f4c8-xxxxx
>>>>> health HEALTH_OK
>>>>> monmap e2: 3 mons at
>>>>> 
>>>> 
>>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0},
>>>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
>>>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
>>>>> pools,
>>>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail
>>>>> 3319
>>>>> active+clean 17 active+clean+scrubbing+deep
>>>>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
>>>>> 
>>>>> At midnight, we run a script that creates an RBD snapshot of all
>>>>> RBD
>>>>> volumes that are attached to the NFS servers (for backup
>>>>> purposes).
>>>>> Looking at our monitoring, around that time, one of the NFS
>>>>> servers
>>>>> became unresponsive and took down the complete ownCloud
>>>>> installation
>>>>> (load on the web server was > 200 and they had lost some of the
>>>>> NFS
>>>>> mounts)
>>>>> 
>>>>> Rebooting the NFS server solved that problem, but the NFS kernel
>>>>> server
>>>>> kept crashing all day long after having run between 10 to 90
>>>>> minutes.
>>>>> 
>>>>> We initially suspected a corrupt rbd volume (as it seemed that we
>>>>> could
>>>>> trigger the kernel crash by just “ls -l” one of the volumes,
>>>>> but
>>>>> subsequent “xfs_repair -n” checks on those RBD volumes showed
>>>>> no
>>>>> problems.
>>>>> 
>>>>> We migrated the NFS server off of its hypervisor, suspecting a
>>>>> problem
>>>>> with RBD kernel modules, rebooted the hypervisor but the problem
>>>>> persisted (both on the new hypervisor, and on the old one when we
>>>>> migrated it back)
>>>>> 
>>>>> We changed the /etc/default/nfs-kernel-server to start up 256
>>>>> servers
>>>>> (even though the defaults had been working fine for over a year)
>>>>> 
>>>>> Only one of our 3 NFS servers crashes (see below for syslog
>>>>> information)
>>>>> - the other 2 have been fine
>>>>> 
>>>>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD:
>>>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
>>>>> directory May
>>>>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting
>>>>> 90-second
>>>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1
>>>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28
>>>>> drive-nfs1
>>>>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team
>>>>> May
>>>>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version
>>>>> 0.5.0
>>>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
>>>>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
>>>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
>>>>> 
>>>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1
>>>>>> collectd[1872]: python: Plugin loaded but not configured. May 23
>>>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete,
>>>>>> entering
>>>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283]
>>>>>> init:
>>>>>> plymouth-upstart-bridge main process ended, respawning May 23
>>>>>> 21:51:26
>>>>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked
>>>>>> for
>>>>>> more than 120 seconds.
>>>>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted
>>>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel:
>>>>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [
>>>>> 600.781504]
>>>>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23
>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50
>>>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26
>>>>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180
>>>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1
>>>>> kernel:
>>>>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff
>>>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515]
>>>>> Call
>>>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523]
>>>>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26
>>>>> drive-nfs1 kernel: [ 600.781526] []
>>>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1
>>>>> kernel:
>>>>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23
>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781557] []
>>>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1
>>>>> kernel:
>>>>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May
>>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] []
>>>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>>>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May
>>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] []
>>>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1
>>>>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200
>>>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662]
>>>>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23
>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781678] []
>>>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1
>>>>> kernel:
>>>>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23
>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ?
>>>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel:
>>>>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26
>>>>> drive-nfs1 kernel: [ 600.781707] [] ?
>>>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1
>>>>> kernel:
>>>>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23
>>>>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ?
>>>>> kthread_create_on_node+0x1c0/0x1c0
>>>>> 
>>>>> Before each crash, we see the disk utilization of one or two
>>>>> random
>>>>> mounted RBD volumes to go to 100% - there is no pattern on which
>>>>> of the
>>>>> RBD disks start to act up.
>>>>> 
>>>>> We have scoured the log files of the Ceph cluster for any signs of
>>>>> problems but came up empty.
>>>>> 
>>>>> The NFS server has almost no load (compared to regular usage) as
>>>>> most
>>>>> sync clients are either turned off (weekend) or have given up
>>>>> connecting
>>>>> to the server.
>>>>> 
>>>>> There haven't been any configuration change on the NFS servers
>>>>> prior to
>>>>> the problems. The only change was the adding of 23 OSDs.
>>>>> 
>>>>> We use ceph version 0.80.7
>>>>> (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>>> 
>>>>> Our team is completely out of ideas. We have removed the 100TB
>>>>> volume
>>>>> from the nfs server (we used the downtime to migrate the last data
>>>>> off
>>>>> of it to one of the smaller volumes). The NFS server has been
>>>>> running
>>>>> for 30 minutes now (with close to no load) but we don’t really
>>>>> expect it
>>>>> to make it until tomorrow.
>>>>> 
>>>>> send help
>>>>> Jens-Christian
>>>> 
>>>> --
>>>> Christian Balzer Network/Systems Engineer
>>>> chibi@xxxxxxx [1] Global OnLine Japan/Fusion Communications
>>>> http://www.gol.com/ [2]
>>> 
>>> 
>>> 
>>> Links:
>>> ------
>>> [1] mailto:chibi@xxxxxxx
>>> [2] http://www.gol.com/
>>> [3] mailto:jens-christian.fischer@xxxxxxxxx
>>> [4] mailto:chibi@xxxxxxx
>> 
>> -- 
> 

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com