Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files 1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer <jens-christian.fischer@xxxxxxxxx> wrote: > George, > > I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. > > we are using Qemu 2.0: > > $ dpkg -l | grep qemu > ii ipxe-qemu 1.0.0+git-20131111.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu > ii qemu-keymaps 2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps > ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries > ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (arm) > ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (common files) > ii qemu-system-mips 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (mips) > ii qemu-system-misc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (miscelaneous) > ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (ppc) > ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (sparc) > ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (x86) > ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU utilities > > cheers > jc > > -- > SWITCH > Jens-Christian Fischer, Peta Solutions > Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland > phone +41 44 268 15 15, direct +41 44 268 15 71 > jens-christian.fischer@xxxxxxxxx > http://www.switch.ch > > http://www.switch.ch/stories > > On 26.05.2015, at 19:12, Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> wrote: > >> Jens-Christian, >> >> how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? >> >> In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. >> No one has complaint for the moment but the load/usage is very minimal. >> If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( >> >> What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm >> >> Best regards, >> >> George >> >>> I think we (i.e. Christian) found the problem: >>> >>> We created a test VM with 9 mounted RBD volumes (no NFS server). As >>> soon as he hit all disks, we started to experience these 120 second >>> timeouts. We realized that the QEMU process on the hypervisor is >>> opening a TCP connection to every OSD for every mounted volume - >>> exceeding the 1024 FD limit. >>> >>> So no deep scrubbing etc, but simply to many connections… >>> >>> cheers >>> jc >>> >>> -- >>> SWITCH >>> Jens-Christian Fischer, Peta Solutions >>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland >>> phone +41 44 268 15 15, direct +41 44 268 15 71 >>> jens-christian.fischer@xxxxxxxxx [3] >>> http://www.switch.ch >>> >>> http://www.switch.ch/stories >>> >>> On 25.05.2015, at 06:02, Christian Balzer wrote: >>> >>>> Hello, >>>> >>>> lets compare your case with John-Paul's. >>>> >>>> Different OS and Ceph versions (thus we can assume different NFS >>>> versions >>>> as well). >>>> The only common thing is that both of you added OSDs and are likely >>>> suffering from delays stemming from Ceph re-balancing or >>>> deep-scrubbing. >>>> >>>> Ceph logs will only pipe up when things have been blocked for more >>>> than 30 >>>> seconds, NFS might take offense to lower values (or the accumulation >>>> of >>>> several distributed delays). >>>> >>>> You added 23 OSDs, tell us more about your cluster, HW, network. >>>> Were these added to the existing 16 nodes, are these on new storage >>>> nodes >>>> (so could there be something different with those nodes?), how busy >>>> is your >>>> network, CPU. >>>> Running something like collectd to gather all ceph perf data and >>>> other >>>> data from the storage nodes and then feeding it to graphite (or >>>> similar) >>>> can be VERY helpful to identify if something is going wrong and what >>>> it is >>>> in particular. >>>> Otherwise run atop on your storage nodes to identify if CPU, >>>> network, >>>> specific HDDs/OSDs are bottlenecks. >>>> >>>> Deep scrubbing can be _very_ taxing, do your problems persist if >>>> inject >>>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower >>>> that >>>> until it hurts again) or if you turn of deep scrubs altogether for >>>> the >>>> moment? >>>> >>>> Christian >>>> >>>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: >>>> >>>>> We see something very similar on our Ceph cluster, starting as of >>>>> today. >>>>> >>>>> We use a 16 node, 102 OSD Ceph installation as the basis for an >>>>> Icehouse >>>>> OpenStack cluster (we applied the RBD patches for live migration >>>>> etc) >>>>> >>>>> On this cluster we have a big ownCloud installation (Sync & Share) >>>>> that >>>>> stores its files on three NFS servers, each mounting 6 2TB RBD >>>>> volumes >>>>> and exposing them to around 10 web server VMs (we originally >>>>> started >>>>> with one NFS server with a 100TB volume, but that has become >>>>> unwieldy). >>>>> All of the servers (hypervisors, ceph storage nodes and VMs) are >>>>> using >>>>> Ubuntu 14.04 >>>>> >>>>> Yesterday evening we added 23 ODSs to the cluster bringing it up >>>>> to 125 >>>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). >>>>> The >>>>> rebalancing process ended this morning (after around 12 hours) The >>>>> cluster has been clean since then: >>>>> >>>>> cluster b1f3f4c8-xxxxx >>>>> health HEALTH_OK >>>>> monmap e2: 3 mons at >>>>> >>>> >>> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0}, >>>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap >>>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 >>>>> pools, >>>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail >>>>> 3319 >>>>> active+clean 17 active+clean+scrubbing+deep >>>>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s >>>>> >>>>> At midnight, we run a script that creates an RBD snapshot of all >>>>> RBD >>>>> volumes that are attached to the NFS servers (for backup >>>>> purposes). >>>>> Looking at our monitoring, around that time, one of the NFS >>>>> servers >>>>> became unresponsive and took down the complete ownCloud >>>>> installation >>>>> (load on the web server was > 200 and they had lost some of the >>>>> NFS >>>>> mounts) >>>>> >>>>> Rebooting the NFS server solved that problem, but the NFS kernel >>>>> server >>>>> kept crashing all day long after having run between 10 to 90 >>>>> minutes. >>>>> >>>>> We initially suspected a corrupt rbd volume (as it seemed that we >>>>> could >>>>> trigger the kernel crash by just “ls -l” one of the volumes, >>>>> but >>>>> subsequent “xfs_repair -n” checks on those RBD volumes showed >>>>> no >>>>> problems. >>>>> >>>>> We migrated the NFS server off of its hypervisor, suspecting a >>>>> problem >>>>> with RBD kernel modules, rebooted the hypervisor but the problem >>>>> persisted (both on the new hypervisor, and on the old one when we >>>>> migrated it back) >>>>> >>>>> We changed the /etc/default/nfs-kernel-server to start up 256 >>>>> servers >>>>> (even though the defaults had been working fine for over a year) >>>>> >>>>> Only one of our 3 NFS servers crashes (see below for syslog >>>>> information) >>>>> - the other 2 have been fine >>>>> >>>>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: >>>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery >>>>> directory May >>>>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting >>>>> 90-second >>>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1 >>>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 >>>>> drive-nfs1 >>>>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team >>>>> May >>>>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version >>>>> 0.5.0 >>>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel: >>>>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23 >>>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 >>>>> >>>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1 >>>>>> collectd[1872]: python: Plugin loaded but not configured. May 23 >>>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, >>>>>> entering >>>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] >>>>>> init: >>>>>> plymouth-upstart-bridge main process ended, respawning May 23 >>>>>> 21:51:26 >>>>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked >>>>>> for >>>>>> more than 120 seconds. >>>>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted >>>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel: >>>>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [ >>>>> 600.781504] >>>>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23 >>>>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50 >>>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26 >>>>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180 >>>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 >>>>> kernel: >>>>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff >>>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515] >>>>> Call >>>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523] >>>>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26 >>>>> drive-nfs1 kernel: [ 600.781526] [] >>>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 >>>>> kernel: >>>>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23 >>>>> 21:51:26 drive-nfs1 kernel: [ 600.781557] [] >>>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 >>>>> kernel: >>>>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May >>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] [] >>>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel: >>>>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May >>>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] [] >>>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1 >>>>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200 >>>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662] >>>>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23 >>>>> 21:51:26 drive-nfs1 kernel: [ 600.781678] [] >>>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 >>>>> kernel: >>>>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23 >>>>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ? >>>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel: >>>>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26 >>>>> drive-nfs1 kernel: [ 600.781707] [] ? >>>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 >>>>> kernel: >>>>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23 >>>>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ? >>>>> kthread_create_on_node+0x1c0/0x1c0 >>>>> >>>>> Before each crash, we see the disk utilization of one or two >>>>> random >>>>> mounted RBD volumes to go to 100% - there is no pattern on which >>>>> of the >>>>> RBD disks start to act up. >>>>> >>>>> We have scoured the log files of the Ceph cluster for any signs of >>>>> problems but came up empty. >>>>> >>>>> The NFS server has almost no load (compared to regular usage) as >>>>> most >>>>> sync clients are either turned off (weekend) or have given up >>>>> connecting >>>>> to the server. >>>>> >>>>> There haven't been any configuration change on the NFS servers >>>>> prior to >>>>> the problems. The only change was the adding of 23 OSDs. >>>>> >>>>> We use ceph version 0.80.7 >>>>> (6c0127fcb58008793d3c8b62d925bc91963672a3) >>>>> >>>>> Our team is completely out of ideas. We have removed the 100TB >>>>> volume >>>>> from the nfs server (we used the downtime to migrate the last data >>>>> off >>>>> of it to one of the smaller volumes). The NFS server has been >>>>> running >>>>> for 30 minutes now (with close to no load) but we don’t really >>>>> expect it >>>>> to make it until tomorrow. >>>>> >>>>> send help >>>>> Jens-Christian >>>> >>>> -- >>>> Christian Balzer Network/Systems Engineer >>>> chibi@xxxxxxx [1] Global OnLine Japan/Fusion Communications >>>> http://www.gol.com/ [2] >>> >>> >>> >>> Links: >>> ------ >>> [1] mailto:chibi@xxxxxxx >>> [2] http://www.gol.com/ >>> [3] mailto:jens-christian.fischer@xxxxxxxxx >>> [4] mailto:chibi@xxxxxxx >> >> -- >
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com