George, I will let Christian provide you the details. As far as I know, it was enough to just do a ?ls? on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-20131111.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps 2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (common files) ii qemu-system-mips 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (mips) ii qemu-system-misc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fischer at switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis <giorgis at acmac.uoc.gr> wrote: > Jens-Christian, > > how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? > > In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. > No one has complaint for the moment but the load/usage is very minimal. > If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( > > What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm > > Best regards, > > George > >> I think we (i.e. Christian) found the problem: >> >> We created a test VM with 9 mounted RBD volumes (no NFS server). As >> soon as he hit all disks, we started to experience these 120 second >> timeouts. We realized that the QEMU process on the hypervisor is >> opening a TCP connection to every OSD for every mounted volume - >> exceeding the 1024 FD limit. >> >> So no deep scrubbing etc, but simply to many connections? >> >> cheers >> jc >> >> -- >> SWITCH >> Jens-Christian Fischer, Peta Solutions >> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland >> phone +41 44 268 15 15, direct +41 44 268 15 71 >> jens-christian.fischer at switch.ch [3] >> http://www.switch.ch >> >> http://www.switch.ch/stories >> >> On 25.05.2015, at 06:02, Christian Balzer wrote: >> >>> Hello, >>> >>> lets compare your case with John-Paul's. >>> >>> Different OS and Ceph versions (thus we can assume different NFS >>> versions >>> as well). >>> The only common thing is that both of you added OSDs and are likely >>> suffering from delays stemming from Ceph re-balancing or >>> deep-scrubbing. >>> >>> Ceph logs will only pipe up when things have been blocked for more >>> than 30 >>> seconds, NFS might take offense to lower values (or the accumulation >>> of >>> several distributed delays). >>> >>> You added 23 OSDs, tell us more about your cluster, HW, network. >>> Were these added to the existing 16 nodes, are these on new storage >>> nodes >>> (so could there be something different with those nodes?), how busy >>> is your >>> network, CPU. >>> Running something like collectd to gather all ceph perf data and >>> other >>> data from the storage nodes and then feeding it to graphite (or >>> similar) >>> can be VERY helpful to identify if something is going wrong and what >>> it is >>> in particular. >>> Otherwise run atop on your storage nodes to identify if CPU, >>> network, >>> specific HDDs/OSDs are bottlenecks. >>> >>> Deep scrubbing can be _very_ taxing, do your problems persist if >>> inject >>> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower >>> that >>> until it hurts again) or if you turn of deep scrubs altogether for >>> the >>> moment? >>> >>> Christian >>> >>> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: >>> >>>> We see something very similar on our Ceph cluster, starting as of >>>> today. >>>> >>>> We use a 16 node, 102 OSD Ceph installation as the basis for an >>>> Icehouse >>>> OpenStack cluster (we applied the RBD patches for live migration >>>> etc) >>>> >>>> On this cluster we have a big ownCloud installation (Sync & Share) >>>> that >>>> stores its files on three NFS servers, each mounting 6 2TB RBD >>>> volumes >>>> and exposing them to around 10 web server VMs (we originally >>>> started >>>> with one NFS server with a 100TB volume, but that has become >>>> unwieldy). >>>> All of the servers (hypervisors, ceph storage nodes and VMs) are >>>> using >>>> Ubuntu 14.04 >>>> >>>> Yesterday evening we added 23 ODSs to the cluster bringing it up >>>> to 125 >>>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). >>>> The >>>> rebalancing process ended this morning (after around 12 hours) The >>>> cluster has been clean since then: >>>> >>>> cluster b1f3f4c8-xxxxx >>>> health HEALTH_OK >>>> monmap e2: 3 mons at >>>> >>> >> {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0}, >>>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap >>>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 >>>> pools, >>>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail >>>> 3319 >>>> active+clean 17 active+clean+scrubbing+deep >>>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s >>>> >>>> At midnight, we run a script that creates an RBD snapshot of all >>>> RBD >>>> volumes that are attached to the NFS servers (for backup >>>> purposes). >>>> Looking at our monitoring, around that time, one of the NFS >>>> servers >>>> became unresponsive and took down the complete ownCloud >>>> installation >>>> (load on the web server was > 200 and they had lost some of the >>>> NFS >>>> mounts) >>>> >>>> Rebooting the NFS server solved that problem, but the NFS kernel >>>> server >>>> kept crashing all day long after having run between 10 to 90 >>>> minutes. >>>> >>>> We initially suspected a corrupt rbd volume (as it seemed that we >>>> could >>>> trigger the kernel crash by just ?ls -l? one of the volumes, >>>> but >>>> subsequent ?xfs_repair -n? checks on those RBD volumes showed >>>> no >>>> problems. >>>> >>>> We migrated the NFS server off of its hypervisor, suspecting a >>>> problem >>>> with RBD kernel modules, rebooted the hypervisor but the problem >>>> persisted (both on the new hypervisor, and on the old one when we >>>> migrated it back) >>>> >>>> We changed the /etc/default/nfs-kernel-server to start up 256 >>>> servers >>>> (even though the defaults had been working fine for over a year) >>>> >>>> Only one of our 3 NFS servers crashes (see below for syslog >>>> information) >>>> - the other 2 have been fine >>>> >>>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: >>>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery >>>> directory May >>>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting >>>> 90-second >>>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1 >>>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 >>>> drive-nfs1 >>>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team >>>> May >>>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version >>>> 0.5.0 >>>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel: >>>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23 >>>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 >>>> >>>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1 >>>>> collectd[1872]: python: Plugin loaded but not configured. May 23 >>>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, >>>>> entering >>>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] >>>>> init: >>>>> plymouth-upstart-bridge main process ended, respawning May 23 >>>>> 21:51:26 >>>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked >>>>> for >>>>> more than 120 seconds. >>>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted >>>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel: >>>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [ >>>> 600.781504] >>>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23 >>>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50 >>>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26 >>>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180 >>>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 >>>> kernel: >>>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff >>>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515] >>>> Call >>>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523] >>>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26 >>>> drive-nfs1 kernel: [ 600.781526] [] >>>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 >>>> kernel: >>>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23 >>>> 21:51:26 drive-nfs1 kernel: [ 600.781557] [] >>>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 >>>> kernel: >>>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May >>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] [] >>>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel: >>>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May >>>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] [] >>>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1 >>>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200 >>>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662] >>>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23 >>>> 21:51:26 drive-nfs1 kernel: [ 600.781678] [] >>>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 >>>> kernel: >>>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23 >>>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ? >>>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel: >>>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26 >>>> drive-nfs1 kernel: [ 600.781707] [] ? >>>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 >>>> kernel: >>>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23 >>>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ? >>>> kthread_create_on_node+0x1c0/0x1c0 >>>> >>>> Before each crash, we see the disk utilization of one or two >>>> random >>>> mounted RBD volumes to go to 100% - there is no pattern on which >>>> of the >>>> RBD disks start to act up. >>>> >>>> We have scoured the log files of the Ceph cluster for any signs of >>>> problems but came up empty. >>>> >>>> The NFS server has almost no load (compared to regular usage) as >>>> most >>>> sync clients are either turned off (weekend) or have given up >>>> connecting >>>> to the server. >>>> >>>> There haven't been any configuration change on the NFS servers >>>> prior to >>>> the problems. The only change was the adding of 23 OSDs. >>>> >>>> We use ceph version 0.80.7 >>>> (6c0127fcb58008793d3c8b62d925bc91963672a3) >>>> >>>> Our team is completely out of ideas. We have removed the 100TB >>>> volume >>>> from the nfs server (we used the downtime to migrate the last data >>>> off >>>> of it to one of the smaller volumes). The NFS server has been >>>> running >>>> for 30 minutes now (with close to no load) but we don?t really >>>> expect it >>>> to make it until tomorrow. >>>> >>>> send help >>>> Jens-Christian >>> >>> -- >>> Christian Balzer Network/Systems Engineer >>> chibi at gol.com [1] Global OnLine Japan/Fusion Communications >>> http://www.gol.com/ [2] >> >> >> >> Links: >> ------ >> [1] mailto:chibi at gol.com >> [2] http://www.gol.com/ >> [3] mailto:jens-christian.fischer at switch.ch >> [4] mailto:chibi at gol.com > > -- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20150527/1e0e23e3/attachment.pgp>