Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George > I think we (i.e. Christian) found the problem: > > We created a test VM with 9 mounted RBD volumes (no NFS server). As > soon as he hit all disks, we started to experience these 120 second > timeouts. We realized that the QEMU process on the hypervisor is > opening a TCP connection to every OSD for every mounted volume - > exceeding the 1024 FD limit. > > So no deep scrubbing etc, but simply to many connections? > > cheers > jc > > -- > SWITCH > Jens-Christian Fischer, Peta Solutions > Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland > phone +41 44 268 15 15, direct +41 44 268 15 71 > jens-christian.fischer at switch.ch [3] > http://www.switch.ch > > http://www.switch.ch/stories > > On 25.05.2015, at 06:02, Christian Balzer wrote: > >> Hello, >> >> lets compare your case with John-Paul's. >> >> Different OS and Ceph versions (thus we can assume different NFS >> versions >> as well). >> The only common thing is that both of you added OSDs and are likely >> suffering from delays stemming from Ceph re-balancing or >> deep-scrubbing. >> >> Ceph logs will only pipe up when things have been blocked for more >> than 30 >> seconds, NFS might take offense to lower values (or the accumulation >> of >> several distributed delays). >> >> You added 23 OSDs, tell us more about your cluster, HW, network. >> Were these added to the existing 16 nodes, are these on new storage >> nodes >> (so could there be something different with those nodes?), how busy >> is your >> network, CPU. >> Running something like collectd to gather all ceph perf data and >> other >> data from the storage nodes and then feeding it to graphite (or >> similar) >> can be VERY helpful to identify if something is going wrong and what >> it is >> in particular. >> Otherwise run atop on your storage nodes to identify if CPU, >> network, >> specific HDDs/OSDs are bottlenecks. >> >> Deep scrubbing can be _very_ taxing, do your problems persist if >> inject >> into your running cluster an "osd_scrub_sleep" value of "0.5" (lower >> that >> until it hurts again) or if you turn of deep scrubs altogether for >> the >> moment? >> >> Christian >> >> On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote: >> >>> We see something very similar on our Ceph cluster, starting as of >>> today. >>> >>> We use a 16 node, 102 OSD Ceph installation as the basis for an >>> Icehouse >>> OpenStack cluster (we applied the RBD patches for live migration >>> etc) >>> >>> On this cluster we have a big ownCloud installation (Sync & Share) >>> that >>> stores its files on three NFS servers, each mounting 6 2TB RBD >>> volumes >>> and exposing them to around 10 web server VMs (we originally >>> started >>> with one NFS server with a 100TB volume, but that has become >>> unwieldy). >>> All of the servers (hypervisors, ceph storage nodes and VMs) are >>> using >>> Ubuntu 14.04 >>> >>> Yesterday evening we added 23 ODSs to the cluster bringing it up >>> to 125 >>> OSDs (because we had 4 OSDs that were nearing the 90% full mark). >>> The >>> rebalancing process ended this morning (after around 12 hours) The >>> cluster has been clean since then: >>> >>> cluster b1f3f4c8-xxxxx >>> health HEALTH_OK >>> monmap e2: 3 mons at >>> >> > > {zhdk0009=[yyyy:xxxx::1009]:6789/0,zhdk0013=[yyyy:xxxx::1013]:6789/0,zhdk0025=[yyyy:xxxx::1025]:6789/0}, >>> election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap >>> e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 >>> pools, >>> 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail >>> 3319 >>> active+clean 17 active+clean+scrubbing+deep >>> client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s >>> >>> At midnight, we run a script that creates an RBD snapshot of all >>> RBD >>> volumes that are attached to the NFS servers (for backup >>> purposes). >>> Looking at our monitoring, around that time, one of the NFS >>> servers >>> became unresponsive and took down the complete ownCloud >>> installation >>> (load on the web server was > 200 and they had lost some of the >>> NFS >>> mounts) >>> >>> Rebooting the NFS server solved that problem, but the NFS kernel >>> server >>> kept crashing all day long after having run between 10 to 90 >>> minutes. >>> >>> We initially suspected a corrupt rbd volume (as it seemed that we >>> could >>> trigger the kernel crash by just ?ls -l? one of the volumes, >>> but >>> subsequent ?xfs_repair -n? checks on those RBD volumes showed >>> no >>> problems. >>> >>> We migrated the NFS server off of its hypervisor, suspecting a >>> problem >>> with RBD kernel modules, rebooted the hypervisor but the problem >>> persisted (both on the new hypervisor, and on the old one when we >>> migrated it back) >>> >>> We changed the /etc/default/nfs-kernel-server to start up 256 >>> servers >>> (even though the defaults had been working fine for over a year) >>> >>> Only one of our 3 NFS servers crashes (see below for syslog >>> information) >>> - the other 2 have been fine >>> >>> May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: >>> Using /var/lib/nfs/v4recovery as the NFSv4 state recovery >>> directory May >>> 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting >>> 90-second >>> grace period (net ffffffff81cdab00) May 23 21:44:23 drive-nfs1 >>> rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 >>> drive-nfs1 >>> kernel: [ 182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team >>> May >>> 23 21:44:28 drive-nfs1 kernel: [ 182.958465] nf_conntrack version >>> 0.5.0 >>> (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel: >>> [ 183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23 >>> 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1 >>> >>>> /dev/null && debian-sa1 1 1) May 23 21:45:17 drive-nfs1 >>>> collectd[1872]: python: Plugin loaded but not configured. May 23 >>>> 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, >>>> entering >>>> read-loop. May 23 21:47:11 drive-nfs1 kernel: [ 346.392283] >>>> init: >>>> plymouth-upstart-bridge main process ended, respawning May 23 >>>> 21:51:26 >>>> drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked >>>> for >>>> more than 120 seconds. >>> May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted >>> 3.13.0-53-generic #89-Ubuntu May 23 21:51:26 drive-nfs1 kernel: >>> [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>> disables this message. May 23 21:51:26 drive-nfs1 kernel: [ >>> 600.781504] >>> nfsd D ffff88013fd93180 0 1696 2 0x00000000 May 23 >>> 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50 >>> 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8 May 23 21:51:26 >>> drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180 >>> ffff8800b22f9800 ffff880035f48240 May 23 21:51:26 drive-nfs1 >>> kernel: >>> [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff >>> ffff880035f48248 May 23 21:51:26 drive-nfs1 kernel: [ 600.781515] >>> Call >>> Trace: May 23 21:51:26 drive-nfs1 kernel: [ 600.781523] >>> [] schedule_preempt_disabled+0x29/0x70 May 23 21:51:26 >>> drive-nfs1 kernel: [ 600.781526] [] >>> __mutex_lock_slowpath+0x135/0x1b0 May 23 21:51:26 drive-nfs1 >>> kernel: >>> [ 600.781528] [] mutex_lock+0x1f/0x2f May 23 >>> 21:51:26 drive-nfs1 kernel: [ 600.781557] [] >>> nfsd_lookup_dentry+0xa1/0x490 [nfsd] May 23 21:51:26 drive-nfs1 >>> kernel: >>> [ 600.781568] [] ? fh_verify+0x14b/0x5e0 [nfsd] May >>> 23 21:51:26 drive-nfs1 kernel: [ 600.781591] [] >>> nfsd_lookup+0x69/0x130 [nfsd] May 23 21:51:26 drive-nfs1 kernel: >>> [ 600.781613] [] nfsd4_lookup+0x1a/0x20 [nfsd] May >>> 23 21:51:26 drive-nfs1 kernel: [ 600.781628] [] >>> nfsd4_proc_compound+0x56a/0x7d0 [nfsd] May 23 21:51:26 drive-nfs1 >>> kernel: [ 600.781638] [] nfsd_dispatch+0xbb/0x200 >>> [nfsd] May 23 21:51:26 drive-nfs1 kernel: [ 600.781662] >>> [] svc_process_common+0x46d/0x6d0 [sunrpc] May 23 >>> 21:51:26 drive-nfs1 kernel: [ 600.781678] [] >>> svc_process+0x107/0x170 [sunrpc] May 23 21:51:26 drive-nfs1 >>> kernel: >>> [ 600.781687] [] nfsd+0xbf/0x130 [nfsd] May 23 >>> 21:51:26 drive-nfs1 kernel: [ 600.781696] [] ? >>> nfsd_destroy+0x80/0x80 [nfsd] May 23 21:51:26 drive-nfs1 kernel: >>> [ 600.781702] [] kthread+0xd2/0xf0 May 23 21:51:26 >>> drive-nfs1 kernel: [ 600.781707] [] ? >>> kthread_create_on_node+0x1c0/0x1c0 May 23 21:51:26 drive-nfs1 >>> kernel: >>> [ 600.781712] [] ret_from_fork+0x58/0x90 May 23 >>> 21:51:26 drive-nfs1 kernel: [ 600.781717] [] ? >>> kthread_create_on_node+0x1c0/0x1c0 >>> >>> Before each crash, we see the disk utilization of one or two >>> random >>> mounted RBD volumes to go to 100% - there is no pattern on which >>> of the >>> RBD disks start to act up. >>> >>> We have scoured the log files of the Ceph cluster for any signs of >>> problems but came up empty. >>> >>> The NFS server has almost no load (compared to regular usage) as >>> most >>> sync clients are either turned off (weekend) or have given up >>> connecting >>> to the server. >>> >>> There haven't been any configuration change on the NFS servers >>> prior to >>> the problems. The only change was the adding of 23 OSDs. >>> >>> We use ceph version 0.80.7 >>> (6c0127fcb58008793d3c8b62d925bc91963672a3) >>> >>> Our team is completely out of ideas. We have removed the 100TB >>> volume >>> from the nfs server (we used the downtime to migrate the last data >>> off >>> of it to one of the smaller volumes). The NFS server has been >>> running >>> for 30 minutes now (with close to no load) but we don?t really >>> expect it >>> to make it until tomorrow. >>> >>> send help >>> Jens-Christian >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi at gol.com [1] Global OnLine Japan/Fusion Communications >> http://www.gol.com/ [2] > > > > Links: > ------ > [1] mailto:chibi at gol.com > [2] http://www.gol.com/ > [3] mailto:jens-christian.fischer at switch.ch > [4] mailto:chibi at gol.com --