Just to fill in some of the gaps from yesterday's mail: On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote: > Some questions below I can't answer immediately, but I'll spend > tomorrow morning irritating people by triggering these events (I think > I have a reproducer -- unpacking a 1.2 GiB tarball with 250000 small > files in it) and giving you more details. Yes, the tarball with the 250000 small files in it is definitely a reproducer. [snip] > > What about iostat on the OSDs ? are your OSD disks busy reading or > > writing during these incidents? > > Not sure. I don't think so, but I'll try to trigger an incident and > report back on this one. Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes, and 200-300 kB/s reads on all three, but it fluctuates a lot (with 5-second intervals). Sample data at the end of the email. > > What are you using for OSD journals? > > On each machine, the three OSD journals live on the same ext4 > filesystem on an SSD, which is also the root filesystem of the > machine. > > > Also check the CPU usage for the mons and osds... > > The mons are doing pretty much nothing in terms of CPU, as far as I > can see. I will double-check during an incident. The mons are just ticking over with a <1% CPU usage. > > Does your hardware provide enough IOPS for what your users need? > > (e.g. what is the op/s from ceph -w) > > Not really an answer to your question, but: Before the ceph cluster > went in, we were running the system on two 5-year-old NFS servers for > a while. We have about half the total number of spindles that we used > to, but more modern drives. > > I'll look at how the op/s values change when we have the problem. > At the moment (with what I assume to be normal desktop usage from the > 3-4 users in the lab), they're flapping wildly somewhere around a > median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s > read and write. With minimal users and one machine running the tar unpacking process, I'm getting somewhere around 100-200 op/s on the ceph cluster, but interactivity on the desktop machine I'm logged in on is horrible -- I'm frequently getting tens of seconds of latency. Compare that to the (relatively) comfortable 350-400 op/s we had yesterday with what is probably workloads with larger files. > > If disabling deep scrub helps, then it might be that something else > > is reading the disks heavily. One thing to check is updatedb ? we > > had to disable it from indexing /var/lib/ceph on our OSDs. > > I haven't seen that running at all during the day, but I'll look > into it. No, it's not anything like that -- iotop reports pretty much the only things doing IO are ceph-osd and the occasional xfsaild. Hugo. > Hugo. > > > Best Regards, > > Dan > > > > -- Dan van der Ster || Data & Storage Services || CERN IT Department -- > > > > > > On 20 Aug 2014, at 16:39, Hugo Mills <h.r.mills at reading.ac.uk> wrote: > > > > > We have a ceph system here, and we're seeing performance regularly > > > descend into unusability for periods of minutes at a time (or longer). > > > This appears to be triggered by writing large numbers of small files. > > > > > > Specifications: > > > > > > ceph 0.80.5 > > > 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) > > > 2 machines running primary and standby MDS > > > 3 monitors on the same machines as the OSDs > > > Infiniband to about 8 CephFS clients (headless, in the machine room) > > > Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop > > > machines, in the analysis lab) > > > > > > The cluster stores home directories of the users and a larger area > > > of scientific data (approx 15 TB) which is being processed and > > > analysed by the users of the cluster. > > > > > > We have a relatively small number of concurrent users (typically > > > 4-6 at most), who use GUI tools to examine their data, and then > > > complex sets of MATLAB scripts to process it, with processing often > > > being distributed across all the machines using Condor. > > > > > > It's not unusual to see the analysis scripts write out large > > > numbers (thousands, possibly tens or hundreds of thousands) of small > > > files, often from many client machines at once in parallel. When this > > > happens, the ceph cluster becomes almost completely unresponsive for > > > tens of seconds (or even for minutes) at a time, until the writes are > > > flushed through the system. Given the nature of modern GUI desktop > > > environments (often reading and writing small state files in the > > > user's home directory), this means that desktop interactiveness and > > > responsiveness for all the other users of the cluster suffer. > > > > > > 1-minute load on the servers typically peaks at about 8 during > > > these events (on 4-core machines). Load on the clients also peaks > > > high, because of the number of processes waiting for a response from > > > the FS. The MDS shows little sign of stress -- it seems to be entirely > > > down to the OSDs. ceph -w shows requests blocked for more than 10 > > > seconds, and in bad cases, ceph -s shows up to many hundreds of > > > requests blocked for more than 32s. > > > > > > We've had to turn off scrubbing and deep scrubbing completely -- > > > except between 01.00 and 04.00 every night -- because it triggers the > > > exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets > > > up to 7 PGs being scrubbed, as it did on Monday, it's completely > > > unusable. > > > > > > Is this problem something that's often seen? If so, what are the > > > best options for mitigation or elimination of the problem? I've found > > > a few references to issue #6278 [1], but that seems to be referencing > > > scrub specifically, not ordinary (if possibly pathological) writes. > > > > > > What are the sorts of things I should be looking at to work out > > > where the bottleneck(s) are? I'm a bit lost about how to drill down > > > into the ceph system for identifying performance issues. Is there a > > > useful guide to tools somewhere? > > > > > > Is an upgrade to 0.84 likely to be helpful? How "development" are > > > the development releases, from a stability / dangerous bugs point of > > > view? > > > > > > Thanks, > > > Hugo. > > > > > > [1] http://tracker.ceph.com/issues/6278 > > > > -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building