Serious performance problems with small file writes

h.r.mills@xxxxxxxxxxxxx (Hugo Mills) · Thu, 21 Aug 2014 13:08:29 +0100

   Just to fill in some of the gaps from yesterday's mail:

On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote:
>    Some questions below I can't answer immediately, but I'll spend
> tomorrow morning irritating people by triggering these events (I think
> I have a reproducer -- unpacking a 1.2 GiB tarball with 250000 small
> files in it) and giving you more details. 

   Yes, the tarball with the 250000 small files in it is definitely a
reproducer.

[snip]
> > What about iostat on the OSDs ? are your OSD disks busy reading or
> > writing during these incidents?
> 
>    Not sure. I don't think so, but I'll try to trigger an incident and
> report back on this one.

   Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes,
and 200-300 kB/s reads on all three, but it fluctuates a lot (with
5-second intervals). Sample data at the end of the email.

> > What are you using for OSD journals?
> 
>    On each machine, the three OSD journals live on the same ext4
> filesystem on an SSD, which is also the root filesystem of the
> machine.
> 
> > Also check the CPU usage for the mons and osds...
> 
>    The mons are doing pretty much nothing in terms of CPU, as far as I
> can see. I will double-check during an incident.

   The mons are just ticking over with a <1% CPU usage.

> > Does your hardware provide enough IOPS for what your users need?
> > (e.g. what is the op/s from ceph -w)
> 
>    Not really an answer to your question, but: Before the ceph cluster
> went in, we were running the system on two 5-year-old NFS servers for
> a while. We have about half the total number of spindles that we used
> to, but more modern drives.
> 
>    I'll look at how the op/s values change when we have the problem.
> At the moment (with what I assume to be normal desktop usage from the
> 3-4 users in the lab), they're flapping wildly somewhere around a
> median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
> read and write.

   With minimal users and one machine running the tar unpacking
process, I'm getting somewhere around 100-200 op/s on the ceph
cluster, but interactivity on the desktop machine I'm logged in on is
horrible -- I'm frequently getting tens of seconds of latency. Compare
that to the (relatively) comfortable 350-400 op/s we had yesterday
with what is probably workloads with larger files.

> > If disabling deep scrub helps, then it might be that something else
> > is reading the disks heavily. One thing to check is updatedb ? we
> > had to disable it from indexing /var/lib/ceph on our OSDs.
> 
>    I haven't seen that running at all during the day, but I'll look
> into it.

   No, it's not anything like that -- iotop reports pretty much the
only things doing IO are ceph-osd and the occasional xfsaild.

   Hugo.

>    Hugo.
> 
> > Best Regards,
> > Dan
> > 
> > -- Dan van der Ster || Data & Storage Services || CERN IT Department --
> > 
> > 
> > On 20 Aug 2014, at 16:39, Hugo Mills <h.r.mills at reading.ac.uk> wrote:
> > 
> > >   We have a ceph system here, and we're seeing performance regularly
> > > descend into unusability for periods of minutes at a time (or longer).
> > > This appears to be triggered by writing large numbers of small files.
> > > 
> > >   Specifications:
> > > 
> > > ceph 0.80.5
> > > 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
> > > 2 machines running primary and standby MDS
> > > 3 monitors on the same machines as the OSDs
> > > Infiniband to about 8 CephFS clients (headless, in the machine room)
> > > Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
> > >   machines, in the analysis lab)
> > > 
> > >   The cluster stores home directories of the users and a larger area
> > > of scientific data (approx 15 TB) which is being processed and
> > > analysed by the users of the cluster.
> > > 
> > >   We have a relatively small number of concurrent users (typically
> > > 4-6 at most), who use GUI tools to examine their data, and then
> > > complex sets of MATLAB scripts to process it, with processing often
> > > being distributed across all the machines using Condor.
> > > 
> > >   It's not unusual to see the analysis scripts write out large
> > > numbers (thousands, possibly tens or hundreds of thousands) of small
> > > files, often from many client machines at once in parallel. When this
> > > happens, the ceph cluster becomes almost completely unresponsive for
> > > tens of seconds (or even for minutes) at a time, until the writes are
> > > flushed through the system. Given the nature of modern GUI desktop
> > > environments (often reading and writing small state files in the
> > > user's home directory), this means that desktop interactiveness and
> > > responsiveness for all the other users of the cluster suffer.
> > > 
> > >   1-minute load on the servers typically peaks at about 8 during
> > > these events (on 4-core machines). Load on the clients also peaks
> > > high, because of the number of processes waiting for a response from
> > > the FS. The MDS shows little sign of stress -- it seems to be entirely
> > > down to the OSDs. ceph -w shows requests blocked for more than 10
> > > seconds, and in bad cases, ceph -s shows up to many hundreds of
> > > requests blocked for more than 32s.
> > > 
> > >   We've had to turn off scrubbing and deep scrubbing completely --
> > > except between 01.00 and 04.00 every night -- because it triggers the
> > > exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
> > > up to 7 PGs being scrubbed, as it did on Monday, it's completely
> > > unusable.
> > > 
> > >   Is this problem something that's often seen? If so, what are the
> > > best options for mitigation or elimination of the problem? I've found
> > > a few references to issue #6278 [1], but that seems to be referencing
> > > scrub specifically, not ordinary (if possibly pathological) writes.
> > > 
> > >   What are the sorts of things I should be looking at to work out
> > > where the bottleneck(s) are? I'm a bit lost about how to drill down
> > > into the ceph system for identifying performance issues. Is there a
> > > useful guide to tools somewhere?
> > > 
> > >   Is an upgrade to 0.84 likely to be helpful? How "development" are
> > > the development releases, from a stability / dangerous bugs point of
> > > view?
> > > 
> > >   Thanks,
> > >   Hugo.
> > > 
> > > [1] http://tracker.ceph.com/issues/6278
> > > 
> 

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building