Serious performance problems with small file writes

h.r.mills@xxxxxxxxxxxxx (Hugo Mills) · Wed, 20 Aug 2014 15:39:11 +0100

   We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

   Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)

   The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

   We have a relatively small number of concurrent users (typically
4-6 at most), who use GUI tools to examine their data, and then
complex sets of MATLAB scripts to process it, with processing often
being distributed across all the machines using Condor.

   It's not unusual to see the analysis scripts write out large
numbers (thousands, possibly tens or hundreds of thousands) of small
files, often from many client machines at once in parallel. When this
happens, the ceph cluster becomes almost completely unresponsive for
tens of seconds (or even for minutes) at a time, until the writes are
flushed through the system. Given the nature of modern GUI desktop
environments (often reading and writing small state files in the
user's home directory), this means that desktop interactiveness and
responsiveness for all the other users of the cluster suffer.

   1-minute load on the servers typically peaks at about 8 during
these events (on 4-core machines). Load on the clients also peaks
high, because of the number of processes waiting for a response from
the FS. The MDS shows little sign of stress -- it seems to be entirely
down to the OSDs. ceph -w shows requests blocked for more than 10
seconds, and in bad cases, ceph -s shows up to many hundreds of
requests blocked for more than 32s.

   We've had to turn off scrubbing and deep scrubbing completely --
except between 01.00 and 04.00 every night -- because it triggers the
exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
up to 7 PGs being scrubbed, as it did on Monday, it's completely
unusable.

   Is this problem something that's often seen? If so, what are the
best options for mitigation or elimination of the problem? I've found
a few references to issue #6278 [1], but that seems to be referencing
scrub specifically, not ordinary (if possibly pathological) writes.

   What are the sorts of things I should be looking at to work out
where the bottleneck(s) are? I'm a bit lost about how to drill down
into the ceph system for identifying performance issues. Is there a
useful guide to tools somewhere?

   Is an upgrade to 0.84 likely to be helpful? How "development" are
the development releases, from a stability / dangerous bugs point of
view?

   Thanks,
   Hugo.

[1] http://tracker.ceph.com/issues/6278

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers