Serious performance problems with small file writes

daniel.vanderster@xxxxxxx (Dan Van Der Ster) · Wed, 20 Aug 2014 15:21:00 +0000

Hi,

On 20 Aug 2014, at 16:55, German Anders <ganders at despegar.com<mailto:ganders at despegar.com>> wrote:

Hi Dan,

      How are you? I want to know how you disable the indexing on the /var/lib/ceph OSDs?

# grep ceph /etc/updatedb.conf
PRUNEPATHS = "/afs /media /net /sfs /tmp /udev /var/cache/ccache /var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph"

Did you disable deep scrub on you OSDs?

No but this can be an issue. If you get many PGs scrubbing at once, performance will suffer.

There is a new feature in 0.67.10 to sleep between scrubbing ?chunks?. I set the that sleep to 0.1 (and the chunk_max to 5, and the scrub size 1MB). In 0.67.10+1 there are some new options to set the iopriority of the scrubbing threads. Set that to class = 3, priority = 0 to give the scrubbing thread the idle priority. You need to use the cfq disk scheduler for io priorities to work. (cfq will also help if updatedb is causing any problems, since it runs with ionice -c 3).

I?m pretty sure those features will come in 0.80.6 as well.

Do you have the journals on SSD's or RAMDISK?

Never use RAMDISK.

We currently have the journals on the same spinning disk as the OSD, but the iops performance is low for the rbd and fs use-cases. (For object store it should be OK). But for rbd or fs, you really need journals on SSDs or your cluster will suffer.

We now have SSDs on order to augment our cluster. (The way I justified this is that our cluster has X TB of storage capacity and Y iops capacity. With disk journals we will run out of iops capacity well before we run out of storage capacity. So you can either increase the iops capacity substantially by decreasing the volume of the cluster by 20% and replacing those disks with SSD journals, or you can just leave 50% of the disk capacity empty since you can?t use it anyway).

What's the perf of your cluster? randos bench? fio? I've setup a new cluster and I want to know what would be the best option scheme to go.

It?s not really meaningful to compare performance of different clusters with different hardware. Some ?constants? I can advise
  - with few clients, large write throughput is limited by the clients bandwidth, as long as you have enough OSDs and the client is striping over many objects.
  - with disk journals, small write latency will be ~30-50ms even when the cluster is idle. if you have SSD journals, maybe ~10ms.
  - count your iops. Each disk OSD can do ~100, and you need to divide by the number of replicas. With SSDs you can do a bit better than this since the synchronous writes go to the SSDs not the disks. In my tests with our hardware I estimate that going from disk to SSD journal will multiply the iops capacity by around 5x.

I also found that I needed to increase some the journal max write and journal queue max limits, also the filestore limits, to squeeze the best performance out of the SSD journals. Try increasing filestore queue max ops/bytes, filestore queue committing max ops/bytes, and the filestore wbthrottle xfs * options. (I?m not going to publish exact configs here because I haven?t finished tuning yet).

Cheers, Dan

Thanks a lot!!

Best regards,

German Anders

On Wednesday 20/08/2014 at 11:51, Dan Van Der Ster wrote:
Hi,

Do you get slow requests during the slowness incidents? What about monitor elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc?)
What about iostat on the OSDs ? are your OSD disks busy reading or writing during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb ? we had to disable it from indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

On 20 Aug 2014, at 16:39, Hugo Mills <h.r.mills at reading.ac.uk<mailto:h.r.mills at reading.ac.uk>> wrote:

    We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

    Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
    machines, in the analysis lab)

    The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

    We have a relatively small number of concurrent users (typically
4-6 at most), who use GUI tools to examine their data, and then
complex sets of MATLAB scripts to process it, with processing often
being distributed across all the machines using Condor.

    It's not unusual to see the analysis scripts write out large
numbers (thousands, possibly tens or hundreds of thousands) of small
files, often from many client machines at once in parallel. When this
happens, the ceph cluster becomes almost completely unresponsive for
tens of seconds (or even for minutes) at a time, until the writes are
flushed through the system. Given the nature of modern GUI desktop
environments (often reading and writing small state files in the
user's home directory), this means that desktop interactiveness and
responsiveness for all the other users of the cluster suffer.

    1-minute load on the servers typically peaks at about 8 during
these events (on 4-core machines). Load on the clients also peaks
high, because of the number of processes waiting for a response from
the FS. The MDS shows little sign of stress -- it seems to be entirely
down to the OSDs. ceph -w shows requests blocked for more than 10
seconds, and in bad cases, ceph -s shows up to many hundreds of
requests blocked for more than 32s.

    We've had to turn off scrubbing and deep scrubbing completely --
except between 01.00 and 04.00 every night -- because it triggers the
exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
up to 7 PGs being scrubbed, as it did on Monday, it's completely
unusable.

    Is this problem something that's often seen? If so, what are the
best options for mitigation or elimination of the problem? I've found
a few references to issue #6278 [1], but that seems to be referencing
scrub specifically, not ordinary (if possibly pathological) writes.

    What are the sorts of things I should be looking at to work out
where the bottleneck(s) are? I'm a bit lost about how to drill down
into the ceph system for identifying performance issues. Is there a
useful guide to tools somewhere?

    Is an upgrade to 0.84 likely to be helpful? How "development" are
the development releases, from a stability / dangerous bugs point of
view?

    Thanks,
    Hugo.

[1] http://tracker.ceph.com/issues/6278

--
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140820/52cea5e6/attachment.htm>