Re: Slow ceph fs performance

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 26 Sep 2012 10:26:15 -0500

On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
Hi folks,

Hi Bryan!

	I'm seeing reasonable performance when I run rados
benchmarks, but really slow I/O when reading or writing
from a mounted ceph filesystem.  The rados benchmarks
show about 150 MB/s for both read and write, but when I
go to a client machine with a mounted ceph filesystem
and try to rsync a large (60 GB) directory tree onto
the ceph fs, I'm getting rates of only 2-5 MB/s.

Was the rados benchmark run from the same client machine that the 
filesystem is being mounted on?  Also, what object size did you use for 
rados bench?  Does the directory tree have a lot of small files or a few 
very large ones?

	The OSDs and MDSs are all running 64-bit CentOS 6.3
with the stock CentOS 2.6.32 kernel.  The client is also
64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
There are four OSDs, each with a hardware RAID 5 array
and an SSD for the OSD journal.  The primary network
is a gigabit network, and the OSD, MDS and MON
machines have a dedicated backend gigabit network on a
second network interface.

	Locally on the OSD, "hdparm -t -T" reports read rates
of ~350 MB/s, and bonnie++ shows:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
osd-local    23800M  1037  99 316048  92 131023  19  2272  98 312781  21 521.0  24
Latency             13103us     183ms     123ms   15316us     100ms   75899us
Version  1.96       ------Sequential Create------ --------Random Create--------
osd-local           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  16 16817  55 +++++ +++ 28786  77 23890  78 +++++ +++ 27128  75
Latency             21549us     105us     134us     902us      12us     104us

	While rsyncing the files, the ceph logs show lots
of warnings of the form:

[WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops

	Snooping on traffic with wireshark shows bursts of
activity separated by long periods (30-60 sec) of idle time.

My guess here is that if there is a lot of small IO happening, your SSD 
journal is handling it well and probably writing data really quickly, 
while your spinning disk raid5 probably can't sustain anywhere near the 
required IOPs to keep up.  So you get a burst of network traffic and the 
journal writes it to the SSD quickly until it is filled up, then the OSD 
stalls while it waits for the raid5 to write data out.  Whenever the 
journal flushes, a new burst of traffic comes in and the process repeats.

	My first thought was that I was seeing a kind of
"bufferbloat". The SSDs are 120 GB, so they could easily contain
enough data to take a long time to dump.  I changed to using a
journal file, limited to 1 GB, but I still see the same slow
behavior.

	Any advice about how to go about debugging this would
be appreciated.

It'd probably be useful to look at the write sizes going to disk. 
Increasing debugging levels in the Ceph logs will give you that, but it 
can be a lot to parse.  You can also use something like iostat or 
collectl to see what the per-second average write sizes are.

					Thanks,
					Bryan

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html