Re: Ceph RBD performance - random writes

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Wed, 08 Aug 2012 11:46:12 -0700

On 08/07/2012 10:19 PM, Mark Kirkwood wrote:
I've been looking at using Ceph RBD as a block store for database use.
As part of this I'm looking a how (particularly random) IO of smallish
(4K, 8K) block sizes performs.

I've setup Ceph with a single osd and mon spread over two SSD (Intel
520) - 2G journal on one and the osd data on the other (xfs filesystem).
The Intel's are pretty fast, and (despite being shackled by a crappy
Nvidia SATA controller) fly for random IO.

However I am not seeing that reflected in the RBD case. I have the
device mounted on the local machine where the osd and mon are running
(so network performance should not be a factor here).

Here is what I did:

Create a rbd device of 10G and mount on /mnt/vol0:

$ rbd create --size 10240 vol0
$ rbd map vol0
$ mkfx.xfs /dev/rbd0
$ rbd mount /dev/rdb0 /mnt/vol0

Make a file:

$ dd if=/dev/zero of=/mnt/vol0/dump/file bs=4k count=300000 conv=fsync
1228800000 bytes (1.2 GB) copied, 13.4361 s, 91.5 MB/s

Performance ok if file size < journal (2G).

$ dd if=/dev/zero of=/mnt/vol0/dump/file bs=4096k count=200 conv=fsync
838860800 bytes (839 MB) copied, 9.47086 s, 88.6 MB/s

Not so good if file size > journal.

$ dd if=/dev/zero of=/mnt/vol0/dump/file bs=4096k count=1000 conv=fsync
4194304000 bytes (4.2 GB) copied, 279.891 s, 15.0 MB/s

Random writes (see attached file) sync'ed with sync_file_range are ok if
block size big:

$ ./writetest /mnt/vol0/dump/file 4194304 0 1
random writes: 292 of: 4194304 bytes elapsed: 9.8397s io rate: 30/s
(118.70 MB/s)

$ ./writetest /mnt/vol0/dump/file 1048576 0 1
random writes: 1171 of: 1048576 bytes elapsed: 10.6042s io rate: 110/s
(110.43 MB/s)

$ ./writetest /mnt/vol0/dump/file 131072 0 1
random writes: 9375 of: 131072 bytes elapsed: 15.8075s io rate: 593/s
(74.13 MB/s)

However smallish block size is suicide (trigger suicide assert after a
while), I see 100 IOPS or less on actual devices, all 100% util:

$ ./writetest /mnt/vol0/dump/file 8192 0 1

I am running into http://tracker.newdream.net/issues/2784 here I think.

This can be a sign of a bug in the underlying filesystem or hardware -
maybe your controller? That assert is hit when a single operation to
the filesystem beneath the osd takes longer than 180 seconds (by
default).

Note that the actual SSD are very fast for this when accessed directly:

$ ./writetest /data1/ceph/1/file 8192 0 1
random writes: 1000000 of: 8192 bytes elapsed: 125.7907s io rate: 7950/s
(62.11 MB/s)

Thanks for your patience in reading so far - some actual questions now :-)

1/ Why is the appending write from dd when the size of file > journal so
slow, despite reasonably capable storage devices?

It's possible you need to use more threads to have more operations in
flight in to the filestore (the main storage for the osd). Try
something like this in your ceph configuration for the osds:

    osd op threads = 24
    osd disk threads = 24
    filestore op threads = 6
    filestore queue max ops = 24

(from http://www.spinics.net/lists/ceph-devel/msg07128.html)

2/ Is the sudden dramatic drop in random write performance a
manifestation of the "small requests  are slow" issue? or is this
something else?

It's probably that. Sam's actively looking into it, and once he has
something it will be interesting to see how well it works on your
hardware.

Josh

Thanks

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html