Slow IOPS on RBD compared to journal and backing devices

s.priebe@xxxxxxxxxxxx (Stefan Priebe - Profihost AG) · Thu, 15 May 2014 09:58:51 +0200

Am 15.05.2014 09:56, schrieb Josef Johansson:
> 
> On 15/05/14 09:11, Stefan Priebe - Profihost AG wrote:
>> Am 15.05.2014 00:26, schrieb Josef Johansson:
>>> Hi,
>>>
>>> So, apparently tmpfs does not support non-root xattr due to a possible
>>> DoS-vector. There's configuration set for enabling it as far as I can see.
>>>
>>> CONFIG_TMPFS=y
>>> CONFIG_TMPFS_POSIX_ACL=y
>>> CONFIG_TMPFS_XATTR=y
>>>
>>> Anyone know a way around it? Saw that there's a patch for enabling it,
>>> but recompiling my kernel is out of reach right now ;)
>> I would create an empty file in tmpfs and then format that file as a
>> block device.
> How do you mean exactly? Creating with dd and mounting with losetup?

mount -t tmpfs -o size=4G /mnt /mnt
dd if=/dev/zero of=/mnt/blockdev_a bs=1M count=4000
mkfs.xfs -f /mnt/blockdev_a
mount /mnt/blockdev_a /ceph/osd.X

Dann /mnt/blockdev_a als OSD device nutzen.

> 
> Cheers,
> Josef
>>> Created the osd with following:
>>>
>>> root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
>>> root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
>>> root at osd1:/# mkfs.xfs /dev/loop0
>>> root at osd1:/# ceph osd create
>>> 50
>>> root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50
>>> root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
>>> root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
>>> --osd-journal=/dev/sdc7 --mkjournal
>>> 2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open:
>>> aio not supported without directio; disabling aio
>>> 2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
>>> bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
>>> b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
>>> 2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open:
>>> aio not supported without directio; disabling aio
>>> 2014-05-15 00:20:29.807237 7f40063bb780 -1
>>> filestore(/var/lib/ceph/osd/ceph-50) could not find
>>> 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
>>> 2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
>>> /var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
>>> c51a2683-55dc-4634-9d9d-f0fec9a6f389
>>> 2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file:
>>> /var/lib/ceph/osd/ceph-50/keyring: can't open
>>> /var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
>>> 2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
>>> /var/lib/ceph/osd/ceph-50/keyring
>>> root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
>>> --osd-journal=/dev/sdc7 --mkjournal
>>> 2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open:
>>> aio not supported without directio; disabling aio
>>> 2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open:
>>> aio not supported without directio; disabling aio
>>> 2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
>>> superblock's -1
>>> 2014-05-15 00:20:51.129845 7ff813ba4780 -1  ** ERROR: error creating
>>> empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument
>>>
>>> Cheers,
>>> Josef
>>>
>>> Christian Balzer skrev 2014-05-14 14:33:
>>>> Hello!
>>>>
>>>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>>>
>>>>> Hi Christian,
>>>>>
>>>>> I missed this thread, haven't been reading the list that well the last
>>>>> weeks.
>>>>>
>>>>> You already know my setup, since we discussed it in an earlier thread. I
>>>>> don't have a fast backing store, but I see the slow IOPS when doing
>>>>> randwrite inside the VM, with rbd cache. Still running dumpling here
>>>>> though.
>>>>>
>>>> Nods, I do recall that thread.
>>>>
>>>>> A thought struck me that I could test with a pool that consists of OSDs
>>>>> that have tempfs-based disks, think I have a bit more latency than your
>>>>> IPoIB but I've pushed 100k IOPS with the same network devices before.
>>>>> This would verify if the problem is with the journal disks. I'll also
>>>>> try to run the journal devices in tempfs as well, as it would test
>>>>> purely Ceph itself.
>>>>>
>>>> That would be interesting indeed.
>>>> Given what I've seen (with the journal at 20% utilization and the actual
>>>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>>>  
>>>>> I'll get back to you with the results, hopefully I'll manage to get them
>>>>> done during this night.
>>>>>
>>>> Looking forward to that. ^^
>>>>
>>>>
>>>> Christian
>>>>> Cheers,
>>>>> Josef
>>>>>
>>>>> On 13/05/14 11:03, Christian Balzer wrote:
>>>>>> I'm clearly talking to myself, but whatever.
>>>>>>
>>>>>> For Greg, I've played with all the pertinent journal and filestore
>>>>>> options and TCP nodelay, no changes at all.
>>>>>>
>>>>>> Is there anybody on this ML who's running a Ceph cluster with a fast
>>>>>> network and FAST filestore, so like me with a big HW cache in front of
>>>>>> a RAID/JBODs or using SSDs for final storage?
>>>>>>
>>>>>> If so, what results do you get out of the fio statement below per OSD?
>>>>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
>>>>>> which is of course vastly faster than the normal indvidual HDDs could
>>>>>> do.
>>>>>>
>>>>>> So I'm wondering if I'm hitting some inherent limitation of how fast a
>>>>>> single OSD (as in the software) can handle IOPS, given that everything
>>>>>> else has been ruled out from where I stand.
>>>>>>
>>>>>> This would also explain why none of the option changes or the use of
>>>>>> RBD caching has any measurable effect in the test case below.
>>>>>> As in, a slow OSD aka single HDD with journal on the same disk would
>>>>>> clearly benefit from even the small 32MB standard RBD cache, while in
>>>>>> my test case the only time the caching becomes noticeable is if I
>>>>>> increase the cache size to something larger than the test data size.
>>>>>> ^o^
>>>>>>
>>>>>> On the other hand if people here regularly get thousands or tens of
>>>>>> thousands IOPS per OSD with the appropriate HW I'm stumped.
>>>>>>
>>>>>> Christian
>>>>>>
>>>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>>>>>>
>>>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>>>>>>
>>>>>>>> Oh, I didn't notice that. I bet you aren't getting the expected
>>>>>>>> throughput on the RAID array with OSD access patterns, and that's
>>>>>>>> applying back pressure on the journal.
>>>>>>>>
>>>>>>> In the a "picture" being worth a thousand words tradition, I give you
>>>>>>> this iostat -x output taken during a fio run:
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            50.82    0.00   19.43    0.17    0.00   29.58
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00    51.50    0.00 1633.50     0.00  7460.00
>>>>>>> 9.13     0.18    0.11    0.00    0.11   0.01   1.40 sdb
>>>>>>> 0.00     0.00    0.00 1240.50     0.00  5244.00     8.45     0.30
>>>>>>> 0.25    0.00    0.25   0.02   2.00 sdc               0.00     5.00
>>>>>>> 0.00 2468.50     0.00 13419.00    10.87     0.24    0.10    0.00
>>>>>>> 0.10   0.09  22.00 sdd               0.00     6.50    0.00 1913.00
>>>>>>> 0.00 10313.00    10.78     0.20    0.10    0.00    0.10   0.09  16.60
>>>>>>>
>>>>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes,
>>>>>>> note the nearly complete absence of iowait.
>>>>>>>
>>>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
>>>>>>> Look at these numbers, the lack of queues, the low wait and service
>>>>>>> times (this is in ms) plus overall utilization.
>>>>>>>
>>>>>>> The only conclusion I can draw from these numbers and the network
>>>>>>> results below is that the latency happens within the OSD processes.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Christian
>>>>>>>> When I suggested other tests, I meant with and without Ceph. One
>>>>>>>> particular one is OSD bench. That should be interesting to try at a
>>>>>>>> variety of block sizes. You could also try runnin RADOS bench and
>>>>>>>> smalliobench at a few different sizes.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Christian,
>>>>>>>>>
>>>>>>>>> Do you have tried without raid6, to have more osd ?
>>>>>>>>> (how many disks do you have begin the raid6 ?)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Aslo, I known that direct ios can be quite slow with ceph,
>>>>>>>>>
>>>>>>>>> maybe can you try without --direct=1
>>>>>>>>>
>>>>>>>>> and also enable rbd_cache
>>>>>>>>>
>>>>>>>>> ceph.conf
>>>>>>>>> [client]
>>>>>>>>> rbd cache = true
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----- Mail original -----
>>>>>>>>>
>>>>>>>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>>
>>>>>>>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
>>>>>>>>> ceph-users at lists.ceph.com <javascript:;>
>>>>>>>>> Envoy?: Jeudi 8 Mai 2014 04:49:16
>>>>>>>>> Objet: Re: Slow IOPS on RBD compared to journal and
>>>>>>>>> backing devices
>>>>>>>>>
>>>>>>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
>>>>>>>>>
>>>>>>>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
>>>>>>>>>> <chibi at gol.com<javascript:;>>
>>>>>>>>> wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
>>>>>>>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6
>>>>>>>>>>> behind an Areca 1882 with 4GB of cache.
>>>>>>>>>>>
>>>>>>>>>>> Running this fio:
>>>>>>>>>>>
>>>>>>>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
>>>>>>>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
>>>>>>>>>>> --iodepth=128
>>>>>>>>>>>
>>>>>>>>>>> results in:
>>>>>>>>>>>
>>>>>>>>>>> 30k IOPS on the journal SSD (as expected)
>>>>>>>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise
>>>>>>>>>>> there) 3200 IOPS from a VM using userspace RBD
>>>>>>>>>>> 2900 IOPS from a host kernelspace mounted RBD
>>>>>>>>>>>
>>>>>>>>>>> When running the fio from the VM RBD the utilization of the
>>>>>>>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
>>>>>>>>>>> (1500 IOPS after some obvious merging).
>>>>>>>>>>> The OSD processes are quite busy, reading well over 200% on atop,
>>>>>>>>>>> but the system is not CPU or otherwise resource starved at that
>>>>>>>>>>> moment.
>>>>>>>>>>>
>>>>>>>>>>> Running multiple instances of this test from several VMs on
>>>>>>>>>>> different hosts changes nothing, as in the aggregated IOPS for
>>>>>>>>>>> the whole cluster will still be around 3200 IOPS.
>>>>>>>>>>>
>>>>>>>>>>> Now clearly RBD has to deal with latency here, but the network is
>>>>>>>>>>> IPoIB with the associated low latency and the journal SSDs are
>>>>>>>>>>> the (consistently) fasted ones around.
>>>>>>>>>>>
>>>>>>>>>>> I guess what I am wondering about is if this is normal and to be
>>>>>>>>>>> expected or if not where all that potential performance got lost.
>>>>>>>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
>>>>>>>>> Yes, but going down to 32 doesn't change things one iota.
>>>>>>>>> Also note the multiple instances I mention up there, so that would
>>>>>>>>> be 256 IOs at a time, coming from different hosts over different
>>>>>>>>> links and nothing changes.
>>>>>>>>>
>>>>>>>>>> that's about 40ms of latency per op (for userspace RBD), which
>>>>>>>>>> seems awfully long. You should check what your client-side objecter
>>>>>>>>>> settings are; it might be limiting you to fewer outstanding ops
>>>>>>>>>> than that.
>>>>>>>>> Googling for client-side objecter gives a few hits on ceph devel and
>>>>>>>>> bugs and nothing at all as far as configuration options are
>>>>>>>>> concerned. Care to enlighten me where one can find those?
>>>>>>>>>
>>>>>>>>> Also note the kernelspace (3.13 if it matters) speed, which is very
>>>>>>>>> much in the same (junior league) ballpark.
>>>>>>>>>
>>>>>>>>>> If
>>>>>>>>>> it's available to you, testing with Firefly or even master would be
>>>>>>>>>> interesting ? there's some performance work that should reduce
>>>>>>>>>> latencies.
>>>>>>>>>>
>>>>>>>>> Not an option, this is going into production next week.
>>>>>>>>>
>>>>>>>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster
>>>>>>>>>> certainly doesn't require 40ms/op, so you should probably run a
>>>>>>>>>> wider array of experiments to try and figure out where it's coming
>>>>>>>>>> from.
>>>>>>>>> I think we can rule out the network, NPtcp gives me:
>>>>>>>>> ---
>>>>>>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> For comparison at about 512KB it reaches maximum throughput and
>>>>>>>>> still isn't that laggy:
>>>>>>>>> ---
>>>>>>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> So with the network performing as well as my lengthy experience with
>>>>>>>>> IPoIB led me to believe, what else is there to look at?
>>>>>>>>> The storage nodes perform just as expected, indicated by the local
>>>>>>>>> fio tests.
>>>>>>>>>
>>>>>>>>> That pretty much leaves only Ceph/RBD to look at and I'm not really
>>>>>>>>> sure what experiments I should run on that. ^o^
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Christian
>>>>>>>>>
>>>>>>>>>> -Greg
>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Christian Balzer Network/Systems Engineer
>>>>>>>>> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
>>>>>>>>> Communications http://www.gol.com/
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users at lists.ceph.com <javascript:;>
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>