Re: How to improve single thread sequential reads?

Jan Schermer <jan@xxxxxxxxxxx> · Tue, 18 Aug 2015 14:13:28 +0200

> On 18 Aug 2015, at 13:58, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> 
> 
> 
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Jan Schermer
>> Sent: 18 August 2015 12:41
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  How to improve single thread sequential reads?
>> 
>> Reply in text
>> 
>>> On 18 Aug 2015, at 12:59, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>>>> Of Jan Schermer
>>>> Sent: 18 August 2015 11:50
>>>> To: Benedikt Fraunhofer <given.to.lists.ceph-
>>>> users.ceph.com.toasta.001@xxxxxxxxxx>
>>>> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx>
>>>> Subject: Re:  How to improve single thread sequential
> reads?
>>>> 
>>>> I'm not sure if I missed that but are you testing in a VM backed by
>>>> RBD device, or using the device directly?
>>>> 
>>>> I don't see how blk-mq would help if it's not a VM, it just passes
>>>> the
>>> request
>>>> to the underlying block device, and in case of RBD there is no real
>>>> block device from the host perspective...? Enlighten me if I'm wrong
>>>> please. I
>>> have
>>>> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
>>>> cringe because I'm unable to tune the scheduler and it just makes no
>>>> sense at all...?
>>> 
>>> Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
>>> infrastructure, but there is a bug which limits max IO sizes to 128kb,
>>> which is why for large block/sequential that testing kernel is
>>> essential. I think this bug fix should make it to 4.2 hopefully.
>> 
>> blk-mq is supposed to remove redundancy of having
>> 
>> IO scheduler in VM -> VM block device -> host IO scheduler -> block device
>> 
>> it's a paravirtualized driver that just moves requests from inside the VM
> to
>> the host queue (and this is why inside the VM you have no IO scheduler
>> options - it effectively becomes noop).
>> 
>> But this just doesn't make sense if you're using qemu with librbd -
> there's no
>> host queue.
>> It would make sense if the qemu drive was krbd device with a queue.
>> 
>> If there's no VM there should be no blk-mq?
> 
> I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
> itself seems to be a lot more about enhancing the overall block layer
> performance in Linux
> 
> https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
> hanism_(blk-mq)
> 
> 
> 
>> 
>> So what was added to the kernel was probably the host-side infrastructure
>> to handle blk-mq in guest passthrough to the krdb device, but that's
> probably
>> not your case, is it?
>> 
>>> 
>>>> 
>>>> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
>>>> (to make sure it gets into readahead), also try (if you're not using
>>>> blk-mq)
>>> to a
>>>> cfq scheduler and set it to rotational=1. I see you've also tried
>>>> this,
>>> but I think
>>>> blk-mq is the limiting factor here now.
>>> 
>>> I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
>>> object size, from what I can tell) and the max_sectors_kb is already
>>> set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
>>> could be set higher though, but I'm not sure if there is a reason for
> this
>> limit.
>>> 
>>>> 
>>>> If you are running a single-threaded benchmark like rados bench then
>>> what's
>>>> limiting you is latency - it's not surprising it scales up with more
>>> threads.
>>> 
>>> Agreed, but with sequential workloads, if you can get readahead
>>> working properly then you can easily remove this limitation as a
>>> single threaded op effectively becomes multithreaded.
>> 
>> Thinking on this more - I don't know if this will help after all, it will
> still be a
>> single thread, just trying to get ahead of the client IO - and that's not
> likely to
>> happen unless you can read the data in userspace slower than what Ceph
>> can read...
>> 
>> I think striping multiple device could be the answer after all. But have
> you
>> tried creating the RBD volume as striped in Ceph?
> 
> Yes striping would probably give amazing performance, but the kernel client
> currently doesn't support it, which leaves us in the position of trying to
> find work arounds to boost performance.
> 
> Although the client read is single threaded, the RBD/RADOS layer would split
> these larger readahead IOs into 4MB requests that would then be processed in
> parallel by the OSD's. This is much the same way as sequential access
> performance varies with a RAID array. If your IO size matches the stripe
> size of the array then you get nearly the bandwidth of all disks involved. I
> think in Ceph the effective stripe size is the   object size * #OSDS.
> 

Hmmm...

RBD -> PG -> objects

stripe_unit (more commonly called "stride") bytes are put into strip_count objects - not OSDs, but it's possible you'll hit all OSDs with a small enough stride and large enough stripe_count... 
I have no idea how well that works in practice on current Ceph releases, my Dumpling experience is probably useless here.

So we're back at striping with mdraid I guess ... :)

> 
>> 
>>> 
>>>> It should run nicely with a real workload once readahead kicks in and
>>>> the queue fills up. But again - not sure how that works with blk-mq
>>>> and I've never used the RBD device directly (the kernel client). Does
>>>> it show in /sys/block ? Can you dump "find /sys/block/$rbd" in here?
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph-
>>>> users.ceph.com.toasta.001@xxxxxxxxxx> wrote:
>>>>> 
>>>>> Hi Nick,
>>>>> 
>>>>> did you do anything fancy to get to ~90MB/s in the first place?
>>>>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
>>>>> quite speedy, around 600MB/s.
>>>>> 
>>>>> radosgw for cold data is around the 90MB/s, which is imho limitted
>>>>> by the speed of a single disk.
>>>>> 
>>>>> Data already present on the osd-os-buffers arrive with around
>>>>> 400-700MB/s so I don't think the network is the culprit.
>>>>> 
>>>>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
>>>>> each, lacp 2x10g bonds)
>>>>> 
>>>>> rados bench single-threaded performs equally bad, but with its
>>>>> default multithreaded settings it generates wonderful numbers,
>>>>> usually only limiited by linerate and/or interrupts/s.
>>>>> 
>>>>> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
>>>>> get to "your wonderful" numbers, but it's staying below 30 MB/s.
>>>>> 
>>>>> I was thinking about using a software raid0 like you did but that's
>>>>> imho really ugly.
>>>>> When I know I needed something speedy, I usually just started dd-ing
>>>>> the file to /dev/null and wait for about  three minutes before
>>>>> starting the actual job; some sort of hand-made read-ahead for
>>>>> dummies.
>>>>> 
>>>>> Thx in advance
>>>>> Benedikt
>>>>> 
>>>>> 
>>>>> 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
>>>>>> Thanks for the replies guys.
>>>>>> 
>>>>>> The client is set to 4MB, I haven't played with the OSD side yet as
>>>>>> I wasn't sure if it would make much difference, but I will give it
>>>>>> a go. If the client is already passing a 4MB request down through
>>>>>> to the OSD, will it be able to readahead any further? The next 4MB
>>>>>> object in theory will be on another OSD and so I'm not sure if
>>>>>> reading ahead any further on the OSD side would help.
>>>>>> 
>>>>>> How I see the problem is that the RBD client will only read 1 OSD
>>>>>> at a time as the RBD readahead can't be set any higher than
>>>>>> max_hw_sectors_kb, which is the object size of the RBD. Please
>>>>>> correct
>>>> me if I'm wrong on this.
>>>>>> 
>>>>>> If you could set the RBD readahead to much higher than the object
>>>>>> size, then this would probably give the desired effect where the
>>>>>> buffer could be populated by reading from several OSD's in advance
>>>>>> to give much higher performance. That or wait for striping to
>>>>>> appear in
>>> the
>>>> Kernel client.
>>>>>> 
>>>>>> I've also found that BareOS (fork of Bacula) seems to has a direct
>>>>>> RADOS feature that supports radosstriper. I might try this and see
>>>>>> how it performs as well.
>>>>>> 
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>>>>>>> Behalf Of Somnath Roy
>>>>>>> Sent: 17 August 2015 03:36
>>>>>>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk
>>>>>>> <nick@xxxxxxxxxx>
>>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>>>>> Subject: Re:  How to improve single thread sequential
>>>> reads?
>>>>>>> 
>>>>>>> Have you tried setting read_ahead_kb to bigger number for both
>>>>>>> client/OSD side if you are using krbd ?
>>>>>>> In case of librbd, try the different config options for rbd cache..
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>>>>>>> Behalf Of Alex Gorbachev
>>>>>>> Sent: Sunday, August 16, 2015 7:07 PM
>>>>>>> To: Nick Fisk
>>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>>>>> Subject: Re:  How to improve single thread sequential
>>>> reads?
>>>>>>> 
>>>>>>> Hi Nick,
>>>>>>> 
>>>>>>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>>>>>>>>> Behalf Of Nick Fisk
>>>>>>>>> Sent: 13 August 2015 18:04
>>>>>>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> Subject:  How to improve single thread sequential
>> reads?
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I'm trying to use a RBD to act as a staging area for some data
>>>>>>>>> before
>>>>>>>> pushing
>>>>>>>>> it down to some LTO6 tapes. As I cannot use striping with the
>>>>>>>>> kernel
>>>>>>>> client I
>>>>>>>>> tend to be maxing out at around 80MB/s reads testing with DD.
>>>>>>>>> Has anyone got any clever suggestions of giving this a bit of a
>>>>>>>>> boost, I think I need
>>>>>>>> to get it
>>>>>>>>> up to around 200MB/s to make sure there is always a steady flow
>>>>>>>>> of data to the tape drive.
>>>>>>>> 
>>>>>>>> I've just tried the testing kernel with the blk-mq fixes in it
>>>>>>>> for full size IO's, this combined with bumping readahead up to
>>>>>>>> 4MB, is now getting me on average 150MB/s to 200MB/s so this
>> might suffice.
>>>>>>>> 
>>>>>>>> On a personal interest, I would still like to know if anyone has
>>>>>>>> ideas on how to really push much higher bandwidth through a RBD.
>>>>>>> 
>>>>>>> Some settings in our ceph.conf that may help:
>>>>>>> 
>>>>>>> osd_op_threads = 20
>>>>>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
>>>>>>> filestore_queue_max_ops = 90000 filestore_flusher = false
>>>>>>> filestore_max_sync_interval = 10 filestore_sync_flush = false
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Alex
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
>>>>>>>>> 
>>>>>>>>> I'm thinking mapping multiple RBD's and then combining them into
>>>>>>>>> a mdadm
>>>>>>>>> RAID0 stripe might work, but seems a bit messy.
>>>>>>>>> 
>>>>>>>>> Any suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> 
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>> message is intended only for the use of the designated
>>>>>>> recipient(s) named above. If
>>>>>> the
>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>> hereby
>>>>>> notified
>>>>>>> that you have received this message in error and that any review,
>>>>>>> dissemination, distribution, or copying of this message is
>>>>>>> strictly
>>>>>> prohibited. If
>>>>>>> you have received this communication in error, please notify the
>>>>>>> sender by telephone or e-mail (as shown above) immediately and
>>>>>>> destroy any and all copies of this message in your possession
>>>>>>> (whether hard copies or electronically stored copies).
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com