Re: How to improve single thread sequential reads?

Jan Schermer <jan@xxxxxxxxxxx> · Tue, 18 Aug 2015 13:41:10 +0200

Reply in text

> On 18 Aug 2015, at 12:59, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Jan Schermer
>> Sent: 18 August 2015 11:50
>> To: Benedikt Fraunhofer <given.to.lists.ceph-
>> users.ceph.com.toasta.001@xxxxxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx>
>> Subject: Re:  How to improve single thread sequential reads?
>> 
>> I'm not sure if I missed that but are you testing in a VM backed by RBD
>> device, or using the device directly?
>> 
>> I don't see how blk-mq would help if it's not a VM, it just passes the
> request
>> to the underlying block device, and in case of RBD there is no real block
>> device from the host perspective...? Enlighten me if I'm wrong please. I
> have
>> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
>> cringe because I'm unable to tune the scheduler and it just makes no sense
>> at all...?
> 
> Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
> infrastructure, but there is a bug which limits max IO sizes to 128kb, which
> is why for large block/sequential that testing kernel is essential. I think
> this bug fix should make it to 4.2 hopefully.

blk-mq is supposed to remove redundancy of having

IO scheduler in VM -> VM block device -> host IO scheduler -> block device

it's a paravirtualized driver that just moves requests from inside the VM to the host queue (and this is why inside the VM you have no IO scheduler options - it effectively becomes noop).

But this just doesn't make sense if you're using qemu with librbd - there's no host queue.
It would make sense if the qemu drive was krbd device with a queue.

If there's no VM there should be no blk-mq?

So what was added to the kernel was probably the host-side infrastructure to handle blk-mq in guest passthrough to the krdb device, but that's probably not your case, is it?

> 
>> 
>> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
>> make sure it gets into readahead), also try (if you're not using blk-mq)
> to a
>> cfq scheduler and set it to rotational=1. I see you've also tried this,
> but I think
>> blk-mq is the limiting factor here now.
> 
> I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
> size, from what I can tell) and the max_sectors_kb is already set at the
> hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
> higher though, but I'm not sure if there is a reason for this limit.
> 
>> 
>> If you are running a single-threaded benchmark like rados bench then
> what's
>> limiting you is latency - it's not surprising it scales up with more
> threads.
> 
> Agreed, but with sequential workloads, if you can get readahead working
> properly then you can easily remove this limitation as a single threaded op
> effectively becomes multithreaded.

Thinking on this more - I don't know if this will help after all, it will still be a single thread, just trying to get ahead of the client IO - and that's not likely to happen unless you can read the data in userspace slower than what Ceph can read...

I think striping multiple device could be the answer after all. But have you tried creating the RBD volume as striped in Ceph?

> 
>> It should run nicely with a real workload once readahead kicks in and the
>> queue fills up. But again - not sure how that works with blk-mq and I've
>> never used the RBD device directly (the kernel client). Does it show in
>> /sys/block ? Can you dump "find /sys/block/$rbd" in here?
>> 
>> Jan
>> 
>> 
>>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph-
>> users.ceph.com.toasta.001@xxxxxxxxxx> wrote:
>>> 
>>> Hi Nick,
>>> 
>>> did you do anything fancy to get to ~90MB/s in the first place?
>>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
>>> quite speedy, around 600MB/s.
>>> 
>>> radosgw for cold data is around the 90MB/s, which is imho limitted by
>>> the speed of a single disk.
>>> 
>>> Data already present on the osd-os-buffers arrive with around
>>> 400-700MB/s so I don't think the network is the culprit.
>>> 
>>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
>>> each, lacp 2x10g bonds)
>>> 
>>> rados bench single-threaded performs equally bad, but with its default
>>> multithreaded settings it generates wonderful numbers, usually only
>>> limiited by linerate and/or interrupts/s.
>>> 
>>> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
>>> get to "your wonderful" numbers, but it's staying below 30 MB/s.
>>> 
>>> I was thinking about using a software raid0 like you did but that's
>>> imho really ugly.
>>> When I know I needed something speedy, I usually just started dd-ing
>>> the file to /dev/null and wait for about  three minutes before
>>> starting the actual job; some sort of hand-made read-ahead for
>>> dummies.
>>> 
>>> Thx in advance
>>> Benedikt
>>> 
>>> 
>>> 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
>>>> Thanks for the replies guys.
>>>> 
>>>> The client is set to 4MB, I haven't played with the OSD side yet as I
>>>> wasn't sure if it would make much difference, but I will give it a
>>>> go. If the client is already passing a 4MB request down through to
>>>> the OSD, will it be able to readahead any further? The next 4MB
>>>> object in theory will be on another OSD and so I'm not sure if
>>>> reading ahead any further on the OSD side would help.
>>>> 
>>>> How I see the problem is that the RBD client will only read 1 OSD at
>>>> a time as the RBD readahead can't be set any higher than
>>>> max_hw_sectors_kb, which is the object size of the RBD. Please correct
>> me if I'm wrong on this.
>>>> 
>>>> If you could set the RBD readahead to much higher than the object
>>>> size, then this would probably give the desired effect where the
>>>> buffer could be populated by reading from several OSD's in advance to
>>>> give much higher performance. That or wait for striping to appear in
> the
>> Kernel client.
>>>> 
>>>> I've also found that BareOS (fork of Bacula) seems to has a direct
>>>> RADOS feature that supports radosstriper. I might try this and see
>>>> how it performs as well.
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>>>>> Behalf Of Somnath Roy
>>>>> Sent: 17 August 2015 03:36
>>>>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk
>>>>> <nick@xxxxxxxxxx>
>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>>> Subject: Re:  How to improve single thread sequential
>> reads?
>>>>> 
>>>>> Have you tried setting read_ahead_kb to bigger number for both
>>>>> client/OSD side if you are using krbd ?
>>>>> In case of librbd, try the different config options for rbd cache..
>>>>> 
>>>>> Thanks & Regards
>>>>> Somnath
>>>>> 
>>>>> -----Original Message-----
>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>>>>> Behalf Of Alex Gorbachev
>>>>> Sent: Sunday, August 16, 2015 7:07 PM
>>>>> To: Nick Fisk
>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>>> Subject: Re:  How to improve single thread sequential
>> reads?
>>>>> 
>>>>> Hi Nick,
>>>>> 
>>>>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>>>>>>> Behalf Of Nick Fisk
>>>>>>> Sent: 13 August 2015 18:04
>>>>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>>>>> Subject:  How to improve single thread sequential reads?
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm trying to use a RBD to act as a staging area for some data
>>>>>>> before
>>>>>> pushing
>>>>>>> it down to some LTO6 tapes. As I cannot use striping with the
>>>>>>> kernel
>>>>>> client I
>>>>>>> tend to be maxing out at around 80MB/s reads testing with DD. Has
>>>>>>> anyone got any clever suggestions of giving this a bit of a boost,
>>>>>>> I think I need
>>>>>> to get it
>>>>>>> up to around 200MB/s to make sure there is always a steady flow of
>>>>>>> data to the tape drive.
>>>>>> 
>>>>>> I've just tried the testing kernel with the blk-mq fixes in it for
>>>>>> full size IO's, this combined with bumping readahead up to 4MB, is
>>>>>> now getting me on average 150MB/s to 200MB/s so this might suffice.
>>>>>> 
>>>>>> On a personal interest, I would still like to know if anyone has
>>>>>> ideas on how to really push much higher bandwidth through a RBD.
>>>>> 
>>>>> Some settings in our ceph.conf that may help:
>>>>> 
>>>>> osd_op_threads = 20
>>>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
>>>>> filestore_queue_max_ops = 90000 filestore_flusher = false
>>>>> filestore_max_sync_interval = 10 filestore_sync_flush = false
>>>>> 
>>>>> Regards,
>>>>> Alex
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
>>>>>>> 
>>>>>>> I'm thinking mapping multiple RBD's and then combining them into a
>>>>>>> mdadm
>>>>>>> RAID0 stripe might work, but seems a bit messy.
>>>>>>> 
>>>>>>> Any suggestions?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>>> ________________________________
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message is intended only for the use of the designated recipient(s)
>>>>> named above. If
>>>> the
>>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified
>>>>> that you have received this message in error and that any review,
>>>>> dissemination, distribution, or copying of this message is strictly
>>>> prohibited. If
>>>>> you have received this communication in error, please notify the
>>>>> sender by telephone or e-mail (as shown above) immediately and
>>>>> destroy any and all copies of this message in your possession
>>>>> (whether hard copies or electronically stored copies).
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com