Re: How to improve single thread sequential reads?

Nick Fisk <nick@xxxxxxxxxx> · Tue, 18 Aug 2015 12:58:29 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 12:41
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  How to improve single thread sequential reads?
> 
> Reply in text
> 
> > On 18 Aug 2015, at 12:59, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >
> >
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Jan Schermer
> >> Sent: 18 August 2015 11:50
> >> To: Benedikt Fraunhofer <given.to.lists.ceph-
> >> users.ceph.com.toasta.001@xxxxxxxxxx>
> >> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx>
> >> Subject: Re:  How to improve single thread sequential
reads?
> >>
> >> I'm not sure if I missed that but are you testing in a VM backed by
> >> RBD device, or using the device directly?
> >>
> >> I don't see how blk-mq would help if it's not a VM, it just passes
> >> the
> > request
> >> to the underlying block device, and in case of RBD there is no real
> >> block device from the host perspective...? Enlighten me if I'm wrong
> >> please. I
> > have
> >> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
> >> cringe because I'm unable to tune the scheduler and it just makes no
> >> sense at all...?
> >
> > Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
> > infrastructure, but there is a bug which limits max IO sizes to 128kb,
> > which is why for large block/sequential that testing kernel is
> > essential. I think this bug fix should make it to 4.2 hopefully.
> 
> blk-mq is supposed to remove redundancy of having
> 
> IO scheduler in VM -> VM block device -> host IO scheduler -> block device
> 
> it's a paravirtualized driver that just moves requests from inside the VM
to
> the host queue (and this is why inside the VM you have no IO scheduler
> options - it effectively becomes noop).
> 
> But this just doesn't make sense if you're using qemu with librbd -
there's no
> host queue.
> It would make sense if the qemu drive was krbd device with a queue.
> 
> If there's no VM there should be no blk-mq?

I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
itself seems to be a lot more about enhancing the overall block layer
performance in Linux

https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
hanism_(blk-mq)

> 
> So what was added to the kernel was probably the host-side infrastructure
> to handle blk-mq in guest passthrough to the krdb device, but that's
probably
> not your case, is it?
> 
> >
> >>
> >> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
> >> (to make sure it gets into readahead), also try (if you're not using
> >> blk-mq)
> > to a
> >> cfq scheduler and set it to rotational=1. I see you've also tried
> >> this,
> > but I think
> >> blk-mq is the limiting factor here now.
> >
> > I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
> > object size, from what I can tell) and the max_sectors_kb is already
> > set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
> > could be set higher though, but I'm not sure if there is a reason for
this
> limit.
> >
> >>
> >> If you are running a single-threaded benchmark like rados bench then
> > what's
> >> limiting you is latency - it's not surprising it scales up with more
> > threads.
> >
> > Agreed, but with sequential workloads, if you can get readahead
> > working properly then you can easily remove this limitation as a
> > single threaded op effectively becomes multithreaded.
> 
> Thinking on this more - I don't know if this will help after all, it will
still be a
> single thread, just trying to get ahead of the client IO - and that's not
likely to
> happen unless you can read the data in userspace slower than what Ceph
> can read...
> 
> I think striping multiple device could be the answer after all. But have
you
> tried creating the RBD volume as striped in Ceph?

Yes striping would probably give amazing performance, but the kernel client
currently doesn't support it, which leaves us in the position of trying to
find work arounds to boost performance.

Although the client read is single threaded, the RBD/RADOS layer would split
these larger readahead IOs into 4MB requests that would then be processed in
parallel by the OSD's. This is much the same way as sequential access
performance varies with a RAID array. If your IO size matches the stripe
size of the array then you get nearly the bandwidth of all disks involved. I
think in Ceph the effective stripe size is the   object size * #OSDS.

> 
> >
> >> It should run nicely with a real workload once readahead kicks in and
> >> the queue fills up. But again - not sure how that works with blk-mq
> >> and I've never used the RBD device directly (the kernel client). Does
> >> it show in /sys/block ? Can you dump "find /sys/block/$rbd" in here?
> >>
> >> Jan
> >>
> >>
> >>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph-
> >> users.ceph.com.toasta.001@xxxxxxxxxx> wrote:
> >>>
> >>> Hi Nick,
> >>>
> >>> did you do anything fancy to get to ~90MB/s in the first place?
> >>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
> >>> quite speedy, around 600MB/s.
> >>>
> >>> radosgw for cold data is around the 90MB/s, which is imho limitted
> >>> by the speed of a single disk.
> >>>
> >>> Data already present on the osd-os-buffers arrive with around
> >>> 400-700MB/s so I don't think the network is the culprit.
> >>>
> >>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
> >>> each, lacp 2x10g bonds)
> >>>
> >>> rados bench single-threaded performs equally bad, but with its
> >>> default multithreaded settings it generates wonderful numbers,
> >>> usually only limiited by linerate and/or interrupts/s.
> >>>
> >>> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
> >>> get to "your wonderful" numbers, but it's staying below 30 MB/s.
> >>>
> >>> I was thinking about using a software raid0 like you did but that's
> >>> imho really ugly.
> >>> When I know I needed something speedy, I usually just started dd-ing
> >>> the file to /dev/null and wait for about  three minutes before
> >>> starting the actual job; some sort of hand-made read-ahead for
> >>> dummies.
> >>>
> >>> Thx in advance
> >>> Benedikt
> >>>
> >>>
> >>> 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
> >>>> Thanks for the replies guys.
> >>>>
> >>>> The client is set to 4MB, I haven't played with the OSD side yet as
> >>>> I wasn't sure if it would make much difference, but I will give it
> >>>> a go. If the client is already passing a 4MB request down through
> >>>> to the OSD, will it be able to readahead any further? The next 4MB
> >>>> object in theory will be on another OSD and so I'm not sure if
> >>>> reading ahead any further on the OSD side would help.
> >>>>
> >>>> How I see the problem is that the RBD client will only read 1 OSD
> >>>> at a time as the RBD readahead can't be set any higher than
> >>>> max_hw_sectors_kb, which is the object size of the RBD. Please
> >>>> correct
> >> me if I'm wrong on this.
> >>>>
> >>>> If you could set the RBD readahead to much higher than the object
> >>>> size, then this would probably give the desired effect where the
> >>>> buffer could be populated by reading from several OSD's in advance
> >>>> to give much higher performance. That or wait for striping to
> >>>> appear in
> > the
> >> Kernel client.
> >>>>
> >>>> I've also found that BareOS (fork of Bacula) seems to has a direct
> >>>> RADOS feature that supports radosstriper. I might try this and see
> >>>> how it performs as well.
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>> Behalf Of Somnath Roy
> >>>>> Sent: 17 August 2015 03:36
> >>>>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk
> >>>>> <nick@xxxxxxxxxx>
> >>>>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>>>> Subject: Re:  How to improve single thread sequential
> >> reads?
> >>>>>
> >>>>> Have you tried setting read_ahead_kb to bigger number for both
> >>>>> client/OSD side if you are using krbd ?
> >>>>> In case of librbd, try the different config options for rbd cache..
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>> Behalf Of Alex Gorbachev
> >>>>> Sent: Sunday, August 16, 2015 7:07 PM
> >>>>> To: Nick Fisk
> >>>>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>>>> Subject: Re:  How to improve single thread sequential
> >> reads?
> >>>>>
> >>>>> Hi Nick,
> >>>>>
> >>>>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>>>> Behalf Of Nick Fisk
> >>>>>>> Sent: 13 August 2015 18:04
> >>>>>>> To: ceph-users@xxxxxxxxxxxxxx
> >>>>>>> Subject:  How to improve single thread sequential
> reads?
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm trying to use a RBD to act as a staging area for some data
> >>>>>>> before
> >>>>>> pushing
> >>>>>>> it down to some LTO6 tapes. As I cannot use striping with the
> >>>>>>> kernel
> >>>>>> client I
> >>>>>>> tend to be maxing out at around 80MB/s reads testing with DD.
> >>>>>>> Has anyone got any clever suggestions of giving this a bit of a
> >>>>>>> boost, I think I need
> >>>>>> to get it
> >>>>>>> up to around 200MB/s to make sure there is always a steady flow
> >>>>>>> of data to the tape drive.
> >>>>>>
> >>>>>> I've just tried the testing kernel with the blk-mq fixes in it
> >>>>>> for full size IO's, this combined with bumping readahead up to
> >>>>>> 4MB, is now getting me on average 150MB/s to 200MB/s so this
> might suffice.
> >>>>>>
> >>>>>> On a personal interest, I would still like to know if anyone has
> >>>>>> ideas on how to really push much higher bandwidth through a RBD.
> >>>>>
> >>>>> Some settings in our ceph.conf that may help:
> >>>>>
> >>>>> osd_op_threads = 20
> >>>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
> >>>>> filestore_queue_max_ops = 90000 filestore_flusher = false
> >>>>> filestore_max_sync_interval = 10 filestore_sync_flush = false
> >>>>>
> >>>>> Regards,
> >>>>> Alex
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
> >>>>>>>
> >>>>>>> I'm thinking mapping multiple RBD's and then combining them into
> >>>>>>> a mdadm
> >>>>>>> RAID0 stripe might work, but seems a bit messy.
> >>>>>>>
> >>>>>>> Any suggestions?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Nick
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>> message is intended only for the use of the designated
> >>>>> recipient(s) named above. If
> >>>> the
> >>>>> reader of this message is not the intended recipient, you are
> >>>>> hereby
> >>>> notified
> >>>>> that you have received this message in error and that any review,
> >>>>> dissemination, distribution, or copying of this message is
> >>>>> strictly
> >>>> prohibited. If
> >>>>> you have received this communication in error, please notify the
> >>>>> sender by telephone or e-mail (as shown above) immediately and
> >>>>> destroy any and all copies of this message in your possession
> >>>>> (whether hard copies or electronically stored copies).
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com