Re: How to improve single thread sequential reads?

Nick Fisk <nick@xxxxxxxxxx> · Tue, 18 Aug 2015 11:59:12 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 11:50
> To: Benedikt Fraunhofer <given.to.lists.ceph-
> users.ceph.com.toasta.001@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx>
> Subject: Re:  How to improve single thread sequential reads?
> 
> I'm not sure if I missed that but are you testing in a VM backed by RBD
> device, or using the device directly?
> 
> I don't see how blk-mq would help if it's not a VM, it just passes the
request
> to the underlying block device, and in case of RBD there is no real block
> device from the host perspective...? Enlighten me if I'm wrong please. I
have
> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
> cringe because I'm unable to tune the scheduler and it just makes no sense
> at all...?

Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
infrastructure, but there is a bug which limits max IO sizes to 128kb, which
is why for large block/sequential that testing kernel is essential. I think
this bug fix should make it to 4.2 hopefully.

> 
> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
> make sure it gets into readahead), also try (if you're not using blk-mq)
to a
> cfq scheduler and set it to rotational=1. I see you've also tried this,
but I think
> blk-mq is the limiting factor here now.

I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
size, from what I can tell) and the max_sectors_kb is already set at the
hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
higher though, but I'm not sure if there is a reason for this limit.

> 
> If you are running a single-threaded benchmark like rados bench then
what's
> limiting you is latency - it's not surprising it scales up with more
threads.

Agreed, but with sequential workloads, if you can get readahead working
properly then you can easily remove this limitation as a single threaded op
effectively becomes multithreaded.

> It should run nicely with a real workload once readahead kicks in and the
> queue fills up. But again - not sure how that works with blk-mq and I've
> never used the RBD device directly (the kernel client). Does it show in
> /sys/block ? Can you dump "find /sys/block/$rbd" in here?
> 
> Jan
> 
> 
> > On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph-
> users.ceph.com.toasta.001@xxxxxxxxxx> wrote:
> >
> > Hi Nick,
> >
> > did you do anything fancy to get to ~90MB/s in the first place?
> > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
> > quite speedy, around 600MB/s.
> >
> > radosgw for cold data is around the 90MB/s, which is imho limitted by
> > the speed of a single disk.
> >
> > Data already present on the osd-os-buffers arrive with around
> > 400-700MB/s so I don't think the network is the culprit.
> >
> > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
> > each, lacp 2x10g bonds)
> >
> > rados bench single-threaded performs equally bad, but with its default
> > multithreaded settings it generates wonderful numbers, usually only
> > limiited by linerate and/or interrupts/s.
> >
> > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
> > get to "your wonderful" numbers, but it's staying below 30 MB/s.
> >
> > I was thinking about using a software raid0 like you did but that's
> > imho really ugly.
> > When I know I needed something speedy, I usually just started dd-ing
> > the file to /dev/null and wait for about  three minutes before
> > starting the actual job; some sort of hand-made read-ahead for
> > dummies.
> >
> > Thx in advance
> >  Benedikt
> >
> >
> > 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
> >> Thanks for the replies guys.
> >>
> >> The client is set to 4MB, I haven't played with the OSD side yet as I
> >> wasn't sure if it would make much difference, but I will give it a
> >> go. If the client is already passing a 4MB request down through to
> >> the OSD, will it be able to readahead any further? The next 4MB
> >> object in theory will be on another OSD and so I'm not sure if
> >> reading ahead any further on the OSD side would help.
> >>
> >> How I see the problem is that the RBD client will only read 1 OSD at
> >> a time as the RBD readahead can't be set any higher than
> >> max_hw_sectors_kb, which is the object size of the RBD. Please correct
> me if I'm wrong on this.
> >>
> >> If you could set the RBD readahead to much higher than the object
> >> size, then this would probably give the desired effect where the
> >> buffer could be populated by reading from several OSD's in advance to
> >> give much higher performance. That or wait for striping to appear in
the
> Kernel client.
> >>
> >> I've also found that BareOS (fork of Bacula) seems to has a direct
> >> RADOS feature that supports radosstriper. I might try this and see
> >> how it performs as well.
> >>
> >>
> >>> -----Original Message-----
> >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>> Behalf Of Somnath Roy
> >>> Sent: 17 August 2015 03:36
> >>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk
> >>> <nick@xxxxxxxxxx>
> >>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>> Subject: Re:  How to improve single thread sequential
> reads?
> >>>
> >>> Have you tried setting read_ahead_kb to bigger number for both
> >>> client/OSD side if you are using krbd ?
> >>> In case of librbd, try the different config options for rbd cache..
> >>>
> >>> Thanks & Regards
> >>> Somnath
> >>>
> >>> -----Original Message-----
> >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>> Behalf Of Alex Gorbachev
> >>> Sent: Sunday, August 16, 2015 7:07 PM
> >>> To: Nick Fisk
> >>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>> Subject: Re:  How to improve single thread sequential
> reads?
> >>>
> >>> Hi Nick,
> >>>
> >>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >>>>> -----Original Message-----
> >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>> Behalf Of Nick Fisk
> >>>>> Sent: 13 August 2015 18:04
> >>>>> To: ceph-users@xxxxxxxxxxxxxx
> >>>>> Subject:  How to improve single thread sequential reads?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I'm trying to use a RBD to act as a staging area for some data
> >>>>> before
> >>>> pushing
> >>>>> it down to some LTO6 tapes. As I cannot use striping with the
> >>>>> kernel
> >>>> client I
> >>>>> tend to be maxing out at around 80MB/s reads testing with DD. Has
> >>>>> anyone got any clever suggestions of giving this a bit of a boost,
> >>>>> I think I need
> >>>> to get it
> >>>>> up to around 200MB/s to make sure there is always a steady flow of
> >>>>> data to the tape drive.
> >>>>
> >>>> I've just tried the testing kernel with the blk-mq fixes in it for
> >>>> full size IO's, this combined with bumping readahead up to 4MB, is
> >>>> now getting me on average 150MB/s to 200MB/s so this might suffice.
> >>>>
> >>>> On a personal interest, I would still like to know if anyone has
> >>>> ideas on how to really push much higher bandwidth through a RBD.
> >>>
> >>> Some settings in our ceph.conf that may help:
> >>>
> >>> osd_op_threads = 20
> >>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
> >>> filestore_queue_max_ops = 90000 filestore_flusher = false
> >>> filestore_max_sync_interval = 10 filestore_sync_flush = false
> >>>
> >>> Regards,
> >>> Alex
> >>>
> >>>>
> >>>>>
> >>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
> >>>>>
> >>>>> I'm thinking mapping multiple RBD's and then combining them into a
> >>>>> mdadm
> >>>>> RAID0 stripe might work, but seems a bit messy.
> >>>>>
> >>>>> Any suggestions?
> >>>>>
> >>>>> Thanks,
> >>>>> Nick
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> ________________________________
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message is intended only for the use of the designated recipient(s)
> >>> named above. If
> >> the
> >>> reader of this message is not the intended recipient, you are hereby
> >> notified
> >>> that you have received this message in error and that any review,
> >>> dissemination, distribution, or copying of this message is strictly
> >> prohibited. If
> >>> you have received this communication in error, please notify the
> >>> sender by telephone or e-mail (as shown above) immediately and
> >>> destroy any and all copies of this message in your possession
> >>> (whether hard copies or electronically stored copies).
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com