> On 18 Aug 2015, at 13:58, Nick Fisk <nick@xxxxxxxxxx> wrote: > > > > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >> Jan Schermer >> Sent: 18 August 2015 12:41 >> To: Nick Fisk <nick@xxxxxxxxxx> >> Cc: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: How to improve single thread sequential reads? >> >> Reply in text >> >>> On 18 Aug 2015, at 12:59, Nick Fisk <nick@xxxxxxxxxx> wrote: >>> >>> >>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >>>> Of Jan Schermer >>>> Sent: 18 August 2015 11:50 >>>> To: Benedikt Fraunhofer <given.to.lists.ceph- >>>> users.ceph.com.toasta.001@xxxxxxxxxx> >>>> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx> >>>> Subject: Re: How to improve single thread sequential > reads? >>>> >>>> I'm not sure if I missed that but are you testing in a VM backed by >>>> RBD device, or using the device directly? >>>> >>>> I don't see how blk-mq would help if it's not a VM, it just passes >>>> the >>> request >>>> to the underlying block device, and in case of RBD there is no real >>>> block device from the host perspective...? Enlighten me if I'm wrong >>>> please. I >>> have >>>> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me >>>> cringe because I'm unable to tune the scheduler and it just makes no >>>> sense at all...? >>> >>> Since 4.0 (I think) the Kernel RBD client now uses the blk-mq >>> infrastructure, but there is a bug which limits max IO sizes to 128kb, >>> which is why for large block/sequential that testing kernel is >>> essential. I think this bug fix should make it to 4.2 hopefully. >> >> blk-mq is supposed to remove redundancy of having >> >> IO scheduler in VM -> VM block device -> host IO scheduler -> block device >> >> it's a paravirtualized driver that just moves requests from inside the VM > to >> the host queue (and this is why inside the VM you have no IO scheduler >> options - it effectively becomes noop). >> >> But this just doesn't make sense if you're using qemu with librbd - > there's no >> host queue. >> It would make sense if the qemu drive was krbd device with a queue. >> >> If there's no VM there should be no blk-mq? > > I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq > itself seems to be a lot more about enhancing the overall block layer > performance in Linux > > https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec > hanism_(blk-mq) > > > >> >> So what was added to the kernel was probably the host-side infrastructure >> to handle blk-mq in guest passthrough to the krdb device, but that's > probably >> not your case, is it? >> >>> >>>> >>>> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb >>>> (to make sure it gets into readahead), also try (if you're not using >>>> blk-mq) >>> to a >>>> cfq scheduler and set it to rotational=1. I see you've also tried >>>> this, >>> but I think >>>> blk-mq is the limiting factor here now. >>> >>> I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals >>> object size, from what I can tell) and the max_sectors_kb is already >>> set at the hw_max. But it would sure be nice if the max_hw_sectors_kb >>> could be set higher though, but I'm not sure if there is a reason for > this >> limit. >>> >>>> >>>> If you are running a single-threaded benchmark like rados bench then >>> what's >>>> limiting you is latency - it's not surprising it scales up with more >>> threads. >>> >>> Agreed, but with sequential workloads, if you can get readahead >>> working properly then you can easily remove this limitation as a >>> single threaded op effectively becomes multithreaded. >> >> Thinking on this more - I don't know if this will help after all, it will > still be a >> single thread, just trying to get ahead of the client IO - and that's not > likely to >> happen unless you can read the data in userspace slower than what Ceph >> can read... >> >> I think striping multiple device could be the answer after all. But have > you >> tried creating the RBD volume as striped in Ceph? > > Yes striping would probably give amazing performance, but the kernel client > currently doesn't support it, which leaves us in the position of trying to > find work arounds to boost performance. > > Although the client read is single threaded, the RBD/RADOS layer would split > these larger readahead IOs into 4MB requests that would then be processed in > parallel by the OSD's. This is much the same way as sequential access > performance varies with a RAID array. If your IO size matches the stripe > size of the array then you get nearly the bandwidth of all disks involved. I > think in Ceph the effective stripe size is the object size * #OSDS. > Hmmm... RBD -> PG -> objects stripe_unit (more commonly called "stride") bytes are put into strip_count objects - not OSDs, but it's possible you'll hit all OSDs with a small enough stride and large enough stripe_count... I have no idea how well that works in practice on current Ceph releases, my Dumpling experience is probably useless here. So we're back at striping with mdraid I guess ... :) > >> >>> >>>> It should run nicely with a real workload once readahead kicks in and >>>> the queue fills up. But again - not sure how that works with blk-mq >>>> and I've never used the RBD device directly (the kernel client). Does >>>> it show in /sys/block ? Can you dump "find /sys/block/$rbd" in here? >>>> >>>> Jan >>>> >>>> >>>>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph- >>>> users.ceph.com.toasta.001@xxxxxxxxxx> wrote: >>>>> >>>>> Hi Nick, >>>>> >>>>> did you do anything fancy to get to ~90MB/s in the first place? >>>>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are >>>>> quite speedy, around 600MB/s. >>>>> >>>>> radosgw for cold data is around the 90MB/s, which is imho limitted >>>>> by the speed of a single disk. >>>>> >>>>> Data already present on the osd-os-buffers arrive with around >>>>> 400-700MB/s so I don't think the network is the culprit. >>>>> >>>>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds >>>>> each, lacp 2x10g bonds) >>>>> >>>>> rados bench single-threaded performs equally bad, but with its >>>>> default multithreaded settings it generates wonderful numbers, >>>>> usually only limiited by linerate and/or interrupts/s. >>>>> >>>>> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to >>>>> get to "your wonderful" numbers, but it's staying below 30 MB/s. >>>>> >>>>> I was thinking about using a software raid0 like you did but that's >>>>> imho really ugly. >>>>> When I know I needed something speedy, I usually just started dd-ing >>>>> the file to /dev/null and wait for about three minutes before >>>>> starting the actual job; some sort of hand-made read-ahead for >>>>> dummies. >>>>> >>>>> Thx in advance >>>>> Benedikt >>>>> >>>>> >>>>> 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>: >>>>>> Thanks for the replies guys. >>>>>> >>>>>> The client is set to 4MB, I haven't played with the OSD side yet as >>>>>> I wasn't sure if it would make much difference, but I will give it >>>>>> a go. If the client is already passing a 4MB request down through >>>>>> to the OSD, will it be able to readahead any further? The next 4MB >>>>>> object in theory will be on another OSD and so I'm not sure if >>>>>> reading ahead any further on the OSD side would help. >>>>>> >>>>>> How I see the problem is that the RBD client will only read 1 OSD >>>>>> at a time as the RBD readahead can't be set any higher than >>>>>> max_hw_sectors_kb, which is the object size of the RBD. Please >>>>>> correct >>>> me if I'm wrong on this. >>>>>> >>>>>> If you could set the RBD readahead to much higher than the object >>>>>> size, then this would probably give the desired effect where the >>>>>> buffer could be populated by reading from several OSD's in advance >>>>>> to give much higher performance. That or wait for striping to >>>>>> appear in >>> the >>>> Kernel client. >>>>>> >>>>>> I've also found that BareOS (fork of Bacula) seems to has a direct >>>>>> RADOS feature that supports radosstriper. I might try this and see >>>>>> how it performs as well. >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>>>> Behalf Of Somnath Roy >>>>>>> Sent: 17 August 2015 03:36 >>>>>>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk >>>>>>> <nick@xxxxxxxxxx> >>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx >>>>>>> Subject: Re: How to improve single thread sequential >>>> reads? >>>>>>> >>>>>>> Have you tried setting read_ahead_kb to bigger number for both >>>>>>> client/OSD side if you are using krbd ? >>>>>>> In case of librbd, try the different config options for rbd cache.. >>>>>>> >>>>>>> Thanks & Regards >>>>>>> Somnath >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>>>> Behalf Of Alex Gorbachev >>>>>>> Sent: Sunday, August 16, 2015 7:07 PM >>>>>>> To: Nick Fisk >>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx >>>>>>> Subject: Re: How to improve single thread sequential >>>> reads? >>>>>>> >>>>>>> Hi Nick, >>>>>>> >>>>>>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >>>>>>>>> -----Original Message----- >>>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>>>>>> Behalf Of Nick Fisk >>>>>>>>> Sent: 13 August 2015 18:04 >>>>>>>>> To: ceph-users@xxxxxxxxxxxxxx >>>>>>>>> Subject: How to improve single thread sequential >> reads? >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm trying to use a RBD to act as a staging area for some data >>>>>>>>> before >>>>>>>> pushing >>>>>>>>> it down to some LTO6 tapes. As I cannot use striping with the >>>>>>>>> kernel >>>>>>>> client I >>>>>>>>> tend to be maxing out at around 80MB/s reads testing with DD. >>>>>>>>> Has anyone got any clever suggestions of giving this a bit of a >>>>>>>>> boost, I think I need >>>>>>>> to get it >>>>>>>>> up to around 200MB/s to make sure there is always a steady flow >>>>>>>>> of data to the tape drive. >>>>>>>> >>>>>>>> I've just tried the testing kernel with the blk-mq fixes in it >>>>>>>> for full size IO's, this combined with bumping readahead up to >>>>>>>> 4MB, is now getting me on average 150MB/s to 200MB/s so this >> might suffice. >>>>>>>> >>>>>>>> On a personal interest, I would still like to know if anyone has >>>>>>>> ideas on how to really push much higher bandwidth through a RBD. >>>>>>> >>>>>>> Some settings in our ceph.conf that may help: >>>>>>> >>>>>>> osd_op_threads = 20 >>>>>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k >>>>>>> filestore_queue_max_ops = 90000 filestore_flusher = false >>>>>>> filestore_max_sync_interval = 10 filestore_sync_flush = false >>>>>>> >>>>>>> Regards, >>>>>>> Alex >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option. >>>>>>>>> >>>>>>>>> I'm thinking mapping multiple RBD's and then combining them into >>>>>>>>> a mdadm >>>>>>>>> RAID0 stripe might work, but seems a bit messy. >>>>>>>>> >>>>>>>>> Any suggestions? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Nick >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> ________________________________ >>>>>>> >>>>>>> PLEASE NOTE: The information contained in this electronic mail >>>>>>> message is intended only for the use of the designated >>>>>>> recipient(s) named above. If >>>>>> the >>>>>>> reader of this message is not the intended recipient, you are >>>>>>> hereby >>>>>> notified >>>>>>> that you have received this message in error and that any review, >>>>>>> dissemination, distribution, or copying of this message is >>>>>>> strictly >>>>>> prohibited. If >>>>>>> you have received this communication in error, please notify the >>>>>>> sender by telephone or e-mail (as shown above) immediately and >>>>>>> destroy any and all copies of this message in your possession >>>>>>> (whether hard copies or electronically stored copies). >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com