> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Jan Schermer > Sent: 18 August 2015 11:50 > To: Benedikt Fraunhofer <given.to.lists.ceph- > users.ceph.com.toasta.001@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx> > Subject: Re: How to improve single thread sequential reads? > > I'm not sure if I missed that but are you testing in a VM backed by RBD > device, or using the device directly? > > I don't see how blk-mq would help if it's not a VM, it just passes the request > to the underlying block device, and in case of RBD there is no real block > device from the host perspective...? Enlighten me if I'm wrong please. I have > some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me > cringe because I'm unable to tune the scheduler and it just makes no sense > at all...? Since 4.0 (I think) the Kernel RBD client now uses the blk-mq infrastructure, but there is a bug which limits max IO sizes to 128kb, which is why for large block/sequential that testing kernel is essential. I think this bug fix should make it to 4.2 hopefully. > > Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to > make sure it gets into readahead), also try (if you're not using blk-mq) to a > cfq scheduler and set it to rotational=1. I see you've also tried this, but I think > blk-mq is the limiting factor here now. I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object size, from what I can tell) and the max_sectors_kb is already set at the hw_max. But it would sure be nice if the max_hw_sectors_kb could be set higher though, but I'm not sure if there is a reason for this limit. > > If you are running a single-threaded benchmark like rados bench then what's > limiting you is latency - it's not surprising it scales up with more threads. Agreed, but with sequential workloads, if you can get readahead working properly then you can easily remove this limitation as a single threaded op effectively becomes multithreaded. > It should run nicely with a real workload once readahead kicks in and the > queue fills up. But again - not sure how that works with blk-mq and I've > never used the RBD device directly (the kernel client). Does it show in > /sys/block ? Can you dump "find /sys/block/$rbd" in here? > > Jan > > > > On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph- > users.ceph.com.toasta.001@xxxxxxxxxx> wrote: > > > > Hi Nick, > > > > did you do anything fancy to get to ~90MB/s in the first place? > > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are > > quite speedy, around 600MB/s. > > > > radosgw for cold data is around the 90MB/s, which is imho limitted by > > the speed of a single disk. > > > > Data already present on the osd-os-buffers arrive with around > > 400-700MB/s so I don't think the network is the culprit. > > > > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds > > each, lacp 2x10g bonds) > > > > rados bench single-threaded performs equally bad, but with its default > > multithreaded settings it generates wonderful numbers, usually only > > limiited by linerate and/or interrupts/s. > > > > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to > > get to "your wonderful" numbers, but it's staying below 30 MB/s. > > > > I was thinking about using a software raid0 like you did but that's > > imho really ugly. > > When I know I needed something speedy, I usually just started dd-ing > > the file to /dev/null and wait for about three minutes before > > starting the actual job; some sort of hand-made read-ahead for > > dummies. > > > > Thx in advance > > Benedikt > > > > > > 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>: > >> Thanks for the replies guys. > >> > >> The client is set to 4MB, I haven't played with the OSD side yet as I > >> wasn't sure if it would make much difference, but I will give it a > >> go. If the client is already passing a 4MB request down through to > >> the OSD, will it be able to readahead any further? The next 4MB > >> object in theory will be on another OSD and so I'm not sure if > >> reading ahead any further on the OSD side would help. > >> > >> How I see the problem is that the RBD client will only read 1 OSD at > >> a time as the RBD readahead can't be set any higher than > >> max_hw_sectors_kb, which is the object size of the RBD. Please correct > me if I'm wrong on this. > >> > >> If you could set the RBD readahead to much higher than the object > >> size, then this would probably give the desired effect where the > >> buffer could be populated by reading from several OSD's in advance to > >> give much higher performance. That or wait for striping to appear in the > Kernel client. > >> > >> I've also found that BareOS (fork of Bacula) seems to has a direct > >> RADOS feature that supports radosstriper. I might try this and see > >> how it performs as well. > >> > >> > >>> -----Original Message----- > >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>> Behalf Of Somnath Roy > >>> Sent: 17 August 2015 03:36 > >>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk > >>> <nick@xxxxxxxxxx> > >>> Cc: ceph-users@xxxxxxxxxxxxxx > >>> Subject: Re: How to improve single thread sequential > reads? > >>> > >>> Have you tried setting read_ahead_kb to bigger number for both > >>> client/OSD side if you are using krbd ? > >>> In case of librbd, try the different config options for rbd cache.. > >>> > >>> Thanks & Regards > >>> Somnath > >>> > >>> -----Original Message----- > >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>> Behalf Of Alex Gorbachev > >>> Sent: Sunday, August 16, 2015 7:07 PM > >>> To: Nick Fisk > >>> Cc: ceph-users@xxxxxxxxxxxxxx > >>> Subject: Re: How to improve single thread sequential > reads? > >>> > >>> Hi Nick, > >>> > >>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > >>>>> -----Original Message----- > >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>>>> Behalf Of Nick Fisk > >>>>> Sent: 13 August 2015 18:04 > >>>>> To: ceph-users@xxxxxxxxxxxxxx > >>>>> Subject: How to improve single thread sequential reads? > >>>>> > >>>>> Hi, > >>>>> > >>>>> I'm trying to use a RBD to act as a staging area for some data > >>>>> before > >>>> pushing > >>>>> it down to some LTO6 tapes. As I cannot use striping with the > >>>>> kernel > >>>> client I > >>>>> tend to be maxing out at around 80MB/s reads testing with DD. Has > >>>>> anyone got any clever suggestions of giving this a bit of a boost, > >>>>> I think I need > >>>> to get it > >>>>> up to around 200MB/s to make sure there is always a steady flow of > >>>>> data to the tape drive. > >>>> > >>>> I've just tried the testing kernel with the blk-mq fixes in it for > >>>> full size IO's, this combined with bumping readahead up to 4MB, is > >>>> now getting me on average 150MB/s to 200MB/s so this might suffice. > >>>> > >>>> On a personal interest, I would still like to know if anyone has > >>>> ideas on how to really push much higher bandwidth through a RBD. > >>> > >>> Some settings in our ceph.conf that may help: > >>> > >>> osd_op_threads = 20 > >>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k > >>> filestore_queue_max_ops = 90000 filestore_flusher = false > >>> filestore_max_sync_interval = 10 filestore_sync_flush = false > >>> > >>> Regards, > >>> Alex > >>> > >>>> > >>>>> > >>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option. > >>>>> > >>>>> I'm thinking mapping multiple RBD's and then combining them into a > >>>>> mdadm > >>>>> RAID0 stripe might work, but seems a bit messy. > >>>>> > >>>>> Any suggestions? > >>>>> > >>>>> Thanks, > >>>>> Nick > >>>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >>> ________________________________ > >>> > >>> PLEASE NOTE: The information contained in this electronic mail > >>> message is intended only for the use of the designated recipient(s) > >>> named above. If > >> the > >>> reader of this message is not the intended recipient, you are hereby > >> notified > >>> that you have received this message in error and that any review, > >>> dissemination, distribution, or copying of this message is strictly > >> prohibited. If > >>> you have received this communication in error, please notify the > >>> sender by telephone or e-mail (as shown above) immediately and > >>> destroy any and all copies of this message in your possession > >>> (whether hard copies or electronically stored copies). > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com