On Wed Apr 25 02:24:19 PDT 2018 Christian Balzer wrote: > Hello, > On Tue, 24 Apr 2018 12:52:55 -0400 Jonathan Proulx wrote: > > The performence I really care about is over rbd for VMs in my > > OpenStack but 'rbd bench' seems to line up frety well with 'fio' test > > inside VMs so a more or less typical random write rbd bench (from a > > monitor node with 10G connection on same net as osds): > > > "rbd bench" does things differently than fio (lots of happy switches > there) so to make absolutely sure you're not doing and apples and oranges > thing I'd suggest you stick to fio in a VM. There's some tradeoffs yes, but I get very close results and I figured ceph tools for ceph list rather than pulling in all the rest of my working stack, since the ceph toools do show the problem. but I do see your point. > In comparison this fio: > --- > fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32 > --- > Will only result in this, due to the network latencies of having direct > I/O and only having one OSD at a time being busy: > --- > write: io=110864KB, bw=1667.2KB/s, iops=416, runt= 66499msec > --- I may simply have under estimated the impact of write caching in libvirt, that fio commang does get me just abotu as crappy write performance as read (which would point to I just need more IOPS from more/faster disks which is definitely true to a greater or lesser extent. WRITE: io=1024.0MB, aggrb=5705KB/s, minb=5705KB/s, maxb=5705KB/s, mint=183789msec, maxt=183789msec READ: io=1024.0MB, aggrb=4322KB/s, minb=4322KB/s, maxb=4322KB/s, mint=242606msec, maxt=242606msec > That being said, something is rather wrong here indeed, my crappy test > cluster shouldn't be able to outperform yours. well load ... the asymetry was may main puzzlement but that may be illusory > > rbd bench --io-total=4G --io-size 4096 --io-type write \ > > --io-pattern rand --io-threads 16 mypool/myvol > > > > <snip /> > > > > elapsed: 361 ops: 1048576 ops/sec: 2903.82 bytes/sec: 11894034.98 > > > > same for random read is an order of magnitude lower: > > > > rbd bench --io-total=4G --io-size 4096 --io-type read \ > > --io-pattern rand --io-threads 16 mypool/myvol > > > > elapsed: 3354 ops: 1048576 ops/sec: 312.60 bytes/sec: 1280403.47 > > > > (sequencial reads and bigger io-size help but not a lot) > > > > ceph -s from during read bench so get a sense of comparing traffic: > > > > cluster: > > id: <UUID> > > health: HEALTH_OK > > > > services: > > mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2 > > mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1 > > osd: 174 osds: 174 up, 174 in > > rgw: 3 daemon active > > > > data: > > pools: 19 pools, 10240 pgs > > objects: 17342k objects, 80731 GB > > usage: 240 TB used, 264 TB / 505 TB avail > > pgs: 10240 active+clean > > > > io: > > client: 4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr > > > > > > During deep-scrubs overnight I can see the disks doing >500MBps reads > > and ~150rx/iops (each at peak), while during read bench (including all > > traffic from ~1k VMs) individual osd data partitions peak around 25 > > rx/iops and 1.5MBps rx bandwidth so it seems like there should be > > performance to spare. > > > OK, there are a couple of things here. > 1k VMs?!? Actually 1.7k VMs just now, which caught me a bit by surprise when I looked at it. Many are idle because we don't charge per use internally so people are sloppy, but many aren't and even the idle ones are writing logs and such. > One assumes that they're not idle, looking at the output above. > And writes will compete with reads on the same spindle of course. > "performance to spare" you say, but have you verified this with iostat or > atop? Thsi assertion is mostly based on collectd stats that show a spike in read ops and bandwidth during our scrub window and no large change in write ops or bandwidth. So I presume the disk *could* do that much (at least ops wise) for client traffic as well. here's a snap of 24hr graph form one server (others are similar in general shape): https://snapshot.raintank.io/dashboard/snapshot/gB3FDPl7uRGWmL17NHNBCuWKGsXdiqlt (link good for 7days) You can clearly see the low read line behinf the higehr writes jump up during scrub window (20:00->02:00 local time here) and a much smaller bump around 6am cron.daily or the thundering herd. The scrubs do impact performance which does mean I'm over capacity as I should be able to scrub and not impact production, but there's still a fair amount of capacity used during scrubbing that doesn't seem used outside. But looking harder the only answer may be "buy hardware" which is valid answer. Thanks, -Jon _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com