Hello, On Wed, 13 Jul 2016 22:47:05 -0500 Di Zhang wrote: > Hi, > I changed to only use the infiniband network. For the 4KB write, the IOPS doesn’t improve much. That's mostly going to be bound by latencies (as I just wrote in the other thread), both network and internal Ceph ones. The cluster I described in the other thread has 32 OSDs and does about 1050 "IOPS" with "rados -p rbd bench 30 write -t 32 -b 4096". So about half with your 15 OSDs isn't all that unexpected. Once again, to get something more realistic use fio. >I also logged into the OSD nodes and atop showed the disks are not always at 100% busy. Please check a snapshot of one node below: When you do the 4KB bench (for 60 seconds or so), also watch the CPU usage, rados bench is a killer there. Christian > > DSK | sdc | busy 72% | read 20/s | write 86/s | KiB/w 13 | MBr/s 0.16 | MBw/s 1.12 | avio 6.69 ms | > DSK | sda | busy 47% | read 0/s | write 589/s | KiB/w 4 | MBr/s 0.00 | MBw/s 2.83 | avio 0.79 ms | > DSK | sdb | busy 31% | read 14/s | write 77/s | KiB/w 10 | MBr/s 0.11 | MBw/s 0.76 | avio 3.42 ms | > DSK | sdd | busy 19% | read 4/s | write 50/s | KiB/w 11 | MBr/s 0.03 | MBw/s 0.55 | avio 3.40 ms | > NET | transport | tcpi 656/s | tcpo 655/s | udpi 0/s | udpo 0/s | tcpao 0/s | tcppo 0/s | tcprs 0/s | > NET | network | ipi 657/s | ipo 655/s | ipfrw 0/s | deliv 657/s | | icmpi 0/s | icmpo 0/s | > NET | p10p1 0% | pcki 0/s | pcko 0/s | si 0 Kbps | so 1 Kbps | erri 0/s | erro 0/s | drpo 0/s | > NET | ib0 ---- | pcki 637/s | pcko 636/s | si 8006 Kbps | so 5213 Kbps | erri 0/s | erro 0/s | drpo 0/s | > NET | lo ---- | pcki 19/s | pcko 19/s | si 14 Kbps | so 14 Kbps | erri 0/s | erro 0/s | drpo 0/s | > > /dev/sda is the OS and journaling SSD. The other three are OSDs. > > Am I missing anything? > > Thanks, > > > > > Zhang, Di > Postdoctoral Associate > Baylor College of Medicine > > > On Jul 13, 2016, at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > > Hello, > > > > On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote: > > > >> I also tried 4K write bench. The IOPS is ~420. > > > > That's what people usually mean (4KB blocks) when talking about IOPS. > > This number is pretty low, my guess would be network latency on your 1Gbs > > network for the most part. > > > > You should run atop on your storage nodes will running a test like this > > and see if the OSDs (HDDs) are also very busy. > > > > Lastly the rados bench gives you some basic numbers but it is not the same > > as real client I/O, for that you want to run fio inside a VM or in your > > case on a mounted CephFS. > > > >> I used to have better > >> bandwidth when I use the same network for both the cluster and clients. Now > >> the bandwidth must be limited by the 1G ethernet. > > That's the bandwidth you also see in your 4MB block tests below. > > For small I/Os the real killer is latency, though. > > > >> What would you suggest to > >> me to do? > >> > > That depends on your budget mostly (switch ports, client NICs). > > > > A uniform, single 10Gb/s network would be better in all aspects than the > > split network you have now. > > > > Christian > > > >> Thanks, > >> > >> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote: > >> > >>> Hello, > >>> Sorry for the misunderstanding about IOPS. Here are some summary stats > >>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?): > >>> > >>> ceph osd pool create test 512 512 > >>> > >>> rados bench -p test 10 write --no-cleanup > >>> > >>> Total time run: 10.480383 > >>> Total writes made: 288 > >>> Write size: 4194304 > >>> Object size: 4194304 > >>> Bandwidth (MB/sec): 109.92 > >>> Stddev Bandwidth: 11.9926 > >>> Max bandwidth (MB/sec): 124 > >>> Min bandwidth (MB/sec): 80 > >>> Average IOPS: 27 > >>> Stddev IOPS: 3 > >>> Max IOPS: 31 > >>> Min IOPS: 20 > >>> Average Latency(s): 0.579105 > >>> Stddev Latency(s): 0.19902 > >>> Max latency(s): 1.32831 > >>> Min latency(s): 0.245505 > >>> > >>> rados bench -p bench -p test 10 seq > >>> Total time run: 10.340724 > >>> Total reads made: 288 > >>> Read size: 4194304 > >>> Object size: 4194304 > >>> Bandwidth (MB/sec): 111.404 > >>> Average IOPS 27 > >>> Stddev IOPS: 2 > >>> Max IOPS: 31 > >>> Min IOPS: 22 > >>> Average Latency(s): 0.564858 > >>> Max latency(s): 1.65278 > >>> Min latency(s): 0.141504 > >>> > >>> rados bench -p bench -p test 10 rand > >>> Total time run: 10.546251 > >>> Total reads made: 293 > >>> Read size: 4194304 > >>> Object size: 4194304 > >>> Bandwidth (MB/sec): 111.13 > >>> Average IOPS: 27 > >>> Stddev IOPS: 2 > >>> Max IOPS: 32 > >>> Min IOPS: 24 > >>> Average Latency(s): 0.57092 > >>> Max latency(s): 1.8631 > >>> Min latency(s): 0.161936 > >>> > >>> > >>> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote: > >>> > >>>> > >>>> Hello, > >>>> > >>>> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote: > >>>> > >>>>> I am using 10G infiniband for cluster network and 1G ethernet for > >>>> public. > >>>> Hmm, very unbalanced, but I guess that's HW you already had. > >>>> > >>>>> Because I don't have enough slots on the node, so I am using three > >>>> files on > >>>>> the OS drive (SSD) for journaling, which really improved but not > >>>> entirely > >>>>> solved the problem. > >>>>> > >>>> If you can, use partitions instead of files, less overhead. > >>>> What model SSD is that? > >>>> > >>>> Also putting the meta-data pool on SSDs might help. > >>>> > >>>>> I am quite happy with the current IOPS, which range from 200 MB/s to 400 > >>>>> MB/s sequential write, depending on the block size. > >>>> That's not IOPS, that's bandwidth, throughput. > >>>> > >>>>> But the problem is, > >>>>> when I transfer data to the cephfs at a rate below 100MB/s, I can > >>>> observe > >>>>> the slow/blocked requests warnings after a few minutes via "ceph -w". > >>>> > >>>> I doubt the speed has anything to do with this, but the actual block size > >>>> and IOPS numbers. > >>>> > >>>> As always, watch your storage nodes with atop (or iostat) during such > >>>> scenarios/tests and spot the bottlenecks. > >>>> > >>>>> It's > >>>>> not specific to any particular OSDs. So I started to doubt if the > >>>>> configuration is correct or upgrading to Jewel can solve it. > >>>>> > >>>> Jewel is likely to help in general, but can't fix insufficient HW or > >>>> broken configurations. > >>>> > >>>>> There are about 5,000,000 objects currently in the cluster. > >>>>> > >>>> You're robably not hitting his, but read the recent filestore merge and > >>>> split threads, including the entirety of this thread: > >>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html > >>>> > >>>> Christian > >>>> > >>>>> Thanks for the hints. > >>>>> > >>>>> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx> > >>>> wrote: > >>>>> > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote: > >>>>>> > >>>>>>> It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 > >>>> for both > >>>>>>> cephfs_data and cephfs_metadata. I experienced some slow/blocked > >>>> requests > >>>>>>> issues when I was using hammer 0.94.x and prior. So I was thinking > >>>> if the > >>>>>>> pg_num is too large for metadata. > >>>>>> > >>>>>> Very, VERY much doubt this. > >>>>>> > >>>>>> Your "ideal" values for a cluster of this size (are you planning to > >>>> grow > >>>>>> it?) would be about 1024 PGs for data and 128 or 256 PGs for > >>>> meta-data. > >>>>>> > >>>>>> Not really that far off and more importantly not overloading the OSDs > >>>> with > >>>>>> too many PGs in total. Or do you have more pools? > >>>>>> > >>>>>> > >>>>>>> I just upgraded the cluster to Jewel > >>>>>>> today. Will watch if the problem remains. > >>>>>>> > >>>>>> Jewel improvements might mask things, but I'd venture that your > >>>> problems > >>>>>> were caused by your HW not being sufficient for the load. > >>>>>> > >>>>>> As in, do you use SSD journals, etc? > >>>>>> How many IOPS do you need/expect from your CephFS? > >>>>>> How many objects are in there? > >>>>>> > >>>>>> Christian > >>>>>> > >>>>>>> Thank you. > >>>>>>> > >>>>>>> On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx > >>>>> > >>>>>> wrote: > >>>>>>> > >>>>>>>> I'm not at all sure that rados cppool actually captures > >>>> everything (it > >>>>>>>> might). Doug has been working on some similar stuff for disaster > >>>>>>>> recovery testing and can probably walk you through moving over. > >>>>>>>> > >>>>>>>> But just how large *is* your metadata pool in relation to others? > >>>>>>>> Having a too-large pool doesn't cost much unless it's > >>>>>>>> grossly-inflated, and having a nice distribution of your folders > >>>> is > >>>>>>>> definitely better than not. > >>>>>>>> -Greg > >>>>>>>> > >>>>>>>> On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx> > >>>>>> wrote: > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> Is there any way to change the metadata pool for a cephfs > >>>> without > >>>>>>>> losing > >>>>>>>>> any existing data? I know how to clone the metadata pool using > >>>> rados > >>>>>>>> cppool. > >>>>>>>>> But the filesystem still links to the original metadata pool no > >>>>>> matter > >>>>>>>> what > >>>>>>>>> you name it. > >>>>>>>>> > >>>>>>>>> The motivation here is to decrease the pg_num of the > >>>> metadata > >>>>>> pool. I > >>>>>>>>> created this cephfs cluster sometime ago, while I didn't realize > >>>>>> that I > >>>>>>>>> shouldn't assign a large pg_num to such a small pool. > >>>>>>>>> > >>>>>>>>> I'm not sure if I can delete the fs and re-create it using > >>>> the > >>>>>>>> existing > >>>>>>>>> data pool and the cloned metadata pool. > >>>>>>>>> > >>>>>>>>> Thank you. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Zhang Di > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> ceph-users mailing list > >>>>>>>>> ceph-users@xxxxxxxxxxxxxx > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Christian Balzer Network/Systems Engineer > >>>>>> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >>>>>> http://www.gol.com/ > >>>>>> > >>>> > >>>> > >>>> -- > >>>> Christian Balzer Network/Systems Engineer > >>>> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >>>> http://www.gol.com/ > >>>> > >>> > >>> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx <mailto:chibi@xxxxxxx> Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ <http://www.gol.com/> -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com