I also tried 4K write bench. The IOPS is ~420. I used to have better bandwidth when I use the same network for both the cluster and clients. Now the bandwidth must be limited by the 1G ethernet. What would you suggest to me to do?
Thanks,
On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:
Hello,Sorry for the misunderstanding about IOPS. Here are some summary stats of my benchmark (Is the 20 - 30 IOPS seems normal to you?):ceph osd pool create test 512 512rados bench -p test 10 write --no-cleanupTotal time run: 10.480383Total writes made: 288Write size: 4194304Object size: 4194304Bandwidth (MB/sec): 109.92Stddev Bandwidth: 11.9926Max bandwidth (MB/sec): 124Min bandwidth (MB/sec): 80Average IOPS: 27Stddev IOPS: 3Max IOPS: 31Min IOPS: 20Average Latency(s): 0.579105Stddev Latency(s): 0.19902Max latency(s): 1.32831Min latency(s): 0.245505rados bench -p bench -p test 10 seqTotal time run: 10.340724Total reads made: 288Read size: 4194304Object size: 4194304Bandwidth (MB/sec): 111.404Average IOPS 27Stddev IOPS: 2Max IOPS: 31Min IOPS: 22Average Latency(s): 0.564858Max latency(s): 1.65278Min latency(s): 0.141504rados bench -p bench -p test 10 randTotal time run: 10.546251Total reads made: 293Read size: 4194304Object size: 4194304Bandwidth (MB/sec): 111.13Average IOPS: 27Stddev IOPS: 2Max IOPS: 32Min IOPS: 24Average Latency(s): 0.57092Max latency(s): 1.8631Min latency(s): 0.161936On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
> I am using 10G infiniband for cluster network and 1G ethernet for public.
Hmm, very unbalanced, but I guess that's HW you already had.
> Because I don't have enough slots on the node, so I am using three files on
> the OS drive (SSD) for journaling, which really improved but not entirely
> solved the problem.
>
If you can, use partitions instead of files, less overhead.
What model SSD is that?
Also putting the meta-data pool on SSDs might help.
> I am quite happy with the current IOPS, which range from 200 MB/s to 400
> MB/s sequential write, depending on the block size.
That's not IOPS, that's bandwidth, throughput.
>But the problem is,
> when I transfer data to the cephfs at a rate below 100MB/s, I can observe
> the slow/blocked requests warnings after a few minutes via "ceph -w".
I doubt the speed has anything to do with this, but the actual block size
and IOPS numbers.
As always, watch your storage nodes with atop (or iostat) during such
scenarios/tests and spot the bottlenecks.
>It's
> not specific to any particular OSDs. So I started to doubt if the
> configuration is correct or upgrading to Jewel can solve it.
>
Jewel is likely to help in general, but can't fix insufficient HW or
broken configurations.
> There are about 5,000,000 objects currently in the cluster.
>
You're robably not hitting his, but read the recent filestore merge and
split threads, including the entirety of this thread:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html
Christian
> Thanks for the hints.
>
> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> >
> > Hello,
> >
> > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
> >
> > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 for both
> > > cephfs_data and cephfs_metadata. I experienced some slow/blocked requests
> > > issues when I was using hammer 0.94.x and prior. So I was thinking if the
> > > pg_num is too large for metadata.
> >
> > Very, VERY much doubt this.
> >
> > Your "ideal" values for a cluster of this size (are you planning to grow
> > it?) would be about 1024 PGs for data and 128 or 256 PGs for meta-data.
> >
> > Not really that far off and more importantly not overloading the OSDs with
> > too many PGs in total. Or do you have more pools?
> >
> >
> > >I just upgraded the cluster to Jewel
> > > today. Will watch if the problem remains.
> > >
> > Jewel improvements might mask things, but I'd venture that your problems
> > were caused by your HW not being sufficient for the load.
> >
> > As in, do you use SSD journals, etc?
> > How many IOPS do you need/expect from your CephFS?
> > How many objects are in there?
> >
> > Christian
> >
> > > Thank you.
> > >
> > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx>
> > wrote:
> > >
> > > > I'm not at all sure that rados cppool actually captures everything (it
> > > > might). Doug has been working on some similar stuff for disaster
> > > > recovery testing and can probably walk you through moving over.
> > > >
> > > > But just how large *is* your metadata pool in relation to others?
> > > > Having a too-large pool doesn't cost much unless it's
> > > > grossly-inflated, and having a nice distribution of your folders is
> > > > definitely better than not.
> > > > -Greg
> > > >
> > > > On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>
> > wrote:
> > > > > Hi,
> > > > >
> > > > > Is there any way to change the metadata pool for a cephfs without
> > > > losing
> > > > > any existing data? I know how to clone the metadata pool using rados
> > > > cppool.
> > > > > But the filesystem still links to the original metadata pool no
> > matter
> > > > what
> > > > > you name it.
> > > > >
> > > > > The motivation here is to decrease the pg_num of the metadata
> > pool. I
> > > > > created this cephfs cluster sometime ago, while I didn't realize
> > that I
> > > > > shouldn't assign a large pg_num to such a small pool.
> > > > >
> > > > > I'm not sure if I can delete the fs and re-create it using the
> > > > existing
> > > > > data pool and the cloned metadata pool.
> > > > >
> > > > > Thank you.
> > > > >
> > > > >
> > > > > Zhang Di
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > >
> >
> >
> > --
> > Christian Balzer Network/Systems Engineer
> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com