Re: cephfs change metadata pool?

Di Zhang <zhangdibio@xxxxxxxxx> · Wed, 13 Jul 2016 12:01:14 -0500

I also tried 4K write bench. The IOPS is ~420. I used to have better bandwidth when I use the same network for both the cluster and clients. Now the bandwidth must be limited by the 1G ethernet. What would you suggest to me to do?
Thanks,

On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:
Hello,    Sorry for the misunderstanding about IOPS. Here are some summary stats of my benchmark (Is the 20 - 30 IOPS seems normal to you?):

ceph osd pool create test 512 512

rados bench -p test 10 write --no-cleanup

Total time run:         10.480383
Total writes made:      288
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     109.92
Stddev Bandwidth:       11.9926
Max bandwidth (MB/sec): 124
Min bandwidth (MB/sec): 80
Average IOPS:           27
Stddev IOPS:            3
Max IOPS:               31
Min IOPS:               20
Average Latency(s):     0.579105
Stddev Latency(s):      0.19902
Max latency(s):         1.32831
Min latency(s):         0.245505

rados bench -p bench -p test 10 seq
Total time run:       10.340724
Total reads made:     288
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   111.404
Average IOPS          27
Stddev IOPS:          2
Max IOPS:             31
Min IOPS:             22
Average Latency(s):   0.564858
Max latency(s):       1.65278
Min latency(s):       0.141504

rados bench -p bench -p test 10 rand
Total time run:       10.546251
Total reads made:     293
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   111.13
Average IOPS:         27
Stddev IOPS:          2
Max IOPS:             32
Min IOPS:             24
Average Latency(s):   0.57092
Max latency(s):       1.8631
Min latency(s):       0.161936

On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:

> I am using 10G infiniband for cluster network and 1G ethernet for public.

Hmm, very unbalanced, but I guess that's HW you already had.

> Because I don't have enough slots on the node, so I am using three files on

> the OS drive (SSD) for journaling, which really improved but not entirely

> solved the problem.

>

If you can, use partitions instead of files, less overhead.

What model SSD is that?

Also putting the meta-data pool on SSDs might help.

> I am quite happy with the current IOPS, which range from 200 MB/s to 400

> MB/s sequential write, depending on the block size.

That's not IOPS, that's bandwidth, throughput.

>But the problem is,

> when I transfer data to the cephfs at a rate below 100MB/s, I can observe

> the slow/blocked requests warnings after a few minutes via "ceph -w".

I doubt the speed has anything to do with this, but the actual block size

and IOPS numbers.

As always, watch your storage nodes with atop (or iostat) during such

scenarios/tests and spot the bottlenecks.

>It's

> not specific to any particular OSDs. So I started to doubt if the

> configuration is correct or upgrading to Jewel can solve it.

>

Jewel is likely to help in general, but can't fix insufficient HW or

broken configurations.

> There are about 5,000,000 objects currently in the cluster.

>

You're robably not hitting his, but read the recent filestore merge and

split threads, including the entirety of this thread:

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html

Christian

> Thanks for the hints.

>

> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx> wrote:

>

> >

> > Hello,

> >

> > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:

> >

> > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 for both

> > > cephfs_data and cephfs_metadata. I experienced some slow/blocked requests

> > > issues when I was using hammer 0.94.x and prior. So I was thinking if the

> > > pg_num is too large for metadata.

> >

> > Very, VERY much doubt this.

> >

> > Your "ideal" values for a cluster of this size (are you planning to grow

> > it?) would be about 1024 PGs for data and 128 or 256 PGs for meta-data.

> >

> > Not really that far off and more importantly not overloading the OSDs with

> > too many PGs in total. Or do you have more pools?

> >

> >

> > >I just upgraded the cluster to Jewel

> > > today. Will watch if the problem remains.

> > >

> > Jewel improvements might mask things, but I'd venture that your problems

> > were caused by your HW not being sufficient for the load.

> >

> > As in, do you use SSD journals, etc?

> > How many IOPS do you need/expect from your CephFS?

> > How many objects are in there?

> >

> > Christian

> >

> > > Thank you.

> > >

> > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx>

> > wrote:

> > >

> > > > I'm not at all sure that rados cppool actually captures everything (it

> > > > might). Doug has been working on some similar stuff for disaster

> > > > recovery testing and can probably walk you through moving over.

> > > >

> > > > But just how large *is* your metadata pool in relation to others?

> > > > Having a too-large pool doesn't cost much unless it's

> > > > grossly-inflated, and having a nice distribution of your folders is

> > > > definitely better than not.

> > > > -Greg

> > > >

> > > > On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>

> > wrote:

> > > > > Hi,

> > > > >

> > > > >     Is there any way to change the metadata pool for a cephfs without

> > > > losing

> > > > > any existing data? I know how to clone the metadata pool using rados

> > > > cppool.

> > > > > But the filesystem still links to the original metadata pool no

> > matter

> > > > what

> > > > > you name it.

> > > > >

> > > > >     The motivation here is to decrease the pg_num of the metadata

> > pool. I

> > > > > created this cephfs cluster sometime ago, while I didn't realize

> > that I

> > > > > shouldn't assign a large pg_num to such a small pool.

> > > > >

> > > > >     I'm not sure if I can delete the fs and re-create it using the

> > > > existing

> > > > > data pool and the cloned metadata pool.

> > > > >

> > > > >     Thank you.

> > > > >

> > > > >

> > > > > Zhang Di

> > > > >

> > > > > _______________________________________________

> > > > > ceph-users mailing list

> > > > > ceph-users@xxxxxxxxxxxxxx

> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > > > >

> > > >

> >

> >

> > --

> > Christian Balzer        Network/Systems Engineer

> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> > http://www.gol.com/

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com