Re: cephfs change metadata pool?

Di Zhang <zhangdibio@xxxxxxxxx> · Wed, 13 Jul 2016 11:37:55 -0500

Hello,    Sorry for the misunderstanding about IOPS. Here are some summary stats of my benchmark (Is the 20 - 30 IOPS seems normal to you?):

ceph osd pool create test 512 512

rados bench -p test 10 write --no-cleanup

Total time run:         10.480383
Total writes made:      288
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     109.92
Stddev Bandwidth:       11.9926
Max bandwidth (MB/sec): 124
Min bandwidth (MB/sec): 80
Average IOPS:           27
Stddev IOPS:            3
Max IOPS:               31
Min IOPS:               20
Average Latency(s):     0.579105
Stddev Latency(s):      0.19902
Max latency(s):         1.32831
Min latency(s):         0.245505

rados bench -p bench -p test 10 seq
Total time run:       10.340724
Total reads made:     288
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   111.404
Average IOPS          27
Stddev IOPS:          2
Max IOPS:             31
Min IOPS:             22
Average Latency(s):   0.564858
Max latency(s):       1.65278
Min latency(s):       0.141504

rados bench -p bench -p test 10 rand
Total time run:       10.546251
Total reads made:     293
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   111.13
Average IOPS:         27
Stddev IOPS:          2
Max IOPS:             32
Min IOPS:             24
Average Latency(s):   0.57092
Max latency(s):       1.8631
Min latency(s):       0.161936

On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:

> I am using 10G infiniband for cluster network and 1G ethernet for public.

Hmm, very unbalanced, but I guess that's HW you already had.

> Because I don't have enough slots on the node, so I am using three files on

> the OS drive (SSD) for journaling, which really improved but not entirely

> solved the problem.

>

If you can, use partitions instead of files, less overhead.

What model SSD is that?

Also putting the meta-data pool on SSDs might help.

> I am quite happy with the current IOPS, which range from 200 MB/s to 400

> MB/s sequential write, depending on the block size.

That's not IOPS, that's bandwidth, throughput.

>But the problem is,

> when I transfer data to the cephfs at a rate below 100MB/s, I can observe

> the slow/blocked requests warnings after a few minutes via "ceph -w".

I doubt the speed has anything to do with this, but the actual block size

and IOPS numbers.

As always, watch your storage nodes with atop (or iostat) during such

scenarios/tests and spot the bottlenecks.

>It's

> not specific to any particular OSDs. So I started to doubt if the

> configuration is correct or upgrading to Jewel can solve it.

>

Jewel is likely to help in general, but can't fix insufficient HW or

broken configurations.

> There are about 5,000,000 objects currently in the cluster.

>

You're robably not hitting his, but read the recent filestore merge and

split threads, including the entirety of this thread:

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html

Christian

> Thanks for the hints.

>

> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx> wrote:

>

> >

> > Hello,

> >

> > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:

> >

> > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 for both

> > > cephfs_data and cephfs_metadata. I experienced some slow/blocked requests

> > > issues when I was using hammer 0.94.x and prior. So I was thinking if the

> > > pg_num is too large for metadata.

> >

> > Very, VERY much doubt this.

> >

> > Your "ideal" values for a cluster of this size (are you planning to grow

> > it?) would be about 1024 PGs for data and 128 or 256 PGs for meta-data.

> >

> > Not really that far off and more importantly not overloading the OSDs with

> > too many PGs in total. Or do you have more pools?

> >

> >

> > >I just upgraded the cluster to Jewel

> > > today. Will watch if the problem remains.

> > >

> > Jewel improvements might mask things, but I'd venture that your problems

> > were caused by your HW not being sufficient for the load.

> >

> > As in, do you use SSD journals, etc?

> > How many IOPS do you need/expect from your CephFS?

> > How many objects are in there?

> >

> > Christian

> >

> > > Thank you.

> > >

> > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx>

> > wrote:

> > >

> > > > I'm not at all sure that rados cppool actually captures everything (it

> > > > might). Doug has been working on some similar stuff for disaster

> > > > recovery testing and can probably walk you through moving over.

> > > >

> > > > But just how large *is* your metadata pool in relation to others?

> > > > Having a too-large pool doesn't cost much unless it's

> > > > grossly-inflated, and having a nice distribution of your folders is

> > > > definitely better than not.

> > > > -Greg

> > > >

> > > > On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>

> > wrote:

> > > > > Hi,

> > > > >

> > > > >     Is there any way to change the metadata pool for a cephfs without

> > > > losing

> > > > > any existing data? I know how to clone the metadata pool using rados

> > > > cppool.

> > > > > But the filesystem still links to the original metadata pool no

> > matter

> > > > what

> > > > > you name it.

> > > > >

> > > > >     The motivation here is to decrease the pg_num of the metadata

> > pool. I

> > > > > created this cephfs cluster sometime ago, while I didn't realize

> > that I

> > > > > shouldn't assign a large pg_num to such a small pool.

> > > > >

> > > > >     I'm not sure if I can delete the fs and re-create it using the

> > > > existing

> > > > > data pool and the cloned metadata pool.

> > > > >

> > > > >     Thank you.

> > > > >

> > > > >

> > > > > Zhang Di

> > > > >

> > > > > _______________________________________________

> > > > > ceph-users mailing list

> > > > > ceph-users@xxxxxxxxxxxxxx

> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > > > >

> > > >

> >

> >

> > --

> > Christian Balzer        Network/Systems Engineer

> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> > http://www.gol.com/

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com