Re: cephfs change metadata pool?

Christian Balzer <chibi@xxxxxxx> · Thu, 14 Jul 2016 08:56:32 +0900

Hello,

On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:

> I also tried 4K write bench. The IOPS is ~420. 

That's what people usually mean (4KB blocks) when talking about IOPS.
This number is pretty low, my guess would be network latency on your 1Gbs
network for the most part.

You should run atop on your storage nodes will running a test like this
and see if the OSDs (HDDs) are also very busy.

Lastly the rados bench gives you some basic numbers but it is not the same
as real client I/O, for that you want to run fio inside a VM or in your
case on a mounted CephFS.

> I used to have better
> bandwidth when I use the same network for both the cluster and clients. Now
> the bandwidth must be limited by the 1G ethernet. 
That's the bandwidth you also see in your 4MB block tests below.
For small I/Os the real killer is latency, though.

>What would you suggest to
> me to do?
> 
That depends on your budget mostly (switch ports, client NICs).

A uniform, single 10Gb/s network would be better in all aspects than the
split network you have now.

Christian

> Thanks,
> 
> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:
> 
> > Hello,
> >     Sorry for the misunderstanding about IOPS. Here are some summary stats
> > of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
> >
> > ceph osd pool create test 512 512
> >
> > rados bench -p test 10 write --no-cleanup
> >
> > Total time run:         10.480383
> > Total writes made:      288
> > Write size:             4194304
> > Object size:            4194304
> > Bandwidth (MB/sec):     109.92
> > Stddev Bandwidth:       11.9926
> > Max bandwidth (MB/sec): 124
> > Min bandwidth (MB/sec): 80
> > Average IOPS:           27
> > Stddev IOPS:            3
> > Max IOPS:               31
> > Min IOPS:               20
> > Average Latency(s):     0.579105
> > Stddev Latency(s):      0.19902
> > Max latency(s):         1.32831
> > Min latency(s):         0.245505
> >
> > rados bench -p bench -p test 10 seq
> > Total time run:       10.340724
> > Total reads made:     288
> > Read size:            4194304
> > Object size:          4194304
> > Bandwidth (MB/sec):   111.404
> > Average IOPS          27
> > Stddev IOPS:          2
> > Max IOPS:             31
> > Min IOPS:             22
> > Average Latency(s):   0.564858
> > Max latency(s):       1.65278
> > Min latency(s):       0.141504
> >
> > rados bench -p bench -p test 10 rand
> > Total time run:       10.546251
> > Total reads made:     293
> > Read size:            4194304
> > Object size:          4194304
> > Bandwidth (MB/sec):   111.13
> > Average IOPS:         27
> > Stddev IOPS:          2
> > Max IOPS:             32
> > Min IOPS:             24
> > Average Latency(s):   0.57092
> > Max latency(s):       1.8631
> > Min latency(s):       0.161936
> >
> >
> > On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >>
> >> Hello,
> >>
> >> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
> >>
> >> > I am using 10G infiniband for cluster network and 1G ethernet for
> >> public.
> >> Hmm, very unbalanced, but I guess that's HW you already had.
> >>
> >> > Because I don't have enough slots on the node, so I am using three
> >> files on
> >> > the OS drive (SSD) for journaling, which really improved but not
> >> entirely
> >> > solved the problem.
> >> >
> >> If you can, use partitions instead of files, less overhead.
> >> What model SSD is that?
> >>
> >> Also putting the meta-data pool on SSDs might help.
> >>
> >> > I am quite happy with the current IOPS, which range from 200 MB/s to 400
> >> > MB/s sequential write, depending on the block size.
> >> That's not IOPS, that's bandwidth, throughput.
> >>
> >> >But the problem is,
> >> > when I transfer data to the cephfs at a rate below 100MB/s, I can
> >> observe
> >> > the slow/blocked requests warnings after a few minutes via "ceph -w".
> >>
> >> I doubt the speed has anything to do with this, but the actual block size
> >> and IOPS numbers.
> >>
> >> As always, watch your storage nodes with atop (or iostat) during such
> >> scenarios/tests and spot the bottlenecks.
> >>
> >> >It's
> >> > not specific to any particular OSDs. So I started to doubt if the
> >> > configuration is correct or upgrading to Jewel can solve it.
> >> >
> >> Jewel is likely to help in general, but can't fix insufficient HW or
> >> broken configurations.
> >>
> >> > There are about 5,000,000 objects currently in the cluster.
> >> >
> >> You're robably not hitting his, but read the recent filestore merge and
> >> split threads, including the entirety of this thread:
> >> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html
> >>
> >> Christian
> >>
> >> > Thanks for the hints.
> >> >
> >> > On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx>
> >> wrote:
> >> >
> >> > >
> >> > > Hello,
> >> > >
> >> > > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
> >> > >
> >> > > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
> >> for both
> >> > > > cephfs_data and cephfs_metadata. I experienced some slow/blocked
> >> requests
> >> > > > issues when I was using hammer 0.94.x and prior. So I was thinking
> >> if the
> >> > > > pg_num is too large for metadata.
> >> > >
> >> > > Very, VERY much doubt this.
> >> > >
> >> > > Your "ideal" values for a cluster of this size (are you planning to
> >> grow
> >> > > it?) would be about 1024 PGs for data and 128 or 256 PGs for
> >> meta-data.
> >> > >
> >> > > Not really that far off and more importantly not overloading the OSDs
> >> with
> >> > > too many PGs in total. Or do you have more pools?
> >> > >
> >> > >
> >> > > >I just upgraded the cluster to Jewel
> >> > > > today. Will watch if the problem remains.
> >> > > >
> >> > > Jewel improvements might mask things, but I'd venture that your
> >> problems
> >> > > were caused by your HW not being sufficient for the load.
> >> > >
> >> > > As in, do you use SSD journals, etc?
> >> > > How many IOPS do you need/expect from your CephFS?
> >> > > How many objects are in there?
> >> > >
> >> > > Christian
> >> > >
> >> > > > Thank you.
> >> > > >
> >> > > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx
> >> >
> >> > > wrote:
> >> > > >
> >> > > > > I'm not at all sure that rados cppool actually captures
> >> everything (it
> >> > > > > might). Doug has been working on some similar stuff for disaster
> >> > > > > recovery testing and can probably walk you through moving over.
> >> > > > >
> >> > > > > But just how large *is* your metadata pool in relation to others?
> >> > > > > Having a too-large pool doesn't cost much unless it's
> >> > > > > grossly-inflated, and having a nice distribution of your folders
> >> is
> >> > > > > definitely better than not.
> >> > > > > -Greg
> >> > > > >
> >> > > > > On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>
> >> > > wrote:
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > >     Is there any way to change the metadata pool for a cephfs
> >> without
> >> > > > > losing
> >> > > > > > any existing data? I know how to clone the metadata pool using
> >> rados
> >> > > > > cppool.
> >> > > > > > But the filesystem still links to the original metadata pool no
> >> > > matter
> >> > > > > what
> >> > > > > > you name it.
> >> > > > > >
> >> > > > > >     The motivation here is to decrease the pg_num of the
> >> metadata
> >> > > pool. I
> >> > > > > > created this cephfs cluster sometime ago, while I didn't realize
> >> > > that I
> >> > > > > > shouldn't assign a large pg_num to such a small pool.
> >> > > > > >
> >> > > > > >     I'm not sure if I can delete the fs and re-create it using
> >> the
> >> > > > > existing
> >> > > > > > data pool and the cloned metadata pool.
> >> > > > > >
> >> > > > > >     Thank you.
> >> > > > > >
> >> > > > > >
> >> > > > > > Zhang Di
> >> > > > > >
> >> > > > > > _______________________________________________
> >> > > > > > ceph-users mailing list
> >> > > > > > ceph-users@xxxxxxxxxxxxxx
> >> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > >
> >> > > > >
> >> > >
> >> > >
> >> > > --
> >> > > Christian Balzer        Network/Systems Engineer
> >> > > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> > > http://www.gol.com/
> >> > >
> >>
> >>
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> http://www.gol.com/
> >>
> >
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com