Hello, On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote: > I am using 10G infiniband for cluster network and 1G ethernet for public. Hmm, very unbalanced, but I guess that's HW you already had. > Because I don't have enough slots on the node, so I am using three files on > the OS drive (SSD) for journaling, which really improved but not entirely > solved the problem. > If you can, use partitions instead of files, less overhead. What model SSD is that? Also putting the meta-data pool on SSDs might help. > I am quite happy with the current IOPS, which range from 200 MB/s to 400 > MB/s sequential write, depending on the block size. That's not IOPS, that's bandwidth, throughput. >But the problem is, > when I transfer data to the cephfs at a rate below 100MB/s, I can observe > the slow/blocked requests warnings after a few minutes via "ceph -w". I doubt the speed has anything to do with this, but the actual block size and IOPS numbers. As always, watch your storage nodes with atop (or iostat) during such scenarios/tests and spot the bottlenecks. >It's > not specific to any particular OSDs. So I started to doubt if the > configuration is correct or upgrading to Jewel can solve it. > Jewel is likely to help in general, but can't fix insufficient HW or broken configurations. > There are about 5,000,000 objects currently in the cluster. > You're robably not hitting his, but read the recent filestore merge and split threads, including the entirety of this thread: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html Christian > Thanks for the hints. > > On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > Hello, > > > > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote: > > > > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 for both > > > cephfs_data and cephfs_metadata. I experienced some slow/blocked requests > > > issues when I was using hammer 0.94.x and prior. So I was thinking if the > > > pg_num is too large for metadata. > > > > Very, VERY much doubt this. > > > > Your "ideal" values for a cluster of this size (are you planning to grow > > it?) would be about 1024 PGs for data and 128 or 256 PGs for meta-data. > > > > Not really that far off and more importantly not overloading the OSDs with > > too many PGs in total. Or do you have more pools? > > > > > > >I just upgraded the cluster to Jewel > > > today. Will watch if the problem remains. > > > > > Jewel improvements might mask things, but I'd venture that your problems > > were caused by your HW not being sufficient for the load. > > > > As in, do you use SSD journals, etc? > > How many IOPS do you need/expect from your CephFS? > > How many objects are in there? > > > > Christian > > > > > Thank you. > > > > > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> > > wrote: > > > > > > > I'm not at all sure that rados cppool actually captures everything (it > > > > might). Doug has been working on some similar stuff for disaster > > > > recovery testing and can probably walk you through moving over. > > > > > > > > But just how large *is* your metadata pool in relation to others? > > > > Having a too-large pool doesn't cost much unless it's > > > > grossly-inflated, and having a nice distribution of your folders is > > > > definitely better than not. > > > > -Greg > > > > > > > > On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx> > > wrote: > > > > > Hi, > > > > > > > > > > Is there any way to change the metadata pool for a cephfs without > > > > losing > > > > > any existing data? I know how to clone the metadata pool using rados > > > > cppool. > > > > > But the filesystem still links to the original metadata pool no > > matter > > > > what > > > > > you name it. > > > > > > > > > > The motivation here is to decrease the pg_num of the metadata > > pool. I > > > > > created this cephfs cluster sometime ago, while I didn't realize > > that I > > > > > shouldn't assign a large pg_num to such a small pool. > > > > > > > > > > I'm not sure if I can delete the fs and re-create it using the > > > > existing > > > > > data pool and the cloned metadata pool. > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > Zhang Di > > > > > > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com