Re: cephfs change metadata pool?

Christian Balzer <chibi@xxxxxxx> · Thu, 14 Jul 2016 13:04:45 +0900

Hello,

On Wed, 13 Jul 2016 22:47:05 -0500 Di Zhang wrote:

> Hi,
> 	I changed to only use the infiniband network. For the 4KB write, the IOPS doesn’t improve much. 

That's mostly going to be bound by latencies (as I just wrote in the other
thread), both network and internal Ceph ones.

The cluster I described in the other thread has 32 OSDs and does about
1050 "IOPS" with "rados -p rbd bench 30 write -t 32 -b 4096".
So about half with your 15 OSDs isn't all that unexpected.

Once again, to get something more realistic use fio.

>I also logged into the OSD nodes and atop showed the disks are not always
at 100% busy. Please check a snapshot of one node below:

When you do the 4KB bench (for 60 seconds or so), also watch the CPU
usage, rados bench is a killer there.

Christian

> 
> DSK |          sdc  | busy     72% |  read    20/s |  write   86/s | KiB/w     13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
> DSK |          sda  | busy     47% |  read     0/s |  write  589/s | KiB/w      4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
> DSK |          sdb  | busy     31% |  read    14/s |  write   77/s | KiB/w     10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
> DSK |          sdd  | busy     19% |  read     4/s |  write   50/s | KiB/w     11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
> NET | transport     | tcpi   656/s |  tcpo   655/s |  udpi     0/s | udpo     0/s  | tcpao    0/s |  tcppo    0/s |  tcprs    0/s |
> NET | network       | ipi    657/s |  ipo    655/s |  ipfrw    0/s | deliv  657/s  |              |  icmpi    0/s |  icmpo    0/s |
> NET | p10p1     0%  | pcki     0/s |  pcko     0/s |  si    0 Kbps | so    1 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
> NET | ib0     ----  | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so 5213 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
> NET | lo      ----  | pcki    19/s |  pcko    19/s |  si   14 Kbps | so   14 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
> 	
> 	/dev/sda is the OS and journaling SSD. The other three are OSDs.
> 
> 	Am I missing anything?
> 
> 	Thanks,
> 
> 	
> 
> 	
> Zhang, Di
> Postdoctoral Associate
> Baylor College of Medicine
> 
> > On Jul 13, 2016, at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > 
> > Hello,
> > 
> > On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:
> > 
> >> I also tried 4K write bench. The IOPS is ~420. 
> > 
> > That's what people usually mean (4KB blocks) when talking about IOPS.
> > This number is pretty low, my guess would be network latency on your 1Gbs
> > network for the most part.
> > 
> > You should run atop on your storage nodes will running a test like this
> > and see if the OSDs (HDDs) are also very busy.
> > 
> > Lastly the rados bench gives you some basic numbers but it is not the same
> > as real client I/O, for that you want to run fio inside a VM or in your
> > case on a mounted CephFS.
> > 
> >> I used to have better
> >> bandwidth when I use the same network for both the cluster and clients. Now
> >> the bandwidth must be limited by the 1G ethernet. 
> > That's the bandwidth you also see in your 4MB block tests below.
> > For small I/Os the real killer is latency, though.
> > 
> >> What would you suggest to
> >> me to do?
> >> 
> > That depends on your budget mostly (switch ports, client NICs).
> > 
> > A uniform, single 10Gb/s network would be better in all aspects than the
> > split network you have now.
> > 
> > Christian
> > 
> >> Thanks,
> >> 
> >> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:
> >> 
> >>> Hello,
> >>>    Sorry for the misunderstanding about IOPS. Here are some summary stats
> >>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
> >>> 
> >>> ceph osd pool create test 512 512
> >>> 
> >>> rados bench -p test 10 write --no-cleanup
> >>> 
> >>> Total time run:         10.480383
> >>> Total writes made:      288
> >>> Write size:             4194304
> >>> Object size:            4194304
> >>> Bandwidth (MB/sec):     109.92
> >>> Stddev Bandwidth:       11.9926
> >>> Max bandwidth (MB/sec): 124
> >>> Min bandwidth (MB/sec): 80
> >>> Average IOPS:           27
> >>> Stddev IOPS:            3
> >>> Max IOPS:               31
> >>> Min IOPS:               20
> >>> Average Latency(s):     0.579105
> >>> Stddev Latency(s):      0.19902
> >>> Max latency(s):         1.32831
> >>> Min latency(s):         0.245505
> >>> 
> >>> rados bench -p bench -p test 10 seq
> >>> Total time run:       10.340724
> >>> Total reads made:     288
> >>> Read size:            4194304
> >>> Object size:          4194304
> >>> Bandwidth (MB/sec):   111.404
> >>> Average IOPS          27
> >>> Stddev IOPS:          2
> >>> Max IOPS:             31
> >>> Min IOPS:             22
> >>> Average Latency(s):   0.564858
> >>> Max latency(s):       1.65278
> >>> Min latency(s):       0.141504
> >>> 
> >>> rados bench -p bench -p test 10 rand
> >>> Total time run:       10.546251
> >>> Total reads made:     293
> >>> Read size:            4194304
> >>> Object size:          4194304
> >>> Bandwidth (MB/sec):   111.13
> >>> Average IOPS:         27
> >>> Stddev IOPS:          2
> >>> Max IOPS:             32
> >>> Min IOPS:             24
> >>> Average Latency(s):   0.57092
> >>> Max latency(s):       1.8631
> >>> Min latency(s):       0.161936
> >>> 
> >>> 
> >>> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >>> 
> >>>> 
> >>>> Hello,
> >>>> 
> >>>> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
> >>>> 
> >>>>> I am using 10G infiniband for cluster network and 1G ethernet for
> >>>> public.
> >>>> Hmm, very unbalanced, but I guess that's HW you already had.
> >>>> 
> >>>>> Because I don't have enough slots on the node, so I am using three
> >>>> files on
> >>>>> the OS drive (SSD) for journaling, which really improved but not
> >>>> entirely
> >>>>> solved the problem.
> >>>>> 
> >>>> If you can, use partitions instead of files, less overhead.
> >>>> What model SSD is that?
> >>>> 
> >>>> Also putting the meta-data pool on SSDs might help.
> >>>> 
> >>>>> I am quite happy with the current IOPS, which range from 200 MB/s to 400
> >>>>> MB/s sequential write, depending on the block size.
> >>>> That's not IOPS, that's bandwidth, throughput.
> >>>> 
> >>>>> But the problem is,
> >>>>> when I transfer data to the cephfs at a rate below 100MB/s, I can
> >>>> observe
> >>>>> the slow/blocked requests warnings after a few minutes via "ceph -w".
> >>>> 
> >>>> I doubt the speed has anything to do with this, but the actual block size
> >>>> and IOPS numbers.
> >>>> 
> >>>> As always, watch your storage nodes with atop (or iostat) during such
> >>>> scenarios/tests and spot the bottlenecks.
> >>>> 
> >>>>> It's
> >>>>> not specific to any particular OSDs. So I started to doubt if the
> >>>>> configuration is correct or upgrading to Jewel can solve it.
> >>>>> 
> >>>> Jewel is likely to help in general, but can't fix insufficient HW or
> >>>> broken configurations.
> >>>> 
> >>>>> There are about 5,000,000 objects currently in the cluster.
> >>>>> 
> >>>> You're robably not hitting his, but read the recent filestore merge and
> >>>> split threads, including the entirety of this thread:
> >>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html
> >>>> 
> >>>> Christian
> >>>> 
> >>>>> Thanks for the hints.
> >>>>> 
> >>>>> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx>
> >>>> wrote:
> >>>>> 
> >>>>>> 
> >>>>>> Hello,
> >>>>>> 
> >>>>>> On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
> >>>>>> 
> >>>>>>> It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
> >>>> for both
> >>>>>>> cephfs_data and cephfs_metadata. I experienced some slow/blocked
> >>>> requests
> >>>>>>> issues when I was using hammer 0.94.x and prior. So I was thinking
> >>>> if the
> >>>>>>> pg_num is too large for metadata.
> >>>>>> 
> >>>>>> Very, VERY much doubt this.
> >>>>>> 
> >>>>>> Your "ideal" values for a cluster of this size (are you planning to
> >>>> grow
> >>>>>> it?) would be about 1024 PGs for data and 128 or 256 PGs for
> >>>> meta-data.
> >>>>>> 
> >>>>>> Not really that far off and more importantly not overloading the OSDs
> >>>> with
> >>>>>> too many PGs in total. Or do you have more pools?
> >>>>>> 
> >>>>>> 
> >>>>>>> I just upgraded the cluster to Jewel
> >>>>>>> today. Will watch if the problem remains.
> >>>>>>> 
> >>>>>> Jewel improvements might mask things, but I'd venture that your
> >>>> problems
> >>>>>> were caused by your HW not being sufficient for the load.
> >>>>>> 
> >>>>>> As in, do you use SSD journals, etc?
> >>>>>> How many IOPS do you need/expect from your CephFS?
> >>>>>> How many objects are in there?
> >>>>>> 
> >>>>>> Christian
> >>>>>> 
> >>>>>>> Thank you.
> >>>>>>> 
> >>>>>>> On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx
> >>>>> 
> >>>>>> wrote:
> >>>>>>> 
> >>>>>>>> I'm not at all sure that rados cppool actually captures
> >>>> everything (it
> >>>>>>>> might). Doug has been working on some similar stuff for disaster
> >>>>>>>> recovery testing and can probably walk you through moving over.
> >>>>>>>> 
> >>>>>>>> But just how large *is* your metadata pool in relation to others?
> >>>>>>>> Having a too-large pool doesn't cost much unless it's
> >>>>>>>> grossly-inflated, and having a nice distribution of your folders
> >>>> is
> >>>>>>>> definitely better than not.
> >>>>>>>> -Greg
> >>>>>>>> 
> >>>>>>>> On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>
> >>>>>> wrote:
> >>>>>>>>> Hi,
> >>>>>>>>> 
> >>>>>>>>>    Is there any way to change the metadata pool for a cephfs
> >>>> without
> >>>>>>>> losing
> >>>>>>>>> any existing data? I know how to clone the metadata pool using
> >>>> rados
> >>>>>>>> cppool.
> >>>>>>>>> But the filesystem still links to the original metadata pool no
> >>>>>> matter
> >>>>>>>> what
> >>>>>>>>> you name it.
> >>>>>>>>> 
> >>>>>>>>>    The motivation here is to decrease the pg_num of the
> >>>> metadata
> >>>>>> pool. I
> >>>>>>>>> created this cephfs cluster sometime ago, while I didn't realize
> >>>>>> that I
> >>>>>>>>> shouldn't assign a large pg_num to such a small pool.
> >>>>>>>>> 
> >>>>>>>>>    I'm not sure if I can delete the fs and re-create it using
> >>>> the
> >>>>>>>> existing
> >>>>>>>>> data pool and the cloned metadata pool.
> >>>>>>>>> 
> >>>>>>>>>    Thank you.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Zhang Di
> >>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list
> >>>>>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> Christian Balzer        Network/Systems Engineer
> >>>>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >>>>>> http://www.gol.com/
> >>>>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Christian Balzer        Network/Systems Engineer
> >>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >>>> http://www.gol.com/
> >>>> 
> >>> 
> >>> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx <mailto:chibi@xxxxxxx>   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/ <http://www.gol.com/>

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com