Re: cephfs change metadata pool?

Di Zhang <zhangdibio@xxxxxxxxx> · Wed, 20 Jul 2016 12:36:44 -0500

update:
    After upgrading to Jewel and changing journaling to SSD, I no longer have the slow/blocked requests warnings during normal data copying.
    Thank you all.

Zhang Di

On Wed, Jul 13, 2016 at 11:04 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Wed, 13 Jul 2016 22:47:05 -0500 Di Zhang wrote:

> Hi,

>       I changed to only use the infiniband network. For the 4KB write, the IOPS doesn’t improve much.

That's mostly going to be bound by latencies (as I just wrote in the other

thread), both network and internal Ceph ones.

The cluster I described in the other thread has 32 OSDs and does about

1050 "IOPS" with "rados -p rbd bench 30 write -t 32 -b 4096".

So about half with your 15 OSDs isn't all that unexpected.

Once again, to get something more realistic use fio.

>I also logged into the OSD nodes and atop showed the disks are not always

at 100% busy. Please check a snapshot of one node below:

When you do the 4KB bench (for 60 seconds or so), also watch the CPU

usage, rados bench is a killer there.

Christian

>

> DSK |          sdc  | busy     72% |  read    20/s |  write   86/s | KiB/w     13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |

> DSK |          sda  | busy     47% |  read     0/s |  write  589/s | KiB/w      4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |

> DSK |          sdb  | busy     31% |  read    14/s |  write   77/s | KiB/w     10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |

> DSK |          sdd  | busy     19% |  read     4/s |  write   50/s | KiB/w     11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |

> NET | transport     | tcpi   656/s |  tcpo   655/s |  udpi     0/s | udpo     0/s  | tcpao    0/s |  tcppo    0/s |  tcprs    0/s |

> NET | network       | ipi    657/s |  ipo    655/s |  ipfrw    0/s | deliv  657/s  |              |  icmpi    0/s |  icmpo    0/s |

> NET | p10p1     0%  | pcki     0/s |  pcko     0/s |  si    0 Kbps | so    1 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |

> NET | ib0     ----  | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so 5213 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |

> NET | lo      ----  | pcki    19/s |  pcko    19/s |  si   14 Kbps | so   14 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |

>

>       /dev/sda is the OS and journaling SSD. The other three are OSDs.

>

>       Am I missing anything?

>

>       Thanks,

>

>

>

>

> Zhang, Di

> Postdoctoral Associate

> Baylor College of Medicine

>

> > On Jul 13, 2016, at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote:

> >

> >

> > Hello,

> >

> > On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:

> >

> >> I also tried 4K write bench. The IOPS is ~420.

> >

> > That's what people usually mean (4KB blocks) when talking about IOPS.

> > This number is pretty low, my guess would be network latency on your 1Gbs

> > network for the most part.

> >

> > You should run atop on your storage nodes will running a test like this

> > and see if the OSDs (HDDs) are also very busy.

> >

> > Lastly the rados bench gives you some basic numbers but it is not the same

> > as real client I/O, for that you want to run fio inside a VM or in your

> > case on a mounted CephFS.

> >

> >> I used to have better

> >> bandwidth when I use the same network for both the cluster and clients. Now

> >> the bandwidth must be limited by the 1G ethernet.

> > That's the bandwidth you also see in your 4MB block tests below.

> > For small I/Os the real killer is latency, though.

> >

> >> What would you suggest to

> >> me to do?

> >>

> > That depends on your budget mostly (switch ports, client NICs).

> >

> > A uniform, single 10Gb/s network would be better in all aspects than the

> > split network you have now.

> >

> > Christian

> >

> >> Thanks,

> >>

> >> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:

> >>

> >>> Hello,

> >>>    Sorry for the misunderstanding about IOPS. Here are some summary stats

> >>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):

> >>>

> >>> ceph osd pool create test 512 512

> >>>

> >>> rados bench -p test 10 write --no-cleanup

> >>>

> >>> Total time run:         10.480383

> >>> Total writes made:      288

> >>> Write size:             4194304

> >>> Object size:            4194304

> >>> Bandwidth (MB/sec):     109.92

> >>> Stddev Bandwidth:       11.9926

> >>> Max bandwidth (MB/sec): 124

> >>> Min bandwidth (MB/sec): 80

> >>> Average IOPS:           27

> >>> Stddev IOPS:            3

> >>> Max IOPS:               31

> >>> Min IOPS:               20

> >>> Average Latency(s):     0.579105

> >>> Stddev Latency(s):      0.19902

> >>> Max latency(s):         1.32831

> >>> Min latency(s):         0.245505

> >>>

> >>> rados bench -p bench -p test 10 seq

> >>> Total time run:       10.340724

> >>> Total reads made:     288

> >>> Read size:            4194304

> >>> Object size:          4194304

> >>> Bandwidth (MB/sec):   111.404

> >>> Average IOPS          27

> >>> Stddev IOPS:          2

> >>> Max IOPS:             31

> >>> Min IOPS:             22

> >>> Average Latency(s):   0.564858

> >>> Max latency(s):       1.65278

> >>> Min latency(s):       0.141504

> >>>

> >>> rados bench -p bench -p test 10 rand

> >>> Total time run:       10.546251

> >>> Total reads made:     293

> >>> Read size:            4194304

> >>> Object size:          4194304

> >>> Bandwidth (MB/sec):   111.13

> >>> Average IOPS:         27

> >>> Stddev IOPS:          2

> >>> Max IOPS:             32

> >>> Min IOPS:             24

> >>> Average Latency(s):   0.57092

> >>> Max latency(s):       1.8631

> >>> Min latency(s):       0.161936

> >>>

> >>>

> >>> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:

> >>>

> >>>>

> >>>> Hello,

> >>>>

> >>>> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:

> >>>>

> >>>>> I am using 10G infiniband for cluster network and 1G ethernet for

> >>>> public.

> >>>> Hmm, very unbalanced, but I guess that's HW you already had.

> >>>>

> >>>>> Because I don't have enough slots on the node, so I am using three

> >>>> files on

> >>>>> the OS drive (SSD) for journaling, which really improved but not

> >>>> entirely

> >>>>> solved the problem.

> >>>>>

> >>>> If you can, use partitions instead of files, less overhead.

> >>>> What model SSD is that?

> >>>>

> >>>> Also putting the meta-data pool on SSDs might help.

> >>>>

> >>>>> I am quite happy with the current IOPS, which range from 200 MB/s to 400

> >>>>> MB/s sequential write, depending on the block size.

> >>>> That's not IOPS, that's bandwidth, throughput.

> >>>>

> >>>>> But the problem is,

> >>>>> when I transfer data to the cephfs at a rate below 100MB/s, I can

> >>>> observe

> >>>>> the slow/blocked requests warnings after a few minutes via "ceph -w".

> >>>>

> >>>> I doubt the speed has anything to do with this, but the actual block size

> >>>> and IOPS numbers.

> >>>>

> >>>> As always, watch your storage nodes with atop (or iostat) during such

> >>>> scenarios/tests and spot the bottlenecks.

> >>>>

> >>>>> It's

> >>>>> not specific to any particular OSDs. So I started to doubt if the

> >>>>> configuration is correct or upgrading to Jewel can solve it.

> >>>>>

> >>>> Jewel is likely to help in general, but can't fix insufficient HW or

> >>>> broken configurations.

> >>>>

> >>>>> There are about 5,000,000 objects currently in the cluster.

> >>>>>

> >>>> You're robably not hitting his, but read the recent filestore merge and

> >>>> split threads, including the entirety of this thread:

> >>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html

> >>>>

> >>>> Christian

> >>>>

> >>>>> Thanks for the hints.

> >>>>>

> >>>>> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx>

> >>>> wrote:

> >>>>>

> >>>>>>

> >>>>>> Hello,

> >>>>>>

> >>>>>> On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:

> >>>>>>

> >>>>>>> It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512

> >>>> for both

> >>>>>>> cephfs_data and cephfs_metadata. I experienced some slow/blocked

> >>>> requests

> >>>>>>> issues when I was using hammer 0.94.x and prior. So I was thinking

> >>>> if the

> >>>>>>> pg_num is too large for metadata.

> >>>>>>

> >>>>>> Very, VERY much doubt this.

> >>>>>>

> >>>>>> Your "ideal" values for a cluster of this size (are you planning to

> >>>> grow

> >>>>>> it?) would be about 1024 PGs for data and 128 or 256 PGs for

> >>>> meta-data.

> >>>>>>

> >>>>>> Not really that far off and more importantly not overloading the OSDs

> >>>> with

> >>>>>> too many PGs in total. Or do you have more pools?

> >>>>>>

> >>>>>>

> >>>>>>> I just upgraded the cluster to Jewel

> >>>>>>> today. Will watch if the problem remains.

> >>>>>>>

> >>>>>> Jewel improvements might mask things, but I'd venture that your

> >>>> problems

> >>>>>> were caused by your HW not being sufficient for the load.

> >>>>>>

> >>>>>> As in, do you use SSD journals, etc?

> >>>>>> How many IOPS do you need/expect from your CephFS?

> >>>>>> How many objects are in there?

> >>>>>>

> >>>>>> Christian

> >>>>>>

> >>>>>>> Thank you.

> >>>>>>>

> >>>>>>> On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx

> >>>>>

> >>>>>> wrote:

> >>>>>>>

> >>>>>>>> I'm not at all sure that rados cppool actually captures

> >>>> everything (it

> >>>>>>>> might). Doug has been working on some similar stuff for disaster

> >>>>>>>> recovery testing and can probably walk you through moving over.

> >>>>>>>>

> >>>>>>>> But just how large *is* your metadata pool in relation to others?

> >>>>>>>> Having a too-large pool doesn't cost much unless it's

> >>>>>>>> grossly-inflated, and having a nice distribution of your folders

> >>>> is

> >>>>>>>> definitely better than not.

> >>>>>>>> -Greg

> >>>>>>>>

> >>>>>>>> On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>

> >>>>>> wrote:

> >>>>>>>>> Hi,

> >>>>>>>>>

> >>>>>>>>>    Is there any way to change the metadata pool for a cephfs

> >>>> without

> >>>>>>>> losing

> >>>>>>>>> any existing data? I know how to clone the metadata pool using

> >>>> rados

> >>>>>>>> cppool.

> >>>>>>>>> But the filesystem still links to the original metadata pool no

> >>>>>> matter

> >>>>>>>> what

> >>>>>>>>> you name it.

> >>>>>>>>>

> >>>>>>>>>    The motivation here is to decrease the pg_num of the

> >>>> metadata

> >>>>>> pool. I

> >>>>>>>>> created this cephfs cluster sometime ago, while I didn't realize

> >>>>>> that I

> >>>>>>>>> shouldn't assign a large pg_num to such a small pool.

> >>>>>>>>>

> >>>>>>>>>    I'm not sure if I can delete the fs and re-create it using

> >>>> the

> >>>>>>>> existing

> >>>>>>>>> data pool and the cloned metadata pool.

> >>>>>>>>>

> >>>>>>>>>    Thank you.

> >>>>>>>>>

> >>>>>>>>>

> >>>>>>>>> Zhang Di

> >>>>>>>>>

> >>>>>>>>> _______________________________________________

> >>>>>>>>> ceph-users mailing list

> >>>>>>>>> ceph-users@xxxxxxxxxxxxxx

> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >>>>>>>>>

> >>>>>>>>

> >>>>>>

> >>>>>>

> >>>>>> --

> >>>>>> Christian Balzer        Network/Systems Engineer

> >>>>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> >>>>>> http://www.gol.com/

> >>>>>>

> >>>>

> >>>>

> >>>> --

> >>>> Christian Balzer        Network/Systems Engineer

> >>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> >>>> http://www.gol.com/

> >>>>

> >>>

> >>>

> >

> >

> > --

> > Christian Balzer        Network/Systems Engineer

> > chibi@xxxxxxx <mailto:chibi@xxxxxxx>        Global OnLine Japan/Rakuten Communications

> > http://www.gol.com/ <http://www.gol.com/>

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com