Re: cephfs change metadata pool?

Di Zhang <zhangdibio@xxxxxxxxx> · Wed, 13 Jul 2016 22:47:05 -0500

Hi,	I changed to only use the infiniband network. For the 4KB write, the IOPS doesn’t improve much. I also logged into the OSD nodes and atop showed the disks are not always at 100% busy. Please check a snapshot of one node below:

DSK |          sdc  | busy     72% |  read    20/s |  write   86/s | KiB/w     13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
DSK |          sda  | busy     47% |  read     0/s |  write  589/s | KiB/w      4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
DSK |          sdb  | busy     31% |  read    14/s |  write   77/s | KiB/w     10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
DSK |          sdd  | busy     19% |  read     4/s |  write   50/s | KiB/w     11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
NET | transport     | tcpi   656/s |  tcpo   655/s |  udpi     0/s | udpo     0/s  | tcpao    0/s |  tcppo    0/s |  tcprs    0/s |
NET | network       | ipi    657/s |  ipo    655/s |  ipfrw    0/s | deliv  657/s  |              |  icmpi    0/s |  icmpo    0/s |
NET | p10p1     0%  | pcki     0/s |  pcko     0/s |  si    0 Kbps | so    1 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
NET | ib0     ----  | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so 5213 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
NET | lo      ----  | pcki    19/s |  pcko    19/s |  si   14 Kbps | so   14 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |

	/dev/sda is the OS and journaling SSD. The other three are OSDs.

	Am I missing anything?

	Thanks,

Zhang, Di
Postdoctoral Associate
Baylor College of Medicine

On Jul 13, 2016, at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:

I also tried 4K write bench. The IOPS is ~420. 

That's what people usually mean (4KB blocks) when talking about IOPS.
This number is pretty low, my guess would be network latency on your 1Gbs
network for the most part.

You should run atop on your storage nodes will running a test like this
and see if the OSDs (HDDs) are also very busy.

Lastly the rados bench gives you some basic numbers but it is not the same
as real client I/O, for that you want to run fio inside a VM or in your
case on a mounted CephFS.

I used to have better
bandwidth when I use the same network for both the cluster and clients. Now
the bandwidth must be limited by the 1G ethernet. 
That's the bandwidth you also see in your 4MB block tests below.
For small I/Os the real killer is latency, though.

What would you suggest to
me to do?

That depends on your budget mostly (switch ports, client NICs).

A uniform, single 10Gb/s network would be better in all aspects than the
split network you have now.

Christian

Thanks,

On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:

Hello,
   Sorry for the misunderstanding about IOPS. Here are some summary stats
of my benchmark (Is the 20 - 30 IOPS seems normal to you?):

ceph osd pool create test 512 512

rados bench -p test 10 write --no-cleanup

Total time run:         10.480383
Total writes made:      288
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     109.92
Stddev Bandwidth:       11.9926
Max bandwidth (MB/sec): 124
Min bandwidth (MB/sec): 80
Average IOPS:           27
Stddev IOPS:            3
Max IOPS:               31
Min IOPS:               20
Average Latency(s):     0.579105
Stddev Latency(s):      0.19902
Max latency(s):         1.32831
Min latency(s):         0.245505

rados bench -p bench -p test 10 seq
Total time run:       10.340724
Total reads made:     288
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   111.404
Average IOPS          27
Stddev IOPS:          2
Max IOPS:             31
Min IOPS:             22
Average Latency(s):   0.564858
Max latency(s):       1.65278
Min latency(s):       0.141504

rados bench -p bench -p test 10 rand
Total time run:       10.546251
Total reads made:     293
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   111.13
Average IOPS:         27
Stddev IOPS:          2
Max IOPS:             32
Min IOPS:             24
Average Latency(s):   0.57092
Max latency(s):       1.8631
Min latency(s):       0.161936

On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:

I am using 10G infiniband for cluster network and 1G ethernet for
public.
Hmm, very unbalanced, but I guess that's HW you already had.

Because I don't have enough slots on the node, so I am using three
files on
the OS drive (SSD) for journaling, which really improved but not
entirely
solved the problem.

If you can, use partitions instead of files, less overhead.
What model SSD is that?

Also putting the meta-data pool on SSDs might help.

I am quite happy with the current IOPS, which range from 200 MB/s to 400
MB/s sequential write, depending on the block size.
That's not IOPS, that's bandwidth, throughput.

But the problem is,
when I transfer data to the cephfs at a rate below 100MB/s, I can
observe
the slow/blocked requests warnings after a few minutes via "ceph -w".

I doubt the speed has anything to do with this, but the actual block size
and IOPS numbers.

As always, watch your storage nodes with atop (or iostat) during such
scenarios/tests and spot the bottlenecks.

It's
not specific to any particular OSDs. So I started to doubt if the
configuration is correct or upgrading to Jewel can solve it.

Jewel is likely to help in general, but can't fix insufficient HW or
broken configurations.

There are about 5,000,000 objects currently in the cluster.

You're robably not hitting his, but read the recent filestore merge and
split threads, including the entirety of this thread:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html

Christian

Thanks for the hints.

On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx>
wrote:

Hello,

On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:

It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
for both
cephfs_data and cephfs_metadata. I experienced some slow/blocked
requests
issues when I was using hammer 0.94.x and prior. So I was thinking
if the
pg_num is too large for metadata.

Very, VERY much doubt this.

Your "ideal" values for a cluster of this size (are you planning to
grow
it?) would be about 1024 PGs for data and 128 or 256 PGs for
meta-data.

Not really that far off and more importantly not overloading the OSDs
with
too many PGs in total. Or do you have more pools?

I just upgraded the cluster to Jewel
today. Will watch if the problem remains.

Jewel improvements might mask things, but I'd venture that your
problems
were caused by your HW not being sufficient for the load.

As in, do you use SSD journals, etc?
How many IOPS do you need/expect from your CephFS?
How many objects are in there?

Christian

Thank you.

On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx

wrote:

I'm not at all sure that rados cppool actually captures
everything (it
might). Doug has been working on some similar stuff for disaster
recovery testing and can probably walk you through moving over.

But just how large *is* your metadata pool in relation to others?
Having a too-large pool doesn't cost much unless it's
grossly-inflated, and having a nice distribution of your folders
is
definitely better than not.
-Greg

On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>
wrote:
Hi,

   Is there any way to change the metadata pool for a cephfs
without
losing
any existing data? I know how to clone the metadata pool using
rados
cppool.
But the filesystem still links to the original metadata pool no
matter
what
you name it.

   The motivation here is to decrease the pg_num of the
metadata
pool. I
created this cephfs cluster sometime ago, while I didn't realize
that I
shouldn't assign a large pg_num to such a small pool.

   I'm not sure if I can delete the fs and re-create it using
the
existing
data pool and the cloned metadata pool.

   Thank you.

Zhang Di

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com