Re: Adding new OSDs, need to increase PGs?

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Tue, 03 Dec 2013 11:02:09 -0500

Robert,

Do you have rbd writeback cache enabled on these volumes? That could 
certainly explain the higher than expected write performance. Any chance 
you could re-test with rbd writeback on vs. off?

Thanks,
Mike Dawson

On 12/3/2013 10:37 AM, Robert van Leeuwen wrote:
Hi Mike,

I am using filebench within a kvm virtual. (Like an actual workload we will have)
Using 100% synchronous 4k writes with a 50GB file on a 100GB volume with 32 writer threads.
Also tried from multiple KVM machines from multiple hosts.
Aggregated performance keeps at 2k+ IOPS

The disks are 7200RPM 2.5 inch drives, no RAID whatsoever.
I agree the amount of IOPS seem high.
Maybe the journal on SSD (2 x Intel 3500) helps a bit in this regard but the SSD's where not maxed out yet.
The writes seem to be limited by the spinning disks:
As soon as the benchmark starts the are used for 100% percent.
Also the usage dropped to 0% pretty much immediately after the benchmark so it looks like it's not lagging behind the journal.

Did not really test reads yet since we have so much read cache (128 GB per node) I assume we will mostly be write limited.

Cheers,
Robert van Leeuwen

Sent from my iPad

On 3 dec. 2013, at 16:15, "Mike Dawson" <mike.dawson@xxxxxxxxxxxx> wrote:

Robert,

Interesting results on the effect of # of PG/PGPs. My cluster struggles a bit under the strain of heavy random small-sized writes.

The IOPS you mention seem high to me given 30 drives and 3x replication unless they were pure reads or on high-rpm drives. Instead of assuming, I want to pose a few questions:

- How are you testing? rados bench, rbd bench, rbd bench with writeback cache, etc?

- Were the 2000-2500 random 4k IOPS more reads than writes? If you test 100% 4k random reads, what do you get? If you test 100% 4k random writes, what do you get?

- What drives do you have? Any RAID involved under your OSDs?

Thanks,
Mike Dawson

On 12/3/2013 1:31 AM, Robert van Leeuwen wrote:

On 2 dec. 2013, at 18:26, "Brian Andrus" <brian.andrus@xxxxxxxxxxx> wrote:

  Setting your pg_num and pgp_num to say... 1024 would A) increase data granularity, B) likely lend no noticeable increase to resource consumption, and C) allow some room for future OSDs two be added while still within range of acceptable pg numbers. You could probably safely double even that number if you plan on expanding at a rapid rate and want to avoid splitting PGs every time a node is added.

In general, you can conservatively err on the larger side when it comes to pg/p_num. Any excess resource utilization will be negligible (up to a certain point). If you have a comfortable amount of available RAM, you could experiment with increasing the multiplier in the equation you are using and see how it affects your final number.

The pg_num and pgp_num parameters can safely be changed before or after your new nodes are integrated.

I would be a bit conservative with the PGs / PGPs.
I've experimented with the PG number a bit and noticed the following random IO performance drop.
( this could be something to our specific setup but since the PG is easily increased and impossible to decrease I would be conservative)

  The setup:
3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht).
Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals.

We use a replica count of 3 so optimum according to formula is about 1000
With 1000 PGs I got about 2000-2500 random 4k IOPS.

Because the nodes are fast enough and I expect the cluster to be expanded with 3 more nodes I set the PGs to 2000.
Performance dropped to about 1200-1400 IOPS.

I noticed that the spinning disks where no longer maxing out on 100% usage.
Memory and CPU did not seem to be a problem.
Since had the option to recreate the pool and I was not using the recommended settings I did not really dive into the issue.
I will not stray to far from the recommended settings in the future though :)

Cheers,
Robert van Leeuwen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com