Re: Weighted Priority Queue testing

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Wed, 11 May 2016 06:58:52 +0000

+ceph users
Hi,

Here is first cut result. I can only manage 128TB box for now.

Ceph code base

Capacity

Each drive capacity

Compute-nodes

Total copy

Total data-set

Failure domain

Fault-injected

Percentage of degraded PGs

Full recovery time

Last 1% of degraded PG recovery time

Hammer

2X128TB IF150

8TB

2

2

~80TB

Chassis

One OSD node down

~20%

~24 hours

~3-4 hours

Hammer

2X128TB IF150

8TB

4

2

~80TB

Chassis

One OSD node down

~10%

10 hours 3 min

~3 hours

Hammer

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

7 hours 5 min

~2.5 hours

Jewel

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

6 hours 10 min

~1 hour 30 min

Jewel + wpq

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

8 hours 30 min

~4 hours 30 min

Summary :
------------

1. First scenario, only 4 node scenario and since it is chassis level replication single node remaining on the chassis taking all the traffic. It seems that is a bottleneck as for the host level replication on the similar setup recovery
 time is much less (data is not in this table).

2. In the second scenario , I kept everything else same but doubled the node/chassis. Recovery time is also half.

3.  For the third scenario, increased cluster data and also now I have doubled the number of  OSDs in the cluster (since each drive size is 4TB now). Recovery time came down further.

4. Moved to Jewel keeping everything else same, got further improvement. Mostly because of improved write performance in jewel (?).

5. Last scenario is interesting. I got improved recovery speed than any other scenario with this WPQ. Degraded PG % came down to 2% within 3 hours , ~0.6% within 4 hours and 15 min , but
last 0.6% took ~4 hours hurting overall time for recovery.

6. In fact, this long tail latency is hurting the overall recovery time for every other scenarios. Related tracker I found is
http://tracker.ceph.com/issues/15763

Any feedback much appreciated. We can discuss this in tomorrow’s performance call if needed.

Thanks & Regards
Somnath

-----Original Message-----

From: Somnath Roy 

Sent: Wednesday, May 04, 2016 11:47 AM

To: 'Mark Nelson'; Nick Fisk; Ben England; Kyle Bader

Cc: Sage Weil; Samuel Just

Subject: RE: Weighted Priority Queue testing

Thanks Mark, I will come back to you with some data on that. This is what I am planning to run.

1. One 2X IF150 chassis with 256 TB  flash each and total 8 node cluster (4 servers on each). Will generate ~100TB of data on the cluster.

2. Will go for host and chassis level replication with 2 copies.

3. Heavy IO will be on (different block sizes 60% RW and 40% RR)

Hammer took me ~4 hours to complete recovery for a host level replication and single host down.
~12 hours when single host down with chassis level replication.

Bear with me till I find all the HW for this :-) Let me know if you guys want to add something here..

Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
Sent: Wednesday, May 04, 2016 8:40 AM
To: Somnath Roy; Nick Fisk; Ben England; Kyle Bader
Cc: Sage Weil; Samuel Just
Subject: Weighted Priority Queue testing

Hi Guys,

I think all of you have expressed some interest in recovery testing either now or in the past, so I wanted to get folks together to talk.

We need to get the new weighted priority queue tested to:

a) see when/how it's breaking
b) hopefully see better behavior

It's available in Jewel through a simple ceph.conf change:

osd_op_queue = wpq

For those of you who have run cbt recovery tests in the past, it might be worth running some new stress tests comparing:

a) jewel + wpq
b) jewel + prio queue
c) hammer

In the past I've done this under various concurrent client workloads (say large sequential or small random writes).  I think Kyle has done quite a bit of this kind of testing in the recent past with Intel as well, so he might have some
 insights as to where we've been hurting recently.

Thanks,
Mark

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this
 message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy
 any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com