RE: Latency Improvement Report for ShardedOpWQ

"Ma, Jianpeng" <jianpeng.ma@xxxxxxxxx> · Sun, 28 Sep 2014 09:18:32 +0000

Hi Somnath:
You mentioned: There is still one global lock we have; this is to protect pg_for_processing() and this we can't get rid of since we need to maintain op order within a pg.

But for most object operations, we only maintain the order of object. Why need maintain op order within a pg?
Can you explain in detail?

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
> Sent: Sunday, September 28, 2014 5:02 PM
> To: Dong Yuan
> Cc: ceph-devel
> Subject: RE: Latency Improvement Report for ShardedOpWQ
> 
> Dong,
> This is mostly because of lock contention may be.
> You can tweak the number of shards in case of sharded WQ to see if it is
> improving this number or not.
> There is still one global lock we have; this is to protect pg_for_processing() and
> this we can't get rid of since we need to maintain op order within a pg. This
> could be increasing latency as well. I would suggest you to measure this
> number in different stages within ShardedOpWQ::_process() like after dequeue
> from pqueue and after getting the pglock and popping the ops from
> pg_for_processing().
> 
> Also, keep in mind there is context switch happening and this could be
> expensive depending on the data copy etc. It's worth trying this experiment by
> pinning OSD to may be actual physical cores ?
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Dong Yuan [mailto:yuandong1222@xxxxxxxxx]
> Sent: Sunday, September 28, 2014 12:19 AM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Latency Improvement Report for ShardedOpWQ
> 
> Hi Somnath,
> 
> I totally agree with you.
> 
> I read the code about  sharded TP and the new OSD OpWQ. In the new
> implementation, there is not  single lock for all PGs, but each lock
> for a subset of PGs(Am I right?).   It is very useful to reduce lock
> contention and so increase parallelism. It is an awesome work!
> 
> While I am working on the latency of single IO (mainly 4K random write), I
> notice the OpWQ spent about 100+us to transfer an IO from msg dispatcher to
> OpWQ worker thread, Do you have any idea to reduce the time span?
> 
> Thanks for your help.
> Dong.
> 
> On 28 September 2014 13:46, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> wrote:
> > Hi Dong,
> > I don't think in case of single client scenario there is much benefit. Single
> client has a limitation. The benefit with sharded TP is, a single OSD is scaling
> much more with the increase of clients since it is increasing parallelism (by
> reducing lock contention) in the filestore level. A quick check could be like this.
> >
> > 1. Create a single node, single OSD cluster and try putting load with
> increasing number of clients like 1,3, 5, 8,10. Small workload serving from
> memory should be ideal.
> > 2. Compare the code with sharded TP against say firefly. You should be seeing
> firefly is not scaling with increasing number of clients.
> > 3. try top -H on two different case and you should be seeing more threads in
> case of sharded tp were working in parallel than firefly.
> >
> > Also, I am sure this latency result will not hold true in high workload , there
> you should be seeing more contention and as a result more latency.
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Dong Yuan
> > Sent: Saturday, September 27, 2014 8:45 PM
> > To: ceph-devel
> > Subject: Latency Improvement Report for ShardedOpWQ
> >
> > ===== Test Purpose =====
> >
> > Measure whether and how much Sharded OpWQ is better than Traditional
> OpWQ for random write scene.
> >
> > ===== Test Case =====
> >
> > 4K Object WriteFull for 1w times.
> >
> > ===== Test Method =====
> >
> > Put the following static probes into codes when running tests to get the time
> span between enqeueue and dequeue of OpWQ.
> >
> > Start: PG::enqueue_op before osd->op_wq.equeue call
> > End: OSD::dequeue_op.entry
> >
> > ===== Test Result =====
> >
> > Traditional OpWQ: 109us(AVG), 40us(MIN)
> > ShardedOpWQ: 97us(AVG), 32us(MIN)
> >
> > ===== Test Conclusion =====
> >
> > No Remarkably Improvement for Latency
> >
> >
> > --
> > Dong Yuan
> > Email:yuandong1222@xxxxxxxxx
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> 
> 
> 
> --
> Dong Yuan
> Email:yuandong1222@xxxxxxxxx
>   칻 & ~ &   +-  ݶ  w  ˛   m  ^  b  ^n r   z   h    &   G
> h ( 階 ݢj"   m     z ޖ   f   h   ~ m
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f