RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool

Praveen Kannoju <praveen.kannoju@xxxxxxxxxx> · Thu, 20 Jan 2022 12:27:48 +0000

> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> Sent: 20 January 2022 05:51 PM
> To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx>
> Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Santosh Shilimkar
> <santosh.shilimkar@xxxxxxxxxx>; David S . Miller <davem@xxxxxxxxxxxxx>;
> kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> rds-devel@xxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rama
> Nichanamatlu <rama.nichanamatlu@xxxxxxxxxx>; Rajesh
> Sivaramasubramaniom <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> asynchronous workers to flush the mr pool
> 
> On Thu, Jan 20, 2022 at 11:57:02AM +0000, Praveen Kannoju wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> > > Sent: 20 January 2022 04:41 PM
> > > To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx>
> > > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Santosh Shilimkar
> > > <santosh.shilimkar@xxxxxxxxxx>; David S . Miller
> > > <davem@xxxxxxxxxxxxx>; kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx;
> > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu
> > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom
> > > <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > the asynchronous workers to flush the mr pool
> > >
> > > On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote:
> > > > -----Original Message-----
> > > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> > > > Sent: 19 January 2022 08:26 PM
> > > > To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx>
> > > > Cc: Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx>; Jason
> > > > Gunthorpe <jgg@xxxxxxxx>; David S . Miller <davem@xxxxxxxxxxxxx>;
> > > > kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> > > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx;
> > > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu
> > > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom
> > > > <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > > the asynchronous workers to flush the mr pool
> > > >
> > > > On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> > > > > -----Original Message-----
> > > > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> > > > > Sent: 19 January 2022 12:29 PM
> > > > > To: Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx>
> > > > > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Praveen Kannoju
> > > > > <praveen.kannoju@xxxxxxxxxx>; David S . Miller
> > > > > <davem@xxxxxxxxxxxxx>; kuba@xxxxxxxxxx;
> netdev@xxxxxxxxxxxxxxx;
> > > > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx;
> > > > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu
> > > > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom
> > > > > <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> > > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused
> > > > > by the asynchronous workers to flush the mr pool
> > > > >
> > > > > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > > > > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@xxxxxxxx>
> wrote:
> > > > > > >
> > > > > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar
> wrote:
> > > > > > >>
> > > > > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju
> > > <praveen.kannoju@xxxxxxxxxx> wrote:
> > > > > > >>>
> > > > > > >>> This patch aims to reduce the number of asynchronous
> > > > > > >>> workers being spawned to execute the function
> "rds_ib_flush_mr_pool"
> > > > > > >>> during the high I/O situations. Synchronous call path's to
> > > > > > >>> this
> > > function "rds_ib_flush_mr_pool"
> > > > > > >>> will be executed without being disturbed. By reducing the
> > > > > > >>> number of processes contending to flush the mr pool, the
> > > > > > >>> total number of D state processes waiting to acquire the
> > > > > > >>> mutex lock will be greatly reduced, which otherwise were
> > > > > > >>> causing DB instance crash as the corresponding processes
> > > > > > >>> were not
> > > progressing while waiting to acquire the mutex lock.
> > > > > > >>>
> > > > > > >>> Signed-off-by: Praveen Kumar Kannoju
> > > > > > >>> <praveen.kannoju@xxxxxxxxxx> —
> > > > > > >>>
> > > > > > >> […]
> > > > > > >>
> > > > > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > > > > > >>> 8f070ee..6b640b5 100644
> > > > > > >>> +++ b/net/rds/ib_rdma.c
> > > > > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct
> > > rds_ib_mr_pool *pool,
> > > > > > >>> 	 */
> > > > > > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list,
> > > &unmap_list);
> > > > > > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > > > > > >>> &unmap_list);
> > > > > > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > > > > > >>> +	smp_wmb();
> > > > > > >>> 	if (free_all) {
> > > > > > >>> 		unsigned long flags;
> > > > > > >>>
> > > > > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct
> > > rds_ib_mr_pool *pool,
> > > > > > >>> 	atomic_sub(nfreed, &pool->item_count);
> > > > > > >>>
> > > > > > >>> out:
> > > > > > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > > > > > >>> +	smp_wmb();
> > > > > > >>> 	mutex_unlock(&pool->flush_lock);
> > > > > > >>> 	if (waitqueue_active(&pool->flush_wait))
> > > > > > >>> 		wake_up(&pool->flush_wait); @@ -507,8 +511,17
> @@ void
> > > > > > >>> rds_ib_free_mr(void *trans_private, int
> > > > > > >>> invalidate)
> > > > > > >>>
> > > > > > >>> 	/* If we've pinned too many pages, request a flush */
> > > > > > >>> 	if (atomic_read(&pool->free_pinned) >= pool-
> > > >max_free_pinned ||
> > > > > > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > > > > > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool-
> > > >flush_worker, 10);
> > > > > > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items /
> > > > > > >>> +5)
> > > {
> > > > > > >>> +		smp_rmb();
> > > > > > >> You won’t need these explicit barriers since above atomic
> > > > > > >> and write once already issue them.
> > > > > > >
> > > > > > > No, they don't. Use smp_store_release() and smp_load_acquire
> > > > > > > if you want to do something like this, but I still can't
> > > > > > > quite figure out if this usage of unlocked memory accesses
> > > > > > > makes any
> > > sense at all.
> > > > > > >
> > > > > > Indeed, I see that now, thanks. Yeah, these multi variable
> > > > > > checks can indeed be racy but they are under lock at least for this
> code path.
> > > > > > But there are few hot path places where single variable states
> > > > > > are evaluated atomically instead of heavy lock.
> > > > >
> > > > > At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Santosh
> > > > > >
> > > > >
> > > > > Thank you Santosh, Leon and Jason for reviewing the Patch.
> > > > >
> > > > > 1. Leon, the bool variable "flush_ongoing " introduced through
> > > > > the patch
> > > has to be accessed only after acquiring the mutex lock. Hence it is
> > > well protected.
> > > >
> > > > I don't see any lock in rds_ib_free_mr() function where your
> > > > perform "if
> > > (!READ_ONCE(pool->flush_ongoing)) { ...".
> > > >
> > > > >
> > > > > 2. As the commit message already conveys the reason, the check
> > > > > being
> > > made in the function "rds_ib_free_mr" is only to avoid the redundant
> > > asynchronous workers from being spawned.
> > > > >
> > > > > 3. The synchronous flush path's through the function
> > > > > "rds_free_mr" with
> > > either cookie=0 or "invalidate" flag being set, have to be honoured
> > > as per the semantics and hence these paths have to execute the
> > > function "rds_ib_flush_mr_pool" unconditionally.
> > > > >
> > > > > 4. It complicates the patch to identify, where from the function
> > > "rds_ib_flush_mr_pool", has been called. And hence, this patch uses
> > > the state of the bool variable to stop the asynchronous workers.
> > > > >
> > > > > 5. We knew that "queue_delayed_work" will ensures only one work
> > > > > is
> > > running, but avoiding these async workers during high load
> > > situations, made way for the allocation and synchronous callers
> > > which would otherwise be waiting long for the flush to happen. Great
> > > reduction in the number of calls to the function "rds_ib_flush_mr_pool"
> has been observed with this patch.
> > > >
> > > > So if you understand that there is only one work in progress, why
> > > > do you
> > > say workerS?
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > 6. Jason, the only function "rds_ib_free_mr" which accesses the
> > > introduced bool variable "flush_ongoing" to spawn a flush worker
> > > does not crucially impact the availability of MR's, because the flush
> happens from
> > > allocation path as well when necessary.   Hence the Load-store ordering is
> > > not essentially needed here, because of which we chose smp_rmb() and
> > > smp_wmb() over smp_load_acquire() and smp_store_release().
> > > > >
> > > > > Regards,
> > > > > Praveen.
> > > >
> > > >
> > > > Jason,
> > > >
> > > > 	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure
> > > to provide
> > > > 	guaranteed atomic memory operations READ_ONCE and
> > > WRITE_ONCE, used in the
> > > > 	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool"
> > > correspondingly.
> > > >
> > > > 	Yes, the memory barrier primitives smp_load_acquire()and
> > > smp_store_release()
> > > > 	are even better. But, because of the simplicity of the use of
> > > > memory
> > > barrier
> > > > 	in the patch, smp_rmb() and smp_wmb() are chosen.
> > > >
> > > > 	Please let me know if you want me to switch to smp_load_acquire()
> > > and
> > > > 	smp_store_release().
> > > >
> > > > Leon,
> > > >
> > > > 	Avoiding the asynchronous worker from being spawned during the
> > > high load situations,
> > > > 	make way for both synchronous and allocation path to flush the mr
> > > pool and grab the
> > > > 	mr without waiting.
> > > >
> > > > 	Please let me know if you still have any queries with this
> > > > respect or
> > > any modifications
> > > > 	are needed.
> > >
> > Thank you for your reply, Leon.
> >
> > > I didn't get any answer on my questions.
> > > So let's me repeat them.
> > > 1. Where can I see locks in rds_ib_free_mr() that protects
> > > concurrent change of your new flush_ongoing field?
> >
> > flush_ongoing variable is only modified in the function
> "rds_ib_flush_mr_pool" under mutex lock. It is only being read atomically in
> the function "rds_ib_free_mr()", with memory barrier in place.  Hence a lock
> is not essential in this function "rds_ib_free_mr()". Depending on the value
> being read, decision is taken weather to spawn the asynchronous worker or
> not.
> 
> This is not how locking works.
> 
>  CPU1                                     CPU2
>   rds_ib_flush_mr_pool
>   lock
>                         context switch -> rds_ib_free_mr
> 			                  READ old value of flush_ongoing
> 			<- context switch
>   WRITE new value to flush_ongoing
> 
> 
> >
> > > 2. There is only one same work can be executed/scheduled, where do
> > > you see multiple/parallel workers (plural) and how is it possible?
> >
> > In my earlier comment, I have re-iterated the same point. I would take
> back my word "workerS". By avoiding this asynchronous worker to
> participate in flushing, the synchronous flush jobs (cookie=0 and
> invalidate=1) as well as allocation path worker will be acquiring the mutex
> lock in the function "rds_ib_flush_mr_pool" quickly, thereby fetching the
> MR.
> 
> This is completely different scenario from what was presented before.
> You are interested to prioritize synchronous operations over async.
> In such case, why don't you simply cancel_delayed_work() in your sync
> flows?

Thank you, Leon. 
Will verify this approach and post the updated patch for review. 

> 
> >
> > I hope I am clear.
> >
> > > BTW, please fix your email client to reply inline.
> > Fixed it. Thank you.
> > >
> > > >
> > > > Regards,
> > > > Praveen.
> >
> >
> > Regards,
> > Praveen.