RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool

Praveen Kannoju <praveen.kannoju@xxxxxxxxxx> · Thu, 20 Jan 2022 11:57:02 +0000

> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> Sent: 20 January 2022 04:41 PM
> To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx>
> Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Santosh Shilimkar
> <santosh.shilimkar@xxxxxxxxxx>; David S . Miller <davem@xxxxxxxxxxxxx>;
> kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> rds-devel@xxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rama
> Nichanamatlu <rama.nichanamatlu@xxxxxxxxxx>; Rajesh
> Sivaramasubramaniom <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> asynchronous workers to flush the mr pool
> 
> On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote:
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> > Sent: 19 January 2022 08:26 PM
> > To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx>
> > Cc: Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx>; Jason Gunthorpe
> > <jgg@xxxxxxxx>; David S . Miller <davem@xxxxxxxxxxxxx>;
> > kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> > rds-devel@xxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rama
> > Nichanamatlu <rama.nichanamatlu@xxxxxxxxxx>; Rajesh
> > Sivaramasubramaniom <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> > asynchronous workers to flush the mr pool
> >
> > On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> > > Sent: 19 January 2022 12:29 PM
> > > To: Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx>
> > > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Praveen Kannoju
> > > <praveen.kannoju@xxxxxxxxxx>; David S . Miller
> > > <davem@xxxxxxxxxxxxx>; kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx;
> > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu
> > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom
> > > <rajesh.sivaramasubramaniom@xxxxxxxxxx>
> > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > the asynchronous workers to flush the mr pool
> > >
> > > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> > > > >>
> > > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju
> <praveen.kannoju@xxxxxxxxxx> wrote:
> > > > >>>
> > > > >>> This patch aims to reduce the number of asynchronous workers
> > > > >>> being spawned to execute the function "rds_ib_flush_mr_pool"
> > > > >>> during the high I/O situations. Synchronous call path's to this
> function "rds_ib_flush_mr_pool"
> > > > >>> will be executed without being disturbed. By reducing the
> > > > >>> number of processes contending to flush the mr pool, the total
> > > > >>> number of D state processes waiting to acquire the mutex lock
> > > > >>> will be greatly reduced, which otherwise were causing DB
> > > > >>> instance crash as the corresponding processes were not
> progressing while waiting to acquire the mutex lock.
> > > > >>>
> > > > >>> Signed-off-by: Praveen Kumar Kannoju
> > > > >>> <praveen.kannoju@xxxxxxxxxx> —
> > > > >>>
> > > > >> […]
> > > > >>
> > > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > > > >>> 8f070ee..6b640b5 100644
> > > > >>> +++ b/net/rds/ib_rdma.c
> > > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct
> rds_ib_mr_pool *pool,
> > > > >>> 	 */
> > > > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list,
> &unmap_list);
> > > > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > > > >>> &unmap_list);
> > > > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > > > >>> +	smp_wmb();
> > > > >>> 	if (free_all) {
> > > > >>> 		unsigned long flags;
> > > > >>>
> > > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct
> rds_ib_mr_pool *pool,
> > > > >>> 	atomic_sub(nfreed, &pool->item_count);
> > > > >>>
> > > > >>> out:
> > > > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > > > >>> +	smp_wmb();
> > > > >>> 	mutex_unlock(&pool->flush_lock);
> > > > >>> 	if (waitqueue_active(&pool->flush_wait))
> > > > >>> 		wake_up(&pool->flush_wait);
> > > > >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private,
> > > > >>> int
> > > > >>> invalidate)
> > > > >>>
> > > > >>> 	/* If we've pinned too many pages, request a flush */
> > > > >>> 	if (atomic_read(&pool->free_pinned) >= pool-
> >max_free_pinned ||
> > > > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > > > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool-
> >flush_worker, 10);
> > > > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> {
> > > > >>> +		smp_rmb();
> > > > >> You won’t need these explicit barriers since above atomic and
> > > > >> write once already issue them.
> > > > >
> > > > > No, they don't. Use smp_store_release() and smp_load_acquire if
> > > > > you want to do something like this, but I still can't quite
> > > > > figure out if this usage of unlocked memory accesses makes any
> sense at all.
> > > > >
> > > > Indeed, I see that now, thanks. Yeah, these multi variable checks
> > > > can indeed be racy but they are under lock at least for this code path.
> > > > But there are few hot path places where single variable states are
> > > > evaluated atomically instead of heavy lock.
> > >
> > > At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> > >
> > > Thanks
> > >
> > > >
> > > > Regards,
> > > > Santosh
> > > >
> > >
> > > Thank you Santosh, Leon and Jason for reviewing the Patch.
> > >
> > > 1. Leon, the bool variable "flush_ongoing " introduced through the patch
> has to be accessed only after acquiring the mutex lock. Hence it is well
> protected.
> >
> > I don't see any lock in rds_ib_free_mr() function where your perform "if
> (!READ_ONCE(pool->flush_ongoing)) { ...".
> >
> > >
> > > 2. As the commit message already conveys the reason, the check being
> made in the function "rds_ib_free_mr" is only to avoid the redundant
> asynchronous workers from being spawned.
> > >
> > > 3. The synchronous flush path's through the function "rds_free_mr" with
> either cookie=0 or "invalidate" flag being set, have to be honoured as per the
> semantics and hence these paths have to execute the function
> "rds_ib_flush_mr_pool" unconditionally.
> > >
> > > 4. It complicates the patch to identify, where from the function
> "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the
> state of the bool variable to stop the asynchronous workers.
> > >
> > > 5. We knew that "queue_delayed_work" will ensures only one work is
> running, but avoiding these async workers during high load situations, made
> way for the allocation and synchronous callers which would otherwise be
> waiting long for the flush to happen. Great reduction in the number of calls
> to the function "rds_ib_flush_mr_pool" has been observed with this patch.
> >
> > So if you understand that there is only one work in progress, why do you
> say workerS?
> >
> > Thanks
> >
> > >
> > > 6. Jason, the only function "rds_ib_free_mr" which accesses the
> introduced bool variable "flush_ongoing" to spawn a flush worker does not
> crucially impact the availability of MR's, because the flush happens from
> allocation path as well when necessary.   Hence the Load-store ordering is
> not essentially needed here, because of which we chose smp_rmb() and
> smp_wmb() over smp_load_acquire() and smp_store_release().
> > >
> > > Regards,
> > > Praveen.
> >
> >
> > Jason,
> >
> > 	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure
> to provide
> > 	guaranteed atomic memory operations READ_ONCE and
> WRITE_ONCE, used in the
> > 	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool"
> correspondingly.
> >
> > 	Yes, the memory barrier primitives smp_load_acquire()and
> smp_store_release()
> > 	are even better. But, because of the simplicity of the use of memory
> barrier
> > 	in the patch, smp_rmb() and smp_wmb() are chosen.
> >
> > 	Please let me know if you want me to switch to smp_load_acquire()
> and
> > 	smp_store_release().
> >
> > Leon,
> >
> > 	Avoiding the asynchronous worker from being spawned during the
> high load situations,
> > 	make way for both synchronous and allocation path to flush the mr
> pool and grab the
> > 	mr without waiting.
> >
> > 	Please let me know if you still have any queries with this respect or
> any modifications
> > 	are needed.
> 
Thank you for your reply, Leon.

> I didn't get any answer on my questions.
> So let's me repeat them.
> 1. Where can I see locks in rds_ib_free_mr() that protects concurrent change
> of your new flush_ongoing field?

flush_ongoing variable is only modified in the function "rds_ib_flush_mr_pool" under mutex lock. It is only being read atomically in the function "rds_ib_free_mr()", with memory barrier in place.  Hence a lock is not essential in this function "rds_ib_free_mr()". Depending on the value being read, decision is taken weather to spawn the asynchronous worker or not. 

> 2. There is only one same work can be executed/scheduled, where do you
> see multiple/parallel workers (plural) and how is it possible?

In my earlier comment, I have re-iterated the same point. I would take back my word "workerS". By avoiding this asynchronous worker to participate in flushing, the synchronous flush jobs (cookie=0 and invalidate=1) as well as allocation path worker will be acquiring the mutex lock in the function "rds_ib_flush_mr_pool" quickly, thereby fetching the MR.

I hope I am clear. 

> BTW, please fix your email client to reply inline.
Fixed it. Thank you. 
> 
> >
> > Regards,
> > Praveen.

Regards,
Praveen.