> -----Original Message----- > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx] > Sent: 20 January 2022 05:51 PM > To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Santosh Shilimkar > <santosh.shilimkar@xxxxxxxxxx>; David S . Miller <davem@xxxxxxxxxxxxx>; > kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; > rds-devel@xxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rama > Nichanamatlu <rama.nichanamatlu@xxxxxxxxxx>; Rajesh > Sivaramasubramaniom <rajesh.sivaramasubramaniom@xxxxxxxxxx> > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the > asynchronous workers to flush the mr pool > > On Thu, Jan 20, 2022 at 11:57:02AM +0000, Praveen Kannoju wrote: > > > -----Original Message----- > > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx] > > > Sent: 20 January 2022 04:41 PM > > > To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx> > > > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Santosh Shilimkar > > > <santosh.shilimkar@xxxxxxxxxx>; David S . Miller > > > <davem@xxxxxxxxxxxxx>; kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; > > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx; > > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu > > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom > > > <rajesh.sivaramasubramaniom@xxxxxxxxxx> > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by > > > the asynchronous workers to flush the mr pool > > > > > > On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote: > > > > -----Original Message----- > > > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx] > > > > Sent: 19 January 2022 08:26 PM > > > > To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx> > > > > Cc: Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx>; Jason > > > > Gunthorpe <jgg@xxxxxxxx>; David S . Miller <davem@xxxxxxxxxxxxx>; > > > > kuba@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; > > > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx; > > > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu > > > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom > > > > <rajesh.sivaramasubramaniom@xxxxxxxxxx> > > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by > > > > the asynchronous workers to flush the mr pool > > > > > > > > On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote: > > > > > -----Original Message----- > > > > > From: Leon Romanovsky [mailto:leon@xxxxxxxxxx] > > > > > Sent: 19 January 2022 12:29 PM > > > > > To: Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx> > > > > > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Praveen Kannoju > > > > > <praveen.kannoju@xxxxxxxxxx>; David S . Miller > > > > > <davem@xxxxxxxxxxxxx>; kuba@xxxxxxxxxx; > netdev@xxxxxxxxxxxxxxx; > > > > > linux-rdma@xxxxxxxxxxxxxxx; rds-devel@xxxxxxxxxxxxxx; > > > > > linux-kernel@xxxxxxxxxxxxxxx; Rama Nichanamatlu > > > > > <rama.nichanamatlu@xxxxxxxxxx>; Rajesh Sivaramasubramaniom > > > > > <rajesh.sivaramasubramaniom@xxxxxxxxxx> > > > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused > > > > > by the asynchronous workers to flush the mr pool > > > > > > > > > > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote: > > > > > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@xxxxxxxx> > wrote: > > > > > > > > > > > > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar > wrote: > > > > > > >> > > > > > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju > > > <praveen.kannoju@xxxxxxxxxx> wrote: > > > > > > >>> > > > > > > >>> This patch aims to reduce the number of asynchronous > > > > > > >>> workers being spawned to execute the function > "rds_ib_flush_mr_pool" > > > > > > >>> during the high I/O situations. Synchronous call path's to > > > > > > >>> this > > > function "rds_ib_flush_mr_pool" > > > > > > >>> will be executed without being disturbed. By reducing the > > > > > > >>> number of processes contending to flush the mr pool, the > > > > > > >>> total number of D state processes waiting to acquire the > > > > > > >>> mutex lock will be greatly reduced, which otherwise were > > > > > > >>> causing DB instance crash as the corresponding processes > > > > > > >>> were not > > > progressing while waiting to acquire the mutex lock. > > > > > > >>> > > > > > > >>> Signed-off-by: Praveen Kumar Kannoju > > > > > > >>> <praveen.kannoju@xxxxxxxxxx> — > > > > > > >>> > > > > > > >> […] > > > > > > >> > > > > > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index > > > > > > >>> 8f070ee..6b640b5 100644 > > > > > > >>> +++ b/net/rds/ib_rdma.c > > > > > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct > > > rds_ib_mr_pool *pool, > > > > > > >>> */ > > > > > > >>> dirty_to_clean = llist_append_to_list(&pool->drop_list, > > > &unmap_list); > > > > > > >>> dirty_to_clean += llist_append_to_list(&pool->free_list, > > > > > > >>> &unmap_list); > > > > > > >>> + WRITE_ONCE(pool->flush_ongoing, true); > > > > > > >>> + smp_wmb(); > > > > > > >>> if (free_all) { > > > > > > >>> unsigned long flags; > > > > > > >>> > > > > > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct > > > rds_ib_mr_pool *pool, > > > > > > >>> atomic_sub(nfreed, &pool->item_count); > > > > > > >>> > > > > > > >>> out: > > > > > > >>> + WRITE_ONCE(pool->flush_ongoing, false); > > > > > > >>> + smp_wmb(); > > > > > > >>> mutex_unlock(&pool->flush_lock); > > > > > > >>> if (waitqueue_active(&pool->flush_wait)) > > > > > > >>> wake_up(&pool->flush_wait); @@ -507,8 +511,17 > @@ void > > > > > > >>> rds_ib_free_mr(void *trans_private, int > > > > > > >>> invalidate) > > > > > > >>> > > > > > > >>> /* If we've pinned too many pages, request a flush */ > > > > > > >>> if (atomic_read(&pool->free_pinned) >= pool- > > > >max_free_pinned || > > > > > > >>> - atomic_read(&pool->dirty_count) >= pool->max_items / 5) > > > > > > >>> - queue_delayed_work(rds_ib_mr_wq, &pool- > > > >flush_worker, 10); > > > > > > >>> + atomic_read(&pool->dirty_count) >= pool->max_items / > > > > > > >>> +5) > > > { > > > > > > >>> + smp_rmb(); > > > > > > >> You won’t need these explicit barriers since above atomic > > > > > > >> and write once already issue them. > > > > > > > > > > > > > > No, they don't. Use smp_store_release() and smp_load_acquire > > > > > > > if you want to do something like this, but I still can't > > > > > > > quite figure out if this usage of unlocked memory accesses > > > > > > > makes any > > > sense at all. > > > > > > > > > > > > > Indeed, I see that now, thanks. Yeah, these multi variable > > > > > > checks can indeed be racy but they are under lock at least for this > code path. > > > > > > But there are few hot path places where single variable states > > > > > > are evaluated atomically instead of heavy lock. > > > > > > > > > > At least pool->dirty_count is not locked in rds_ib_free_mr() at all. > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > Regards, > > > > > > Santosh > > > > > > > > > > > > > > > > Thank you Santosh, Leon and Jason for reviewing the Patch. > > > > > > > > > > 1. Leon, the bool variable "flush_ongoing " introduced through > > > > > the patch > > > has to be accessed only after acquiring the mutex lock. Hence it is > > > well protected. > > > > > > > > I don't see any lock in rds_ib_free_mr() function where your > > > > perform "if > > > (!READ_ONCE(pool->flush_ongoing)) { ...". > > > > > > > > > > > > > > 2. As the commit message already conveys the reason, the check > > > > > being > > > made in the function "rds_ib_free_mr" is only to avoid the redundant > > > asynchronous workers from being spawned. > > > > > > > > > > 3. The synchronous flush path's through the function > > > > > "rds_free_mr" with > > > either cookie=0 or "invalidate" flag being set, have to be honoured > > > as per the semantics and hence these paths have to execute the > > > function "rds_ib_flush_mr_pool" unconditionally. > > > > > > > > > > 4. It complicates the patch to identify, where from the function > > > "rds_ib_flush_mr_pool", has been called. And hence, this patch uses > > > the state of the bool variable to stop the asynchronous workers. > > > > > > > > > > 5. We knew that "queue_delayed_work" will ensures only one work > > > > > is > > > running, but avoiding these async workers during high load > > > situations, made way for the allocation and synchronous callers > > > which would otherwise be waiting long for the flush to happen. Great > > > reduction in the number of calls to the function "rds_ib_flush_mr_pool" > has been observed with this patch. > > > > > > > > So if you understand that there is only one work in progress, why > > > > do you > > > say workerS? > > > > > > > > Thanks > > > > > > > > > > > > > > 6. Jason, the only function "rds_ib_free_mr" which accesses the > > > introduced bool variable "flush_ongoing" to spawn a flush worker > > > does not crucially impact the availability of MR's, because the flush > happens from > > > allocation path as well when necessary. Hence the Load-store ordering is > > > not essentially needed here, because of which we chose smp_rmb() and > > > smp_wmb() over smp_load_acquire() and smp_store_release(). > > > > > > > > > > Regards, > > > > > Praveen. > > > > > > > > > > > > Jason, > > > > > > > > The relaxed ordering primitives smp_rmb() and smp_wmb() ensure > > > to provide > > > > guaranteed atomic memory operations READ_ONCE and > > > WRITE_ONCE, used in the > > > > functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool" > > > correspondingly. > > > > > > > > Yes, the memory barrier primitives smp_load_acquire()and > > > smp_store_release() > > > > are even better. But, because of the simplicity of the use of > > > > memory > > > barrier > > > > in the patch, smp_rmb() and smp_wmb() are chosen. > > > > > > > > Please let me know if you want me to switch to smp_load_acquire() > > > and > > > > smp_store_release(). > > > > > > > > Leon, > > > > > > > > Avoiding the asynchronous worker from being spawned during the > > > high load situations, > > > > make way for both synchronous and allocation path to flush the mr > > > pool and grab the > > > > mr without waiting. > > > > > > > > Please let me know if you still have any queries with this > > > > respect or > > > any modifications > > > > are needed. > > > > > Thank you for your reply, Leon. > > > > > I didn't get any answer on my questions. > > > So let's me repeat them. > > > 1. Where can I see locks in rds_ib_free_mr() that protects > > > concurrent change of your new flush_ongoing field? > > > > flush_ongoing variable is only modified in the function > "rds_ib_flush_mr_pool" under mutex lock. It is only being read atomically in > the function "rds_ib_free_mr()", with memory barrier in place. Hence a lock > is not essential in this function "rds_ib_free_mr()". Depending on the value > being read, decision is taken weather to spawn the asynchronous worker or > not. > > This is not how locking works. > > CPU1 CPU2 > rds_ib_flush_mr_pool > lock > context switch -> rds_ib_free_mr > READ old value of flush_ongoing > <- context switch > WRITE new value to flush_ongoing > > > > > > > 2. There is only one same work can be executed/scheduled, where do > > > you see multiple/parallel workers (plural) and how is it possible? > > > > In my earlier comment, I have re-iterated the same point. I would take > back my word "workerS". By avoiding this asynchronous worker to > participate in flushing, the synchronous flush jobs (cookie=0 and > invalidate=1) as well as allocation path worker will be acquiring the mutex > lock in the function "rds_ib_flush_mr_pool" quickly, thereby fetching the > MR. > > This is completely different scenario from what was presented before. > You are interested to prioritize synchronous operations over async. > In such case, why don't you simply cancel_delayed_work() in your sync > flows? Thank you, Leon. Will verify this approach and post the updated patch for review. > > > > > I hope I am clear. > > > > > BTW, please fix your email client to reply inline. > > Fixed it. Thank you. > > > > > > > > > > > Regards, > > > > Praveen. > > > > > > Regards, > > Praveen.