Re: [PATCH V3 1/1] MD: fix lock contention for flush bios

Xiao Ni <xni@xxxxxxxxxx> · Wed, 4 Apr 2018 03:09:41 -0400 (EDT)

----- Original Message -----
> From: "Shaohua Li" <shli@xxxxxxxxxx>
> To: "Xiao Ni" <xni@xxxxxxxxxx>
> Cc: linux-raid@xxxxxxxxxxxxxxx, "ming lei" <ming.lei@xxxxxxxxxx>, ncroxon@xxxxxxxxxx, neilb@xxxxxxxx
> Sent: Wednesday, April 4, 2018 12:26:52 AM
> Subject: Re: [PATCH V3 1/1] MD: fix lock contention for flush bios
> 
> On Mon, Apr 02, 2018 at 10:48:36PM -0400, Xiao Ni wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Shaohua Li" <shli@xxxxxxxxxx>
> > > To: "Xiao Ni" <xni@xxxxxxxxxx>
> > > Cc: linux-raid@xxxxxxxxxxxxxxx, "ming lei" <ming.lei@xxxxxxxxxx>,
> > > ncroxon@xxxxxxxxxx, neilb@xxxxxxxx
> > > Sent: Monday, April 2, 2018 7:02:37 AM
> > > Subject: Re: [PATCH V3 1/1] MD: fix lock contention for flush bios
> > > 
> > > On Wed, Mar 21, 2018 at 02:47:22PM +0800, Xiao Ni wrote:
> > > > There is a lock contention when there are many processes which send
> > > > flush
> > > > bios
> > > > to md device. eg. Create many lvs on one raid device and mkfs.xfs on
> > > > each
> > > > lv.
> > > > 
> > > > Now it just can handle flush request sequentially. It needs to wait
> > > > mddev->flush_bio
> > > > to be NULL, otherwise get mddev->lock.
> > > > 
> > > > This patch remove mddev->flush_bio and handle flush bio asynchronously.
> > > > I did a test with command dbench -s 128 -t 4800. This is the test
> > > > result:
> > > > 
> > > > =================Without the patch============================
> > > >  Operation                Count    AvgLat    MaxLat
> > > >  --------------------------------------------------
> > > >  Flush                     5239   142.590  3972.034
> > > >  Close                    53114     0.176   498.236
> > > >  LockX                      208     0.066     0.907
> > > >  Rename                    2793     0.335     7.203
> > > >  ReadX                    98100     0.020     2.280
> > > >  WriteX                   67800   555.649  8238.498
> > > >  Unlink                    7985     1.742   446.503
> > > >  UnlockX                    208     0.058     1.013
> > > >  FIND_FIRST               21035     0.141     3.147
> > > >  SET_FILE_INFORMATION      6419     0.090     1.539
> > > >  QUERY_FILE_INFORMATION   18244     0.007     0.130
> > > >  QUERY_PATH_INFORMATION   55622     0.060     3.884
> > > >  QUERY_FS_INFORMATION      9451     0.040     1.148
> > > >  NTCreateX                63960     0.717   536.542
> > > > 
> > > > Throughput 12.1782 MB/sec (sync open)  128 clients  128 procs
> > > > max_latency=8238.513 ms
> > > > 
> > > > =====================With the patch===========================
> > > >  Operation                Count    AvgLat    MaxLat
> > > >  --------------------------------------------------
> > > >  Flush                    34858    36.484   668.243
> > > >  Close                   379883     0.107   252.232
> > > >  LockX                     1792     0.048     1.070
> > > >  Rename                   21761     0.804   266.659
> > > >  ReadX                   817947     0.021    42.891
> > > >  WriteX                  254804   142.485   948.090
> > > >  Unlink                   99665     3.590   899.816
> > > >  UnlockX                   1792     0.056     1.240
> > > >  FIND_FIRST              178857     0.187    23.287
> > > >  SET_FILE_INFORMATION     41612     0.135    26.575
> > > >  QUERY_FILE_INFORMATION   83691     0.007     2.589
> > > >  QUERY_PATH_INFORMATION  470889     0.077    83.846
> > > >  QUERY_FS_INFORMATION     82764     0.056    10.368
> > > >  NTCreateX               512262     0.616   809.980
> > > > 
> > > > Throughput 53.6545 MB/sec (sync open)  128 clients  128 procs
> > > > max_latency=948.105 ms
> > > > 
> > > > V3:
> > > > Shaohua suggests mempool is overkill. In v3 it allocs memory during
> > > > creating raid device
> > > > and uses a simple bitmap to record which resource is free.
> > > 
> > > Sorry for the delay. The bitmap method is still too complicated. Can we
> > > do
> > > something like this:
> > > 
> > > in mddev:
> > > struct flush_info flush_infos[8 or 16]; Maybe don't need a special struct
> > > too.
> > > 
> > > the info can include a lock, every time we handle a flush, we select a
> > > flush_info by flush_infos[jhash(flush_request_bio_address)];
> > > 
> > > we then take the lock and do whatever current code does against the
> > > specific
> > > flush_info. Isn't this a simpler implementation? I think 8 or 16 locks
> > > should
> > > reduce most of the lock contention.
> > 
> > Hi Shaohua
> > 
> > There is a problem. If the method hash give the same result. The flush bios
> > will
> > wait for one flush_info, even there are some flush_info are free. If we use
> > the
> > bitmap method, the flush request can get one flush info in the
> > flush_infos[].
> 
> If the hash algorithm, this shouldn't be a problem.
> The bitmap method need a global lock, which we want to avoid I think.

OK, I'm trying to write with jhash method.

There is one another problem. When updating raid disks, it needs to re-allocate
flush info buffer and replace mddev->flush_info with the new buffer. I'm not sure
the right way to protect it. If there are some flush bios submitting to md at the
same time, it will have problems. Is it right to suspend md device when updating
raid disks? (mddev->suspended = 1)

Regards
Xiao

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html