Re: [PATCH] md: Use optimal I/O size for last bitmap page

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Jonathan,


Thank you for your patch.

Am 17.02.23 um 00:52 schrieb Jonathan Derrick:

On 2/10/2023 10:32 AM, Song Liu wrote:

On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick wrote:

Any thoughts on this?

I am really sorry that I missed this patch.

On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
From: Jon Derrick <jonathan.derrick@xxxxxxxxx>

If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.

This change helps increase performance by preventing unnecessary
device-side read-mod-writes due to non-atomic write unit sizes.

Ex biosnoop log. Device lba size 512, optimal size 4k:
Before:
Time        Process        PID     Device      LBA        Size      Lat
0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02

After:
Time        Process        PID     Device      LBA        Size      Lat
18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02

Do we see significant improvements from io benchmarks?

Yes. With lbaf=512, optimal i/o size=4k:

Without patch:
   write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
With patch:
   write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets

It's going to be different for different drives, but this was a drive where a
drive-side read-mod-write has huge penalty.

Nice. Can you share the drive model used for benchmarking. Also, maybe also add that to the commit message?

IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
such optimizations in the future.

Maybe. But many drives still ship with 512B formatting as default.

Is that the problem on “desktop” HDDs, where `PHY-SEC` and `LOG-SEC` are different?

$ lsblk -o NAME,MODEL,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,ROTA,SCHED,RQ-SIZE,RA,WSAME -d /dev/sda NAME MODEL ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sda TOSHIBA DT01ACA050 0 4096 0 4096 512 1 bfq 64 128 0B

(Though these are not used in a RAID here.)


Kind regards,

Paul


[1]: https://toshiba.semicon-storage.com/content/dam/toshiba-ss-v3/master/en/storage/product/internal-specialty/cHDD-DT01ACAxxx-Product-Overview.pdf


Signed-off-by: Jon Derrick <jonathan.derrick@xxxxxxxxx>
---
  drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
  1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index e7cc6ba1b657..569297ea9b99 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
       rdev = NULL;
       while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
               int size = PAGE_SIZE;
+             int optimal_size = PAGE_SIZE;
               loff_t offset = mddev->bitmap_info.offset;

               bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
@@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
                       int last_page_size = store->bytes & (PAGE_SIZE-1);
                       if (last_page_size == 0)
                               last_page_size = PAGE_SIZE;
-                     size = roundup(last_page_size,
-                                    bdev_logical_block_size(bdev));
+                     size = roundup(last_page_size, bdev_logical_block_size(bdev));
+                     if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
+                             optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
+                     else
+                             optimal_size = size;
               }
+
+
               /* Just make sure we aren't corrupting data or
                * metadata
                */
@@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
                               goto bad_alignment;
               } else if (offset < 0) {
                       /* DATA  BITMAP METADATA  */
-                     if (offset
-                         + (long)(page->index * (PAGE_SIZE/512))
-                         + size/512 > 0)
+                     loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
+                     if (size != optimal_size &&
+                         off + optimal_size/512 <= 0)
+                             size = optimal_size;
+                     else if (off + size/512 > 0)
                               /* bitmap runs in to metadata */
                               goto bad_alignment;
                       if (rdev->data_offset + mddev->dev_sectors
@@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
                               goto bad_alignment;
               } else if (rdev->sb_start < rdev->data_offset) {
                       /* METADATA BITMAP DATA */
-                     if (rdev->sb_start
-                         + offset
-                         + page->index*(PAGE_SIZE/512) + size/512
-                         > rdev->data_offset)
+                     loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
+                     if (size != optimal_size &&
+                         off + optimal_size/512 <= rdev->data_offset)
+                             size = optimal_size;
+                     else if (off + size/512 > rdev->data_offset)
                               /* bitmap runs in to data */
                               goto bad_alignment;
               } else {



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux