Re: fstrim on raid1 LV with writemostly PV leads to system freeze

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Mon, 25 Sep 2023 10:58:31 +0800

Hi,

在 2023/09/22 6:03, Roman Mamedov 写道:
On Thu, 21 Sep 2023 17:45:24 -0400
Mike Snitzer <snitzer@xxxxxxxxxx> wrote:

I just verified that 6.5.0 does have this DM core fix (needed to
prevent excessive splitting of discard IO.. which could cause fstrim
to take longer for a DM device), but again 6.5.0 has this fix so it
isn't relevant:
be04c14a1bd2 dm: use op specific max_sectors when splitting abnormal io

Given your use of 'writemostly' I'm inferring you're using lvm2's
raid1 that uses MD raid1 code in terms of the dm-raid target.

Discards (more generic term for fstrim) are considered writes, so
writemostly really shouldn't matter... but I know that there have been
issues with MD's writemostly code (identified by others relatively
recently).

All said: hopefully someone more MD oriented can review your report
and help you further.

Mike

I've reported that write-mostly TRIM gets split into 1MB pieces, which can be
an order of magnitude slower on some SSDs: https://www.spinics.net/lists/raid/msg72471.html

Looks like I missed the report.

Based on code review, it's very clearly where diskcard bio is splited:

raid1_write_request
 for (i = 0;  i < disks; i++)
  if (rdev && test_bit(WriteMostly, &rdev->flags))
   write_behind = true

 if (write_behind && bitmap)
  max_sectors = min_t(int, max_sectors, BIO_MAX_VECS * (PAGE_SIZE >> 9))
  // io size is 512 * (256 * (4k >> 9)) = 1M

 if (max_sectors < bio_sectors(bio))
  bio_split

Roman and Kirill, can you test the following patch?

Thanks,
Kuai

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4b30a1742162..4963f864ef99 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1345,6 +1345,7 @@ static void raid1_write_request(struct mddev 
*mddev, struct bio *bio,
        int first_clone;
        int max_sectors;
        bool write_behind = false;
+       bool is_discard = (bio_op(bio) == REQ_OP_DISCARD);

        if (mddev_is_clustered(mddev) &&
             md_cluster_ops->area_resyncing(mddev, WRITE,
@@ -1405,7 +1406,7 @@ static void raid1_write_request(struct mddev 
*mddev, struct bio *bio,
                 * write-mostly, which means we could allocate write behind
                 * bio later.
                 */
-               if (rdev && test_bit(WriteMostly, &rdev->flags))
+               if (!is_discard && rdev && test_bit(WriteMostly, 
&rdev->flags))
                        write_behind = true;

                if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {



Nobody cared to reply, investigate or fix.

Maybe your system hasn't frozen too, just taking its time in processing all
the tiny split requests.