Re: [RFC PATCH] dm: Avoid sending redundant empty flush bios to the same block device

YangYang <yang.yang@xxxxxxxx> · Sat, 11 May 2024 19:05:23 +0800

On 2024/5/9 22:20, Mike Snitzer wrote:
On Mon, Apr 22, 2024 at 06:05:40PM +0800, Yang Yang wrote:
__send_empty_flush() sends empty flush bios to every target in the
dm_table. However, if the num_targets exceeds the number of block
devices in the dm_table's device list, it could lead to multiple
invocations of __send_duplicate_bios() for the same block device.
Typically, a single thread sending numerous empty flush bios to one
block device is redundant, as these bios are likely to be merged by the
flush state machine. In scenarios where num_targets significantly
outweighs the number of block devices, such behavior may result in a
noteworthy decrease in performance.

This issue can be reproduced using this command line:
   for i in {0..1023}; do
     echo $((8000*$i)) 8000 linear /dev/sda2 $((16384*$i))
   done | dmsetup create example

Only _one_ dm_dev will be created for this example: /dev/sda2
So your bitmap is looking at a single bit.

Yes. And this patch has also been tested on a scenario with multiple
dm_devs using the following command line:
  for i in {0..1023}; do
    if [[ $i -gt 511 ]]; then
      echo $((8000*$i)) 8000 linear /dev/sda3 $((16384*$i))
    else
      echo $((8000*$i)) 8000 linear /dev/sda2 $((16384*$i))
    fi
  done | sudo dmsetup create example

With this fix, a random write with fsync workload executed with the
following fio command:

   fio --group_reporting --name=benchmark --filename=/dev/mapper/example \
       --ioengine=sync --invalidate=1 --numjobs=16 --rw=randwrite \
       --blocksize=4k --size=2G --time_based --runtime=30 --fdatasync=1

results in an increase from 857 KB/s to 30.8 MB/s of the write
throughput (3580% increase).

Signed-off-by: Yang Yang <yang.yang@xxxxxxxx>

I'm including my original fine-grained review comments inlined below
BUT, having wasted time reviewing this patch I'm left with the
conclusions:

1) this patch has serious issues.
2) it fixes an issue with a toy 'example' but ignores that not all
    targets are linear, each disparate target could have their own
    reason(s) for actually _needing_ the redundant flushes.

I chose this specific 'example' because the constructed scenario closely
resembles real-world use cases.
This is a real-life scenario that we have encountered:
1) Call fallocate(file_fd, 0, 0, SZ_8G)
2) Call ioctl(file_fd, FS_IOC_FIEMAP, fiemap). In situations of severe
file system fragmentation, fiemap->fm_mapped_extents may exceed 1000.
3) Create a dm-linear device based on fiemap->fm_extents
4) Create a snapshot-cow device based on the dm-linear device

I guess targets needing the redundant flushes you mentioned are targets
with ti->num_flush_bios > 1. I think they won't be affected, duplicate
flush bios are still send to dm_dev by __send_duplicate_bios().
The main purpose of this patch is to ensure that __send_duplicate_bios()
is called only once for each dm_dev.

Anyway, I will modify this patch to apply only to dm-linear and dm-stripe.

I'm inclined to never take this type of change.

---
  drivers/md/dm-core.h          |  1 +
  drivers/md/dm-table.c         |  7 +++++
  drivers/md/dm.c               | 59 +++++++++++++++++++++++++++++++++++
  include/linux/device-mapper.h |  6 ++++
  4 files changed, 73 insertions(+)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index e6757a30dcca..7e3f2168289f 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -217,6 +217,7 @@ struct dm_table {
  	/* a list of devices used by this table */
  	struct list_head devices;
  	struct rw_semaphore devices_lock;
+	unsigned short num_devices;
  
  	/* events get handed up using this callback */
  	void (*event_fn)(void *data);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 41f1d731ae5a..ddc60e498afb 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -2133,6 +2133,8 @@ void dm_table_postsuspend_targets(struct dm_table *t)
  
  int dm_table_resume_targets(struct dm_table *t)
  {
+	struct list_head *devices = dm_table_get_devices(t);
+	struct dm_dev_internal *dd;
  	unsigned int i;
  	int r = 0;
  
@@ -2159,6 +2161,11 @@ int dm_table_resume_targets(struct dm_table *t)
  			ti->type->resume(ti);
  	}
  
+	t->num_devices = 0;
+
+	list_for_each_entry(dd, devices, list)
+		dd->dm_dev->index = ++(t->num_devices);
+
  	return 0;
  }
  
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 56aa2a8b9d71..7297235291f6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -48,6 +48,8 @@
   */
  #define REQ_DM_POLL_LIST	REQ_DRV
  
+#define DM_MAX_TABLE_DEVICES	1024
+

This name is too general. This limit isn't imposed for anything other
than bounding the size of the bitmap used to avoid redundant flushes.

So maybe rename to: DM_MAX_REDUNDANT_FLUSH_BITMAP_DEVICES

Got it!

  static const char *_name = DM_NAME;
  
  static unsigned int major;
@@ -1543,10 +1545,38 @@ static unsigned int __send_duplicate_bios(struct clone_info *ci, struct dm_targe
  	return ret;
  }
  
+static inline bool has_redundant_flush(struct dm_table *t,
+		unsigned long **bitmap)
+{
+	if (t->num_devices < t->num_targets) {
+		/* Add a limit here to prevent excessive memory usage for bitmaps */
+		if (t->num_devices >= DM_MAX_TABLE_DEVICES)
+			return false;

OK, in practice I'm not aware of tables that require such an excessive
amount of devices.

Noted!

+		/* dm_dev's index starts from 1, so need plus 1 here */
+		*bitmap = bitmap_zalloc(t->num_devices + 1, GFP_KERNEL);
+		if (*bitmap)
+			return true;
+	}

This method is being used in the IO path, so you cannot use
GFP_KERNEL.  Please switch to GFP_NOIO.

Got it!

+
+	return false;
+}
+
+static int dm_get_dev_index(struct dm_target *ti, struct dm_dev *dev,
+				     sector_t start, sector_t len, void *data)
+{
+	unsigned short *index = data;
+	*index = dev->index;
+	return 0;
+}
+
  static void __send_empty_flush(struct clone_info *ci)
  {
  	struct dm_table *t = ci->map;
  	struct bio flush_bio;
+	unsigned long *handled_map;

Please rename, e.g.: handled_devices_bitmap

Got it!

+	unsigned int nr_handled = 0;
+	bool check = has_redundant_flush(t, &handled_map);
  
  	/*
  	 * Use an on-stack bio for this, it's safe since we don't
@@ -1562,17 +1592,46 @@ static void __send_empty_flush(struct clone_info *ci)
  
  	for (unsigned int i = 0; i < t->num_targets; i++) {
  		unsigned int bios;
+		unsigned short index = 0;
  		struct dm_target *ti = dm_table_get_target(t, i);
  
  		if (unlikely(ti->num_flush_bios == 0))
  			continue;
  
+		/*
+		 * If the num_targets is greater than the number of block devices
+		 * in the dm_table's devices list, __send_empty_flush() might
+		 * invoke __send_duplicate_bios() multiple times for the same
+		 * block device. This could lead to a substantial decrease in
+		 * performance when num_targets significantly exceeds the number
+		 * of block devices.
+		 * Ensure that __send_duplicate_bios() is only called once for
+		 * each block device.
+		 */
+		if (check) {
+			if (nr_handled == t->num_devices)
+				break;
+
+			if (ti->type->iterate_devices)
+				ti->type->iterate_devices(ti, dm_get_dev_index, &index);

You're looping through all data devices in a target, so you're only
getting the last device in the target's index.  That seems very
broken.

But each target in your test 'example' device (mentioned in the patch
header) only has a single device so you wouldn't have noticed this.

Perhaps the code comments are not sufficiently detailed, so allow me to
provide a detailed explanation of the execution flow of this code.

For device table only has a single device:
example: 0 8000 linear 8:2 0            -> dm_dev(8:2).index = 1
example: 8000 8000 linear 8:2 16384     -> dm_dev(8:2).index = 1
example: 16000 8000 linear 8:2 32768    -> dm_dev(8:2).index = 1
example: 24000 8000 linear 8:2 49152    -> dm_dev(8:2).index = 1
example: 32000 8000 linear 8:2 65536    -> dm_dev(8:2).index = 1
example: 40000 8000 linear 8:2 81920    -> dm_dev(8:2).index = 1
example: 48000 8000 linear 8:2 98304    -> dm_dev(8:2).index = 1
example: 56000 8000 linear 8:2 114688   -> dm_dev(8:2).index = 1
example: 64000 8000 linear 8:2 131072   -> dm_dev(8:2).index = 1

Before enter the loop:
    nr_handled = 0
    t->num_devices = 1

When i equals to 0:
for (unsigned int i = 0; i < t->num_targets; i++) {
    if (nr_handled == t->num_devices) //nr_handled=0, t->num_devices=1
        break;

    if (ti->type->iterate_devices)
        ti->type->iterate_devices(ti, dm_get_dev_index, &index) //index is set to 1

    if (index > 0) {
        if (__test_and_set_bit(index, handled_map)) //__test_and_set_bit() return 0
            continue;
        else
            nr_handled++;    //nr_handled = 1
    }

    bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios,
                     NULL, GFP_NOWAIT);  //__send_duplicate_bios() is called
}

When i equals to 1:
for (unsigned int i = 0; i < t->num_targets; i++) {
    if (nr_handled == t->num_devices) //nr_handled=1, t->num_devices=1, break the loop
        break;
}

__send_duplicate_bios() is called for device (8:2) only once.

======================================================================

For device table has two devices:
example: 0 8000 linear 8:2 0            -> dm_dev(8:2).index = 1
example: 8000 8000 linear 8:2 16384     -> dm_dev(8:2).index = 1
example: 16000 8000 linear 8:2 32768    -> dm_dev(8:2).index = 1
example: 24000 8000 linear 8:2 49152    -> dm_dev(8:2).index = 1
example: 32000 8000 linear 8:2 65536    -> dm_dev(8:2).index = 1
example: 40000 8000 linear 8:3 81920    -> dm_dev(8:3).index = 2
example: 48000 8000 linear 8:3 98304    -> dm_dev(8:3).index = 2
example: 56000 8000 linear 8:3 114688   -> dm_dev(8:3).index = 2
example: 64000 8000 linear 8:3 131072   -> dm_dev(8:3).index = 2

Before enter the loop:
    nr_handled = 0
    t->num_devices = 2

When i equals to 0:
for (unsigned int i = 0; i < t->num_targets; i++) {
    if (nr_handled == t->num_devices) //nr_handled=0, t->num_devices=2
        break;

    if (ti->type->iterate_devices)
        ti->type->iterate_devices(ti, dm_get_dev_index, &index) //index is set to 1

    if (index > 0) {
        if (__test_and_set_bit(index, handled_map)) //__test_and_set_bit() return 0
            continue;
        else
            nr_handled++;    //nr_handled = 1
    }

    bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios,
                     NULL, GFP_NOWAIT);    //__send_duplicate_bios() is called
}

When i equals to 1, 2, 3, or 4:
for (unsigned int i = 0; i < t->num_targets; i++) {
    if (nr_handled == t->num_devices) //nr_handled=1, t->num_devices=2
        break;

    if (ti->type->iterate_devices)
        ti->type->iterate_devices(ti, dm_get_dev_index, &index) //index is set to 1

    if (index > 0) {
        if (__test_and_set_bit(index, handled_map)) //__test_and_set_bit() return 1
            continue;
        else
            nr_handled++;
    }
}

When i equals to 5:
for (unsigned int i = 0; i < t->num_targets; i++) {
    if (nr_handled == t->num_devices) //nr_handled=1, t->num_devices=2
        break;

    if (ti->type->iterate_devices)
        ti->type->iterate_devices(ti, dm_get_dev_index, &index) //index is set to 2

    if (index > 0) {
        if (__test_and_set_bit(index, handled_map)) //__test_and_set_bit() return 0
            continue;
        else
            nr_handled++;    //nr_handled = 2
    }

    bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios,
                     NULL, GFP_NOWAIT); //__send_duplicate_bios() is called for the second time
}

When i equals to 6:
for (unsigned int i = 0; i < t->num_targets; i++) {
    if (nr_handled == t->num_devices) //nr_handled=2, t->num_devices=2, break the loop
        break;
}

__send_duplicate_bios() is called only once for each of devices (8:2) and (8:3).

+
+			if (index > 0) {
+				if (__test_and_set_bit(index, handled_map))
+					continue;
+				else
+					nr_handled++;

Think you really mean to set bits in the bitmap within the
iterate_devices callout.  So it should be renamed accordingly.

Why not count the first time a device is handled in nr_handled?
Also, it strikes me as strange that you're break'ing out this loop
early based on nr_handled... not seeing the point (and also it seems
broken, because it implies you aren't issuing flushes to targets at
the end).

+			}
+		}
+
  		atomic_add(ti->num_flush_bios, &ci->io->io_count);
  		bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios,
  					     NULL, GFP_NOWAIT);
  		atomic_sub(ti->num_flush_bios - bios, &ci->io->io_count);
  	}
  
+	if (check)
+		bitmap_free(handled_map);
+
  	/*
  	 * alloc_io() takes one extra reference for submission, so the
  	 * reference won't reach 0 without the following subtraction
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 82b2195efaca..4a54b4f0a609 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -169,6 +169,12 @@ struct dm_dev {
  	struct dax_device *dax_dev;
  	blk_mode_t mode;
  	char name[16];
+
+	/*
+	 * sequential number for each dm_dev in dm_table's devices list,
+	 * start from 1
+	 */
+	unsigned short index;

Please update this comment to indicate that index=0 is (ab)used as a
flag in __send_empty_flush().

Noted!

Thanks for your comments.