Re: [fuse-devel] Writing to FUSE via mmap extremely slow (sometimes) on some machines?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Find attached a patch which introduces a min_bw and max_bw limit for a
backing_dev_info. As outlined in the commit description, this can be
used to work around the issue until we have a better understanding of
how a real solution would look like.

Could we include this change in Linux? What would be the next step?

Thanks,

On Mon, Mar 9, 2020 at 4:11 PM Michael Stapelberg
<michael+lkml@xxxxxxxxxxxxx> wrote:
>
> Thanks for clarifying. I have modified the mmap test program (see
> attached) to optionally read in the entire file when the WORKAROUND=
> environment variable is set, thereby preventing the FUSE reads in the
> write phase. I can now see a batch of reads, followed by a batch of
> writes.
>
> What’s interesting: when polling using “while :; do grep ^Bdi
> /sys/kernel/debug/bdi/0:93/stats; sleep 0.1; done” and running the
> mmap test program, I see:
>
> BdiDirtied:            3566304 kB
> BdiWritten:            3563616 kB
> BdiWriteBandwidth:       13596 kBps
>
> BdiDirtied:            3566304 kB
> BdiWritten:            3563616 kB
> BdiWriteBandwidth:       13596 kBps
>
> BdiDirtied:            3566528 kB (+224 kB) <-- starting to dirty pages
> BdiWritten:            3564064 kB (+448 kB) <-- starting to write
> BdiWriteBandwidth:       10700 kBps <-- only bandwidth update!
>
> BdiDirtied:            3668224 kB (+ 101696 kB) <-- all pages dirtied
> BdiWritten:            3565632 kB (+1568 kB)
> BdiWriteBandwidth:       10700 kBps
>
> BdiDirtied:            3668224 kB
> BdiWritten:            3665536 kB (+ 99904 kB) <-- all pages written
> BdiWriteBandwidth:       10700 kBps
>
> BdiDirtied:            3668224 kB
> BdiWritten:            3665536 kB
> BdiWriteBandwidth:       10700 kBps
>
> This seems to suggest that the bandwidth measurements only capture the
> rising slope of the transfer, but not the bulk of the transfer itself,
> resulting in inaccurate measurements. This effect is worsened when the
> test program doesn’t pre-read the output file and hence the kernel
> gets fewer FUSE write requests out.
>
> On Mon, Mar 9, 2020 at 3:36 PM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> >
> > On Mon, Mar 9, 2020 at 3:32 PM Michael Stapelberg
> > <michael+lkml@xxxxxxxxxxxxx> wrote:
> > >
> > > Here’s one more thing I noticed: when polling
> > > /sys/kernel/debug/bdi/0:93/stats, I see that BdiDirtied and BdiWritten
> > > remain at their original values while the kernel sends FUSE read
> > > requests, and only goes up when the kernel transitions into sending
> > > FUSE write requests. Notably, the page dirtying throttling happens in
> > > the read phase, which is most likely why the write bandwidth is
> > > (correctly) measured as 0.
> > >
> > > Do we have any ideas on why the kernel sends FUSE reads at all?
> >
> > Memory writes (stores) need the memory page to be up-to-date wrt. the
> > backing file before proceeding.   This means that if the page hasn't
> > yet been cached by the kernel, it needs to be read first.
> >
> > Thanks,
> > Miklos
From 10c5fd0412ab71c14cca7a66c2407bfe3bb861af Mon Sep 17 00:00:00 2001
From: Michael Stapelberg <stapelberg@xxxxxxxxxx>
Date: Tue, 10 Mar 2020 15:48:20 +0100
Subject: [PATCH] backing_dev_info: introduce min_bw/max_bw limits

This allows working around long-standing significant performance issues when
using mmap with files on FUSE file systems such as ObjFS.

The page-writeback code tries to measure how quick file system backing devices
are able to write data.

Unfortunately, our usage pattern seems to hit an unfortunate code path: the
kernel only ever measures the (non-representative) rising slope of the starting
transfer, but the transfer is already over before it could possibly measure the
representative steady-state.

As a consequence, the FUSE write bandwidth sinks steadily down to 0 (!) and
heavily throttles page dirtying in programs trying to write to FUSE.

This patch adds a knob which allows avoiding this situation entirely on a
per-file-system basis by restricting the minimum/maximum bandwidth.

There are no negative effects expected from applying this patch.

See also the discussion on the Linux Kernel Mailing List:

https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@xxxxxxxxxxxxxx/

To inspect the measured bandwidth, check the BdiWriteBandwidth field in
e.g. /sys/kernel/debug/bdi/0:93/stats.

To pin the measured bandwidth to its default of 100 MB/s, use:

    echo 25600 > /sys/class/bdi/0:42/min_bw
    echo 25600 > /sys/class/bdi/0:42/max_bw
---
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/backing-dev.h      |  3 +++
 mm/backing-dev.c                 | 40 ++++++++++++++++++++++++++++++++
 mm/page-writeback.c              | 29 +++++++++++++++++++++++
 4 files changed, 74 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 4fc87dee005a..a29bcb8a799d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -200,6 +200,8 @@ struct backing_dev_info {
 	unsigned int capabilities; /* Device capabilities */
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
+	u64 min_bw;
+	u64 max_bw;
 
 	/*
 	 * Sum of avg_write_bw of wbs with dirty inodes.  > 0 if there are
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f88197c1ffc2..4490bd03aec1 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -111,6 +111,9 @@ static inline unsigned long wb_stat_error(void)
 int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio);
 int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 
+int bdi_set_min_bw(struct backing_dev_info *bdi, u64 min_bw);
+int bdi_set_max_bw(struct backing_dev_info *bdi, u64 max_bw);
+
 /*
  * Flags in backing_dev_info::capability
  *
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 62f05f605fb5..5c10d4425976 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -201,6 +201,44 @@ static ssize_t max_ratio_store(struct device *dev,
 }
 BDI_SHOW(max_ratio, bdi->max_ratio)
 
+static ssize_t min_bw_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	unsigned long long limit;
+	ssize_t ret;
+
+	ret = kstrtoull(buf, 10, &limit);
+	if (ret < 0)
+		return ret;
+
+	ret = bdi_set_min_bw(bdi, limit);
+	if (!ret)
+		ret = count;
+
+	return ret;
+}
+BDI_SHOW(min_bw, bdi->min_bw)
+
+static ssize_t max_bw_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	unsigned long long limit;
+	ssize_t ret;
+
+	ret = kstrtoull(buf, 10, &limit);
+	if (ret < 0)
+		return ret;
+
+	ret = bdi_set_max_bw(bdi, limit);
+	if (!ret)
+		ret = count;
+
+	return ret;
+}
+BDI_SHOW(max_bw, bdi->max_bw)
+
 static ssize_t stable_pages_required_show(struct device *dev,
 					  struct device_attribute *attr,
 					  char *page)
@@ -216,6 +254,8 @@ static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
 	&dev_attr_min_ratio.attr,
 	&dev_attr_max_ratio.attr,
+	&dev_attr_min_bw.attr,
+	&dev_attr_max_bw.attr,
 	&dev_attr_stable_pages_required.attr,
 	NULL,
 };
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2caf780a42e7..c7c9eebc4c56 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -713,6 +713,22 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
+int bdi_set_min_bw(struct backing_dev_info *bdi, u64 min_bw)
+{
+	spin_lock_bh(&bdi_lock);
+	bdi->min_bw = min_bw;
+	spin_unlock_bh(&bdi_lock);
+	return 0;
+}
+
+int bdi_set_max_bw(struct backing_dev_info *bdi, u64 max_bw)
+{
+	spin_lock_bh(&bdi_lock);
+	bdi->max_bw = max_bw;
+	spin_unlock_bh(&bdi_lock);
+	return 0;
+}
+
 static unsigned long dirty_freerun_ceiling(unsigned long thresh,
 					   unsigned long bg_thresh)
 {
@@ -1080,6 +1096,16 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
 	dtc->pos_ratio = pos_ratio;
 }
 
+static u64 clamp_bw(struct backing_dev_info *bdi, u64 bw) {
+	if (bdi->min_bw > 0 && bw < bdi->min_bw) {
+	  bw = bdi->min_bw;
+	}
+	if (bdi->max_bw > 0 && bw > bdi->max_bw) {
+	  bw = bdi->max_bw;
+	}
+	return bw;
+}
+
 static void wb_update_write_bandwidth(struct bdi_writeback *wb,
 				      unsigned long elapsed,
 				      unsigned long written)
@@ -1103,12 +1129,15 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
 	bw *= HZ;
 	if (unlikely(elapsed > period)) {
 		bw = div64_ul(bw, elapsed);
+		bw = clamp_bw(wb->bdi, bw);
 		avg = bw;
 		goto out;
 	}
 	bw += (u64)wb->write_bandwidth * (period - elapsed);
 	bw >>= ilog2(period);
 
+	bw = clamp_bw(wb->bdi, bw);
+
 	/*
 	 * one more level of smoothing, for filtering out sudden spikes
 	 */
-- 
2.25.1


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux