On Wed, Apr 19 2023 at 11:36P -0400, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > On Tue, Apr 18, 2023 at 03:12:04PM -0700, Sarthak Kukreti wrote: > > Introduce block request REQ_OP_PROVISION. The intent of this request > > is to request underlying storage to preallocate disk space for the given > > block range. Block devices that support this capability will export > > a provision limit within their request queues. > > > > This patch also adds the capability to call fallocate() in mode 0 > > on block devices, which will send REQ_OP_PROVISION to the block > > device for the specified range, > > > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@xxxxxxxxxxxx> > > --- > > block/blk-core.c | 5 ++++ > > block/blk-lib.c | 53 +++++++++++++++++++++++++++++++++++++++ > > block/blk-merge.c | 18 +++++++++++++ > > block/blk-settings.c | 19 ++++++++++++++ > > block/blk-sysfs.c | 8 ++++++ > > block/bounce.c | 1 + > > block/fops.c | 25 +++++++++++++----- > > include/linux/bio.h | 6 +++-- > > include/linux/blk_types.h | 5 +++- > > include/linux/blkdev.h | 16 ++++++++++++ > > 10 files changed, 147 insertions(+), 9 deletions(-) > > > > <cut to the fallocate part; the block/ changes look fine to /me/ at > first glance, but what do I know... ;)> > > > diff --git a/block/fops.c b/block/fops.c > > index d2e6be4e3d1c..e1775269654a 100644 > > --- a/block/fops.c > > +++ b/block/fops.c > > @@ -611,9 +611,13 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) > > return ret; > > } > > > > +#define BLKDEV_FALLOC_FL_TRUNCATE \ > > At first I thought from this name that you were defining a new truncate > mode for fallocate, then I realized that this is mask for deciding if we > /want/ to truncate the pagecache. > > #define BLKDEV_FALLOC_TRUNCATE_MASK ? > > > + (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE | \ > > Ok, so discarding and writing zeroes truncates the page cache, makes > sense since we're "writing" directly to the block device. > > > + FALLOC_FL_NO_HIDE_STALE) > > Here things get tricky -- some of the FALLOC_FL mode bits are really an > opcode and cannot be specified together, whereas others select optional > behavior for certain opcodes. > > IIRC, the mutually exclusive opcodes are: > > PUNCH_HOLE > ZERO_RANGE > COLLAPSE_RANGE > INSERT_RANGE > (none of the above, for allocation) > > and the "variants on a theme are": > > KEEP_SIZE > NO_HIDE_STALE > UNSHARE_RANGE > > not all of which are supported by all the opcodes. > > Does it make sense to truncate the page cache if userspace passes in > mode == NO_HIDE_STALE? There's currently no defined meaning for this > combination, but I think this means we'll truncate the pagecache before > deciding if we're actually going to issue any commands. > > I think that's just a bug in the existing code -- it should be > validating that @mode is any of the supported combinations *before* > truncating the pagecache. > > Otherwise you could have a mkfs program that starts writing new fs > metadata, decides to provision the storage (say for a logging region), > doesn't realize it's running on an old kernel, and then oops the > provision attempt fails but have we now shredded the pagecache and lost > all the writes? While that just caused me to have an "oh shit, that's crazy" (in a scary way) belly laugh... (And obviously needs fixing independent of this patchset) Shouldn't mkfs first check that the underlying storage supports REQ_OP_PROVISION by verifying /sys/block/<dev>/queue/provision_max_bytes exists and is not 0? (Just saying, we need to add new features more defensively.. you just made the case based on this scenario's implications alone) Sarthak, please note I said "provision_max_bytes": all other ops (e.g. DISCARD, WRITE_ZEROES, etc) have <op>_max_bytes exported through sysfs, not <op>_max_sectors. Please export provision_max_bytes, e.g.: diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 202aa78f933e..2e5ac7b1ffbd 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -605,12 +605,12 @@ QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size"); QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size"); QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments"); -QUEUE_RO_ENTRY(queue_max_provision_sectors, "max_provision_sectors"); QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity"); QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes"); QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes"); QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data"); +QUEUE_RO_ENTRY(queue_provision_max, "provision_max_bytes"); QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes"); QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");