Re: [RFC] 'discard sectors' block request.

Jens Axboe <jens.axboe@xxxxxxxxxx> · Tue, 5 Aug 2008 13:42:11 +0200

On Tue, Aug 05 2008, David Woodhouse wrote:
> On Sat, 2008-07-26 at 13:02 -0700, Andrew Morton wrote:
> > I seem to be hearing a lot of silence over support for SSD devices.  I
> > have this vague worry that there will be a large rollout of SSD
> > hardware and Linux will be found to have pants-around-ankles.
> > 
> > For example, support for the T13 Trim command.  There will be others,
> > but I surely don't know what they are.
> 
> Here's a first attempt at that. It implements basic support for
> 'discard' requests in the block layer, implements that in FTL (the flash
> translation layer used on PCMCIA flash cards), and in FAT.
> 
> It doesn't yet do any merging of discard requests, and everything in
> block/ needs careful review -- I don't know that code well at all. But
> it seems to pass at least basic testing: 
> 
>  modprobe mtdram
>  modprobe mtdchar
>  ftl_format /dev/mtd0
>  modprobe ftl
>  mkfs.msdos /dev/ftla
>  mount /dev/ftla /mnt/test
>  cp /etc/services /mnt/test
>  rm /mnt/test/services
> 
> ... and (with appropriate debugging added) you can see the sectors being
> thrown away. And 'od -t x1 /dev/ftla' will confirm that.
> 
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index a09ead1..e0ac1c8 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> @@ -258,7 +258,6 @@ static void bio_end_empty_barrier(struct bio *bio, int err)
>  			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
>  		clear_bit(BIO_UPTODATE, &bio->bi_flags);
>  	}
> -
>  	complete(bio->bi_private);
>  }
>  
> @@ -315,3 +314,62 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
>  	return ret;
>  }
>  EXPORT_SYMBOL(blkdev_issue_flush);
> +
> +/**
> + * blkdev_issue_discard - queue a discard
> + * @bdev:	blockdev to issue discard for
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to discard
> + * @error_sector:	error sector
> + *
> + * Description:
> + *    Issue a discard request for the sectors in question. Caller can supply
> + *    room for storing the error offset in case of a discard error, if they
> + *    wish to.
> + */
> +int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> +			 unsigned nr_sects, sector_t *error_sector)
> +{
> +	DECLARE_COMPLETION_ONSTACK(wait);
> +	struct request_queue *q;
> +	struct bio *bio;
> +	int ret;
> +
> +	if (bdev->bd_disk == NULL)
> +		return -ENXIO;
> +
> +	q = bdev_get_queue(bdev);
> +	if (!q)
> +		return -ENXIO;
> +
> +	bio = bio_alloc(GFP_KERNEL, 0);
> +	if (!bio)
> +		return -ENOMEM;
> +
> +	bio->bi_end_io = bio_end_empty_barrier;
> +	bio->bi_private = &wait;
> +	bio->bi_bdev = bdev;
> +	bio->bi_sector = sector;
> +	bio->bi_size = nr_sects << 9;
> +	submit_bio(1 << BIO_RW_DISCARD, bio);
> +
> +	wait_for_completion(&wait);
> +
> +	/*
> +	 * The driver must store the error location in ->bi_sector, if
> +	 * it supports it. For non-stacked drivers, this should be copied
> +	 * from rq->sector.
> +	 */
> +	if (error_sector)
> +		*error_sector = bio->bi_sector;
> +
> +	ret = 0;
> +	if (bio_flagged(bio, BIO_EOPNOTSUPP))
> +		ret = -EOPNOTSUPP;
> +	else if (!bio_flagged(bio, BIO_UPTODATE))
> +		ret = -EIO;
> +
> +	bio_put(bio);
> +	return ret;
> +}
> +EXPORT_SYMBOL(blkdev_issue_discard);

I was going to suggest that if you want this sync, then you should cut
this code out of blkdev_issue_flush() and reuse that helper for
blkdev_issue_flush and blkdev_issue_discard(). But I don't think the
sync interface makes a lot of sense. Also, get rid of the error_sector
stuff. First of all it isn't really used consistently in the flush part
since most block drivers don't have this information (and/or fill it
in), and secondly you can just store that in bi_sector or whatever
instead for this request.

int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
			 unsigned nr_sects, bio_end_io_t *end_io)
{
	struct request_queue *q;
	struct bio *bio;
	int ret;

	if (bdev->bd_disk == NULL)
		return -ENXIO;

	q = bdev_get_queue(bdev);
	if (!q)
		return -ENXIO;

	bio = bio_alloc(GFP_KERNEL, 0);
	if (!bio)
		return -ENOMEM;

	bio->bi_bdev = bdev;
	bio->bi_sector = sector;
	bio->bi_size = nr_sects << 9;
	submit_bio(1 << BIO_RW_DISCARD, bio);
        return 0;
}

That would be the simpler async variant. Let the caller pass in the
end_io function as well, as the fs would definitely want to check for
EOPNOTSUPP or other errors itself on completion. Remember to do the
bio_put() in there as well at the end. Perhaps you want a sync interface
as well? Then I would suggest passing in the bio instead of allocating
it, or just add a blkdev_issue_discard_sync() with the above body moved
to a stsatic __blkdev_issue_discard().

Otherwise that akpm said, he covered basically all my points. Lets do a

static inline int bio_data_less(struct bio *bio)
{
        return !bio->bi_io_vec;
}

like you suggested for checking whether the bio carries data or not,
since we need that for the empty barrier as well (and other future bio
hints we want to pass down).

> @@ -1886,7 +1899,7 @@ static int blk_end_io(struct request *rq, int error, unsigned int nr_bytes,
>  	struct request_queue *q = rq->q;
>  	unsigned long flags = 0UL;
>  
> -	if (blk_fs_request(rq) || blk_pc_request(rq)) {
> +	if (blk_fs_request(rq) || blk_pc_request(rq) || blk_discard_rq(rq)) {
>  		if (__end_that_request_first(rq, error, nr_bytes))
>  			return 1;
>  
> @@ -1944,7 +1957,7 @@ EXPORT_SYMBOL_GPL(blk_end_request);
>   **/
>  int __blk_end_request(struct request *rq, int error, unsigned int nr_bytes)
>  {
> -	if (blk_fs_request(rq) || blk_pc_request(rq)) {
> +	if (blk_fs_request(rq) || blk_pc_request(rq) || blk_discard_rq(rq)) {
>  		if (__end_that_request_first(rq, error, nr_bytes))
>  			return 1;
>  	}

Do you need to call into __end_request_request_first()? You don't need
to fill error sector now, and there's no bio data to complete.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html