Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD

Shaohua Li <shli@xxxxxxxxxx> · Fri, 7 Mar 2014 15:57:33 +0800

ping!

On Tue, Feb 18, 2014 at 06:13:04PM +0800, Shaohua Li wrote:
> 
> This is a simple DM target supporting compression for SSD only. Under layer SSD
> must support 512B sector size, the target only supports 4k sector size.
> 
> Disk layout:
> |super|...meta...|..data...|
> 
> Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> and compression algorithm. Meta is a bitmap. For each data block, there are 5
> bits meta.
> 
> Data:
> Data of a block is compressed. Compressed data is round up to 512B, which is
> the payload. In disk, payload is stored at the begining of logical sector of
> the block. Let's look at an example. Say we store data to block A, which is in
> sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
> (CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> stored at sector B.
> 
> ---------------------------------------------------
> ... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
> ---------------------------------------------------
>     ^B    ^B+1  ^B+2                  ^B+7 ^B+8
> 
> For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> to present payload size. The compressed size (1500) isn't stored in meta
> directly. Instead, we store it at the last 32bits of payload. In this example,
> we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> crosses a sector, payload size will increase one sector. If payload uses 8
> sectors, we store uncompressed data directly.
> 
> If IO size is bigger than one block, we can store the data as an extent. Data
> of the whole extent will compressed and stored in the similar way like above.
> The first block of the extent is the head, all others are the tail. If extent
> is 1 block, the block is head. We have 1 bit of meta to present if a block is
> head or tail. If 4 meta bits of head block can't store extent payload size, we
> will borrow tail block meta bits to store payload size. Max allowd extent size
> is 128k, so we don't compress/decompress too big size data.
> 
> Meta:
> Modifying data will modify meta too. Meta will be written(flush) to disk
> depending on meta write policy. We support writeback and writethrough mode. In
> writeback mode, meta will be written to disk in an interval or a FLUSH request.
> In writethrough mode, data and meta data will be written to disk together.
> 
> Advantages:
> 1. simple. Since we store compressed data in-place, we don't need complicated
> disk data management.
> 2. efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
> 200M meta, so we can load all meta into memory. And actual compression size is
> in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
> need extra IO for meta.
> 
> Disadvantages:
> 1. hole. Since we store compressed data in-place, there are a lot of holes (in
> above example, B+3 - B+7) Hole can impact IO, because we can't do IO merge.
> 2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can only store
> 1T data even we do compression.
> 
> But this is for SSD only. Generally SSD firmware has a FTL layer to map disk
> sectors to flash nand. High end SSD firmware has filesystem-like FTL.
> 1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
> in nand. Even we can't do IO merge in OS layer, SSD firmware can do it.
> 2. 1:1 size. On one side, we write compressed data to SSD, which means less
> data is written to SSD. This will be very helpful to improve SSD garbage
> collection, and so write speed and life cycle. So even this is a problem, the
> target is still helpful. On the other side, advanced SSD FTL can easily do thin
> provision. For example, if nand is 1T and we let SSD report it as 2T, and use
> the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
> 
> So if SSD FTL can map non-continuous disk sectors to continuous nand and
> support thin provision, the compressed target will work very well.
> 
> V2->V3:
> Updated with new bio iter API
> 
> V1->V2:
> 1. Change name to insitu_comp, cleanup code, add comments and doc
> 2. Improve performance (extent locking, dedicated workqueue)
> 
> Signed-off-by: Shaohua Li <shli@xxxxxxxxxxxx>
> ---
>  Documentation/device-mapper/insitu-comp.txt |   50 
>  drivers/md/Kconfig                          |    6 
>  drivers/md/Makefile                         |    1 
>  drivers/md/dm-insitu-comp.c                 | 1480 ++++++++++++++++++++++++++++
>  drivers/md/dm-insitu-comp.h                 |  158 ++
>  5 files changed, 1695 insertions(+)
> 
> Index: linux/drivers/md/Kconfig
> ===================================================================
> --- linux.orig/drivers/md/Kconfig	2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Kconfig	2014-02-17 17:34:45.423464815 +0800
> @@ -295,6 +295,12 @@ config DM_CACHE_CLEANER
>           A simple cache policy that writes back all data to the
>           origin.  Used when decommissioning a dm-cache.
>  
> +config DM_INSITU_COMPRESSION
> +       tristate "Insitu compression target"
> +       depends on BLK_DEV_DM
> +       ---help---
> +         Allow volume managers to insitu compress data for SSD.
> +
>  config DM_MIRROR
>         tristate "Mirror target"
>         depends on BLK_DEV_DM
> Index: linux/drivers/md/Makefile
> ===================================================================
> --- linux.orig/drivers/md/Makefile	2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Makefile	2014-02-17 17:34:45.423464815 +0800
> @@ -53,6 +53,7 @@ obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
>  obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
>  obj-$(CONFIG_DM_CACHE_MQ)	+= dm-cache-mq.o
>  obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
> +obj-$(CONFIG_DM_INSITU_COMPRESSION)		+= dm-insitu-comp.o
>  
>  ifeq ($(CONFIG_DM_UEVENT),y)
>  dm-mod-objs			+= dm-uevent.o
> Index: linux/drivers/md/dm-insitu-comp.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.c	2014-02-17 20:16:38.093360018 +0800
> @@ -0,0 +1,1480 @@
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/slab.h>
> +#include <linux/device-mapper.h>
> +#include <linux/dm-io.h>
> +#include <linux/crypto.h>
> +#include <linux/lzo.h>
> +#include <linux/kthread.h>
> +#include <linux/page-flags.h>
> +#include <linux/completion.h>
> +#include "dm-insitu-comp.h"
> +
> +#define DM_MSG_PREFIX "dm_insitu_comp"
> +
> +static struct insitu_comp_compressor_data compressors[] = {
> +	[INSITU_COMP_ALG_LZO] = {
> +		.name = "lzo",
> +		.comp_len = lzo_comp_len,
> +	},
> +	[INSITU_COMP_ALG_ZLIB] = {
> +		.name = "deflate",
> +	},
> +};
> +static int default_compressor;
> +
> +static struct kmem_cache *insitu_comp_io_range_cachep;
> +static struct kmem_cache *insitu_comp_meta_io_cachep;
> +
> +static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
> +static struct workqueue_struct *insitu_comp_wq;
> +
> +/* each block has 5 bits metadata */
> +static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 block_index)
> +{
> +	u64 first_bit = block_index * INSITU_COMP_META_BITS;
> +	int bits, offset;
> +	u8 data, ret = 0;
> +
> +	offset = first_bit & 7;
> +	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> +	data = info->meta_bitmap[first_bit >> 3];
> +	ret = (data >> offset) & ((1 << bits) - 1);
> +
> +	if (bits < INSITU_COMP_META_BITS) {
> +		data = info->meta_bitmap[(first_bit >> 3) + 1];
> +		bits = INSITU_COMP_META_BITS - bits;
> +		ret |= (data & ((1 << bits) - 1)) <<
> +			(INSITU_COMP_META_BITS - bits);
> +	}
> +	return ret;
> +}
> +
> +static void insitu_comp_set_meta(struct insitu_comp_info *info,
> +	u64 block_index, u8 meta, bool dirty_meta)
> +{
> +	u64 first_bit = block_index * INSITU_COMP_META_BITS;
> +	int bits, offset;
> +	u8 data;
> +	struct page *page;
> +
> +	offset = first_bit & 7;
> +	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> +	data = info->meta_bitmap[first_bit >> 3];
> +	data &= ~(((1 << bits) - 1) << offset);
> +	data |= (meta & ((1 << bits) - 1)) << offset;
> +	info->meta_bitmap[first_bit >> 3] = data;
> +
> +	/*
> +	 * For writethrough, we write metadata directly. For writeback, if
> +	 * request is FUA, we do this too; otherwise we just dirty the page,
> +	 * which will be flush out in an interval
> +	 */
> +	if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +		page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
> +		if (dirty_meta)
> +			SetPageDirty(page);
> +		else
> +			ClearPageDirty(page);
> +	}
> +
> +	if (bits < INSITU_COMP_META_BITS) {
> +		meta >>= bits;
> +		data = info->meta_bitmap[(first_bit >> 3) + 1];
> +		bits = INSITU_COMP_META_BITS - bits;
> +		data = (data >> bits) << bits;
> +		data |= meta & ((1 << bits) - 1);
> +		info->meta_bitmap[(first_bit >> 3) + 1] = data;
> +
> +		if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +			page = vmalloc_to_page(&info->meta_bitmap[
> +						(first_bit >> 3) + 1]);
> +			if (dirty_meta)
> +				SetPageDirty(page);
> +			else
> +				ClearPageDirty(page);
> +		}
> +	}
> +}
> +
> +/*
> + * set metadata for an extent since block @block_index, length is
> + * @logical_blocks.  The extent uses @data_sectors sectors
> + */
> +static void insitu_comp_set_extent(struct insitu_comp_req *req,
> +	u64 block_index, u16 logical_blocks, sector_t data_sectors)
> +{
> +	int i;
> +	u8 data;
> +
> +	for (i = 0; i < logical_blocks; i++) {
> +		data = min_t(sector_t, data_sectors, 8);
> +		data_sectors -= data;
> +		if (i != 0)
> +			data |= INSITU_COMP_TAIL_MASK;
> +		/* For FUA, we write out meta data directly */
> +		insitu_comp_set_meta(req->info, block_index + i, data,
> +					!(insitu_req_rw(req) & REQ_FUA));
> +	}
> +}
> +
> +/*
> + * get metadata for an extent covering block @block_index. @first_block_index
> + * returns the first block of the extent. @logical_sectors returns the extent
> + * length. @data_sectors returns the sectors the extent uses
> + */
> +static void insitu_comp_get_extent(struct insitu_comp_info *info,
> +	u64 block_index, u64 *first_block_index, u16 *logical_sectors,
> +	u16 *data_sectors)
> +{
> +	u8 data;
> +
> +	data = insitu_comp_get_meta(info, block_index);
> +	while (data & INSITU_COMP_TAIL_MASK) {
> +		block_index--;
> +		data = insitu_comp_get_meta(info, block_index);
> +	}
> +	*first_block_index = block_index;
> +	*logical_sectors = INSITU_COMP_BLOCK_SIZE >> 9;
> +	*data_sectors = data & INSITU_COMP_LENGTH_MASK;
> +	block_index++;
> +	while (block_index < info->data_blocks) {
> +		data = insitu_comp_get_meta(info, block_index);
> +		if (!(data & INSITU_COMP_TAIL_MASK))
> +			break;
> +		*logical_sectors += INSITU_COMP_BLOCK_SIZE >> 9;
> +		*data_sectors += data & INSITU_COMP_LENGTH_MASK;
> +		block_index++;
> +	}
> +}
> +
> +static int insitu_comp_access_super(struct insitu_comp_info *info,
> +	void *addr, int rw)
> +{
> +	struct dm_io_region region;
> +	struct dm_io_request req;
> +	unsigned long io_error = 0;
> +	int ret;
> +
> +	region.bdev = info->dev->bdev;
> +	region.sector = 0;
> +	region.count = INSITU_COMP_BLOCK_SIZE >> 9;
> +
> +	req.bi_rw = rw;
> +	req.mem.type = DM_IO_KMEM;
> +	req.mem.offset = 0;
> +	req.mem.ptr.addr = addr;
> +	req.notify.fn = NULL;
> +	req.client = info->io_client;
> +
> +	ret = dm_io(&req, 1, &region, &io_error);
> +	if (ret || io_error)
> +		return -EIO;
> +	return 0;
> +}
> +
> +static void insitu_comp_meta_io_done(unsigned long error, void *context)
> +{
> +	struct insitu_comp_meta_io *meta_io = context;
> +
> +	meta_io->fn(meta_io->data, error);
> +	kmem_cache_free(insitu_comp_meta_io_cachep, meta_io);
> +}
> +
> +static int insitu_comp_write_meta(struct insitu_comp_info *info,
> +	u64 start_page, u64 end_page, void *data,
> +	void (*fn)(void *data, unsigned long error), int rw)
> +{
> +	struct insitu_comp_meta_io *meta_io;
> +
> +	BUG_ON(end_page > info->meta_bitmap_pages);
> +
> +	meta_io = kmem_cache_alloc(insitu_comp_meta_io_cachep, GFP_NOIO);
> +	if (!meta_io) {
> +		fn(data, -ENOMEM);
> +		return -ENOMEM;
> +	}
> +	meta_io->data = data;
> +	meta_io->fn = fn;
> +
> +	meta_io->io_region.bdev = info->dev->bdev;
> +	meta_io->io_region.sector = INSITU_COMP_META_START_SECTOR +
> +					(start_page << (PAGE_SHIFT - 9));
> +	meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
> +
> +	atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
> +
> +	meta_io->io_req.bi_rw = rw;
> +	meta_io->io_req.mem.type = DM_IO_VMA;
> +	meta_io->io_req.mem.offset = 0;
> +	meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
> +						(start_page << PAGE_SHIFT);
> +	meta_io->io_req.notify.fn = insitu_comp_meta_io_done;
> +	meta_io->io_req.notify.context = meta_io;
> +	meta_io->io_req.client = info->io_client;
> +
> +	dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
> +	return 0;
> +}
> +
> +struct writeback_flush_data {
> +	struct completion complete;
> +	atomic_t cnt;
> +};
> +
> +static void writeback_flush_io_done(void *data, unsigned long error)
> +{
> +	struct writeback_flush_data *wb = data;
> +
> +	if (atomic_dec_return(&wb->cnt))
> +		return;
> +	complete(&wb->complete);
> +}
> +
> +static void insitu_comp_flush_dirty_meta(struct insitu_comp_info *info,
> +			struct writeback_flush_data *data)
> +{
> +	struct page *page;
> +	u64 start = 0, index;
> +	u32 pending = 0, cnt = 0;
> +	bool dirty;
> +	struct blk_plug plug;
> +
> +	blk_start_plug(&plug);
> +	for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
> +		if (cnt == 256) {
> +			cnt = 0;
> +			cond_resched();
> +		}
> +
> +		page = vmalloc_to_page(info->meta_bitmap +
> +					(index << PAGE_SHIFT));
> +		dirty = TestClearPageDirty(page);
> +
> +		if (pending == 0 && dirty) {
> +			start = index;
> +			pending++;
> +			continue;
> +		} else if (pending == 0)
> +			continue;
> +		else if (pending > 0 && dirty) {
> +			pending++;
> +			continue;
> +		}
> +
> +		/* pending > 0 && !dirty */
> +		atomic_inc(&data->cnt);
> +		insitu_comp_write_meta(info, start, start + pending, data,
> +			writeback_flush_io_done, WRITE);
> +		pending = 0;
> +	}
> +
> +	if (pending > 0) {
> +		atomic_inc(&data->cnt);
> +		insitu_comp_write_meta(info, start, start + pending, data,
> +			writeback_flush_io_done, WRITE);
> +	}
> +	blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
> +	blk_finish_plug(&plug);
> +}
> +
> +/* writeback thread flushs all dirty metadata to disk in an interval */
> +static int insitu_comp_meta_writeback_thread(void *data)
> +{
> +	struct insitu_comp_info *info = data;
> +	struct writeback_flush_data wb;
> +
> +	atomic_set(&wb.cnt, 1);
> +	init_completion(&wb.complete);
> +
> +	while (!kthread_should_stop()) {
> +		schedule_timeout_interruptible(
> +			msecs_to_jiffies(info->writeback_delay * 1000));
> +		insitu_comp_flush_dirty_meta(info, &wb);
> +	}
> +
> +	insitu_comp_flush_dirty_meta(info, &wb);
> +
> +	writeback_flush_io_done(&wb, 0);
> +	wait_for_completion(&wb.complete);
> +	return 0;
> +}
> +
> +static int insitu_comp_init_meta(struct insitu_comp_info *info, bool new)
> +{
> +	struct dm_io_region region;
> +	struct dm_io_request req;
> +	unsigned long io_error = 0;
> +	struct blk_plug plug;
> +	int ret;
> +	ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +
> +	len *= sizeof(unsigned long);
> +
> +	region.bdev = info->dev->bdev;
> +	region.sector = INSITU_COMP_META_START_SECTOR;
> +	region.count = (len + 511) >> 9;
> +
> +	req.mem.type = DM_IO_VMA;
> +	req.mem.offset = 0;
> +	req.mem.ptr.addr = info->meta_bitmap;
> +	req.notify.fn = NULL;
> +	req.client = info->io_client;
> +
> +	blk_start_plug(&plug);
> +	if (new) {
> +		memset(info->meta_bitmap, 0, len);
> +		req.bi_rw = WRITE_FLUSH;
> +		ret = dm_io(&req, 1, &region, &io_error);
> +	} else {
> +		req.bi_rw = READ;
> +		ret = dm_io(&req, 1, &region, &io_error);
> +	}
> +	blk_finish_plug(&plug);
> +
> +	if (ret || io_error) {
> +		info->ti->error = "Access metadata error";
> +		return -EIO;
> +	}
> +
> +	if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +		info->writeback_tsk = kthread_run(
> +			insitu_comp_meta_writeback_thread,
> +			info, "insitu_comp_writeback");
> +		if (!info->writeback_tsk) {
> +			info->ti->error = "Create writeback thread error";
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int insitu_comp_alloc_compressor(struct insitu_comp_info *info)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		info->tfm[i] = crypto_alloc_comp(
> +			compressors[info->comp_alg].name, 0, 0);
> +		if (IS_ERR(info->tfm[i])) {
> +			info->tfm[i] = NULL;
> +			goto err;
> +		}
> +	}
> +	return 0;
> +err:
> +	for_each_possible_cpu(i) {
> +		if (info->tfm[i]) {
> +			crypto_free_comp(info->tfm[i]);
> +			info->tfm[i] = NULL;
> +		}
> +	}
> +	return -ENOMEM;
> +}
> +
> +static void insitu_comp_free_compressor(struct insitu_comp_info *info)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		if (info->tfm[i]) {
> +			crypto_free_comp(info->tfm[i]);
> +			info->tfm[i] = NULL;
> +		}
> +	}
> +}
> +
> +static int insitu_comp_read_or_create_super(struct insitu_comp_info *info)
> +{
> +	void *addr;
> +	struct insitu_comp_super_block *super;
> +	u64 total_blocks;
> +	u64 data_blocks, meta_blocks;
> +	u32 rem, cnt;
> +	bool new_super = false;
> +	int ret;
> +	ssize_t len;
> +
> +	total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
> +					INSITU_COMP_BLOCK_SHIFT;
> +	data_blocks = total_blocks - 1;
> +	rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 +
> +			INSITU_COMP_META_BITS);
> +	meta_blocks = data_blocks * INSITU_COMP_META_BITS;
> +	data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
> +
> +	cnt = rem;
> +	rem /= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> +	data_blocks += rem * (INSITU_COMP_BLOCK_SIZE * 8 /
> +				INSITU_COMP_META_BITS);
> +	meta_blocks += rem;
> +
> +	cnt %= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> +	meta_blocks += 1;
> +	data_blocks += cnt - 1;
> +
> +	info->data_blocks = data_blocks;
> +	info->data_start = (1 + meta_blocks) << INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +
> +	addr = kzalloc(INSITU_COMP_BLOCK_SIZE, GFP_KERNEL);
> +	if (!addr) {
> +		info->ti->error = "Cannot allocate super";
> +		return -ENOMEM;
> +	}
> +
> +	super = addr;
> +	ret = insitu_comp_access_super(info, addr, READ);
> +	if (ret)
> +		goto out;
> +
> +	if (le64_to_cpu(super->magic) == INSITU_COMP_SUPER_MAGIC) {
> +		if (le64_to_cpu(super->version) != INSITU_COMP_VERSION ||
> +		    le64_to_cpu(super->meta_blocks) != meta_blocks ||
> +		    le64_to_cpu(super->data_blocks) != data_blocks) {
> +			info->ti->error = "Super is invalid";
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
> +			info->ti->error =
> +					"Compressor algorithm doesn't support";
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	} else {
> +		super->magic = cpu_to_le64(INSITU_COMP_SUPER_MAGIC);
> +		super->version = cpu_to_le64(INSITU_COMP_VERSION);
> +		super->meta_blocks = cpu_to_le64(meta_blocks);
> +		super->data_blocks = cpu_to_le64(data_blocks);
> +		super->comp_alg = default_compressor;
> +		ret = insitu_comp_access_super(info, addr, WRITE_FUA);
> +		if (ret) {
> +			info->ti->error = "Access super fails";
> +			goto out;
> +		}
> +		new_super = true;
> +	}
> +
> +	info->comp_alg = super->comp_alg;
> +	if (insitu_comp_alloc_compressor(info)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	info->meta_bitmap_bits = data_blocks * INSITU_COMP_META_BITS;
> +	len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +	len *= sizeof(unsigned long);
> +	info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
> +	if (!info->meta_bitmap) {
> +		ret = -ENOMEM;
> +		goto bitmap_err;
> +	}
> +
> +	ret = insitu_comp_init_meta(info, new_super);
> +	if (ret)
> +		goto meta_err;
> +
> +	return 0;
> +meta_err:
> +	vfree(info->meta_bitmap);
> +bitmap_err:
> +	insitu_comp_free_compressor(info);
> +out:
> +	kfree(addr);
> +	return ret;
> +}
> +
> +/*
> + * <dev> <writethough>/<writeback> <meta_commit_delay>
> + */
> +static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct insitu_comp_info *info;
> +	char write_mode[15];
> +	int ret, i;
> +
> +	if (argc < 2) {
> +		ti->error = "Invalid argument count";
> +		return -EINVAL;
> +	}
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info) {
> +		ti->error = "Cannot allocate context";
> +		return -ENOMEM;
> +	}
> +	info->ti = ti;
> +
> +	if (sscanf(argv[1], "%s", write_mode) != 1) {
> +		ti->error = "Invalid argument";
> +		ret = -EINVAL;
> +		goto err_para;
> +	}
> +
> +	if (strcmp(write_mode, "writeback") == 0) {
> +		if (argc != 3) {
> +			ti->error = "Invalid argument";
> +			ret = -EINVAL;
> +			goto err_para;
> +		}
> +		info->write_mode = INSITU_COMP_WRITE_BACK;
> +		if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
> +			ti->error = "Invalid argument";
> +			ret = -EINVAL;
> +			goto err_para;
> +		}
> +	} else if (strcmp(write_mode, "writethrough") == 0) {
> +		info->write_mode = INSITU_COMP_WRITE_THROUGH;
> +	} else {
> +		ti->error = "Invalid argument";
> +		ret = -EINVAL;
> +		goto err_para;
> +	}
> +
> +	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +							&info->dev)) {
> +		ti->error = "Can't get device";
> +		ret = -EINVAL;
> +		goto err_para;
> +	}
> +
> +	info->io_client = dm_io_client_create();
> +	if (!info->io_client) {
> +		ti->error = "Can't create io client";
> +		ret = -EINVAL;
> +		goto err_ioclient;
> +	}
> +
> +	if (bdev_logical_block_size(info->dev->bdev) != 512) {
> +		ti->error = "Can't logical block size too big";
> +		ret = -EINVAL;
> +		goto err_blocksize;
> +	}
> +
> +	ret = insitu_comp_read_or_create_super(info);
> +	if (ret)
> +		goto err_blocksize;
> +
> +	for (i = 0; i < BITMAP_HASH_LEN; i++) {
> +		info->bitmap_locks[i].io_running = 0;
> +		spin_lock_init(&info->bitmap_locks[i].wait_lock);
> +		INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
> +	}
> +
> +	atomic64_set(&info->compressed_write_size, 0);
> +	atomic64_set(&info->uncompressed_write_size, 0);
> +	atomic64_set(&info->meta_write_size, 0);
> +	ti->num_flush_bios = 1;
> +	/* doesn't support discard yet */
> +	ti->per_bio_data_size = sizeof(struct insitu_comp_req);
> +	ti->private = info;
> +	return 0;
> +err_blocksize:
> +	dm_io_client_destroy(info->io_client);
> +err_ioclient:
> +	dm_put_device(ti, info->dev);
> +err_para:
> +	kfree(info);
> +	return ret;
> +}
> +
> +static void insitu_comp_dtr(struct dm_target *ti)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +
> +	if (info->write_mode == INSITU_COMP_WRITE_BACK)
> +		kthread_stop(info->writeback_tsk);
> +	insitu_comp_free_compressor(info);
> +	vfree(info->meta_bitmap);
> +	dm_io_client_destroy(info->io_client);
> +	dm_put_device(ti, info->dev);
> +	kfree(info);
> +}
> +
> +static u64 insitu_comp_sector_to_block(sector_t sect)
> +{
> +	return sect >> INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_block_hash_lock(struct insitu_comp_info *info, u64 block_index)
> +{
> +	return &info->bitmap_locks[(block_index >> HASH_LOCK_SHIFT) &
> +			BITMAP_HASH_MASK];
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_trylock_block(struct insitu_comp_info *info,
> +	struct insitu_comp_req *req, u64 block_index)
> +{
> +	struct insitu_comp_hash_lock *hash_lock;
> +
> +	hash_lock = insitu_comp_block_hash_lock(req->info, block_index);
> +
> +	spin_lock_irq(&hash_lock->wait_lock);
> +	if (!hash_lock->io_running) {
> +		hash_lock->io_running = 1;
> +		spin_unlock_irq(&hash_lock->wait_lock);
> +		return hash_lock;
> +	}
> +	list_add_tail(&req->sibling, &hash_lock->wait_list);
> +	spin_unlock_irq(&hash_lock->wait_lock);
> +	return NULL;
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> +	struct list_head *list);
> +static void insitu_comp_unlock_block(struct insitu_comp_info *info,
> +	struct insitu_comp_req *req, struct insitu_comp_hash_lock *hash_lock)
> +{
> +	LIST_HEAD(pending_list);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&hash_lock->wait_lock, flags);
> +	/* wakeup all pending reqs to avoid live lock */
> +	list_splice_init(&hash_lock->wait_list, &pending_list);
> +	hash_lock->io_running = 0;
> +	spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
> +
> +	insitu_comp_queue_req_list(info, &pending_list);
> +}
> +
> +static void insitu_comp_unlock_req_range(struct insitu_comp_req *req)
> +{
> +	insitu_comp_unlock_block(req->info, req, req->lock);
> +}
> +
> +/* Check comments of HASH_LOCK_SHIFT. each request only need take one lock */
> +static int insitu_comp_lock_req_range(struct insitu_comp_req *req)
> +{
> +	u64 block_index, tmp;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +	tmp = insitu_comp_sector_to_block(insitu_req_end_sector(req) - 1);
> +	BUG_ON(insitu_comp_block_hash_lock(req->info, block_index) !=
> +			insitu_comp_block_hash_lock(req->info, tmp));
> +
> +	req->lock = insitu_comp_trylock_block(req->info, req, block_index);
> +	if (!req->lock)
> +		return 0;
> +
> +	return 1;
> +}
> +
> +static void insitu_comp_queue_req(struct insitu_comp_info *info,
> +	struct insitu_comp_req *req)
> +{
> +	unsigned long flags;
> +	struct insitu_comp_io_worker *worker =
> +		&insitu_comp_io_workers[req->cpu];
> +
> +	spin_lock_irqsave(&worker->lock, flags);
> +	list_add_tail(&req->sibling, &worker->pending);
> +	spin_unlock_irqrestore(&worker->lock, flags);
> +
> +	queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> +	struct list_head *list)
> +{
> +	struct insitu_comp_req *req;
> +	while (!list_empty(list)) {
> +		req = list_first_entry(list, struct insitu_comp_req, sibling);
> +		list_del_init(&req->sibling);
> +		insitu_comp_queue_req(info, req);
> +	}
> +}
> +
> +static void insitu_comp_get_req(struct insitu_comp_req *req)
> +{
> +	atomic_inc(&req->io_pending);
> +}
> +
> +static void insitu_comp_free_io_range(struct insitu_comp_io_range *io)
> +{
> +	kfree(io->decomp_data);
> +	kfree(io->comp_data);
> +	kmem_cache_free(insitu_comp_io_range_cachep, io);
> +}
> +
> +static void insitu_comp_put_req(struct insitu_comp_req *req)
> +{
> +	struct insitu_comp_io_range *io;
> +
> +	if (atomic_dec_return(&req->io_pending))
> +		return;
> +
> +	if (req->stage == STAGE_INIT) /* waiting for locking */
> +		return;
> +
> +	if (req->stage == STAGE_READ_DECOMP ||
> +	    req->stage == STAGE_WRITE_COMP ||
> +	    req->result)
> +		req->stage = STAGE_DONE;
> +
> +	if (req->stage != STAGE_DONE) {
> +		insitu_comp_queue_req(req->info, req);
> +		return;
> +	}
> +
> +	while (!list_empty(&req->all_io)) {
> +		io = list_entry(req->all_io.next, struct insitu_comp_io_range,
> +			next);
> +		list_del(&io->next);
> +		insitu_comp_free_io_range(io);
> +	}
> +
> +	insitu_comp_unlock_req_range(req);
> +
> +	insitu_req_endio(req, req->result);
> +}
> +
> +static void insitu_comp_io_range_done(unsigned long error, void *context)
> +{
> +	struct insitu_comp_io_range *io = context;
> +
> +	if (error)
> +		io->req->result = error;
> +	insitu_comp_put_req(io->req);
> +}
> +
> +static inline int insitu_comp_compressor_len(struct insitu_comp_info *info,
> +	int len)
> +{
> +	if (compressors[info->comp_alg].comp_len)
> +		return compressors[info->comp_alg].comp_len(len);
> +	return len;
> +}
> +
> +/*
> + * caller should set region.sector, region.count. bi_rw. IO always to/from
> + * comp_data
> + */
> +static struct insitu_comp_io_range *
> +insitu_comp_create_io_range(struct insitu_comp_req *req, int comp_len,
> +	int decomp_len)
> +{
> +	struct insitu_comp_io_range *io;
> +
> +	io = kmem_cache_alloc(insitu_comp_io_range_cachep, GFP_NOIO);
> +	if (!io)
> +		return NULL;
> +
> +	io->comp_data = kmalloc(insitu_comp_compressor_len(req->info, comp_len),
> +								GFP_NOIO);
> +	io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
> +	if (!io->decomp_data || !io->comp_data) {
> +		kfree(io->decomp_data);
> +		kfree(io->comp_data);
> +		kmem_cache_free(insitu_comp_io_range_cachep, io);
> +		return NULL;
> +	}
> +
> +	io->io_req.notify.fn = insitu_comp_io_range_done;
> +	io->io_req.notify.context = io;
> +	io->io_req.client = req->info->io_client;
> +	io->io_req.mem.type = DM_IO_KMEM;
> +	io->io_req.mem.ptr.addr = io->comp_data;
> +	io->io_req.mem.offset = 0;
> +
> +	io->io_region.bdev = req->info->dev->bdev;
> +
> +	io->decomp_len = decomp_len;
> +	io->comp_len = comp_len;
> +	io->req = req;
> +	return io;
> +}
> +
> +static void insitu_comp_req_copy(struct insitu_comp_req *req, off_t req_off, void *buf,
> +		ssize_t len, bool to_buf)
> +{
> +	struct bio *bio = req->bio;
> +	struct bvec_iter iter;
> +	off_t buf_off = 0;
> +	ssize_t size;
> +	void *addr;
> +
> +	iter = bio->bi_iter;
> +	bio_advance_iter(bio, &iter, req_off);
> +
> +	while (len) {
> +		addr = kmap_atomic(bio_iter_page(bio, iter));
> +		size = min_t(ssize_t, len, bio_iter_len(bio, iter));
> +		if (to_buf)
> +			memcpy(buf + buf_off, addr + bio_iter_offset(bio, iter),
> +				size);
> +		else
> +			memcpy(addr + bio_iter_offset(bio, iter), buf + buf_off,
> +				size);
> +		kunmap_atomic(addr);
> +
> +		buf_off += size;
> +		len -= size;
> +
> +		bio_advance_iter(bio, &iter, size);
> +	}
> +}
> +
> +/*
> + * return value:
> + * < 0 : error
> + * == 0 : ok
> + * == 1 : ok, but comp/decomp is skipped
> + * Compressed data size is roundup of 512, which makes the payload.
> + * We store the actual compressed length in the last u32 of the payload.
> + * If there is no free space, we add 512 to the payload size.
> + */
> +static int insitu_comp_io_range_comp(struct insitu_comp_info *info,
> +	void *comp_data, unsigned int *comp_len, void *decomp_data,
> +	unsigned int decomp_len, bool do_comp)
> +{
> +	struct crypto_comp *tfm;
> +	u32 *addr;
> +	unsigned int actual_comp_len;
> +	int ret;
> +
> +	if (do_comp) {
> +		actual_comp_len = *comp_len;
> +
> +		tfm = info->tfm[get_cpu()];
> +		ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
> +			comp_data, &actual_comp_len);
> +		put_cpu();
> +
> +		atomic64_add(decomp_len, &info->uncompressed_write_size);
> +		if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
> +			*comp_len = decomp_len;
> +			atomic64_add(*comp_len, &info->compressed_write_size);
> +			return 1;
> +		}
> +
> +		*comp_len = round_up(actual_comp_len, 512);
> +		if (*comp_len - actual_comp_len < sizeof(u32))
> +			*comp_len += 512;
> +		atomic64_add(*comp_len, &info->compressed_write_size);
> +		addr = comp_data + *comp_len;
> +		addr--;
> +		*addr = cpu_to_le32(actual_comp_len);
> +	} else {
> +		if (*comp_len == decomp_len)
> +			return 1;
> +		addr = comp_data + *comp_len;
> +		addr--;
> +		actual_comp_len = le32_to_cpu(*addr);
> +
> +		tfm = info->tfm[get_cpu()];
> +		ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
> +			decomp_data, &decomp_len);
> +		put_cpu();
> +		if (ret)
> +			return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * compressed data is updated. We decompress it and fill req. If there is no
> + * valid compressed data, we just zero req
> + */
> +static void insitu_comp_handle_read_decomp(struct insitu_comp_req *req)
> +{
> +	struct insitu_comp_io_range *io;
> +	off_t req_off = 0;
> +	int ret;
> +
> +	req->stage = STAGE_READ_DECOMP;
> +
> +	if (req->result)
> +		return;
> +
> +	list_for_each_entry(io, &req->all_io, next) {
> +		ssize_t dst_off = 0, src_off = 0, len;
> +
> +		io->io_region.sector -= req->info->data_start;
> +
> +		/* Do decomp here */
> +		ret = insitu_comp_io_range_comp(req->info, io->comp_data,
> +			&io->comp_len, io->decomp_data, io->decomp_len, false);
> +		if (ret < 0) {
> +			req->result = -EIO;
> +			return;
> +		}
> +
> +		if (io->io_region.sector >= insitu_req_start_sector(req))
> +			dst_off = (io->io_region.sector - insitu_req_start_sector(req))
> +				<< 9;
> +		else
> +			src_off = (insitu_req_start_sector(req) - io->io_region.sector)
> +				<< 9;
> +		len = min_t(ssize_t, io->decomp_len - src_off,
> +			(insitu_req_sectors(req) << 9) - dst_off);
> +
> +		/* io range in all_io list is ordered for read IO */
> +		while (req_off != dst_off) {
> +			ssize_t size = min_t(ssize_t, PAGE_SIZE,
> +					dst_off - req_off);
> +			insitu_comp_req_copy(req, req_off,
> +				empty_zero_page, size, false);
> +			req_off += size;
> +		}
> +
> +		if (ret == 1) /* uncompressed, valid data is in .comp_data */
> +			insitu_comp_req_copy(req, dst_off,
> +					io->comp_data + src_off, len, false);
> +		else
> +			insitu_comp_req_copy(req, dst_off,
> +					io->decomp_data + src_off, len, false);
> +		req_off = dst_off + len;
> +	}
> +
> +	while (req_off != (insitu_req_sectors(req) << 9)) {
> +		ssize_t size = min_t(ssize_t, PAGE_SIZE,
> +			(insitu_req_sectors(req) << 9) - req_off);
> +		insitu_comp_req_copy(req, req_off, empty_zero_page,
> +			size, false);
> +		req_off += size;
> +	}
> +}
> +
> +/*
> + * read one extent data from disk. The extent starts from block @block and has
> + * @data_sectors data
> + */
> +static void insitu_comp_read_one_extent(struct insitu_comp_req *req, u64 block,
> +	u16 logical_sectors, u16 data_sectors)
> +{
> +	struct insitu_comp_io_range *io;
> +
> +	io = insitu_comp_create_io_range(req, data_sectors << 9,
> +		logical_sectors << 9);
> +	if (!io) {
> +		req->result = -EIO;
> +		return;
> +	}
> +
> +	insitu_comp_get_req(req);
> +	list_add_tail(&io->next, &req->all_io);
> +
> +	io->io_region.sector = (block << INSITU_COMP_BLOCK_SECTOR_SHIFT) +
> +				req->info->data_start;
> +	io->io_region.count = data_sectors;
> +
> +	io->io_req.bi_rw = READ;
> +	dm_io(&io->io_req, 1, &io->io_region, NULL);
> +}
> +
> +static void insitu_comp_handle_read_read_existing(struct insitu_comp_req *req)
> +{
> +	u64 block_index, first_block_index;
> +	u16 logical_sectors, data_sectors;
> +
> +	req->stage = STAGE_READ_EXISTING;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +again:
> +	insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +		&logical_sectors, &data_sectors);
> +	if (data_sectors > 0)
> +		insitu_comp_read_one_extent(req, first_block_index,
> +			logical_sectors, data_sectors);
> +
> +	if (req->result)
> +		return;
> +
> +	block_index = first_block_index + (logical_sectors >>
> +				INSITU_COMP_BLOCK_SECTOR_SHIFT);
> +	/* the request might cover several extents */
> +	if ((block_index << INSITU_COMP_BLOCK_SECTOR_SHIFT) <
> +			insitu_req_end_sector(req))
> +		goto again;
> +
> +	/* A shortcut if all data is in already */
> +	if (list_empty(&req->all_io))
> +		insitu_comp_handle_read_decomp(req);
> +}
> +
> +static void insitu_comp_handle_read_request(struct insitu_comp_req *req)
> +{
> +	insitu_comp_get_req(req);
> +
> +	if (req->stage == STAGE_INIT) {
> +		if (!insitu_comp_lock_req_range(req)) {
> +			insitu_comp_put_req(req);
> +			return;
> +		}
> +
> +		insitu_comp_handle_read_read_existing(req);
> +	} else if (req->stage == STAGE_READ_EXISTING)
> +		insitu_comp_handle_read_decomp(req);
> +
> +	insitu_comp_put_req(req);
> +}
> +
> +static void insitu_comp_write_meta_done(void *context, unsigned long error)
> +{
> +	struct insitu_comp_req *req = context;
> +	insitu_comp_put_req(req);
> +}
> +
> +static u64 insitu_comp_block_meta_page_index(u64 block, bool end)
> +{
> +	u64 bits = block * INSITU_COMP_META_BITS - !!end;
> +	/* (1 << 3) bits per byte */
> +	return bits >> (3 + PAGE_SHIFT);
> +}
> +
> +/*
> + * the request covers some extents partially. Decompress data of the extents,
> + * compress remaining valid data, and finally write them out
> + */
> +static int insitu_comp_handle_write_modify(struct insitu_comp_io_range *io,
> +	u64 *meta_start, u64 *meta_end, bool *handle_req)
> +{
> +	struct insitu_comp_req *req = io->req;
> +	sector_t start, count;
> +	unsigned int comp_len;
> +	off_t offset;
> +	u64 page_index;
> +	int ret;
> +
> +	io->io_region.sector -= req->info->data_start;
> +
> +	/* decompress original data */
> +	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &io->comp_len,
> +			io->decomp_data, io->decomp_len, false);
> +	if (ret < 0) {
> +		req->result = -EINVAL;
> +		return -EIO;
> +	}
> +
> +	start = io->io_region.sector;
> +	count = io->decomp_len >> 9;
> +	if (start < insitu_req_start_sector(req) && start + count >
> +					insitu_req_end_sector(req)) {
> +		/* we don't split an extent */
> +		if (ret == 1) {
> +			memcpy(io->decomp_data, io->comp_data, io->decomp_len);
> +			insitu_comp_req_copy(req, 0,
> +			   io->decomp_data + ((insitu_req_start_sector(req) - start) <<
> +			   9), insitu_req_sectors(req) << 9, true);
> +		} else {
> +			insitu_comp_req_copy(req, 0,
> +			   io->decomp_data + ((insitu_req_start_sector(req) - start) <<
> +			   9), insitu_req_sectors(req) << 9, true);
> +			kfree(io->comp_data);
> +			/* New compressed len might be bigger */
> +			io->comp_data = kmalloc(insitu_comp_compressor_len(
> +				req->info, io->decomp_len), GFP_NOIO);
> +			io->comp_len = io->decomp_len;
> +			if (!io->comp_data) {
> +				req->result = -ENOMEM;
> +				return -EIO;
> +			}
> +			io->io_req.mem.ptr.addr = io->comp_data;
> +		}
> +		/* need compress data */
> +		ret = 0;
> +		offset = 0;
> +		*handle_req = false;
> +	} else if (start < insitu_req_start_sector(req)) {
> +		count = insitu_req_start_sector(req) - start;
> +		offset = 0;
> +	} else {
> +		offset = insitu_req_end_sector(req) - start;
> +		start = insitu_req_end_sector(req);
> +		count = count - offset;
> +	}
> +
> +	/* Original data is uncompressed, we don't need writeback */
> +	if (ret == 1) {
> +		comp_len = count << 9;
> +		goto handle_meta;
> +	}
> +
> +	/* assume compress less data uses less space (at least 4k lsess data) */
> +	comp_len = io->comp_len;
> +	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> +		io->decomp_data + (offset << 9), count << 9, true);
> +	if (ret < 0) {
> +		req->result = -EIO;
> +		return -EIO;
> +	}
> +
> +	insitu_comp_get_req(req);
> +	if (ret == 1)
> +		io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
> +	io->io_region.count = comp_len >> 9;
> +	io->io_region.sector = start + req->info->data_start;
> +
> +	io->io_req.bi_rw = insitu_req_rw(req);
> +	dm_io(&io->io_req, 1, &io->io_region, NULL);
> +handle_meta:
> +	insitu_comp_set_extent(req, start >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +		count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> +	page_index = insitu_comp_block_meta_page_index(start >>
> +					INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> +	if (*meta_start > page_index)
> +		*meta_start = page_index;
> +	page_index = insitu_comp_block_meta_page_index(
> +		(start + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, true);
> +	if (*meta_end < page_index)
> +		*meta_end = page_index;
> +	return 0;
> +}
> +
> +/* Compress data and write it out */
> +static void insitu_comp_handle_write_comp(struct insitu_comp_req *req)
> +{
> +	struct insitu_comp_io_range *io;
> +	sector_t count;
> +	unsigned int comp_len;
> +	u64 meta_start = -1L, meta_end = 0, page_index;
> +	int ret;
> +	bool handle_req = true;
> +
> +	req->stage = STAGE_WRITE_COMP;
> +
> +	if (req->result)
> +		return;
> +
> +	list_for_each_entry(io, &req->all_io, next) {
> +		if (insitu_comp_handle_write_modify(io, &meta_start, &meta_end,
> +						&handle_req))
> +			return;
> +	}
> +
> +	if (!handle_req)
> +		goto update_meta;
> +
> +	count = insitu_req_sectors(req);
> +	io = insitu_comp_create_io_range(req, count << 9, count << 9);
> +	if (!io) {
> +		req->result = -EIO;
> +		return;
> +	}
> +	insitu_comp_req_copy(req, 0, io->decomp_data, count << 9, true);
> +
> +	/* compress data */
> +	comp_len = io->comp_len;
> +	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> +		io->decomp_data, count << 9, true);
> +	if (ret < 0) {
> +		insitu_comp_free_io_range(io);
> +		req->result = -EIO;
> +		return;
> +	}
> +
> +	insitu_comp_get_req(req);
> +	list_add_tail(&io->next, &req->all_io);
> +	io->io_region.sector = insitu_req_start_sector(req) + req->info->data_start;
> +	if (ret == 1)
> +		io->io_req.mem.ptr.addr = io->decomp_data;
> +	io->io_region.count = comp_len >> 9;
> +	io->io_req.bi_rw = insitu_req_rw(req);
> +	dm_io(&io->io_req, 1, &io->io_region, NULL);
> +	insitu_comp_set_extent(req,
> +		insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +		count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> +	page_index = insitu_comp_block_meta_page_index(
> +		insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> +	if (meta_start > page_index)
> +		meta_start = page_index;
> +	page_index = insitu_comp_block_meta_page_index(
> +		(insitu_req_start_sector(req) + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +		true);
> +	if (meta_end < page_index)
> +		meta_end = page_index;
> +update_meta:
> +	if (req->info->write_mode == INSITU_COMP_WRITE_THROUGH ||
> +						(insitu_req_rw(req) & REQ_FUA)) {
> +		insitu_comp_get_req(req);
> +		insitu_comp_write_meta(req->info, meta_start, meta_end + 1, req,
> +			insitu_comp_write_meta_done, insitu_req_rw(req));
> +	}
> +}
> +
> +/* request might cover some extents partially, read them first */
> +static void insitu_comp_handle_write_read_existing(struct insitu_comp_req *req)
> +{
> +	u64 block_index, first_block_index;
> +	u16 logical_sectors, data_sectors;
> +
> +	req->stage = STAGE_READ_EXISTING;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +	insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +		&logical_sectors, &data_sectors);
> +	if (data_sectors > 0 && (first_block_index < block_index ||
> +	    first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> +	    insitu_comp_sector_to_block(insitu_req_end_sector(req))))
> +		insitu_comp_read_one_extent(req, first_block_index,
> +			logical_sectors, data_sectors);
> +
> +	if (req->result)
> +		return;
> +
> +	if (first_block_index + insitu_comp_sector_to_block(logical_sectors) >=
> +	    insitu_comp_sector_to_block(insitu_req_end_sector(req)))
> +		goto out;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_end_sector(req)) - 1;
> +	insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +		&logical_sectors, &data_sectors);
> +	if (data_sectors > 0 &&
> +	    first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> +	    block_index + 1)
> +		insitu_comp_read_one_extent(req, first_block_index,
> +			logical_sectors, data_sectors);
> +
> +	if (req->result)
> +		return;
> +out:
> +	if (list_empty(&req->all_io))
> +		insitu_comp_handle_write_comp(req);
> +}
> +
> +static void insitu_comp_handle_write_request(struct insitu_comp_req *req)
> +{
> +	insitu_comp_get_req(req);
> +
> +	if (req->stage == STAGE_INIT) {
> +		if (!insitu_comp_lock_req_range(req)) {
> +			insitu_comp_put_req(req);
> +			return;
> +		}
> +
> +		insitu_comp_handle_write_read_existing(req);
> +	} else if (req->stage == STAGE_READ_EXISTING)
> +		insitu_comp_handle_write_comp(req);
> +
> +	insitu_comp_put_req(req);
> +}
> +
> +/* For writeback mode */
> +static void insitu_comp_handle_flush_request(struct insitu_comp_req *req)
> +{
> +	struct writeback_flush_data wb;
> +
> +	atomic_set(&wb.cnt, 1);
> +	init_completion(&wb.complete);
> +
> +	insitu_comp_flush_dirty_meta(req->info, &wb);
> +
> +	writeback_flush_io_done(&wb, 0);
> +	wait_for_completion(&wb.complete);
> +
> +	insitu_req_endio(req, 0);
> +}
> +
> +static void insitu_comp_handle_request(struct insitu_comp_req *req)
> +{
> +	if (insitu_req_rw(req) & REQ_FLUSH)
> +		insitu_comp_handle_flush_request(req);
> +	else if (insitu_req_rw(req) & REQ_WRITE)
> +		insitu_comp_handle_write_request(req);
> +	else
> +		insitu_comp_handle_read_request(req);
> +}
> +
> +static void insitu_comp_do_request_work(struct work_struct *work)
> +{
> +	struct insitu_comp_io_worker *worker = container_of(work,
> +			struct insitu_comp_io_worker, work);
> +	LIST_HEAD(list);
> +	struct insitu_comp_req *req;
> +	struct blk_plug plug;
> +	bool repeat;
> +
> +	blk_start_plug(&plug);
> +again:
> +	spin_lock_irq(&worker->lock);
> +	list_splice_init(&worker->pending, &list);
> +	spin_unlock_irq(&worker->lock);
> +
> +	repeat = !list_empty(&list);
> +	while (!list_empty(&list)) {
> +		req = list_first_entry(&list, struct insitu_comp_req, sibling);
> +		list_del(&req->sibling);
> +
> +		insitu_comp_handle_request(req);
> +	}
> +	if (repeat)
> +		goto again;
> +	blk_finish_plug(&plug);
> +}
> +
> +static int insitu_comp_map(struct dm_target *ti, struct bio *bio)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +	struct insitu_comp_req *req;
> +
> +	req = dm_per_bio_data(bio, sizeof(struct insitu_comp_req));
> +
> +	if ((bio->bi_rw & REQ_FLUSH) &&
> +			info->write_mode == INSITU_COMP_WRITE_THROUGH) {
> +		bio->bi_bdev = info->dev->bdev;
> +		return DM_MAPIO_REMAPPED;
> +	}
> +
> +	req->bio = bio;
> +	req->info = info;
> +	atomic_set(&req->io_pending, 0);
> +	INIT_LIST_HEAD(&req->all_io);
> +	req->result = 0;
> +	req->stage = STAGE_INIT;
> +
> +	req->cpu = raw_smp_processor_id();
> +	insitu_comp_queue_req(info, req);
> +
> +	return DM_MAPIO_SUBMITTED;
> +}
> +
> +/*
> + * INFO: uncompressed_data_size compressed_data_size metadata_size
> + * TABLE: writethrough/writeback commit_delay
> + */
> +static void insitu_comp_status(struct dm_target *ti, status_type_t type,
> +			  unsigned status_flags, char *result, unsigned maxlen)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +	unsigned int sz = 0;
> +
> +	switch (type) {
> +	case STATUSTYPE_INFO:
> +		DMEMIT("%lu %lu %lu",
> +			atomic64_read(&info->uncompressed_write_size),
> +			atomic64_read(&info->compressed_write_size),
> +			atomic64_read(&info->meta_write_size));
> +		break;
> +	case STATUSTYPE_TABLE:
> +		if (info->write_mode == INSITU_COMP_WRITE_BACK)
> +			DMEMIT("%s %s %d", info->dev->name, "writeback",
> +				info->writeback_delay);
> +		else
> +			DMEMIT("%s %s", info->dev->name, "writethrough");
> +		break;
> +	}
> +}
> +
> +static int insitu_comp_iterate_devices(struct dm_target *ti,
> +				  iterate_devices_callout_fn fn, void *data)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +
> +	return fn(ti, info->dev, info->data_start,
> +		info->data_blocks << INSITU_COMP_BLOCK_SECTOR_SHIFT, data);
> +}
> +
> +static void insitu_comp_io_hints(struct dm_target *ti,
> +			    struct queue_limits *limits)
> +{
> +	/* No blk_limits_logical_block_size */
> +	limits->logical_block_size = limits->physical_block_size =
> +		limits->io_min = INSITU_COMP_BLOCK_SIZE;
> +	blk_limits_max_hw_sectors(limits, INSITU_COMP_MAX_SIZE >> 9);
> +}
> +
> +static int insitu_comp_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> +			struct bio_vec *biovec, int max_size)
> +{
> +	/* Guarantee request can only cover one aligned 128k range */
> +	return min_t(int, max_size, INSITU_COMP_MAX_SIZE - bvm->bi_size -
> +			((bvm->bi_sector << 9) % INSITU_COMP_MAX_SIZE));
> +}
> +
> +static struct target_type insitu_comp_target = {
> +	.name   = "insitu_comp",
> +	.version = {1, 0, 0},
> +	.module = THIS_MODULE,
> +	.ctr    = insitu_comp_ctr,
> +	.dtr    = insitu_comp_dtr,
> +	.map    = insitu_comp_map,
> +	.status = insitu_comp_status,
> +	.iterate_devices = insitu_comp_iterate_devices,
> +	.io_hints = insitu_comp_io_hints,
> +	.merge = insitu_comp_merge,
> +};
> +
> +static int __init insitu_comp_init(void)
> +{
> +	int r;
> +
> +	for (r = 0; r < ARRAY_SIZE(compressors); r++)
> +		if (crypto_has_comp(compressors[r].name, 0, 0))
> +			break;
> +	if (r >= ARRAY_SIZE(compressors)) {
> +		DMWARN("No crypto compressors are supported");
> +		return -EINVAL;
> +	}
> +
> +	default_compressor = r;
> +
> +	r = -ENOMEM;
> +	insitu_comp_io_range_cachep = kmem_cache_create("insitu_comp_io_range",
> +		sizeof(struct insitu_comp_io_range), 0, 0, NULL);
> +	if (!insitu_comp_io_range_cachep) {
> +		DMWARN("Can't create io_range cache");
> +		goto err;
> +	}
> +
> +	insitu_comp_meta_io_cachep = kmem_cache_create("insitu_comp_meta_io",
> +		sizeof(struct insitu_comp_meta_io), 0, 0, NULL);
> +	if (!insitu_comp_meta_io_cachep) {
> +		DMWARN("Can't create meta_io cache");
> +		goto err;
> +	}
> +
> +	insitu_comp_wq = alloc_workqueue("insitu_comp_io",
> +		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> +	if (!insitu_comp_wq) {
> +		DMWARN("Can't create io workqueue");
> +		goto err;
> +	}
> +
> +	r = dm_register_target(&insitu_comp_target);
> +	if (r < 0) {
> +		DMWARN("target registration failed");
> +		goto err;
> +	}
> +
> +	for_each_possible_cpu(r) {
> +		INIT_LIST_HEAD(&insitu_comp_io_workers[r].pending);
> +		spin_lock_init(&insitu_comp_io_workers[r].lock);
> +		INIT_WORK(&insitu_comp_io_workers[r].work,
> +			insitu_comp_do_request_work);
> +	}
> +	return 0;
> +err:
> +	if (insitu_comp_io_range_cachep)
> +		kmem_cache_destroy(insitu_comp_io_range_cachep);
> +	if (insitu_comp_meta_io_cachep)
> +		kmem_cache_destroy(insitu_comp_meta_io_cachep);
> +	if (insitu_comp_wq)
> +		destroy_workqueue(insitu_comp_wq);
> +
> +	return r;
> +}
> +
> +static void __exit insitu_comp_exit(void)
> +{
> +	dm_unregister_target(&insitu_comp_target);
> +	kmem_cache_destroy(insitu_comp_io_range_cachep);
> +	kmem_cache_destroy(insitu_comp_meta_io_cachep);
> +	destroy_workqueue(insitu_comp_wq);
> +}
> +
> +module_init(insitu_comp_init);
> +module_exit(insitu_comp_exit);
> +
> +MODULE_AUTHOR("Shaohua Li <shli@xxxxxxxxxx>");
> +MODULE_DESCRIPTION(DM_NAME " target with insitu data compression for SSD");
> +MODULE_LICENSE("GPL");
> Index: linux/drivers/md/dm-insitu-comp.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.h	2014-02-17 18:37:07.108425465 +0800
> @@ -0,0 +1,158 @@
> +#ifndef __DM_INSITU_COMPRESSION_H__
> +#define __DM_INSITU_COMPRESSION_H__
> +#include <linux/types.h>
> +
> +struct insitu_comp_super_block {
> +	__le64 magic;
> +	__le64 version;
> +	__le64 meta_blocks;
> +	__le64 data_blocks;
> +	u8 comp_alg;
> +} __attribute__((packed));
> +
> +#define INSITU_COMP_SUPER_MAGIC 0x106526c206506c09
> +#define INSITU_COMP_VERSION 1
> +#define INSITU_COMP_ALG_LZO 0
> +#define INSITU_COMP_ALG_ZLIB 1
> +
> +#ifdef __KERNEL__
> +struct insitu_comp_compressor_data {
> +	char *name;
> +	int (*comp_len)(int comp_len);
> +};
> +
> +static inline int lzo_comp_len(int comp_len)
> +{
> +	return lzo1x_worst_compress(comp_len);
> +}
> +
> +/*
> + * Minium logical sector size of this target is 4096 byte, which is a block.
> + * Data of a block is compressed. Compressed data is round up to 512B, which is
> + * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
> + * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, we
> + * just store uncompressed data. Actual compressed data length is stored at the
> + * last 32 bits of payload if data is compressed. In disk, payload is stored at
> + * the begining of logical sector of the block. If IO size is bigger than one
> + * block, we store the whole data as an extent. Bit 4 stands tail for an
> + * extent. Max allowed extent size is 128k.
> + */
> +#define INSITU_COMP_BLOCK_SIZE 4096
> +#define INSITU_COMP_BLOCK_SHIFT 12
> +#define INSITU_COMP_BLOCK_SECTOR_SHIFT (INSITU_COMP_BLOCK_SHIFT - 9)
> +
> +#define INSITU_COMP_MIN_SIZE 4096
> +/* Change this should change HASH_LOCK_SHIFT too */
> +#define INSITU_COMP_MAX_SIZE (128 * 1024)
> +
> +#define INSITU_COMP_LENGTH_MASK ((1 << 4) - 1)
> +#define INSITU_COMP_TAIL_MASK (1 << 4)
> +#define INSITU_COMP_META_BITS 5
> +
> +#define INSITU_COMP_META_START_SECTOR (INSITU_COMP_BLOCK_SIZE >> 9)
> +
> +enum INSITU_COMP_WRITE_MODE {
> +	INSITU_COMP_WRITE_BACK,
> +	INSITU_COMP_WRITE_THROUGH,
> +};
> +
> +/*
> + * request can cover one aligned 128k (4k * (1 << 5)) range. Since maxium
> + * request size is 128k, we only need take one lock for each request
> + */
> +#define HASH_LOCK_SHIFT 5
> +
> +#define BITMAP_HASH_SHIFT 9
> +#define BITMAP_HASH_MASK ((1 << BITMAP_HASH_SHIFT) - 1)
> +#define BITMAP_HASH_LEN (1 << BITMAP_HASH_SHIFT)
> +
> +struct insitu_comp_hash_lock {
> +	int io_running;
> +	spinlock_t wait_lock;
> +	struct list_head wait_list;
> +};
> +
> +struct insitu_comp_info {
> +	struct dm_target *ti;
> +	struct dm_dev *dev;
> +
> +	int comp_alg;
> +	struct crypto_comp *tfm[NR_CPUS];
> +
> +	sector_t data_start;
> +	u64 data_blocks;
> +
> +	char *meta_bitmap;
> +	u64 meta_bitmap_bits;
> +	u64 meta_bitmap_pages;
> +	struct insitu_comp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
> +
> +	enum INSITU_COMP_WRITE_MODE write_mode;
> +	unsigned int writeback_delay; /* second unit */
> +	struct task_struct *writeback_tsk;
> +	struct dm_io_client *io_client;
> +
> +	atomic64_t compressed_write_size;
> +	atomic64_t uncompressed_write_size;
> +	atomic64_t meta_write_size;
> +};
> +
> +struct insitu_comp_meta_io {
> +	struct dm_io_request io_req;
> +	struct dm_io_region io_region;
> +	void *data;
> +	void (*fn)(void *data, unsigned long error);
> +};
> +
> +struct insitu_comp_io_range {
> +	struct dm_io_request io_req;
> +	struct dm_io_region io_region;
> +	void *decomp_data;
> +	unsigned int decomp_len;
> +	void *comp_data;
> +	unsigned int comp_len; /* For write, this is estimated */
> +	struct list_head next;
> +	struct insitu_comp_req *req;
> +};
> +
> +enum INSITU_COMP_REQ_STAGE {
> +	STAGE_INIT,
> +	STAGE_READ_EXISTING,
> +	STAGE_READ_DECOMP,
> +	STAGE_WRITE_COMP,
> +	STAGE_DONE,
> +};
> +
> +struct insitu_comp_req {
> +	struct bio *bio;
> +	struct insitu_comp_info *info;
> +	struct list_head sibling;
> +
> +	struct list_head all_io;
> +	atomic_t io_pending;
> +	enum INSITU_COMP_REQ_STAGE stage;
> +
> +	struct insitu_comp_hash_lock *lock;
> +	int result;
> +
> +	int cpu;
> +};
> +
> +#define insitu_req_start_sector(req) (req->bio->bi_iter.bi_sector)
> +#define insitu_req_end_sector(req) (bio_end_sector(req->bio))
> +#define insitu_req_rw(req) (req->bio->bi_rw)
> +#define insitu_req_sectors(req) (bio_sectors(req->bio))
> +
> +static inline void insitu_req_endio(struct insitu_comp_req *req, int error)
> +{
> +	bio_endio(req->bio, error);
> +}
> +
> +struct insitu_comp_io_worker {
> +	struct list_head pending;
> +	spinlock_t lock;
> +	struct work_struct work;
> +};
> +#endif
> +
> +#endif
> Index: linux/Documentation/device-mapper/insitu-comp.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/Documentation/device-mapper/insitu-comp.txt	2014-02-17 17:34:45.427464765 +0800
> @@ -0,0 +1,50 @@
> +This is a simple DM target supporting compression for SSD only. Under layer SSD
> +must support 512B sector size, the target only supports 4k sector size.
> +
> +Disk layout:
> +|super|...meta...|..data...|
> +
> +Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> +and compression algorithm. Meta is a bitmap. For each data block, there are 5
> +bits meta.
> +
> +Data:
> +Data of a block is compressed. Compressed data is round up to 512B, which is
> +the payload. In disk, payload is stored at the begining of logical sector of
> +the block. Let's look at an example. Say we store data to block A, which is in
> +sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
> +(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> +stored at sector B.
> +
> +---------------------------------------------------
> +... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
> +---------------------------------------------------
> +    ^B    ^B+1  ^B+2                  ^B+7 ^B+8
> +
> +For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> +to present payload size. The compressed size (1500) isn't stored in meta
> +directly. Instead, we store it at the last 32bits of payload. In this example,
> +we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> +crosses a sector, payload size will increase one sector. If payload uses 8
> +sectors, we store uncompressed data directly.
> +
> +If IO size is bigger than one block, we can store the data as an extent. Data
> +of the whole extent will compressed and stored in the similar way like above.
> +The first block of the extent is the head, all others are the tail. If extent
> +is 1 block, the block is head. We have 1 bit of meta to present if a block is
> +head or tail. If 4 meta bits of head block can't store extent payload size, we
> +will borrow tail block meta bits to store payload size. Max allowd extent size
> +is 128k, so we don't compress/decompress too big size data.
> +
> +Meta:
> +Modifying data will modify meta too. Meta will be written(flush) to disk
> +depending on meta write policy. We support writeback and writethrough mode. In
> +writeback mode, meta will be written to disk in an interval or a FLUSH request.
> +In writethrough mode, data and meta data will be written to disk together.
> +
> +=========================
> +Parameters: <dev> [<writethrough>|<writeback> <meta_commit_delay>]
> +   <dev>: underlying device
> +   <writethrough>: metadata flush to disk with writetrough mode
> +   <writeback>: metadata flush to disk with writeback mode
> +   <meta_commit_delay>: metadata flush to disk interval in writeback mode

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel