[PATCH] New device mapper target ZDM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



ZDM presents a traditional block device for ZBC/ZAC zoned devices.

User space utilities in zdm-tools for creating, repairing and restore
DM instances at: https://github.com/Seagate

Primary advantages of using a zoned translation layer (ZTL) over a
zone-caching model include:
Low memory usage less than 25 MiB per instance based on cache aging.
Consistent first fill performance.
Good random write performance.
User configurable to match different workloads and QoS.

Disadvantages
A small amount of disk is used for ZTL data and over-provisioning.
Lower random read performance.
Greater code complexity.

Signed-off-by: Shaun Tancheff <shaun.tancheff@xxxxxxxxxxx>
Signed-off-by: Vineet Agarwal <vineet.agarwal@xxxxxxxxxxx>
---
 Documentation/device-mapper/zdm.txt |   152 +
 MAINTAINERS                         |     7 +
 drivers/md/Kconfig                  |    20 +
 drivers/md/Makefile                 |     1 +
 drivers/md/dm-zdm.c                 |  2826 ++++++++++
 drivers/md/dm-zdm.h                 |   945 ++++
 drivers/md/libzdm.c                 | 10043 ++++++++++++++++++++++++++++++++++
 7 files changed, 13994 insertions(+)
 create mode 100644 Documentation/device-mapper/zdm.txt
 create mode 100644 drivers/md/dm-zdm.c
 create mode 100644 drivers/md/dm-zdm.h
 create mode 100644 drivers/md/libzdm.c

diff --git a/Documentation/device-mapper/zdm.txt b/Documentation/device-mapper/zdm.txt
new file mode 100644
index 0000000..3384d7d
--- /dev/null
+++ b/Documentation/device-mapper/zdm.txt
@@ -0,0 +1,152 @@
+Overview of Host Aware ZBC/ZAC Device Mapper
+  - Zone size (256MiB)
+  - v4.10 ZBC support or later ...
+
+ZDM presents a traditional block device for ZBC/ZAC zoned devices.
+
+User space utilities in zdm-tools for creating, repairing and restore
+DM instances at: https://github.com/stancheff and https://github.com/Seagate
+
+ZDM uses a zoned translation layer [ZTL] which shares similarities with
+an FTL. Sometimes referred to as an STL [Shingled translation layer] as
+the physical media restricts differ the core principles remain.
+
+Primary advantages of using a zoned translation layer (ZTL) over a
+zone-caching model include:
+    Low memory usage less than 25 MiB per instance based on cache aging.
+    Consistent first fill performance.
+    Good random write performance.
+    User configurable to match different workloads and QoS.
+
+Disadvantages
+    A small amount of disk is used for ZTL data and over-provisioning.
+    Lower random read performance.
+    Greater code complexity.
+
+
+        Initial Setup (Reported size / addressable space)
+                
+On initial setup ZDM and the zdmadm tool calculate the mount of space
+needed for ZDM internal metadata (lookup tables, superblocks, crcs, etc)
+as well as a percentage of over provisioned space needed for garbage
+collection.
+
+This effectively reduces the amount of addressable space reported by
+the constructed target block device.
+
+Garbage collection by default is a transparent background activity. To
+facilitate the this ZDM reports 'trim' or 'discard' support as well as
+
+The current ZTL I/O model is to request a contiguous block large enough to
+hold the current request and map request to the allocated space and update
+the mapping into the ZTL. The initial update of ZTL is held in an extent
+table that is periodically migrated to the on-disc format and persisted.
+Extent table blocks that are not migrated are incorporated in the superblock
+stream during flush/fua operations. Read operations consult this hierarchy to
+determine the current location of requested blocks.
+
+
+        De-dupe Write Optimization
+
+On write I/Os which are all zeroed are mapped into the discard extent
+tables. The zeroed I/O is not written to disc, on read the empty block
+is generated.
+
+
+        Delayed Writeback Optimization
+
+An optional 1ms delay can be enabled to allow contiguous I/Os to be
+merged. This is mainly used for RAID 4/5/6 as the MD-RAID layer
+presents a 4K block mixed read/write workload which is far from
+optimal for ZDM.
+
+
+        Garbage Collection
+
+Background garbage collection operations are initiated on a periodic trigger
+and on allocation failures (unable to allocate free blocks for writing).
+The zone selected for cleaning is done by estimated the amount of stale
+blocks in the zone, either determined directly by remapped blocks, or estimated
+from discard requests. A zone stale ratio is assigned based on the estimated
+number of stale blocks in the zone. This is the number of free blocks that
+will be recovered once the active blocks in the zone are relocated and the zone
+is reset.
+There are several knobs presented via zdmadm to determine how aggressively
+garbage collection operate as well as the ability to halt automatic garbage
+collection entirely for critical operations. The primary knobs are how stale
+a zone must be and the number of empty zones which are available. The defaults
+are to initiate GC on a fully stale zone all the time and on to start
+considering lesser ratios as the disc when less than 25% of all zones are
+not in use.
+                
+
+        Implementation Overview
+
+-> .map(): DM target map handler
+When a target has I/O to process this handler is called
+
+ - Map incoming bio (bi_iter.bi_sector) onto ZDM exposed address space.
+ - Flag queue as no merge to avoid I/O being re-ordered by any io
+   elevator that may be enabled.
+ - Determine if I/O is a discard, read, or write. If it is a write determine
+   if the I/O is also flagged with FLUSH/FUA. Note: REQ_OP_FLUSH is a write op.
+    - Record the discard via the trim extent cache.
+    - Handle the read
+    - Handle the write. See .end_io() for completion handling.
+ - If flush is in effect queue the metadata worker and wait for completion.
+
+-> .end_io(): DM target endio handler
+When a bio completes this endio handler is notified.
+
+ - If completed operation is a write, update the appropriate zone wp to
+   indicate the wp has actually advanced. If the zone is full update the
+   stale ratio calculation of the zone and it's containing bin.
+
+Metadata worker:
+ - If initialize load previous state from disc
+ - Migrate extent cache entries to ZTL table entries
+ - on flush/fua, push all dirty ZLT blocks, extent cache, superblock et al.
+   to disc and perform a explicit media flush.
+
+Periodic activity (background worker):
+ - If last flush > 30s, queue a flush operation.
+ - If last I/O > DISCARD_IDLE_MSECS process some trim extent cache
+ - Scan for GC candidates and issue GC if found
+ - If cache memory usage is high try to release old blocks of ZLT
+
+GC worker:
+ - Load reverse map for zone and locate forward map ZLT entries.
+ - Build the current zone valid block map (tLBA -> bLBA).
+ - Loop until all valid data is moved out of zone: 
+   - Read valid tLBA entries
+   - Write (relocate) tLBA entries to new zone.
+ - Update map information for all moved (still valid) blocks.
+ - Reset WP on newly cleared zone and add zone to the free pool.
+
+Notable code [struct map_pool]:
+ - array merge sort and insert over non-contiguous pages
+ - binary search over non-contiguous pages
+
+
+        Data Layout
+
+See: z_mapped_sync / z_mapped_init
+
+Zone #0: 
+   Block 0 - Reserved for zdmadm
+   Superblock layout:
+    - 512 blocks (1-512, 513-1024, 1025-1536)
+   WP use/free cache
+    - 2048-~2200 [acutal limit: (2 x <zone count>) / (PAGE_SIZE / sizeof(u32))]
+
+ZLT layout:
+   Zone #1 - z:   Forward map
+   Zone #z+1 - y: Reverse map
+   Zone #y+1:     CRC-16 over forward/reverse map.
+
+Data zones #y+2 to end.
+
+Future plans:
+  Enable 3 generations of WP use/free similar to current superblock scheme.
+  Pack ZLT layout (currently zone aligned) for better CMR utilization.
+  Restore/Expand stream id support when feature is merged.
diff --git a/MAINTAINERS b/MAINTAINERS
index f593300..67962182 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13036,6 +13036,13 @@ L:	zd1211-devs@xxxxxxxxxxxxxxxxxxxxx (subscribers-only)
 S:	Maintained
 F:	drivers/net/wireless/zydas/zd1211rw/
 
+ZDM ZONED DEVICE MAPPER TARGET
+M:	Shaun Tancheff <shaun.tancheff@xxxxxxxxxxx>
+L:	dm-devel@xxxxxxxxxx
+S:	Maintained
+F:	drivers/md/dm-zdm.*
+F:	drivers/md/libzdm.c
+
 ZPOOL COMPRESSED PAGE STORAGE API
 M:	Dan Streetman <ddstreet@xxxxxxxx>
 L:	linux-mm@xxxxxxxxx
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 02a5345..d0cdb8a 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -336,6 +336,26 @@ config DM_ERA
          over time.  Useful for maintaining cache coherency when using
          vendor snapshots.
 
+config DM_ZDM
+       tristate "ZDM: Zoned based device target (EXPERIMENTAL)"
+       depends on BLK_DEV_DM
+       default n
+       select LIBCRC32C
+       ---help---
+         This device-mapper target implements a translation layer
+         for zoned block devices (ZBC/ZAC)
+         ZDM performs write serialization and copy-on-write data to
+         ensure forward writing within zones. ZDM also includes a
+         garbage collection model to reclaim stale blocks and remap
+         in use blocks.
+
+         Use zdmadm to create, repair and/or restore ZDM instances.
+
+         To compile this code as a module, choose M here: the module will
+         be called dm-zoned.
+
+         If unsure, say N.
+
 config DM_MIRROR
        tristate "Mirror target"
        depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1a..1a35fc3 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_ZDM)		+= dm-zdm.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-zdm.c b/drivers/md/dm-zdm.c
new file mode 100644
index 0000000..ace8151
--- /dev/null
+++ b/drivers/md/dm-zdm.c
@@ -0,0 +1,2826 @@
+/*
+ * Kernel Device Mapper for abstracting ZAC/ZBC devices as normal
+ * block devices for linux file systems.
+ *
+ * Copyright (C) 2015,2016 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@xxxxxxxxxxx>
+ *
+ * Bio queue support and metadata relocation by:
+ * Vineet Agarwal <vineet.agarwal@xxxxxxxxxxx>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "dm.h"
+#include <linux/dm-io.h>
+#include <linux/init.h>
+#include <linux/mempool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/random.h>
+#include <linux/crc32c.h>
+#include <linux/crc16.h>
+#include <linux/sort.h>
+#include <linux/ctype.h>
+#include <linux/types.h>
+#include <linux/timer.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kfifo.h>
+#include <linux/bsearch.h>
+#include "dm-zdm.h"
+
+#define PRIu64 "llu"
+#define PRIx64 "llx"
+#define PRId32 "d"
+#define PRIx32 "x"
+#define PRIu32 "u"
+
+#define BIOSET_RESV		256
+#define READ_PRIO_DEPTH		256
+
+/**
+ * _zdisk() - Return a pretty ZDM name.
+ * @znd: ZDM Instance
+ *
+ * Return: ZDM/backing device pretty name.
+ */
+static inline char *_zdisk(struct zdm *znd)
+{
+	return znd->bdev_name;
+}
+
+#define Z_ERR(znd, fmt, arg...) \
+	pr_err("dm-zdm(%s): " fmt "\n", _zdisk(znd), ## arg)
+
+#define Z_INFO(znd, fmt, arg...) \
+	pr_info("dm-zdm(%s): " fmt "\n", _zdisk(znd), ## arg)
+
+#define Z_DBG(znd, fmt, arg...) \
+	pr_debug("dm-zdm(%s): " fmt "\n", _zdisk(znd), ## arg)
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+static void do_io_work(struct work_struct *work);
+static int block_io(struct zdm *, enum dm_io_mem_type, void *, sector_t,
+		    unsigned int, u8 op, unsigned int op_flgs, int);
+static int znd_async_io(struct zdm *znd,
+			enum dm_io_mem_type dtype,
+			void *data, sector_t block, unsigned int nDMsect,
+			unsigned int op, unsigned int opf, int queue,
+			io_notify_fn callback, void *context);
+static int zoned_bio(struct zdm *znd, struct bio *bio);
+static int zoned_map_write(struct zdm *znd, struct bio*, u64 s_zdm);
+static sector_t get_dev_size(struct dm_target *ti);
+static int dmz_reset_wp(struct zdm *znd, u64 z_id);
+static int dmz_open_zone(struct zdm *znd, u64 z_id);
+static int dmz_close_zone(struct zdm *znd, u64 z_id);
+static int dmz_report_zones(struct zdm *znd, u64 z_id, struct blk_zone *zones,
+			    unsigned int *nz, gfp_t gfp);
+static void activity_timeout(unsigned long data);
+static void zoned_destroy(struct zdm *);
+static int gc_can_cherrypick(struct zdm *znd, u32 sid, int delay, gfp_t gfp);
+static void bg_work_task(struct work_struct *work);
+static void on_timeout_activity(struct zdm *znd, int delay);
+static int zdm_create_proc_entries(struct zdm *znd);
+static void zdm_remove_proc_entries(struct zdm *znd);
+
+#if ENABLE_SEC_METADATA
+/**
+ * znd_get_backing_dev() - Get the backing device.
+ * @znd: ZDM Instance
+ * @block: Logical sector / tlba
+ *
+ * Return: Backing device
+ */
+static struct block_device *znd_get_backing_dev(struct zdm *znd,
+						sector_t *block)
+{
+	struct block_device *bdev = NULL;
+
+	switch (znd->meta_dst_flag) {
+
+	case DST_TO_PRI_DEVICE:
+		bdev = znd->dev->bdev;
+		break;
+
+	case DST_TO_SEC_DEVICE:
+		if (*block < (znd->data_lba << Z_SHFT4K)) {
+			bdev = znd->meta_dev->bdev;
+		} else {
+			bdev = znd->dev->bdev;
+			/*
+			 * Align data drive starting LBA to zone boundary
+			 * to ensure wp sync.
+			 */
+			*block = *block  - (znd->data_lba << Z_SHFT4K)
+						+ znd->sec_zone_align;
+		}
+		break;
+
+	case DST_TO_BOTH_DEVICE:
+		if (*block < (znd->data_lba << Z_SHFT4K))
+			bdev = znd->meta_dev->bdev;
+		else
+			bdev = znd->dev->bdev;
+		break;
+	}
+	return bdev;
+}
+#endif
+/**
+ * get_bdev_bd_inode() - Get primary backing device inode
+ * @znd: ZDM Instance
+ *
+ * Return: backing device inode
+ */
+static inline struct inode *get_bdev_bd_inode(struct zdm *znd)
+{
+	return znd->dev->bdev->bd_inode;
+}
+
+#include "libzdm.c"
+
+#define BIO_CACHE_SECTORS (IO_VCACHE_PAGES * Z_BLOCKS_PER_DM_SECTOR)
+
+/**
+ * bio_stream() - Decode stream id from BIO.
+ * @znd: ZDM Instance
+ *
+ * Return: stream_id
+ */
+static inline u32 bio_stream(struct bio *bio)
+{
+	u32 stream_id = 0x40;
+
+	/*
+	 * Since adding stream id to a BIO is not yet in mainline we just
+	 * use this heuristic to try to skip unnecessary co-mingling of data.
+	 */
+	if (bio->bi_opf & REQ_META) {
+		stream_id = 0xfe; /* upper level metadata */
+	} else if (bio->bi_iter.bi_size < (Z_C4K * 4)) {
+		/* avoid XFS meta/data churn in extent maps */
+		stream_id = 0xfd; /* 'hot' upper level data */
+
+#if 0 /* bio_get_streamid() is available */
+	} else {
+		unsigned int id = bio_get_streamid(bio);
+
+		/* high 8 bits is hash of PID, low 8 bits is hash of inode# */
+		stream_id = id >> 8;
+		if (stream_id == 0)
+			stream_id++;
+		if (stream_id >= 0xfc)
+			stream_id--;
+#endif
+	}
+
+	return stream_id;
+}
+
+/**
+ * zoned_map_discard() - Return a pretty ZDM name.
+ * @znd: ZDM Instance
+ * @bio: struct bio hold discard information
+ * @s_zdm: tlba being discarded.
+ *
+ * Return: 0 on success, otherwise error code.
+ */
+static int zoned_map_discard(struct zdm *znd, struct bio *bio, u64 s_zdm)
+{
+	int rcode = DM_MAPIO_SUBMITTED;
+	u32 blks = bio->bi_iter.bi_size >> PAGE_SHIFT;
+	unsigned long flags;
+	int redundant = 0;
+	const gfp_t gfp = GFP_ATOMIC;
+	int err;
+
+	spin_lock_irqsave(&znd->stats_lock, flags);
+	if (znd->is_empty)
+		redundant = 1;
+	else if ((s_zdm + blks) >= znd->htlba)
+		redundant = 1;
+	spin_unlock_irqrestore(&znd->stats_lock, flags);
+
+	if (!redundant) {
+		err = z_mapped_discard(znd, s_zdm, blks, gfp);
+		if (err < 0) {
+			bio->bi_error = err;
+			rcode = err;
+		}
+	}
+	bio->bi_iter.bi_sector = 8;
+	bio_endio(bio);
+	return rcode;
+}
+
+/**
+ * is_non_wp_zone() - Test zone # to see if it flagged as conventional.
+ * @znd: ZDM Instance
+ * @z_id: Zone #
+ *
+ * Return: 1 if conventional zone. 0 if sequentional write zone.
+ */
+static int is_non_wp_zone(struct zdm *znd, u64 z_id)
+{
+	u32 gzoff = z_id % 1024;
+	struct meta_pg *wpg = &znd->wp[z_id >> 10];
+	u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+
+	return (wp & Z_WP_NON_SEQ) ? 1 : 0;
+}
+
+/**
+ */
+struct zone_action {
+	struct work_struct work;
+	struct zdm *znd;
+	u64 s_addr;
+	unsigned int op;
+	unsigned int op_f;
+	int wait;
+	int wp_err;
+};
+
+/**
+ * zsplit_endio() - Bio endio tracking for update internal WP.
+ * @bio: Bio being completed.
+ *
+ * Bios that are split for writing are usually split to land on a zone
+ * boundary. Forward the bio along the endio path and update the WP.
+ */
+static void za_endio(struct bio *bio)
+{
+	struct zone_action *za = bio->bi_private;
+
+	switch (bio_op(bio)) {
+	case REQ_OP_ZONE_RESET:
+		/* find the zone and reset the wp on it */
+		break;
+	default:
+		pr_err("%s: unexpected op: %d\n", __func__, bio_op(bio));
+		break;
+	}
+
+	if (bio->bi_error) {
+		struct zdm *znd = za->znd;
+
+		Z_ERR(znd, "Zone Cmd: LBA: %" PRIx64 " -> %d failed.",
+		      za->s_addr, bio->bi_error);
+		Z_ERR(znd, "ZAC/ZBC support disabled.");
+
+		znd->bdev_is_zoned = 0;
+	}
+
+	if (!za->wait)
+		kfree(za);
+
+	bio_put(bio);
+}
+
+
+/**
+ * do_zone_action_work() - Issue a 'zone action' to the backing device.
+ * @work: Work to do.
+ */
+static void do_zone_action_work(struct work_struct *work)
+{
+	struct zone_action *za = container_of(work, struct zone_action, work);
+	struct zdm *znd = za->znd;
+	struct block_device *bdev = znd->dev->bdev;
+	const gfp_t gfp = GFP_ATOMIC;
+	struct bio *bio = bio_alloc_bioset(gfp, 1, znd->bio_set);
+
+	if (bio) {
+		bio->bi_iter.bi_sector = za->s_addr;
+		bio->bi_bdev = bdev;
+		bio->bi_vcnt = 0;
+		bio->bi_iter.bi_size = 0;
+		bio_set_op_attrs(bio, za->op, za->op_f);
+		if (!za->wait) {
+			bio->bi_private = za;
+			bio->bi_end_io = za_endio;
+			submit_bio(bio);
+			return;
+		}
+		za->wp_err = submit_bio_wait(bio);
+		bio_put(bio);
+	} else {
+		za->wp_err = -ENOMEM;
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+	}
+}
+
+/**
+ * dmz_zone_action() - Issue a 'zone action' to the backing device (via worker).
+ * @znd: ZDM Instance
+ * @z_id: Zone # to open.
+ * @rw: One of REQ_OPEN_ZONE, REQ_CLOSE_ZONE, or REQ_RESET_ZONE.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int dmz_zone_action(struct zdm *znd, u64 z_id, unsigned int op,
+			   unsigned int op_f, int wait)
+{
+	int wp_err = 0;
+	u64 z_offset = zone_to_sector(z_id);
+	struct zone_action za = {
+		.znd = znd,
+		.s_addr = z_offset,
+		.op = op,
+		.op_f = op_f,
+		.wait = wait,
+		.wp_err = 0,
+	};
+
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag == DST_TO_SEC_DEVICE) {
+		z_offset = zone_to_sector(z_id) + znd->sec_zone_align
+				 + (znd->sec_dev_start_sect << Z_SHFT4K);
+		za.s_addr = z_offset;
+	}
+#endif
+
+	if (is_non_wp_zone(znd, z_id))
+		return wp_err;
+
+	if (!znd->bdev_is_zoned)
+		return wp_err;
+
+	if (!wait) {
+		struct zone_action *zact = kzalloc(sizeof(*zact), GFP_ATOMIC);
+
+		if (!zact)
+			return -ENOMEM;
+
+		memcpy(zact, &za, sizeof(za));
+		INIT_WORK(&zact->work, do_zone_action_work);
+		queue_work(znd->zone_action_wq, &zact->work);
+		return 0;
+	}
+
+	/*
+	 * Issue the synchronous I/O from a different thread
+	 * to avoid generic_make_request recursion.
+	 */
+	INIT_WORK_ONSTACK(&za.work, do_zone_action_work);
+	queue_work(znd->zone_action_wq, &za.work);
+	flush_workqueue(znd->zone_action_wq);
+	destroy_work_on_stack(&za.work);
+	wp_err = za.wp_err;
+
+	if (wait && wp_err) {
+		struct hd_struct *p = znd->dev->bdev->bd_part;
+
+		Z_ERR(znd, "Zone Cmd: LBA: %" PRIx64 " (%" PRIx64
+			   " [Z:%" PRIu64 "] -> %d failed.",
+		      za.s_addr, za.s_addr + p->start_sect, z_id, wp_err);
+
+		Z_ERR(znd, "ZAC/ZBC support disabled.");
+		znd->bdev_is_zoned = 0;
+		wp_err = -ENOTSUPP;
+	}
+	return wp_err;
+}
+
+/**
+ * dmz_reset_wp() - Reset write pointer for zone z_id.
+ * @znd: ZDM Instance
+ * @z_id: Zone # to reset.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int dmz_reset_wp(struct zdm *znd, u64 z_id)
+{
+	return dmz_zone_action(znd, z_id, REQ_OP_ZONE_RESET, 0, 1);
+}
+
+/**
+ * dmz_open_zone() - Open zone for writing.
+ * @znd: ZDM Instance
+ * @z_id: Zone # to open.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int dmz_open_zone(struct zdm *znd, u64 z_id)
+{
+	return 0;
+}
+
+/**
+ * dmz_close_zone() - Close zone to writing.
+ * @znd: ZDM Instance
+ * @z_id: Zone # to close.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int dmz_close_zone(struct zdm *znd, u64 z_id)
+{
+	return 0;
+}
+
+/**
+ * dmz_report_zones() - issue report zones from z_id zones after zdstart
+ * @znd: ZDM Instance
+ * @z_id: Zone past zdstart
+ * @report: structure filled
+ * @bufsz: kmalloc()'d space reserved for report
+ *
+ * Return: -ENOTSUPP or 0 on success
+ */
+static int dmz_report_zones(struct zdm *znd, u64 z_id, struct blk_zone *zones,
+			    unsigned int *nz, gfp_t gfp)
+{
+	int wp_err = -ENOTSUPP;
+
+	if (znd->bdev_is_zoned) {
+		struct block_device *bdev = znd->dev->bdev;
+		u64 s_addr = zone_to_sector(z_id);
+
+#if ENABLE_SEC_METADATA
+		if (znd->meta_dst_flag == DST_TO_SEC_DEVICE)
+			s_addr = zone_to_sector(z_id) + znd->sec_zone_align;
+#endif
+		wp_err = blkdev_report_zones(bdev, s_addr, zones, nz, gfp);
+		if (wp_err) {
+			Z_ERR(znd, "Report Zones: LBA: %" PRIx64
+			      " [Z:%" PRIu64 " -> %d failed.",
+			      s_addr, z_id + znd->zdstart, wp_err);
+			Z_ERR(znd, "ZAC/ZBC support disabled.");
+			znd->bdev_is_zoned = 0;
+			wp_err = -ENOTSUPP;
+		}
+	}
+	return wp_err;
+}
+
+/**
+ * _zoned_map() - kthread handling
+ * @znd: ZDM Instance
+ * @bio: bio to be mapped.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int _zoned_map(struct zdm *znd, struct bio *bio)
+{
+	int err;
+
+	err = zoned_bio(znd, bio);
+	if (err < 0) {
+		znd->meta_result = err;
+		err = 0;
+	}
+
+	return err;
+
+}
+
+/**
+ * znd_bio_merge_dispatch() - kthread handling
+ * @arg: Argument
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int znd_bio_merge_dispatch(void *arg)
+{
+	struct zdm *znd = (struct zdm *)arg;
+	struct list_head *jlst = &znd->bio_srt_jif_lst_head;
+	struct zdm_q_node *node;
+	struct bio *bio = NULL;
+	unsigned long flags;
+	long timeout;
+
+	Z_INFO(znd, "znd_bio_merge_dispatch kthread [started]");
+	while (!kthread_should_stop() || atomic_read(&znd->enqueued)) {
+		node = NULL;
+		timeout = znd->queue_delay;
+		spin_lock_irqsave(&znd->zdm_bio_q_lck, flags);
+		node = list_first_entry_or_null(jlst, struct zdm_q_node, jlist);
+		if (node) {
+			bio = node->bio;
+			list_del(&node->jlist);
+			spin_unlock_irqrestore(&znd->zdm_bio_q_lck, flags);
+			ZDM_FREE(znd, node, sizeof(struct zdm_q_node), KM_14);
+			atomic_dec(&znd->enqueued);
+			if (bio) {
+				_zoned_map(znd, bio);
+				timeout = 0;
+			}
+		} else {
+			spin_unlock_irqrestore(&znd->zdm_bio_q_lck, flags);
+			if (znd->queue_depth <= atomic_read(&znd->enqueued))
+				timeout *= 100;
+		}
+		if (timeout)
+			schedule_timeout_interruptible(timeout);
+	}
+	z_flush_bdev(znd, GFP_KERNEL);
+	Z_INFO(znd, "znd_bio_merge_dispatch kthread [stopped]");
+	return 0;
+}
+
+/**
+ * zoned_map() - Handle an incoming BIO
+ * @ti: Device Mapper Target Instance
+ * @bio: The BIO to disposition.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int zoned_map(struct dm_target *ti, struct bio *bio)
+{
+	struct zdm *znd = ti->private;
+	struct zdm_q_node *q_node;
+	unsigned long flags;
+	u32 op;
+
+	if (znd->queue_depth == 0 && atomic_read(&znd->enqueued) == 0)
+		return _zoned_map(znd, bio);
+
+	op = bio_op(bio);
+	if (op == REQ_OP_READ || op == REQ_OP_DISCARD)
+		return _zoned_map(znd, bio);
+
+	q_node = ZDM_ALLOC(znd, sizeof(struct zdm_q_node), KM_14, GFP_ATOMIC);
+	if (unlikely(!q_node)) {
+		Z_INFO(znd, "Bio Q allocation failed");
+		return _zoned_map(znd, bio);
+	}
+
+	q_node->bio = bio;
+	q_node->jiffies = jiffies;
+	q_node->bi_sector = bio->bi_iter.bi_sector;
+	INIT_LIST_HEAD(&(q_node->jlist));
+
+	spin_lock_irqsave(&znd->zdm_bio_q_lck, flags);
+	list_add_tail(&(q_node->jlist), &(znd->bio_srt_jif_lst_head));
+	atomic_inc(&znd->enqueued);
+	spin_unlock_irqrestore(&znd->zdm_bio_q_lck, flags);
+	wake_up_process(znd->bio_kthread);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/**
+ * zoned_actual_size() - Set number of 4k blocks available on block device.
+ * @ti: Device Mapper Target Instance
+ * @znd: ZDM Instance
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static void zoned_actual_size(struct dm_target *ti, struct zdm *znd)
+{
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag == DST_TO_SEC_DEVICE) {
+		znd->nr_blocks = (i_size_read(get_bdev_bd_inode(znd))
+				 - (znd->sec_zone_align << Z_SHFT_SEC)) / Z_C4K;
+	} else
+		znd->nr_blocks = i_size_read(get_bdev_bd_inode(znd)) / Z_C4K;
+#else
+	znd->nr_blocks = i_size_read(get_bdev_bd_inode(znd)) / Z_C4K;
+#endif
+}
+
+/**
+ * zoned_ctr() - Create a ZDM Instance from DM Target Instance and args.
+ * @ti: Device Mapper Target Instance
+ * @argc: Number of args to handle.
+ * @argv: args to handle.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int zoned_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	const int reset_non_empty = false;
+	int create = 0;
+	int force = 0;
+	int zbc_probe = 1;
+	int zac_probe = 1;
+	int r;
+	struct zdm *znd;
+#if ENABLE_SEC_METADATA
+	char *meta_dev = NULL;
+#endif
+	u64 first_data_zone = 0;
+	u64 mz_md_provision = MZ_METADATA_ZONES;
+
+	BUILD_BUG_ON(Z_C4K != (sizeof(struct map_cache_page)));
+	BUILD_BUG_ON(Z_C4K != (sizeof(struct io_4k_block)));
+	BUILD_BUG_ON(Z_C4K != (sizeof(struct mz_superkey)));
+	BUILD_BUG_ON(PAGE_SIZE != sizeof(struct map_pool));
+
+	znd = ZDM_ALLOC(NULL, sizeof(*znd), KM_00, GFP_KERNEL);
+	if (!znd) {
+		ti->error = "Error allocating zdm structure";
+		return -ENOMEM;
+	}
+
+	znd->enable_trim = 0;
+	znd->queue_depth = 0;
+	znd->gc_prio_def = 0xff00;
+	znd->gc_prio_low = 0x7fff;
+	znd->gc_prio_high = 0x0400;
+	znd->gc_prio_crit = 0x0040;
+	znd->gc_wm_crit = 7;
+	znd->gc_wm_high = 5;
+	znd->gc_wm_low = 25;
+	znd->gc_status = 1;
+	znd->cache_ageout_ms = 9000;
+	znd->cache_size = 4096;
+	znd->cache_to_pagecache = 0;
+	znd->cache_reada = 64;
+	znd->journal_age = 3;
+	znd->queue_delay = msecs_to_jiffies(WB_DELAY_MS);
+
+	if (argc < 1) {
+		ti->error = "Invalid argument count";
+		return -EINVAL;
+	}
+
+	for (r = 1; r < argc; r++) {
+		if (isdigit(*argv[r])) {
+			int krc = kstrtoll(argv[r], 0, &first_data_zone);
+
+			if (krc != 0) {
+				DMERR("Failed to parse %s: %d", argv[r], krc);
+				first_data_zone = 0;
+			}
+		}
+		if (!strcasecmp("create", argv[r]))
+			create = 1;
+		if (!strcasecmp("load", argv[r]))
+			create = 0;
+		if (!strcasecmp("force", argv[r]))
+			force = 1;
+		if (!strcasecmp("nozbc", argv[r]))
+			zbc_probe = 0;
+		if (!strcasecmp("nozac", argv[r]))
+			zac_probe = 0;
+		if (!strcasecmp("discard", argv[r]))
+			znd->enable_trim = 1;
+		if (!strcasecmp("nodiscard", argv[r]))
+			znd->enable_trim = 0;
+		if (!strcasecmp("bio-queue", argv[r]))
+			znd->queue_depth = 1;
+		if (!strcasecmp("no-bio-queue", argv[r]))
+			znd->queue_depth = 0;
+
+		if (!strncasecmp("reserve=", argv[r], 8)) {
+			u64 mz_resv;
+			int krc = kstrtoll(argv[r] + 8, 0, &mz_resv);
+
+			if (krc == 0) {
+				if (mz_resv > mz_md_provision)
+					mz_md_provision = mz_resv;
+			} else {
+				DMERR("Reserved 'FAILED TO PARSE.' %s: %d",
+					argv[r]+8, krc);
+				mz_resv = 0;
+			}
+		}
+#if ENABLE_SEC_METADATA
+		if (!strncasecmp("meta=", argv[r], 5))
+			meta_dev = argv[r]+5;
+		if (!strcasecmp("mirror-md", argv[r]))
+			znd->meta_dst_flag = DST_TO_BOTH_DEVICE;
+#endif
+	}
+
+	znd->ti = ti;
+	ti->private = znd;
+	znd->zdstart = first_data_zone; /* IN ABSOLUTE COORDs */
+	znd->mz_provision = mz_md_provision;
+
+	r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, &znd->dev);
+	if (r) {
+		ti->error = "Error opening backing device";
+		zoned_destroy(znd);
+		return -EINVAL;
+	}
+
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag != DST_TO_BOTH_DEVICE) {
+		if (meta_dev)
+			znd->meta_dst_flag = DST_TO_SEC_DEVICE;
+		else
+			znd->meta_dst_flag = DST_TO_PRI_DEVICE;
+	}
+#endif
+
+	if (znd->dev->bdev) {
+		u64 sect = get_start_sect(znd->dev->bdev) >> Z_SHFT4K;
+
+		bdevname(znd->dev->bdev, znd->bdev_name);
+#if ENABLE_SEC_METADATA
+		if (znd->meta_dst_flag == DST_TO_SEC_DEVICE) {
+			u64 zone = dm_round_up(sect, Z_BLKSZ) - sect;
+
+			znd->sec_dev_start_sect = sect;
+			znd->sec_zone_align = zone << Z_SHFT4K;
+		} else
+			znd->start_sect = sect;
+#else
+		znd->start_sect = sect;
+#endif
+	}
+
+#if ENABLE_SEC_METADATA
+	if (meta_dev) {
+		u64 sect;
+
+		r = dm_get_device(ti, meta_dev, FMODE_READ | FMODE_WRITE,
+				&znd->meta_dev);
+		if (r) {
+			ti->error = "Error opening metadata device";
+			zoned_destroy(znd);
+			return -EINVAL;
+		}
+		bdevname(znd->meta_dev->bdev, znd->bdev_metaname);
+		sect = get_start_sect(znd->meta_dev->bdev) >> Z_SHFT4K;
+		if (znd->meta_dst_flag == DST_TO_SEC_DEVICE)
+			znd->start_sect = sect;
+		else
+			znd->sec_dev_start_sect = sect;
+	}
+#endif
+
+	/*
+	 * Set if this target needs to receive flushes regardless of
+	 * whether or not its underlying devices have support.
+	 */
+	ti->num_flush_bios = 1;
+	ti->flush_supported = true;
+
+	/*
+	 * Set if this target needs to receive discards regardless of
+	 * whether or not its underlying devices have support.
+	 */
+	ti->discards_supported = true;
+
+	/*
+	 * Set if the target required discard bios to be split
+	 * on max_io_len boundary.
+	 */
+	ti->split_discard_bios = false;
+
+	/*
+	 * Set if this target does not return zeroes on discarded blocks.
+	 */
+	ti->discard_zeroes_data_unsupported = false;
+
+	/*
+	 * Set if this target wants discard bios to be sent.
+	 */
+	ti->num_discard_bios = 1;
+
+	if (!znd->enable_trim) {
+		ti->discards_supported = false;
+		ti->num_discard_bios = 0;
+	}
+
+	zoned_actual_size(ti, znd);
+
+	r = do_init_zoned(ti, znd);
+	if (r) {
+		ti->error = "Error in zdm init";
+		zoned_destroy(znd);
+		return -EINVAL;
+	}
+	znd->filled_zone = NOZONE;
+
+	if (zac_probe || zbc_probe)
+		znd->bdev_is_zoned = 1;
+
+	r = zoned_init_disk(ti, znd, create, force);
+	if (r) {
+		ti->error = "Error in zdm init from disk";
+		zoned_destroy(znd);
+		return -EINVAL;
+	}
+	r = zoned_wp_sync(znd, reset_non_empty);
+	if (r) {
+		ti->error = "Error in zdm re-sync WP";
+		zoned_destroy(znd);
+		return -EINVAL;
+	}
+
+	update_all_stale_ratio(znd);
+
+	znd->bio_set = bioset_create(BIOSET_RESV, 0);
+	if (!znd->bio_set)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&(znd->bio_srt_jif_lst_head));
+	spin_lock_init(&znd->zdm_bio_q_lck);
+	znd->bio_kthread = kthread_run(znd_bio_merge_dispatch, znd, "zdm-io-%s",
+			znd->bdev_name);
+	if (IS_ERR(znd->bio_kthread)) {
+		r = PTR_ERR(znd->bio_kthread);
+		ti->error = "Couldn't alloc kthread";
+		zoned_destroy(znd);
+		return r;
+	}
+
+	r = zdm_create_proc_entries(znd);
+	if (r) {
+		ti->error = "Failed to create /proc entries";
+		zoned_destroy(znd);
+		return -EINVAL;
+	}
+
+	/* Restore any ZDM SB 'config' changes here */
+	mod_timer(&znd->timer, jiffies + msecs_to_jiffies(5000));
+
+	return 0;
+}
+
+/**
+ * zoned_dtr() - Deconstruct a ZDM Instance from DM Target Instance.
+ * @ti: Device Mapper Target Instance
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static void zoned_dtr(struct dm_target *ti)
+{
+	struct zdm *znd = ti->private;
+
+	if (znd->z_sballoc) {
+		struct mz_superkey *key_blk = znd->z_sballoc;
+		struct zdm_superblock *sblock = &key_blk->sblock;
+
+		sblock->flags = cpu_to_le32(0);
+		sblock->csum = sb_crc32(sblock);
+	}
+
+	wake_up_process(znd->bio_kthread);
+	wait_event(znd->wait_bio, atomic_read(&znd->enqueued) == 0);
+	kthread_stop(znd->bio_kthread);
+	zdm_remove_proc_entries(znd);
+	zoned_destroy(znd);
+}
+
+
+/**
+ * do_io_work() - Read or write a data from a block device.
+ * @work: Work to be done.
+ */
+static void do_io_work(struct work_struct *work)
+{
+	struct z_io_req_t *req = container_of(work, struct z_io_req_t, work);
+	struct dm_io_request *io_req = req->io_req;
+	unsigned long error_bits = 0;
+
+	req->result = dm_io(io_req, 1, req->where, &error_bits);
+	if (error_bits)
+		DMERR("ERROR: dm_io_work error: %lx", error_bits);
+}
+
+/**
+ * _znd_async_io() - Issue I/O via dm_io async or sync (using worker thread).
+ * @znd: ZDM Instance
+ * @io_req: I/O request
+ * @data: Data for I/O
+ * @where: I/O region
+ * @dtype: I/O data type
+ * @queue: Use worker when true
+ *
+ * Return 0 on success, otherwise error.
+ */
+static int _znd_async_io(struct zdm *znd, struct dm_io_request *io_req,
+			void *data, struct dm_io_region *where,
+			enum dm_io_mem_type dtype, int queue)
+{
+	int rcode;
+	unsigned long error_bits = 0;
+
+	switch (dtype) {
+	case DM_IO_KMEM:
+		io_req->mem.ptr.addr = data;
+		break;
+	case DM_IO_BIO:
+		io_req->mem.ptr.bio = data;
+		break;
+	case DM_IO_VMA:
+		io_req->mem.ptr.vma = data;
+		break;
+	default:
+		Z_ERR(znd, "page list not handled here ..  see dm-io.");
+		break;
+	}
+
+	if (queue) {
+		struct z_io_req_t req;
+
+		/*
+		 * Issue the synchronous I/O from a different thread
+		 * to avoid generic_make_request recursion.
+		 */
+		INIT_WORK_ONSTACK(&req.work, do_io_work);
+		req.where = where;
+		req.io_req = io_req;
+		queue_work(znd->io_wq, &req.work);
+
+		Z_DBG(znd, "%s: wait for %s io (%lx)",
+			__func__,
+			io_req->bi_op == REQ_OP_READ ? "R" : "W",
+			where->sector >> 3);
+
+		flush_workqueue(znd->io_wq);
+
+		Z_DBG(znd, "%s: cmplted %s io (%lx)",
+			__func__,
+			io_req->bi_op == REQ_OP_READ ? "R" : "W",
+			where->sector >> 3);
+		destroy_work_on_stack(&req.work);
+
+		rcode = req.result;
+		if (rcode < 0)
+			Z_ERR(znd, "ERROR: dm_io error: %d", rcode);
+		goto done;
+	}
+	rcode = dm_io(io_req, 1, where, &error_bits);
+	if (error_bits || rcode < 0)
+		Z_ERR(znd, "ERROR: dm_io error: %d -- %lx", rcode, error_bits);
+
+done:
+	return rcode;
+
+}
+
+/**
+ * znd_async_io() - Issue I/O via dm_io async or sync (using worker thread).
+ * @znd: ZDM Instance
+ * @dtype: Type of memory in data
+ * @data: Data for I/O
+ * @block: bLBA for I/O
+ * @nDMsect: Number of 512 byte blocks to read/write.
+ * @rw: REQ_OP_READ or REQ_OP_WRITE
+ * @queue: if true then use worker thread for I/O and wait.
+ * @callback: callback to use on I/O complete.
+ * context: context to be passed to callback.
+ *
+ * Return 0 on success, otherwise error.
+ */
+static int znd_async_io(struct zdm *znd,
+			enum dm_io_mem_type dtype,
+			void *data, sector_t block, unsigned int nDMsect,
+			unsigned int op, unsigned int opf, int queue,
+			io_notify_fn callback, void *context)
+{
+	int rcode;
+#if ENABLE_SEC_METADATA
+	struct dm_io_region where;
+#else
+	struct dm_io_region where = {
+		.bdev = znd->dev->bdev,
+		.sector = block,
+		.count = nDMsect,
+	};
+
+#endif
+	struct dm_io_request io_req = {
+		.bi_op = op,
+		.bi_op_flags = opf,
+		.mem.type = dtype,
+		.mem.offset = 0,
+		.mem.ptr.vma = data,
+		.client = znd->io_client,
+		.notify.fn = callback,
+		.notify.context = context,
+	};
+
+#if ENABLE_SEC_METADATA
+	where.bdev  = znd_get_backing_dev(znd, &block);
+	where.count = nDMsect;
+	where.sector = block;
+	if (op == REQ_OP_WRITE &&
+	    znd->meta_dst_flag == DST_TO_BOTH_DEVICE &&
+	    block < (znd->data_lba << Z_SHFT4K)) {
+		where.bdev = znd->meta_dev->bdev;
+		rcode = _znd_async_io(znd, &io_req, data, &where, dtype, queue);
+		where.bdev = znd->dev->bdev;
+		rcode = _znd_async_io(znd, &io_req, data, &where, dtype, queue);
+	} else
+
+/*
+ * do we need to check for OP_READ && DST_TO_SEC_DEVICE,
+ * also wouldn't DST_TO_BOTH_DEVICE prefer to read from DST_TO_SEC_DEVICE ?
+ */
+
+#endif
+		rcode = _znd_async_io(znd, &io_req, data, &where, dtype, queue);
+
+	return rcode;
+}
+
+/**
+ * block_io() - Issue sync I/O maybe using using a worker thread.
+ * @znd: ZDM Instance
+ * @dtype: Type of memory in data
+ * @data: Data for I/O
+ * @sector: bLBA for I/O [512 byte resolution]
+ * @nblks: Number of 512 byte blocks to read/write.
+ * @op: bi_op (Read/Write/Discard ... )
+ * @op_flags: bi_op_flags (Sync, Flush, FUA, ...)
+ * @queue: if true then use worker thread for I/O and wait.
+ *
+ * Return 0 on success, otherwise error.
+ */
+static int block_io(struct zdm *znd,
+		    enum dm_io_mem_type dtype, void *data, sector_t sector,
+		    unsigned int nblks, u8 op, unsigned int op_flags, int queue)
+{
+	return znd_async_io(znd, dtype, data, sector, nblks, op,
+			    op_flags, queue, NULL, NULL);
+}
+
+/**
+ * read_block() - Issue sync read maybe using using a worker thread.
+ * @ti: Device Mapper Target Instance
+ * @dtype: Type of memory in data
+ * @data: Data for I/O
+ * @lba: bLBA for I/O [4k resolution]
+ * @count: Number of 4k blocks to read/write.
+ * @queue: if true then use worker thread for I/O and wait.
+ *
+ * Return 0 on success, otherwise error.
+ */
+static int read_block(struct zdm *znd, enum dm_io_mem_type dtype,
+		      void *data, u64 lba, unsigned int count, int queue)
+{
+	sector_t block = lba << Z_SHFT4K;
+	unsigned int nDMsect = count << Z_SHFT4K;
+	int rc;
+
+	if (lba >= znd->nr_blocks) {
+		Z_ERR(znd, "Error reading past end of media: %llx.", lba);
+		rc = -EIO;
+		return rc;
+	}
+
+	rc = block_io(znd, dtype, data, block, nDMsect, REQ_OP_READ, 0, queue);
+	if (rc) {
+		Z_ERR(znd, "read error: %d -- R: %llx [%u dm sect] (Q:%d)",
+			rc, lba, nDMsect, queue);
+		dump_stack();
+	}
+
+	return rc;
+}
+
+/**
+ * writef_block() - Issue sync write maybe using using a worker thread.
+ * @ti: Device Mapper Target Instance
+ * @dtype: Type of memory in data
+ * @data: Data for I/O
+ * @lba: bLBA for I/O [4k resolution]
+ * @op_flags: bi_op_flags for bio (Sync/Flush/FUA)
+ * @count: Number of 4k blocks to read/write.
+ * @queue: if true then use worker thread for I/O and wait.
+ *
+ * Return 0 on success, otherwise error.
+ */
+static int writef_block(struct zdm *znd, enum dm_io_mem_type dtype,
+			void *data, u64 lba, unsigned int op_flags,
+			unsigned int count, int queue)
+{
+	sector_t block = lba << Z_SHFT4K;
+	unsigned int nDMsect = count << Z_SHFT4K;
+	int rc;
+
+	rc = block_io(znd, dtype, data, block, nDMsect, REQ_OP_WRITE,
+		      op_flags, queue);
+	if (rc) {
+		Z_ERR(znd, "write error: %d W: %llx [%u dm sect] (Q:%d)",
+			rc, lba, nDMsect, queue);
+		dump_stack();
+	}
+
+	return rc;
+}
+
+/**
+ * write_block() - Issue sync write maybe using using a worker thread.
+ * @ti: Device Mapper Target Instance
+ * @dtype: Type of memory in data
+ * @data: Data for I/O
+ * @lba: bLBA for I/O [4k resolution]
+ * @count: Number of 4k blocks to read/write.
+ * @queue: if true then use worker thread for I/O and wait.
+ *
+ * Return 0 on success, otherwise error.
+ */
+static int write_block(struct zdm *znd, enum dm_io_mem_type dtype,
+		       void *data, u64 lba, unsigned int count, int queue)
+{
+	unsigned int op_flags = 0;
+
+	return writef_block(znd, dtype, data, lba, op_flags, count, queue);
+}
+
+/**
+ * struct zsplit_hook - Extra data attached to a hooked bio
+ * @znd: ZDM Instance to update on BIO completion.
+ * @endio: BIO's original bi_end_io handler
+ * @private: BIO's original bi_private data.
+ */
+struct zsplit_hook {
+	struct zdm *znd;
+	bio_end_io_t *endio;
+	void *private;
+};
+
+/**
+ * hook_bio() - Wrapper for hooking bio's endio function.
+ * @znd: ZDM Instance
+ * @bio: Bio to clone and hook
+ * @endiofn: End IO Function to hook with.
+ */
+static int hook_bio(struct zdm *znd, struct bio *split, bio_end_io_t *endiofn)
+{
+	struct zsplit_hook *hook = kmalloc(sizeof(*hook), GFP_NOIO);
+
+	if (!hook) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		return -ENOMEM;
+	}
+
+	/*
+	 * On endio report back to ZDM Instance and restore
+	 * original the bi_private and bi_end_io.
+	 * Since all of our splits are also chain'd we also
+	 * 'know' that bi_private will be the bio we sharded
+	 * and that bi_end_io is the bio_chain_endio helper.
+	 */
+	hook->znd = znd;
+	hook->private = split->bi_private; /* = bio */
+	hook->endio = split->bi_end_io; /* = bio_chain_endio */
+
+	/*
+	 * Now on complete the bio will call endiofn which is 'zsplit_endio'
+	 * and we can record the update WP location and restore the
+	 * original bi_private and bi_end_io
+	 */
+	split->bi_private = hook;
+	split->bi_end_io = endiofn;
+
+	return 0;
+}
+
+
+/**
+ * TODO: On write error such as this we can incr wp-used but we need
+ * to re-queue/re-map the write to a new location on disk?
+ *
+ * sd 0:0:1:0: [sdb] tag#1 FAILED Result: hostbyte=DID_SOFT_ERROR
+ *	driverbyte=DRIVER_OK
+ * sd 0:0:1:0: [sdb] tag#1
+ *	CDB: Write(16) 8a 00 00 00 00 00 06 d4 92 a0 00 00 00 08 00 00
+ *	blk_update_request: I/O error, dev sdb, sector 114594464
+ * exec scsi cmd failed,opcode:133
+ * sdb: command 1 failed
+ * sd 0:0:1:0: [sdb] tag#1
+ *	CDB: Write(16) 8a 00 00 00 00 00 00 01 0a 70 00 00 00 18 00 00
+ * mpt3sas_cm0:    sas_address(0x4433221105000000), phy(5)
+ * mpt3sas_cm0:    enclosure_logical_id(0x500605b0074854d0),slot(6)
+ * mpt3sas_cm0:    enclosure level(0x0000), connector name(     )
+ * mpt3sas_cm0:    handle(0x000a), ioc_status(success)(0x0000), smid(17)
+ * mpt3sas_cm0:    request_len(12288), underflow(12288), resid(-1036288)
+ * mpt3sas_cm0:    tag(65535), transfer_count(1048576), sc->result(0x00000000)
+ * mpt3sas_cm0:    scsi_status(check condition)(0x02),
+ *	scsi_state(autosense valid )(0x01)
+ * mpt3sas_cm0:    [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
+ * Aborting journal on device dm-0-8.
+ * EXT4-fs error (device dm-0):
+ *	ext4_journal_check_start:56: Detected aborted journal
+ * EXT4-fs (dm-0): Remounting filesystem read-only
+ */
+
+/**
+ * _common_endio() - Bio endio tracking for update internal WP.
+ * @bio: Bio being completed.
+ *
+ * Bios that are split for writing are usually split to land on a zone
+ * boundary. Forward the bio along the endio path and update the WP.
+ */
+static void _common_endio(struct zdm *znd, struct bio *bio)
+{
+	u64 lba = bio->bi_iter.bi_sector >> Z_SHFT4K;
+	u32 blks = bio->bi_iter.bi_size / Z_C4K;
+
+#if ENABLE_SEC_METADATA
+	struct block_device *bdev = znd->dev->bdev;
+
+	if (bio->bi_bdev != bdev->bd_contains && bio->bi_bdev != bdev)
+		return;
+
+	switch (znd->meta_dst_flag) {
+	case DST_TO_PRI_DEVICE:
+	case DST_TO_BOTH_DEVICE:
+		if (bio_op(bio) == REQ_OP_WRITE && lba > znd->start_sect) {
+			lba -= znd->start_sect;
+			if (lba > 0)
+				increment_used_blks(znd, lba - 1, blks + 1);
+		}
+		break;
+	case DST_TO_SEC_DEVICE:
+		if (bio_op(bio) == REQ_OP_WRITE &&
+		    lba > znd->sec_dev_start_sect) {
+
+			lba = lba - znd->sec_dev_start_sect
+				 - (znd->sec_zone_align >> Z_SHFT4K);
+			if (lba > 0)
+				increment_used_blks(znd, lba - 1, blks + 1);
+		}
+		break;
+	}
+#else
+	if (bio_op(bio) == REQ_OP_WRITE && lba > znd->start_sect) {
+		lba -= znd->start_sect;
+		if (lba > 0)
+			increment_used_blks(znd, lba - 1, blks + 1);
+	}
+#endif
+}
+
+/**
+ * zoned_endio() - DM bio completion notification.
+ * @ti: DM Target instance.
+ * @bio: Bio being completed.
+ * @err: Error associated with bio.
+ *
+ * Non-split and non-dm_io bios end notification is here.
+ * Update the WP location for REQ_OP_WRITE bios.
+ */
+static int zoned_endio(struct dm_target *ti, struct bio *bio, int err)
+{
+	struct zdm *znd = ti->private;
+
+	_common_endio(znd, bio);
+	return 0;
+}
+
+/**
+ * zsplit_endio() - Bio endio tracking for update internal WP.
+ * @bio: Bio being completed.
+ *
+ * Bios that are split for writing are usually split to land on a zone
+ * boundary. Forward the bio along the endio path and update the WP.
+ */
+static void zsplit_endio(struct bio *bio)
+{
+	struct zsplit_hook *hook = bio->bi_private;
+	struct bio *parent = hook->private;
+	struct zdm *znd = hook->znd;
+
+	_common_endio(znd, bio);
+
+	bio->bi_private = hook->private;
+	bio->bi_end_io = hook->endio;
+
+	/* On split bio's we are responsible for de-ref'ing and freeing */
+	bio_put(bio);
+	if (parent)
+		bio_endio(parent);
+
+	/* release our temporary private data */
+	kfree(hook);
+}
+
+/**
+ * zsplit_bio() - Split and chain a bio.
+ * @znd: ZDM Instance
+ * @bio: Bio to split
+ * @sectors: Number of sectors.
+ *
+ * Return: split bio.
+ */
+static struct bio *zsplit_bio(struct zdm *znd, struct bio *bio, int sectors)
+{
+	struct bio *split = bio;
+
+	if (bio_sectors(bio) > sectors) {
+		split = bio_split(bio, sectors, GFP_NOIO, znd->bio_set);
+		if (!split)
+			goto out;
+		bio_chain(split, bio);
+		if (bio_data_dir(bio) == REQ_OP_WRITE)
+			hook_bio(znd, split, zsplit_endio);
+	}
+out:
+	return split;
+}
+
+/**
+ * zm_cow() - Read Modify Write to write less than 4k size blocks.
+ * @znd: ZDM Instance
+ * @bio: Bio to write
+ * @s_zdm: tLBA
+ * @blks: number of blocks to RMW (should be 1).
+ * @origin: Current bLBA
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int zm_cow(struct zdm *znd, struct bio *bio, u64 s_zdm, u32 blks,
+		  u64 origin)
+{
+	int count = 1;
+	int use_wq = 1;
+	unsigned int bytes = bio_cur_bytes(bio);
+	u8 *data = bio_data(bio);
+	u8 *io = NULL;
+	u16 ua_off = bio->bi_iter.bi_sector & 0x0007;
+	u16 ua_size = bio->bi_iter.bi_size & 0x0FFF;	/* in bytes */
+	u32 mapped = 0;
+	u64 disk_lba = 0;
+
+	znd->is_empty = 0;
+	if (!znd->cow_block)
+		znd->cow_block = ZDM_ALLOC(znd, Z_C4K, PG_02, GFP_ATOMIC);
+
+	io = znd->cow_block;
+	if (!io)
+		return -EIO;
+
+	disk_lba = z_acquire(znd, Z_AQ_STREAM_ID, blks, &mapped);
+	if (!disk_lba || !mapped)
+		return -ENOSPC;
+
+	while (bytes) {
+		int ioer;
+		unsigned int iobytes = Z_C4K;
+		gfp_t gfp = GFP_ATOMIC;
+
+		/* ---------------------------------------------------------- */
+		if (origin) {
+			if (s_zdm != znd->cow_addr) {
+				Z_ERR(znd, "Copy block from %llx <= %llx",
+				      origin, s_zdm);
+				ioer = read_block(znd, DM_IO_KMEM, io, origin,
+						count, use_wq);
+				if (ioer)
+					return -EIO;
+
+				znd->cow_addr = s_zdm;
+			} else {
+				Z_ERR(znd, "Cached block from %llx <= %llx",
+				      origin, s_zdm);
+			}
+		} else {
+			memset(io, 0, Z_C4K);
+		}
+
+		if (ua_off)
+			iobytes -= ua_off * 512;
+
+		if (bytes < iobytes)
+			iobytes = bytes;
+
+		Z_ERR(znd, "Moving %u bytes from origin [offset:%u]",
+		      iobytes, ua_off * 512);
+
+		memcpy(io + (ua_off * 512), data, iobytes);
+
+		/* ---------------------------------------------------------- */
+
+		ioer = write_block(znd, DM_IO_KMEM, io, disk_lba, count, use_wq);
+		if (ioer)
+			return -EIO;
+
+		ioer = z_mapped_addmany(znd, s_zdm, disk_lba, mapped, gfp);
+		if (ioer) {
+			Z_ERR(znd, "%s: Map MANY failed.", __func__);
+			return -EIO;
+		}
+		increment_used_blks(znd, disk_lba, mapped);
+
+		data += iobytes;
+		bytes -= iobytes;
+		ua_size -= (ua_size > iobytes) ? iobytes : ua_size;
+		ua_off = 0;
+		disk_lba++;
+
+		if (bytes && (ua_size || ua_off)) {
+			s_zdm++;
+			origin = current_mapping(znd, s_zdm, gfp);
+		}
+	}
+	bio_endio(bio);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/**
+ * Write 4k blocks from cache to lba.
+ * Move any remaining 512 byte blocks to the start of cache and update
+ * the @_blen count is updated
+ */
+static int zm_write_cache(struct zdm *znd, struct io_dm_block *dm_vbuf,
+			  u64 lba, u32 *_blen)
+{
+	int use_wq    = 1;
+	int cached    = *_blen;
+	int blks      = cached >> 3;
+	int sectors   = blks << 3;
+	int remainder = cached - sectors;
+	int err;
+
+	err = write_block(znd, DM_IO_VMA, dm_vbuf, lba, blks, use_wq);
+	if (!err) {
+		if (remainder)
+			memcpy(dm_vbuf[0].data,
+			       dm_vbuf[sectors].data, remainder * 512);
+		*_blen = remainder;
+	}
+	return err;
+}
+
+/**
+ * zm_write_bios() - Map and write bios.
+ * @znd: ZDM Instance
+ * @bio: Bio to be written.
+ * @s_zdm: tLBA for mapping.
+ *
+ * Return: DM_MAPIO_SUBMITTED or negative on error.
+ */
+static int zm_write_bios(struct zdm *znd, struct bio *bio, u64 s_zdm)
+{
+	struct bio *split = NULL;
+	u32 acqflgs = Z_AQ_STREAM_ID | bio_stream(bio);
+	u64 lba     = 0;
+	u32 mapped  = 0;
+	int err     = -EIO;
+	int done    = 0;
+	int sectors;
+	u32 blks;
+#if ENABLE_SEC_METADATA
+	sector_t sector;
+#endif
+	znd->is_empty = 0;
+	do {
+		blks = dm_div_up(bio->bi_iter.bi_size, Z_C4K);
+		lba = z_acquire(znd, acqflgs, blks, &mapped);
+		if (!lba && mapped)
+			lba = z_acquire(znd, acqflgs, mapped, &mapped);
+
+		if (!lba) {
+			if (atomic_read(&znd->gc_throttle) == 0) {
+				err = -ENOSPC;
+				goto out;
+			}
+
+			Z_ERR(znd, "Throttle input ... Mandatory GC.");
+			if (delayed_work_pending(&znd->gc_work)) {
+				mod_delayed_work(znd->gc_wq, &znd->gc_work, 0);
+				flush_delayed_work(&znd->gc_work);
+			}
+			continue;
+		}
+
+		sectors = mapped << Z_SHFT4K;
+		split = zsplit_bio(znd, bio, sectors);
+		if (split == bio)
+			done = 1;
+
+		if (!split) {
+			err = -ENOMEM;
+			Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+			goto out;
+		}
+
+#if ENABLE_SEC_METADATA
+		sector = lba << Z_SHFT4K;
+		split->bi_bdev = znd_get_backing_dev(znd, &sector);
+		split->bi_iter.bi_sector = sector;
+#else
+		split->bi_iter.bi_sector = lba << Z_SHFT4K;
+#endif
+		submit_bio(split);
+		err = z_mapped_addmany(znd, s_zdm, lba, mapped, GFP_ATOMIC);
+		if (err) {
+			Z_ERR(znd, "%s: Map MANY failed.", __func__);
+			err = DM_MAPIO_REQUEUE;
+			goto out;
+		}
+		s_zdm += mapped;
+	} while (!done);
+	err = DM_MAPIO_SUBMITTED;
+
+out:
+	return err;
+}
+
+/**
+ * zm_write_pages() - Copy bio pages to 4k aligned buffer. Write and map buffer.
+ * @znd: ZDM Instance
+ * @bio: Bio to be written.
+ * @s_zdm: tLBA for mapping.
+ *
+ * Return: DM_MAPIO_SUBMITTED or negative on error.
+ */
+static int zm_write_pages(struct zdm *znd, struct bio *bio, u64 s_zdm)
+{
+	u32 blks     = dm_div_up(bio->bi_iter.bi_size, Z_C4K);
+	u64 lba      = 0;
+	u32 blen     = 0; /* total: IO_VCACHE_PAGES * 8 */
+	u32 written  = 0;
+	int avail    = 0;
+	u32 acqflgs  = Z_AQ_STREAM_ID | bio_stream(bio);
+	int err;
+	gfp_t gfp = GFP_ATOMIC;
+	struct bvec_iter start;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	struct io_4k_block *io_vcache;
+	struct io_dm_block *dm_vbuf = NULL;
+
+	znd->is_empty = 0;
+	MutexLock(&znd->vcio_lock);
+	io_vcache = get_io_vcache(znd, gfp);
+	if (!io_vcache) {
+		Z_ERR(znd, "%s: FAILED to get SYNC CACHE.", __func__);
+		err = -ENOMEM;
+		goto out;
+	}
+
+	dm_vbuf = (struct io_dm_block *)io_vcache;
+
+	/* USE: dm_vbuf for dumping bio pages to disk ... */
+	start = bio->bi_iter; /* struct implicit copy */
+	do {
+		u64 alloc_ori = 0;
+		u32 mcount = 0;
+		u32 mapped = 0;
+
+reacquire:
+		/*
+		 * When lba is zero no blocks were not allocated.
+		 * Retry with the smaller request
+		 */
+		lba = z_acquire(znd, acqflgs, blks - written, &mapped);
+		if (!lba && mapped)
+			lba = z_acquire(znd, acqflgs, mapped, &mapped);
+
+		if (!lba) {
+			if (atomic_read(&znd->gc_throttle) == 0) {
+				err = -ENOSPC;
+				goto out;
+			}
+
+			Z_ERR(znd, "Throttle input ... Mandatory GC.");
+			if (delayed_work_pending(&znd->gc_work)) {
+				mod_delayed_work(znd->gc_wq, &znd->gc_work, 0);
+				flush_delayed_work(&znd->gc_work);
+			}
+			goto reacquire;
+		}
+
+		/* this may be redundant .. if we have lba we have mapped > 0 */
+		if (lba && mapped)
+			avail += mapped * 8; /* claimed pages in dm blocks */
+
+		alloc_ori = lba;
+
+		/* copy [upto mapped] pages to buffer */
+		__bio_for_each_segment(bv, bio, iter, start) {
+			int issue_write = 0;
+			unsigned int boff;
+			void *src;
+
+			if (avail <= 0) {
+				Z_ERR(znd, "%s: TBD: Close Z# %llu",
+					__func__, alloc_ori >> 16);
+				start = iter;
+				break;
+			}
+
+			src = kmap_atomic(bv.bv_page);
+			boff = bv.bv_offset;
+			memcpy(dm_vbuf[blen].data, src + boff, bv.bv_len);
+			kunmap_atomic(src);
+			blen   += bv.bv_len / 512;
+			avail  -= bv.bv_len / 512;
+
+			if ((blen >= (mapped * 8)) ||
+			    (blen >= (BIO_CACHE_SECTORS - 8)))
+				issue_write = 1;
+
+			/*
+			 * If there is less than 1 4k block in out cache,
+			 * send the available blocks to disk
+			 */
+			if (issue_write) {
+				int blks = blen / 8;
+
+				err = zm_write_cache(znd, dm_vbuf, lba, &blen);
+				if (err) {
+					Z_ERR(znd, "%s: bio-> %" PRIx64
+					      " [%d of %d blks] -> %d",
+					      __func__, lba, blen, blks, err);
+					bio->bi_error = err;
+					bio_endio(bio);
+					goto out;
+				}
+
+				if (mapped < blks) {
+					Z_ERR(znd, "ERROR: Bad write %"
+					      PRId32 " beyond alloc'd space",
+					      mapped);
+				}
+
+				lba     += blks;
+				written += blks;
+				mcount  += blks;
+				mapped  -= blks;
+
+				if (mapped == 0) {
+					bio_advance_iter(bio, &iter, bv.bv_len);
+					start = iter;
+					break;
+				}
+			}
+		} /* end: __bio_for_each_segment */
+		if ((mapped > 0) && ((blen / 8) > 0)) {
+			int blks = blen / 8;
+
+			err = zm_write_cache(znd, dm_vbuf, lba, &blen);
+			if (err) {
+				Z_ERR(znd, "%s: bio-> %" PRIx64
+				      " [%d of %d blks] -> %d",
+				      __func__, lba, blen, blks, err);
+				bio->bi_error = err;
+				bio_endio(bio);
+				goto out;
+			}
+
+			if (mapped < blks) {
+				Z_ERR(znd, "ERROR: [2] Bad write %"
+				      PRId32 " beyond alloc'd space",
+				      mapped);
+			}
+
+			lba     += blks;
+			written += blks;
+			mcount  += blks;
+			mapped  -= blks;
+
+		}
+		err = z_mapped_addmany(znd, s_zdm, alloc_ori, mcount, gfp);
+		if (err) {
+			Z_ERR(znd, "%s: Map MANY failed.", __func__);
+			err = DM_MAPIO_REQUEUE;
+			/*
+			 * FIXME:
+			 * Ending the BIO here is causing a GFP:
+			 -       DEBUG_PAGEALLOC
+			 -    in Workqueue:
+			 -        writeback bdi_writeback_workfn (flush-252:0)
+			 -    backtrace:
+			 -      __map_bio+0x7a/0x280
+			 -      __split_and_process_bio+0x2e3/0x4e0
+			 -      ? __split_and_process_bio+0x22/0x4e0
+			 -      ? generic_start_io_acct+0x5/0x210
+			 -      dm_make_request+0x6b/0x100
+			 -      generic_make_request+0xc0/0x110
+			 -      ....
+			 -
+			 - bio->bi_error = err;
+			 - bio_endio(bio);
+			 */
+			goto out;
+		}
+		increment_used_blks(znd, alloc_ori, mcount);
+
+		if (written < blks)
+			s_zdm += written;
+
+		if (written == blks && blen > 0)
+			Z_ERR(znd, "%s: blen: %d un-written blocks!!",
+			      __func__, blen);
+	} while (written < blks);
+	bio_endio(bio);
+	err = DM_MAPIO_SUBMITTED;
+
+out:
+	put_io_vcache(znd, io_vcache);
+	mutex_unlock(&znd->vcio_lock);
+
+	return err;
+}
+
+/**
+ * is_empty_page() - Scan memory range for any set bits.
+ * @pg: The start of memory to be scanned.
+ * @len: Number of bytes to check (should be long aligned)
+ * Return: 0 if any bits are set, 1 if all bits are 0.
+ */
+static int is_empty_page(void *pg, size_t len)
+{
+	unsigned long *chk = pg;
+	size_t count = len / sizeof(*chk);
+	size_t entry;
+
+	for (entry = 0; entry < count; entry++) {
+		if (chk[entry])
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * is_zero_bio() - Scan bio to see if all bytes are 0.
+ * @bio: The bio to be scanned.
+ * Return: 1 if all bits are 0. 0 if any bits in bio are set.
+ */
+static int is_zero_bio(struct bio *bio)
+{
+	int is_empty = 0;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+
+	/* Scan bio to determine if it is zero'd */
+	bio_for_each_segment(bv, bio, iter) {
+		unsigned int boff;
+		void *src;
+
+		src = kmap_atomic(bv.bv_page);
+		boff = bv.bv_offset;
+		is_empty = is_empty_page(src + boff, bv.bv_len);
+		kunmap_atomic(src);
+
+		if (!is_empty)
+			break;
+	} /* end: __bio_for_each_segment */
+
+	return is_empty;
+}
+
+/**
+ * is_bio_aligned() - Test bio and bio_vec for 4k aligned pages.
+ * @bio: Bio to be tested.
+ * Return: 1 if bio is 4k aligned, 0 if not.
+ */
+static int is_bio_aligned(struct bio *bio)
+{
+	int aligned = 1;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+
+	bio_for_each_segment(bv, bio, iter) {
+		if ((bv.bv_offset & 0x0FFF) || (bv.bv_len & 0x0FFF)) {
+			aligned = 0;
+			break;
+		}
+	}
+	return aligned;
+}
+
+/**
+ * zoned_map_write() - Write a bio by the fastest safe method.
+ * @znd: ZDM Instance
+ * @bio: Bio to be written
+ * @s_zdm: tLBA for mapping.
+ *
+ * Bios that are less than 4k need RMW.
+ * Bios that are single pages are deduped and written or discarded.
+ * Bios that are multiple pages with 4k aligned bvecs are written as bio(s).
+ * Biso that are multiple pages and mis-algined are copied to an algined buffer
+ * and submitted and new I/O.
+ */
+static int zoned_map_write(struct zdm *znd, struct bio *bio, u64 s_zdm)
+{
+	u32 blks     = dm_div_up(bio->bi_iter.bi_size, Z_C4K);
+	u16 ua_off   = bio->bi_iter.bi_sector & 0x0007;
+	u16 ua_size  = bio->bi_iter.bi_size & 0x0FFF;	/* in bytes */
+	int rcode    = -EIO;
+	unsigned long flags;
+
+	if (ua_size || ua_off) {
+		u64 origin;
+
+		origin = current_mapping(znd, s_zdm, GFP_ATOMIC);
+		if (origin) {
+			rcode = zm_cow(znd, bio, s_zdm, blks, origin);
+			spin_lock_irqsave(&znd->stats_lock, flags);
+			if (znd->htlba < (s_zdm + blks))
+				znd->htlba = s_zdm + blks;
+			spin_unlock_irqrestore(&znd->stats_lock, flags);
+		}
+		return rcode;
+	}
+
+	/*
+	 * For larger bios test for 4k alignment.
+	 * When bios are mis-algined we must copy out the
+	 * the mis-algined pages into a new bio and submit.
+	 * [The 4k alignment requests on our queue may be ignored
+	 *  by mis-behaving layers that are not 4k safe].
+	 */
+	if (is_bio_aligned(bio)) {
+		if (is_zero_bio(bio)) {
+			rcode = zoned_map_discard(znd, bio, s_zdm);
+		} else {
+			rcode = zm_write_bios(znd, bio, s_zdm);
+			spin_lock_irqsave(&znd->stats_lock, flags);
+			if (znd->htlba < (s_zdm + blks))
+				znd->htlba = s_zdm + blks;
+			spin_unlock_irqrestore(&znd->stats_lock, flags);
+		}
+	} else {
+		rcode = zm_write_pages(znd, bio, s_zdm);
+		spin_lock_irqsave(&znd->stats_lock, flags);
+		if (znd->htlba < (s_zdm + blks))
+			znd->htlba = s_zdm + blks;
+		spin_unlock_irqrestore(&znd->stats_lock, flags);
+	}
+
+	return rcode;
+}
+
+/**
+ * zm_read_bios() - Read bios from device
+ * @znd: ZDM Instance
+ * @bio: Bio to read
+ * @s_zdm: tLBA to read from.
+ *
+ * Return DM_MAPIO_SUBMITTED or negative on error.
+ */
+static int zm_read_bios(struct zdm *znd, struct bio *bio, u64 s_zdm)
+{
+	struct bio *split = NULL;
+	int rcode = DM_MAPIO_SUBMITTED;
+	u64 blba;
+	u32 blks;
+	int sectors;
+	int count;
+	u16 ua_off;
+	u16 ua_size;
+	gfp_t gfp = GFP_ATOMIC;
+#if ENABLE_SEC_METADATA
+	sector_t sector;
+#endif
+
+	do {
+		count = blks = dm_div_up(bio->bi_iter.bi_size, Z_C4K);
+		ua_off = bio->bi_iter.bi_sector & 0x0007;
+		ua_size = bio->bi_iter.bi_size & 0x0FFF;
+		blba = current_map_range(znd, s_zdm, &count, gfp);
+		s_zdm += count;
+		sectors = (count << Z_SHFT4K);
+		if (ua_size)
+			sectors += (ua_size >> SECTOR_SHIFT) - 8;
+
+		split = zsplit_bio(znd, bio, sectors);
+		if (!split) {
+			rcode = -ENOMEM;
+			Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+			goto out;
+		}
+		if (blba) {
+#if ENABLE_SEC_METADATA
+			sector = (blba << Z_SHFT4K) + ua_off;
+			split->bi_bdev = znd_get_backing_dev(znd, &sector);
+			split->bi_iter.bi_sector = sector;
+#else
+			split->bi_iter.bi_sector = (blba << Z_SHFT4K) + ua_off;
+#endif
+			submit_bio(split);
+		} else {
+			zero_fill_bio(split);
+			bio_endio(split);
+		}
+	} while (split != bio);
+
+out:
+	return rcode;
+}
+
+#define REQ_CHECKPOINT (REQ_FLUSH_SEQ | REQ_PREFLUSH | REQ_FUA)
+
+/**
+ * zoned_bio() - Handle and incoming BIO.
+ * @znd: ZDM Instance
+ */
+static int zoned_bio(struct zdm *znd, struct bio *bio)
+{
+	bool is_write = op_is_write(bio_op(bio));
+	u64 s_zdm = (bio->bi_iter.bi_sector >> Z_SHFT4K) + znd->md_end;
+	int rcode = DM_MAPIO_SUBMITTED;
+	struct request_queue *q;
+	bool op_is_flush = false;
+	bool do_flush_workqueue = false;
+	bool do_end_bio = false;
+
+	/* map to backing device ... NOT dm-zdm device */
+	bio->bi_bdev = znd->dev->bdev;
+
+	q = bdev_get_queue(bio->bi_bdev);
+	q->queue_flags |= QUEUE_FLAG_NOMERGES;
+
+	if (is_write && znd->meta_result) {
+		if (!(bio_op(bio) == REQ_OP_DISCARD)) {
+			rcode = znd->meta_result;
+			Z_ERR(znd, "MAP ERR (meta): %d", rcode);
+			goto out;
+		}
+	}
+
+	if (is_write) {
+		if ((bio->bi_opf & REQ_CHECKPOINT) ||
+		     bio_op(bio) == REQ_OP_FLUSH) {
+
+#if ENABLE_SEC_METADATA
+			if (znd->meta_dst_flag != DST_TO_SEC_DEVICE)
+				bio->bi_opf &= ~REQ_CHECKPOINT;
+#else
+			bio->bi_opf &= ~REQ_CHECKPOINT;
+#endif
+			set_bit(DO_SYNC, &znd->flags);
+			set_bit(DO_FLUSH, &znd->flags);
+			op_is_flush = true;
+			do_flush_workqueue = true;
+		}
+	}
+
+	if (znd->last_op_is_flush && op_is_flush && bio->bi_iter.bi_size == 0) {
+		do_end_bio = true;
+		goto out;
+	}
+
+	Z_DBG(znd, "%s: U:%lx sz:%u (%lu) -> s:%"PRIx64"-> %s%s", __func__,
+	      bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
+	      bio->bi_iter.bi_size / PAGE_SIZE, s_zdm,
+	      op_is_flush ? "F" : (is_write ? "W" : "R"),
+	      bio_op(bio) == REQ_OP_DISCARD ? "+D" : "");
+
+	if (bio->bi_iter.bi_size) {
+		if (bio_op(bio) == REQ_OP_DISCARD) {
+			rcode = zoned_map_discard(znd, bio, s_zdm);
+		} else if (is_write) {
+			const gfp_t gfp = GFP_ATOMIC;
+			const int gc_wait = 0;
+
+			rcode = zoned_map_write(znd, bio, s_zdm);
+			if (znd->z_gc_free < (znd->gc_wm_crit + 2))
+				gc_immediate(znd, gc_wait, gfp);
+
+		} else {
+			rcode = zm_read_bios(znd, bio, s_zdm);
+		}
+		znd->age = jiffies;
+	} else {
+		do_end_bio = true;
+	}
+
+	if (znd->memstat > 25 << 20)
+		set_bit(DO_MEMPOOL, &znd->flags);
+
+	if (test_bit(DO_FLUSH, &znd->flags) ||
+	    test_bit(DO_SYNC, &znd->flags) ||
+	    test_bit(DO_MAPCACHE_MOVE, &znd->flags) ||
+	    test_bit(DO_MEMPOOL, &znd->flags)) {
+		if (!test_bit(DO_METAWORK_QD, &znd->flags) &&
+		    !work_pending(&znd->meta_work)) {
+			set_bit(DO_METAWORK_QD, &znd->flags);
+			queue_work(znd->meta_wq, &znd->meta_work);
+		}
+	}
+
+	if (znd->trim->count > MC_HIGH_WM ||
+	    znd->unused->count > MC_HIGH_WM ||
+	    znd->wbjrnl->count > MC_HIGH_WM ||
+	    znd->ingress->count > MC_HIGH_WM)
+		do_flush_workqueue = true;
+
+	if (do_flush_workqueue && work_pending(&znd->meta_work))
+		flush_workqueue(znd->meta_wq);
+
+
+
+out:
+	if (do_end_bio) {
+		if (znd->meta_result) {
+			bio->bi_error = znd->meta_result;
+			znd->meta_result = 0;
+		}
+		bio_endio(bio);
+	}
+
+	Z_DBG(znd, "%s: ..... -> s:%"PRIx64"-> rc: %d", __func__, s_zdm, rcode);
+
+	znd->last_op_is_flush = op_is_flush;
+
+	return rcode;
+}
+
+/**
+ * _do_mem_purge() - conditionally trigger a reduction of cache memory
+ * @znd: ZDM Instance
+ */
+static inline int _do_mem_purge(struct zdm *znd)
+{
+	const int pool_size = znd->cache_size >> 1;
+	int do_work = 0;
+
+	if (atomic_read(&znd->incore) > pool_size) {
+		set_bit(DO_MEMPOOL, &znd->flags);
+		if (!work_pending(&znd->meta_work))
+			do_work = 1;
+	}
+	return do_work;
+}
+
+/**
+ * on_timeout_activity() - Periodic background task execution.
+ * @znd: ZDM Instance
+ * @mempurge: If memory purge should be scheduled.
+ * @delay: Delay metric for periodic GC
+ *
+ * NOTE: Executed as a worker task queued froma timer.
+ */
+static void on_timeout_activity(struct zdm *znd, int delay)
+{
+	int max_tries = 1;
+
+	if (test_bit(ZF_FREEZE, &znd->flags))
+		return;
+
+	if (is_expired_msecs(znd->flush_age, 30000)) {
+		if (!test_bit(DO_METAWORK_QD, &znd->flags) &&
+		    !work_pending(&znd->meta_work)) {
+			Z_DBG(znd, "Periodic FLUSH");
+			set_bit(DO_SYNC, &znd->flags);
+			set_bit(DO_FLUSH, &znd->flags);
+			set_bit(DO_METAWORK_QD, &znd->flags);
+			queue_work(znd->meta_wq, &znd->meta_work);
+			znd->flush_age = jiffies_64;
+		}
+	}
+
+	if (is_expired_msecs(znd->age, DISCARD_IDLE_MSECS))
+		max_tries = 20;
+
+	do {
+		int count;
+
+		count = unmap_deref_chunk(znd, 2048, 0, GFP_KERNEL);
+		if (count == -EAGAIN) {
+			if (!work_pending(&znd->meta_work)) {
+				set_bit(DO_METAWORK_QD, &znd->flags);
+				queue_work(znd->meta_wq, &znd->meta_work);
+			}
+			break;
+		}
+		if (count != 1 || --max_tries < 0)
+			break;
+
+		if (test_bit(ZF_FREEZE, &znd->flags))
+			return;
+
+	} while (is_expired_msecs(znd->age, DISCARD_IDLE_MSECS));
+
+	gc_queue_with_delay(znd, delay, GFP_KERNEL);
+
+	if (_do_mem_purge(znd))
+		queue_work(znd->meta_wq, &znd->meta_work);
+}
+
+/**
+ * bg_work_task() - periodic background worker
+ * @work: context for worker thread
+ */
+static void bg_work_task(struct work_struct *work)
+{
+	struct zdm *znd;
+	const int delay = 1;
+
+	if (!work)
+		return;
+
+	znd = container_of(work, struct zdm, bg_work);
+	on_timeout_activity(znd, delay);
+}
+
+/**
+ * activity_timeout() - Handler for timer used to trigger background worker.
+ * @data: context for timer.
+ */
+static void activity_timeout(unsigned long data)
+{
+	struct zdm *znd = (struct zdm *) data;
+
+	if (!work_pending(&znd->bg_work))
+		queue_work(znd->bg_wq, &znd->bg_work);
+
+	if (!test_bit(ZF_FREEZE, &znd->flags))
+		mod_timer(&znd->timer, jiffies + msecs_to_jiffies(2500));
+}
+
+/**
+ * get_dev_size() - Report accessible size of device to upper layer.
+ * @ti: DM Target
+ *
+ * Return: Size in 512 byte sectors
+ */
+static sector_t get_dev_size(struct dm_target *ti)
+{
+	struct zdm *znd = ti->private;
+	u64 sz = i_size_read(get_bdev_bd_inode(znd));	/* size in bytes. */
+	u64 lut_resv = znd->gz_count * znd->mz_provision;
+
+	/*
+	 * NOTE: `sz` should match `ti->len` when the dm_table
+	 *       is setup correctly
+	 */
+	sz -= (lut_resv * Z_SMR_SZ_BYTES);
+
+	return to_sector(sz);
+}
+
+/**
+ * zoned_iterate_devices() - Iterate over devices call fn() at each.
+ * @ti: DM Target
+ * @fn: Function for each callout
+ * @data: Context for fn().
+ */
+static int zoned_iterate_devices(struct dm_target *ti,
+				 iterate_devices_callout_fn fn, void *data)
+{
+	struct zdm *znd = ti->private;
+	int rc;
+
+	rc = fn(ti, znd->dev, 0, get_dev_size(ti), data);
+	return rc;
+}
+
+/**
+ * zoned_io_hints() - The place to tweek queue limits for DM targets
+ * @ti: DM Target
+ * @limits: queue_limits for this DM target
+ */
+static void zoned_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct zdm *znd = ti->private;
+	u64 io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;
+
+	/*
+	 * If the system-determined stacked limits are compatible with the
+	 * zdm device's blocksize (io_opt is a factor) do not override them.
+	 */
+	if (io_opt_sectors < 8 || do_div(io_opt_sectors, 8)) {
+		blk_limits_io_min(limits, 0);
+		blk_limits_io_opt(limits, 8 << SECTOR_SHIFT);
+	}
+
+	limits->logical_block_size =
+		limits->physical_block_size =
+		limits->io_min = Z_C4K;
+	if (znd->enable_trim) {
+		limits->discard_alignment = Z_C4K;
+		limits->discard_granularity = Z_C4K;
+		limits->max_discard_sectors = 1 << 20;
+		limits->max_hw_discard_sectors = 1 << 20;
+		limits->discard_zeroes_data = 1;
+	}
+}
+
+/**
+ * zoned_status() - Report status of DM Target
+ * @ti: DM Target
+ * @type: Type of status to report.
+ * @status_flags: Flags
+ * @result: Fill in with status.
+ * @maxlen: Maximum number of bytes for result.
+ */
+static void zoned_status(struct dm_target *ti, status_type_t type,
+			 unsigned int status_flags, char *result,
+			 unsigned int maxlen)
+{
+	struct zdm *znd = (struct zdm *) ti->private;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = '\0';
+		break;
+
+	case STATUSTYPE_TABLE:
+		scnprintf(result, maxlen, "%s Z#%u", znd->dev->name,
+			 znd->zdstart);
+		break;
+	}
+}
+
+/* -------------------------------------------------------------------------- */
+/* --ProcFS Support Routines------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+#if defined(CONFIG_PROC_FS)
+
+/**
+ * struct zone_info_entry - Proc zone entry.
+ * @zone: Zone Index
+ * @info: Info (WP/Used).
+ */
+struct zone_info_entry {
+	u32 zone;
+	u32 info;
+};
+
+/**
+ * Startup writing to our proc entry
+ */
+static void *proc_wp_start(struct seq_file *seqf, loff_t *pos)
+{
+	struct zdm *znd = seqf->private;
+
+	if (*pos == 0)
+		znd->wp_proc_at = *pos;
+	return &znd->wp_proc_at;
+}
+
+/**
+ * Increment to our next 'grand zone' 4k page.
+ */
+static void *proc_wp_next(struct seq_file *seqf, void *v, loff_t *pos)
+{
+	struct zdm *znd = seqf->private;
+	u32 zone = ++znd->wp_proc_at;
+
+	return zone < znd->zone_count ? &znd->wp_proc_at : NULL;
+}
+
+/**
+ * Stop ... a place to free resources that we don't hold .. [noop].
+ */
+static void proc_wp_stop(struct seq_file *seqf, void *v)
+{
+}
+
+/**
+ * Write as many entries as possbile ....
+ */
+static int proc_wp_show(struct seq_file *seqf, void *v)
+{
+	int err = 0;
+	struct zdm *znd = seqf->private;
+	u32 zone = znd->wp_proc_at;
+	u32 out = 0;
+
+	while (zone < znd->zone_count) {
+		u32 gzno  = zone >> GZ_BITS;
+		u32 gzoff = zone & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		struct zone_info_entry entry;
+
+		entry.zone = zone;
+		entry.info = le32_to_cpu(wpg->wp_alloc[gzoff]);
+
+		err = seq_write(seqf, &entry, sizeof(entry));
+		if (err) {
+			/*
+			 * write failure is temporary ..
+			 * just return and try again
+			 */
+			err = 0;
+			goto out;
+		}
+		out++;
+		zone = ++znd->wp_proc_at;
+	}
+
+out:
+	if (err)
+		Z_ERR(znd, "%s: %llu -> %d", __func__, znd->wp_proc_at, err);
+
+	return err;
+}
+
+/**
+ * zdm_wp_ops() - Seq_file operations for retrieving WP via proc fs
+ */
+static const struct seq_operations zdm_wp_ops = {
+	.start	= proc_wp_start,
+	.next	= proc_wp_next,
+	.stop	= proc_wp_stop,
+	.show	= proc_wp_show
+};
+
+/**
+ * zdm_wp_open() - Need to migrate our private data to the seq_file
+ */
+static int zdm_wp_open(struct inode *inode, struct file *file)
+{
+	/* seq_open will populate file->private_data with a seq_file */
+	int err = seq_open(file, &zdm_wp_ops);
+
+	if (!err) {
+		struct zdm *znd = PDE_DATA(inode);
+		struct seq_file *seqf = file->private_data;
+
+		seqf->private = znd;
+	}
+	return err;
+}
+
+/**
+ * zdm_wp_fops() - File operations for retrieving WP via proc fs
+ */
+static const struct file_operations zdm_wp_fops = {
+	.open		= zdm_wp_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+
+/**
+ * Startup writing to our proc entry
+ */
+static void *proc_used_start(struct seq_file *seqf, loff_t *pos)
+{
+	struct zdm *znd = seqf->private;
+
+	if (*pos == 0)
+		znd->wp_proc_at = *pos;
+	return &znd->wp_proc_at;
+}
+
+/**
+ * Increment to our next zone
+ */
+static void *proc_used_next(struct seq_file *seqf, void *v, loff_t *pos)
+{
+	struct zdm *znd = seqf->private;
+	u32 zone = ++znd->wp_proc_at;
+
+	return zone < znd->zone_count ? &znd->wp_proc_at : NULL;
+}
+
+/**
+ * Stop ... a place to free resources that we don't hold .. [noop].
+ */
+static void proc_used_stop(struct seq_file *seqf, void *v)
+{
+}
+
+/**
+ * proc_used_show() - Write as many 'used' entries as possbile.
+ * @seqf: seq_file I/O handler
+ * @v: An unused parameter.
+ */
+static int proc_used_show(struct seq_file *seqf, void *v)
+{
+	int err = 0;
+	struct zdm *znd = seqf->private;
+	u32 zone = znd->wp_proc_at;
+	u32 out = 0;
+
+	while (zone < znd->zone_count) {
+		u32 gzno  = zone >> GZ_BITS;
+		u32 gzoff = zone & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		struct zone_info_entry entry;
+
+		entry.zone = zone;
+		entry.info = le32_to_cpu(wpg->zf_est[gzoff]);
+
+		err = seq_write(seqf, &entry, sizeof(entry));
+		if (err) {
+			/*
+			 * write failure is temporary ..
+			 * just return and try again
+			 */
+			err = 0;
+			goto out;
+		}
+		out++;
+		zone = ++znd->wp_proc_at;
+	}
+
+out:
+	if (err)
+		Z_ERR(znd, "%s: %llu -> %d", __func__, znd->wp_proc_at, err);
+
+	return err;
+}
+
+/**
+ * zdm_used_ops() - Seq_file Ops for retrieving 'used' state via proc fs
+ */
+static const struct seq_operations zdm_used_ops = {
+	.start	= proc_used_start,
+	.next	= proc_used_next,
+	.stop	= proc_used_stop,
+	.show	= proc_used_show
+};
+
+/**
+ * zdm_used_open() - Need to migrate our private data to the seq_file
+ */
+static int zdm_used_open(struct inode *inode, struct file *file)
+{
+	/* seq_open will populate file->private_data with a seq_file */
+	int err = seq_open(file, &zdm_used_ops);
+
+	if (!err) {
+		struct zdm *znd = PDE_DATA(inode);
+		struct seq_file *seqf = file->private_data;
+
+		seqf->private = znd;
+	}
+	return err;
+}
+
+/**
+ * zdm_used_fops() - File operations for retrieving 'used' state via proc fs
+ */
+static const struct file_operations zdm_used_fops = {
+	.open		= zdm_used_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+/**
+ * zdm_status_show() - Dump the status structure via proc fs
+ */
+static int zdm_status_show(struct seq_file *seqf, void *unused)
+{
+	struct zdm *znd = seqf->private;
+	struct zdm_ioc_status status;
+	u32 zone;
+
+	memset(&status, 0, sizeof(status));
+	for (zone = 0; zone < znd->zone_count; zone++) {
+		u32 gzno  = zone >> GZ_BITS;
+		u32 gzoff = zone & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp_at = le32_to_cpu(wpg->wp_alloc[gzoff]) & Z_WP_VALUE_MASK;
+
+		status.b_used += wp_at;
+		status.b_available += Z_BLKSZ - wp_at;
+	}
+	status.map_cache_entries = znd->mc_entries;
+	status.discard_cache_entries = znd->dc_entries;
+	status.b_discard = znd->discard_count;
+	status.journal_pages = znd->wbjrnl->size / PAGE_SIZE;
+	status.journal_entries = znd->wbjrnl->count;
+
+	/*  fixed array of ->fwd_tm and ->rev_tm */
+	status.m_zones = znd->zone_count;
+
+	status.memstat = znd->memstat;
+	memcpy(status.bins, znd->bins, sizeof(status.bins));
+	status.mlut_blocks = atomic_read(&znd->incore);
+
+	return seq_write(seqf, &status, sizeof(status));
+}
+
+/**
+ * zdm_status_open() - Open seq_file from file.
+ * @inode: Our data is stuffed here, Retrieve it.
+ * @file: file objected used by seq_file.
+ */
+static int zdm_status_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, zdm_status_show, PDE_DATA(inode));
+}
+
+/**
+ * zdm_used_fops() - File operations to chain to zdm_status_open.
+ */
+static const struct file_operations zdm_status_fops = {
+	.open = zdm_status_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+/**
+ * zdm_info_show() - Report some information as text.
+ * @seqf: Sequence file for writing
+ * @unused: Not used.
+ */
+static int zdm_info_show(struct seq_file *seqf, void *unused)
+{
+	struct zdm *znd = seqf->private;
+	int bin;
+
+	seq_printf(seqf, "On device:     %s\n", _zdisk(znd));
+	seq_printf(seqf, "Data Zones:    %u\n", znd->data_zones);
+	seq_printf(seqf, "Empty Zones:   %u\n", znd->z_gc_free);
+	seq_printf(seqf, "Cached Pages:  %u\n", znd->mc_entries);
+	seq_printf(seqf, "Discard Pages: %u\n", znd->dc_entries);
+	seq_printf(seqf, "ZTL Pages:     %d\n", atomic_read(&znd->incore));
+	seq_printf(seqf, "   in ZTL:     %d\n", znd->in_zlt);
+	seq_printf(seqf, "   in LZY:     %d\n", znd->in_lzy);
+	seq_printf(seqf, "RAM in Use:    %lu [%lu MiB]\n",
+		   znd->memstat, dm_div_up(znd->memstat, 1 << 20));
+	seq_printf(seqf, "Zones GC'd:    %u\n", znd->gc_events);
+	seq_printf(seqf, "GC Throttle:   %d\n", atomic_read(&znd->gc_throttle));
+	seq_printf(seqf, "Ingress InUse: %u / %u\n",
+		znd->ingress->count, znd->ingress->size);
+	seq_printf(seqf, "Unused InUse:  %u / %u\n",
+		znd->unused->count, znd->unused->size);
+	seq_printf(seqf, "Discard InUse: %u / %u\n",
+		znd->trim->count, znd->trim->size);
+	seq_printf(seqf, "WB Jrnl InUse: %u / %u\n",
+		znd->wbjrnl->count, znd->wbjrnl->size);
+
+	seq_printf(seqf, "queue-depth=%u\n", znd->queue_depth);
+	seq_printf(seqf, "gc-prio-def=%u\n", znd->gc_prio_def);
+	seq_printf(seqf, "gc-prio-low=%u\n", znd->gc_prio_low);
+	seq_printf(seqf, "gc-prio-high=%u\n", znd->gc_prio_high);
+	seq_printf(seqf, "gc-prio-crit=%u\n", znd->gc_prio_crit);
+	seq_printf(seqf, "gc-wm-crit=%u\n", znd->gc_wm_crit);
+	seq_printf(seqf, "gc-wm-high=%u\n", znd->gc_wm_high);
+	seq_printf(seqf, "gc-wm-low=%u\n", znd->gc_wm_low);
+	seq_printf(seqf, "gc-status=%u\n", znd->gc_status);
+	seq_printf(seqf, "cache-ageout-ms=%u\n", znd->cache_ageout_ms);
+	seq_printf(seqf, "cache-size=%u\n", znd->cache_size);
+	seq_printf(seqf, "cache-to-pagecache=%u\n", znd->cache_to_pagecache);
+	seq_printf(seqf, "cache-reada=%u\n", znd->cache_reada);
+	seq_printf(seqf, "journal-age=%u\n", znd->journal_age);
+
+	for (bin = 0; bin < ARRAY_SIZE(znd->bins); bin++) {
+		if (znd->max_bins[bin])
+			seq_printf(seqf, "#%d: %d/%d\n",
+				bin, znd->bins[bin], znd->max_bins[bin]);
+	}
+
+#if ALLOC_DEBUG
+	seq_printf(seqf, "Max Allocs:    %u\n", znd->hw_allocs);
+#endif
+
+	return 0;
+}
+
+/**
+ * zdm_info_open() - Open seq_file from file.
+ * @inode: Our data is stuffed here, Retrieve it.
+ * @file: file objected used by seq_file.
+ */
+static int zdm_info_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, zdm_info_show, PDE_DATA(inode));
+}
+
+
+/**
+ * zdm_info_write() - Handle writes to /proc/zdm_sdXn/status
+ * @file: In memory file structure attached to ../status
+ * @buffer: User space buffer being written to status
+ * @count: Number of bytes in write
+ * @ppos: pseudo position within stream that buffer is starting at.
+ */
+static ssize_t zdm_info_write(struct file *file, const char __user *buffer,
+			  size_t count, loff_t *ppos)
+{
+	struct zdm *znd = PDE_DATA(file_inode(file));
+	char *user_data = NULL;
+
+	if (count > 32768)
+		return -EINVAL;
+
+	user_data = vmalloc(count+1);
+	if (!user_data) {
+		Z_ERR(znd, "Out of space for user buffer .. %ld", count + 1);
+		return -ENOMEM;
+	}
+	if (copy_from_user(user_data, buffer, count)) {
+		vfree(user_data);
+		return -EFAULT;
+	}
+	user_data[count] = 0;
+
+	/* do stuff */
+
+	/* echo a (possibly truncated) copy of user's input to kenrel log */
+	if (count > 30)
+		user_data[30] = 0;
+	Z_ERR(znd, "User sent %ld bytes at offset %lld ... %s",
+	      count, *ppos, user_data);
+
+	return count;
+}
+
+/**
+ * zdm_used_fops() - File operations to chain to zdm_info_open.
+ */
+static const struct file_operations zdm_info_fops = {
+	.open = zdm_info_open,
+	.read = seq_read,
+	.write = zdm_info_write,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+/**
+ * zdm_create_proc_entries() - Create proc entries for ZDM utilities
+ * @znd: ZDM Instance
+ */
+static int zdm_create_proc_entries(struct zdm *znd)
+{
+	snprintf(znd->proc_name, sizeof(znd->proc_name), "zdm_%s", _zdisk(znd));
+
+	znd->proc_fs = proc_mkdir(znd->proc_name, NULL);
+	if (!znd->proc_fs)
+		return -ENOMEM;
+
+	proc_create_data(PROC_WP, 0, znd->proc_fs, &zdm_wp_fops, znd);
+	proc_create_data(PROC_FREE, 0, znd->proc_fs, &zdm_used_fops, znd);
+	proc_create_data(PROC_DATA, 0, znd->proc_fs, &zdm_status_fops, znd);
+	proc_create_data(PROC_STATUS, 0, znd->proc_fs, &zdm_info_fops, znd);
+
+	return 0;
+}
+
+/**
+ * zdm_remove_proc_entries() - Remove proc entries
+ * @znd: ZDM Instance
+ */
+static void zdm_remove_proc_entries(struct zdm *znd)
+{
+	remove_proc_subtree(znd->proc_name, NULL);
+}
+
+#else /* !CONFIG_PROC_FS */
+
+static int zdm_create_proc_entries(struct zdm *znd)
+{
+	(void)znd;
+	return 0;
+}
+static void zdm_remove_proc_entries(struct zdm *znd)
+{
+	(void)znd;
+}
+#endif /* CONFIG_PROC_FS */
+
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static void start_worker(struct zdm *znd)
+{
+	clear_bit(ZF_FREEZE, &znd->flags);
+	atomic_set(&znd->suspended, 0);
+	mod_timer(&znd->timer, jiffies + msecs_to_jiffies(5000));
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static void stop_worker(struct zdm *znd)
+{
+	set_bit(ZF_FREEZE, &znd->flags);
+	atomic_set(&znd->suspended, 1);
+	zoned_io_flush(znd);
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static void zoned_postsuspend(struct dm_target *ti)
+{
+	struct zdm *znd = ti->private;
+
+	stop_worker(znd);
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static void zoned_resume(struct dm_target *ti)
+{
+	/* TODO */
+}
+
+struct marg {
+	const char *prefix;
+	u32 *value;
+};
+
+/**
+ * zoned_message() - dmsetup message sent to target.
+ * @ti: Target Instance
+ * @argc: Number of arguments
+ * @argv: Array of arguments
+ */
+static int zoned_message(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct zdm *znd = ti->private;
+	int iter;
+	struct marg margs[] = {
+		{ "queue-depth=",       &znd->queue_depth        },
+		{ "gc-prio-def=",       &znd->gc_prio_def        },
+		{ "gc-prio-low=",       &znd->gc_prio_low        },
+		{ "gc-prio-high=",      &znd->gc_prio_high       },
+		{ "gc-prio-crit=",      &znd->gc_prio_crit       },
+		{ "gc-wm-crit=",        &znd->gc_wm_crit         },
+		{ "gc-wm-high=",        &znd->gc_wm_high         },
+		{ "gc-wm-low=",         &znd->gc_wm_low          },
+		{ "gc-status=",         &znd->gc_status          },
+		{ "cache-ageout-ms=",   &znd->cache_ageout_ms    },
+		{ "cache-size=",        &znd->cache_size         },
+		{ "cache-to-pagecache=",&znd->cache_to_pagecache },
+		{ "cache-reada=",       &znd->cache_reada        },
+		{ "journal-age=",       &znd->journal_age        },
+	};
+
+	for (iter = 0; iter < argc; iter++) {
+		u64 tmp;
+		int ii, err;
+		bool handled = false;
+
+		for (ii = 0; ii < ARRAY_SIZE(margs); ii++) {
+			const char *opt = margs[ii].prefix;
+			int len = strlen(opt);
+
+			if (!strncasecmp(argv[iter], opt, len)) {
+				err = kstrtoll(argv[iter] + len, 0, &tmp);
+				if (err) {
+					Z_ERR(znd, "Invalid arg %s\n",
+						argv[iter]);
+					continue;
+				}
+				*margs[ii].value = tmp;
+				handled = true;
+				break;
+			}
+		}
+		if (!handled)
+			Z_ERR(znd, "Message: %s not handled.", argv[iter]);
+	}
+
+	if (znd->queue_depth == 0 && atomic_read(&znd->enqueued) > 0)
+		wake_up_process(znd->bio_kthread);
+
+	return 0;
+}
+
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static int zoned_preresume(struct dm_target *ti)
+{
+	struct zdm *znd = ti->private;
+
+	start_worker(znd);
+	return 0;
+}
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static struct target_type zoned_target = {
+	.name = "zdm",
+	.module = THIS_MODULE,
+	.version = {1, 0, 0},
+	.ctr = zoned_ctr,
+	.dtr = zoned_dtr,
+	.map = zoned_map,
+	.end_io = zoned_endio,
+	.postsuspend = zoned_postsuspend,
+	.preresume = zoned_preresume,
+	.resume = zoned_resume,
+	.status = zoned_status,
+	.message = zoned_message,
+	.iterate_devices = zoned_iterate_devices,
+	.io_hints = zoned_io_hints
+};
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+static int __init dm_zoned_init(void)
+{
+	int rcode = dm_register_target(&zoned_target);
+
+	if (rcode)
+		DMERR("zdm target registration failed: %d", rcode);
+
+	return rcode;
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+static void __exit dm_zoned_exit(void)
+{
+	dm_unregister_target(&zoned_target);
+}
+
+module_init(dm_zoned_init);
+module_exit(dm_zoned_exit);
+
+MODULE_DESCRIPTION(DM_NAME " zdm target for Host Aware/Managed drives.");
+MODULE_AUTHOR("Shaun Tancheff <shaun.tancheff@xxxxxxxxxxx>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-zdm.h b/drivers/md/dm-zdm.h
new file mode 100644
index 0000000..335a28c
--- /dev/null
+++ b/drivers/md/dm-zdm.h
@@ -0,0 +1,945 @@
+/*
+ * Kernel Device Mapper for abstracting ZAC/ZBC devices as normal
+ * block devices for linux file systems.
+ *
+ * Copyright (C) 2015,2016 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@xxxxxxxxxxx>
+ *
+ * Bio queue support and metadata relocation by:
+ * Vineet Agarwal <vineet.agarwal@xxxxxxxxxxx>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _DM_ZONED_H
+#define _DM_ZONED_H
+
+#define ALLOC_DEBUG		0
+#define ADBG_ENTRIES		65536
+
+#define ENABLE_SEC_METADATA	1
+
+#define WB_DELAY_MS		1
+
+#define NORMAL			GFP_KERNEL
+#define CRIT			GFP_NOIO
+
+#define DM_MSG_PREFIX		"zdm"
+
+#define PROC_WP			"wp.bin"
+#define PROC_FREE		"free.bin"
+#define PROC_DATA		"data.bin"
+#define PROC_STATUS		"status"
+
+#define ZDM_RESERVED_ZNR	0
+#define ZDM_CRC_STASH_ZNR	1 /* first 64 blocks */
+#define ZDM_RMAP_ZONE		2
+#define ZDM_SECTOR_MAP_ZNR	3
+#define ZDM_DATA_START_ZNR	4
+
+#define Z_WP_GC_FULL		(1u << 31)
+#define Z_WP_GC_ACTIVE		(1u << 30)
+#define Z_WP_GC_TARGET		(1u << 29)
+#define Z_WP_GC_READY		(1u << 28)
+#define Z_WP_GC_BITS		(0xFu << 28)
+
+#define Z_WP_GC_PENDING		(Z_WP_GC_FULL|Z_WP_GC_ACTIVE)
+#define Z_WP_NON_SEQ		(1u << 27)
+#define Z_WP_RRECALC		(1u << 26)
+#define Z_WP_RESV_02		(1u << 25)
+#define Z_WP_RESV_03		(1u << 24)
+
+#define Z_WP_VALUE_MASK		(~0u >> 8)
+#define Z_WP_FLAGS_MASK		(~0u << 24)
+#define Z_WP_STREAM_MASK	Z_WP_FLAGS_MASK
+
+#define Z_AQ_GC			(1u << 31)
+#define Z_AQ_META		(1u << 30)
+#define Z_AQ_NORMAL		(1u << 29)
+#define Z_AQ_STREAM_ID		(1u << 28)
+#define Z_AQ_STREAM_MASK	(0xFF)
+#define Z_MDJRNL_SID		0xff
+#define Z_AQ_META_STREAM	(Z_AQ_META | Z_AQ_STREAM_ID | Z_MDJRNL_SID)
+
+#define Z_C4K			(4096ul)
+#define Z_SHFT_SEC		(9)
+#define Z_BLOCKS_PER_DM_SECTOR	(Z_C4K/512)
+#define MZ_METADATA_ZONES	(8ul)
+#define Z_SHFT4K		(3)
+
+#define Z_UNSORTED		(Z_C4K / sizeof(struct map_cache_entry))
+#define Z_MAP_MAX		(Z_UNSORTED - 1)
+
+#define Z_MCE_MAX	((PAGE_SIZE - sizeof(struct zdm *) - sizeof(int) * 4) \
+				/ sizeof(void*))
+#define MCE_SHIFT	(ilog2(PAGE_SIZE/sizeof(struct map_cache_entry)))
+#define MCE_MASK	((1 << MCE_SHIFT) - 1)
+
+#define LBA_SB_START		1
+#define SUPERBLOCK_MAGIC	0x5a6f4e65ul	/* ZoNe */
+#define SUPERBLOCK_CSUM_XOR	146538381
+#define MIN_ZONED_VERSION	1
+#define Z_VERSION		1
+#define MAX_ZONED_VERSION	1
+
+#define ZONE_SECT_BITS		19
+#define Z_BLKBITS		16
+#define Z_BLKSZ			(1ul << Z_BLKBITS)
+#define Z_SMR_SZ_BYTES		(Z_C4K << Z_BLKBITS)
+
+#define UUID_LEN		16
+
+#define Z_TYPE_SMR		2
+#define Z_TYPE_SMR_HA		1
+#define Z_VPD_INFO_BYTE		8
+
+/* hash map for CRC's and Lookup Tables */
+#define HCRC_ORDER		4
+#define HLUT_ORDER		8
+
+/*
+ * INCR = distance between superblock writes from LBA 1.
+ *
+ * Superblock at LBA 1, 512, 1024 ...
+ * Maximum reserved size of each is 400 cache blocks (map, and discard cache)
+ *
+ * At 1536 - 1664 are WP + ZF_EST blocks
+ *
+ * From 1664 to the end of the zone [Z1 LBA] is metadata write-back
+ * journal blocks.
+ *
+ * The wb journal is used for metadata blocks between Z1 LBA and data_start LBA.
+ * When Z1 LBA == data_start LBA the journal is disabled and used for the
+ * key FWD lookup table blocks [the FWD blocks needed to map the FWD table].
+ * Ex. When the FWD table consumes 64 zones (16TiB) then a lookup table
+ * for 64*64 -> 4096 entries {~4 4k pages for 32bit LBAs) needs to be
+ * maintained within the superblock generations.
+ *
+ * MAX_CACHE_SYNC - Space in SB INCR reserved for map/discard cache blocks.
+ * CACHE_COPIES   - Number of SB to keep (Active, Previous, Backup)
+ *
+ */
+#define MAX_SB_INCR_SZ		512ul
+#define MAX_CACHE_SYNC		400ul
+#define CACHE_COPIES		3
+/* WP is 4k, and ZF_Est is 4k written in pairs: Hence 2x */
+#define MAX_WP_BLKS		64
+#define WP_ZF_BASE		(MAX_SB_INCR_SZ * CACHE_COPIES)
+#define FWD_KEY_BLOCKS		8
+#define FWD_KEY_BASE		(WP_ZF_BASE + (MAX_WP_BLKS * 2))
+
+#define MC_HIGH_WM		4096
+#define MC_MOVE_SZ		512
+
+#define WB_JRNL_MIN		4096u
+#define WB_JRNL_MAX		WB_JRNL_MIN /* 16384u */
+#define WB_JRNL_BLKS		(WB_JRNL_MAX >> 10)
+#define WB_JRNL_IDX		(FWD_KEY_BASE + FWD_KEY_BLOCKS)
+#define WB_JRNL_BASE		(WB_JRNL_IDX + WB_JRNL_BLKS)
+
+#define IO_VCACHE_ORDER		8
+#define IO_VCACHE_PAGES		(1 << IO_VCACHE_ORDER)  /* 256 pages => 1MiB */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+enum superblock_flags_t {
+	SB_DIRTY = 1,
+};
+
+struct z_io_req_t {
+	struct dm_io_region *where;
+	struct dm_io_request *io_req;
+	struct work_struct work;
+	int result;
+};
+
+#define MAX_EXTENT_ORDER	17
+
+#define LBA48_BITS		45
+#define LBA48_XBITS		(64 - LBA48_BITS)
+#if LBA48_XBITS > MAX_EXTENT_ORDER
+    #define EXTENT_CEILING	(1 << MAX_EXTENT_ORDER)
+#else
+    #define EXTENT_CEILING	(1 << LBA48_XBITS)
+#endif
+#define LBA48_CEILING		(1 << LBA48_BITS)
+
+#define Z_LOWER48		(~0ul >> LBA48_XBITS)
+#define Z_UPPER16		(~Z_LOWER48 >> LBA48_BITS)
+
+#define STREAM_SIZE		256
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+enum gc_opt_t {
+	GC_OFF = 0,
+	GC_ON,
+	GC_FORCE,
+};
+
+/**
+ * enum pg_flag_enum - Map Pg flags
+ * @IS_DIRTY: Block is modified from on-disk copy.
+ * @IS_STALE: ??
+ * @IS_FLUSH: If flush was issued when IS_DIRTY was not set.
+ *
+ * @IS_FWD: Is part of Forward ZLT or CRC table.
+ * @IS_REV: Is part of Reverse ZLT or CRC table.
+ * @IS_CRC: Is part of CRC table.
+ * @IS_LUT: Is part of ZTL table.
+ *
+ * @WB_JOURNAL: If WB is targeted to Journal or LUT entries.
+ *
+ * @R_IN_FLIGHT: Async read in progress.
+ * @W_IN_FLIGHT: ASync write in progress.
+ * @DELAY_ADD: Spinlock was busy for zltlst, added to lazy for transit.
+ * @STICKY: ???
+ * @IS_READA: Block will pulled for Read Ahead. Cleared when used.
+ * @IS_DROPPED: Has clean/expired from zlt_lst and is waiting free().
+ *
+ * @IS_LAZY: Had been added to 'lazy' lzy_lst.
+ * @IN_ZLT: Had been added to 'inpool' zlt_lst.
+ *
+ */
+enum pg_flag_enum {
+	IS_DIRTY,         /*     1  */
+	IS_STALE,         /*     2  */
+	IS_FLUSH,         /*     4  */
+
+	IS_FWD,           /*     8  */
+	IS_REV,           /*    10  */
+	IS_CRC,           /*    20  */
+	IS_LUT,           /*    40  */
+
+	WB_RE_CACHE,      /*    80  */
+	IN_WB_JOURNAL,    /*   100  */
+
+	R_IN_FLIGHT,      /*   200  */
+	R_CRC_PENDING,    /*   400  */
+	W_IN_FLIGHT,      /*   800  */
+	DELAY_ADD,        /*   1000 */
+	STICKY,           /*   2000 */
+	IS_READA,         /*   4000 */
+	IS_DROPPED,       /*   8000 */
+	IS_LAZY,          /*  10000 */
+	IN_ZLT,           /*  20000 */
+	IS_ALLOC,         /*  40000 */
+	R_SCHED,          /*  80000 */
+};
+
+/**
+ * enum gc_flags_enum - Garbage Collection [GC] states.
+ */
+enum gc_flags_enum {
+	GC_IN_PROGRESS,
+	DO_GC_INIT,		/* -> GC_MD_MAP                          */
+	DO_GC_MD_MAP,		/* -> GC_MD_MAP   | GC_READ    | GC_DONE */
+	DO_GC_READ,		/* -> GC_WRITE    | GC_MD_SYNC           */
+	DO_GC_WRITE,		/* -> GC_CONTINUE                        */
+	DO_GC_CONTINUE,		/* -> GC_READ     | GC_MD_SYNC           */
+	DO_GC_MD_SYNC,		/* -> DO_GC_MD_ZLT                       */
+	DO_GC_MD_ZLT,		/* -> GC_DONE                            */
+	DO_GC_DONE,
+	DO_GC_COMPLETE,
+};
+
+/**
+ * enum znd_flags_enum - zdm state/feature/action flags
+ */
+enum znd_flags_enum {
+	ZF_FREEZE,
+	ZF_POOL_FWD,
+	ZF_POOL_REV,
+	ZF_POOL_CRCS,
+
+	ZF_RESV_1,
+	ZF_RESV_2,
+	ZF_RESV_3,
+	ZF_RESV_4,
+
+	DO_MAPCACHE_MOVE,
+	DO_MEMPOOL,
+	DO_SYNC,
+	DO_FLUSH,
+	DO_ZDM_RELOAD,
+	DO_GC_NO_PURGE,
+	DO_METAWORK_QD,
+};
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+struct zdm;
+
+/**
+ * struct map_addr - A page of map table
+ * @dm_s:    full map on dm layer
+ * @zone_id: z_id match zone_list_t.z_id
+ * @pg_idx:  entry in lut (0-1023)
+ * @lut_s:   sector table lba
+ * @lut_r:   reverse table lba
+ *
+ * Longer description of this structure.
+ */
+struct map_addr {
+	u64 dm_s;
+	u64 lut_s;
+	u64 lut_r;
+
+	u32 zone_id;
+	u32 pg_idx;
+};
+
+/**
+ * struct map_cache_entry - Sector to LBA mapping.
+ * @tlba: tlba
+ * @physical: blba or number of blocks
+ *
+ * Longer description of this structure.
+ */
+struct map_cache_entry {
+	__le64 tlba;		/* record type [16 bits] + logical sector # */
+	__le64 bval;	/* csum 16 [16 bits] + 'physical' block lba */
+} __packed;
+
+
+#define MCE_NO_ENTRY	0x4000
+#define MCE_NO_MERGE	0x8000
+
+/* 1 temporary buffer for each for overlay/split/mergeback
+ */
+struct map_cache_page {
+	struct map_cache_entry header;
+	struct map_cache_entry maps[Z_MAP_MAX];
+} __packed;
+
+struct gc_map_cache_data {
+	struct map_cache_entry header;
+	struct map_cache_entry maps[Z_BLKSZ];
+} __packed;
+
+/**
+ * enum map_type_enum - Map Cache pool types
+ * @IS_MAP: Is an ingress map pool
+ * @IS_JOURNAL: Is a meta data journal pool.
+ * @IS_DISCARD: Is a discard pool.
+ */
+enum map_type_enum {
+	IS_JRNL_PG = 1,
+	IS_POST_MAP,
+
+	IS_WBJRNL,
+	IS_INGRESS,
+	IS_TRIM,
+	IS_UNUSED,
+};
+
+/**
+ * struct gc_map_cache - An array of expected re-mapped blocks
+ * @gc_mcd: map_cache_entry array
+ * @cached_lock:
+ * @jcount:
+ * @jsorted:
+ * @jsize:
+ * @map_content: See map_type_enum
+ *
+ * Working set of mapping information during GC/zone reclamation.
+ */
+struct gc_map_cache {
+	struct gc_map_cache_data *gc_mcd;
+	spinlock_t cached_lock;
+	int jcount;
+	int jsorted;
+	int jsize;
+	int map_content;
+};
+
+/**
+ * struct map_pool - Mapping cache (ingress, discard, unused, journal)
+ * @count:	Number of current entries.
+ * @sorted:	Number of 'sorted' entries.
+ * @size:	Total size of array.
+ * @isa:	See map_type_enum
+ * @znd:	Reference to ZDM Instance
+ * @pgs:	Array of pages, each containing an array of entries.
+ *
+ * Working set of mapping information
+ */
+struct map_pool {
+	int count;
+	int sorted;
+	int size;
+	int isa;
+	struct zdm *znd;
+	struct map_cache_entry *pgs[Z_MCE_MAX];
+};
+
+/**
+ * union map_pg_data - A page of map data
+ * @addr:	Array of LBA's in little endian.
+ * @crc:	Array of 16 bit CRCs in little endian.
+ */
+union map_pg_data {
+	__le32 *addr;
+	__le16 *crc;
+};
+
+/**
+ * struct map_pg - A page of map table
+ *
+ * @data:	A block/page [4K] of table entries
+ * @refcount:	In use (reference counted).
+ * @age:	Most recent access time in jiffies
+ * @lba:	Logical position (use lookups to find actual)
+ * @last_write:	Last known LBA written on disk
+ * @hentry:	List of same hash entries.
+ * @md_lock:	Spinlock held when data is modified.
+ * @flags:	State and Type flags.
+ * @zltlst:	Entry in lookup table list.
+ * @lazy:	Entry in lazy list.
+ * @znd:	ZDM Instance (for async metadata I/O)
+ * @crc_pg:	Backing CRC page (for IS_LUT pages)
+ * @lba48_in:	Mapped LBA to be read.
+ * @event:	I/O completion event (async).
+ * @io_error:	I/O error (async).
+ * @io_count:	Number of I/O attempts.
+ * @hotness:	Additional time to keep cache block in memory.
+ * @index:	Offset from base [array index].
+ * @md_crc:	TBD: Use to test for dirty/clean during I/O instead of locking.
+ *
+ * Longer description of this structure.
+ */
+struct map_pg {
+	union map_pg_data data;
+	atomic_t refcount;
+	u64 age;
+	u64 lba;
+	u64 last_write;
+	struct hlist_node hentry;
+
+	spinlock_t md_lock;
+
+	unsigned long flags;
+	struct list_head zltlst;
+
+	/* in flight/cache hit tracking */
+	struct list_head lazy;
+
+	/* for async io */
+	struct zdm *znd;
+	struct map_pg *crc_pg;
+	u64 lba48_in;
+	struct completion event;
+	int io_error;
+	int io_count;
+	int hotness;
+	int index;
+	u16 gen;
+	__le16 md_crc;
+};
+
+/**
+ * struct map_crc - Map to backing crc16.
+ *
+ * @lba:	Backing CRC's LBA
+ * @pg_no:	Page [index] of table entry if applicable.
+ * @pg_idx:	Offset within page (From: zdm::md_crcs when table is null)
+ *
+ * Longer description of this structure.
+ */
+struct map_crc {
+	u64 lba;
+	int pg_no;
+	int pg_idx;
+};
+
+/**
+ * struct gc_state - A page of map table
+ * @znd: ZDM Instance
+ * @pgs: map pages held during GC.
+ * @gc_flags: See gc_flags_enum
+ * @r_ptr: Next read in zone.
+ * @w_ptr: Next write in target zone.
+ * @nblks: Number of blocks in I/O
+ * @result: GC operation result.
+ * @z_gc: Zone undergoing compacation
+ * @tag: System wide serial number (debugging).
+ *
+ * Longer description of this structure.
+ */
+struct gc_state {
+	struct zdm *znd;
+	struct completion gc_complete;
+	unsigned long gc_flags;
+	atomic_t refcount;
+
+	u32 r_ptr;
+	u32 w_ptr;
+
+	u32 nblks;		/* 1-65536 */
+	u32 z_gc;
+
+	int is_cpick;
+	int result;
+};
+
+/**
+ * struct mpinfo - Map to backing lookup table.
+ *
+ * @table:        backing table
+ * @crc:          backing crc16 detail.
+ * @index:        index [page] of table entry. Use map_addr::pg_idx for offset.
+ * @bit_type:     IS_LUT or IS_CRC
+ * @bit_dir:      IS_FWD or IS_REV
+ *
+ * Longer description of this structure.
+ */
+struct mpinfo {
+	struct hlist_head *htable;
+	struct map_crc crc;
+	spinlock_t *lock;
+	int ht_order;
+	int index;
+	int bit_type;
+	int bit_dir;
+};
+
+/**
+ * struct meta_pg - A page of zone WP mapping.
+ *
+ * @wp_alloc: Bits 23-0: wp alloc location.   Bits 31-24: GC Flags, Type Flags
+ * @zf_est:   Bits 23-0: free block count.    Bits 31-24: Stream Id
+ * @wp_used:  Bits 23-0: wp written location. Bits 31-24: Ratio ReCalc flag.
+ * @lba: pinned LBA in conventional/preferred zone.
+ * @wplck: spinlock held during data updates.
+ * @flags: IS_DIRTY flag
+ *
+ * One page is used per 1024 zones on media.
+ * For an 8TB drive this uses 30 entries or about 360k RAM.
+ */
+struct meta_pg {
+	__le32 *wp_alloc;
+	__le32 *zf_est;
+	__le32 *wp_used;
+	u64 lba;
+	spinlock_t wplck;
+	unsigned long flags;
+};
+
+/**
+ * struct zdm_superblock - A page of map table
+ * @uuid:
+ * @nr_zones:
+ * @magic:
+ * @zdstart:
+ * @version:
+ * @packed_meta:
+ * @flags:
+ * @csum:
+ *
+ * Longer description of this structure.
+ */
+struct zdm_superblock {
+	u8 uuid[UUID_LEN];	/* 16 */
+	__le64 nr_zones;	/*  8 */
+	__le64 magic;		/*  8 */
+	__le32 resvd;		/*  4 */
+	__le32 zdstart;		/*  4 */
+	__le32 version;		/*  4 */
+	__le32 packed_meta;	/*  4 */
+	__le32 flags;		/*  4 */
+	__le32 csum;		/*  4 */
+} __packed;			/* 56 */
+
+/**
+ * struct mz_superkey - A page of map table
+ * @sig0:           8 - Native endian
+ * @sig1:           8 - Little endian
+ * @sblock:        56 -
+ * @stream:      1024 -
+ * @reserved:    2982 -
+ * @gc_resv:        4 -
+ * @meta_resv:      4 -
+ * @generation:     8 -
+ * @key_crc:        2 -
+ * @magic:          8 -
+ *
+ * Longer description of this structure.
+ */
+struct mz_superkey {
+	u64 sig0;
+	__le64 sig1;
+	struct zdm_superblock sblock;
+	__le32 stream[STREAM_SIZE];
+	__le32 gc_resv;
+	__le32 meta_resv;
+	__le16 n_crcs;
+	__le16 crcs[MAX_CACHE_SYNC];
+	__le16 md_crc;
+	__le16 wp_crc[64];
+	__le16 zf_crc[64];
+	__le16 discards;
+	__le16 maps;
+	__le16 unused;
+	__le16 wbjrnld;
+	u8 reserved[1904];
+	__le32 crc32;
+	__le64 generation;
+	__le64 magic;
+} __packed;
+
+/**
+ * struct io_4k_block - Sector to LBA mapping.
+ * @data:	A 4096 byte block
+ *
+ * Longer description of this structure.
+ */
+struct io_4k_block {
+	u8 data[Z_C4K];
+};
+
+/**
+ * struct io_dm_block - Sector to LBA mapping.
+ * @data:	A 512 byte block
+ *
+ * Longer description of this structure.
+ */
+struct io_dm_block {
+	u8 data[512];
+};
+
+struct stale_tracking {
+	u32 binsz;
+	u32 count;
+	int bins[STREAM_SIZE];
+};
+
+/*
+ * struct zdm_io_q - Queueing and ordering the BIO.
+ */
+struct zdm_q_node {
+	struct bio *bio;
+	struct list_head jlist;
+	sector_t bi_sector;
+	unsigned long jiffies;
+};
+
+struct zdm_bio_chain {
+	struct zdm *znd;
+	struct bio *bio[BIO_MAX_PAGES];
+	atomic_t num_bios; /*Zero based counter & used as idx to bio[]*/
+};
+
+#if ENABLE_SEC_METADATA
+enum meta_dst_flags {
+	DST_TO_PRI_DEVICE,
+	DST_TO_SEC_DEVICE,
+	DST_TO_BOTH_DEVICE,
+};
+#endif
+
+/*
+ *
+ * Partition -----------------------------------------------------------------+
+ *     Table ---+                                                             |
+ *              |                                                             |
+ *   SMR Drive |^-------------------------------------------------------------^|
+ * CMR Zones     ^^^^^^^^^
+ *  meta data    |||||||||
+ *
+ *   Remaining partition is filesystem data
+ *
+ */
+
+/**
+ * struct zdm - A page of map table
+ * @ti:		dm_target entry
+ * @dev:	dm_dev entry
+ * @meta_dev:	device for keeping secondary metadata
+ * @sec_zone_align: Sectors required to align data start to zone boundary
+ * @meta_dst_flag:	See: enum meta_dst_flags
+ * @mclist:	list of pages of in-memory LBA mappings.
+ * @mclck:	in memory map-cache lock (spinlock)
+
+ * @zltpool:		pages of lookup table entries
+ * @zlt_lck:	zltpool: memory pool lock
+ * @lzy_pool:
+ * @lzy_lock:
+ * @fwd_tm:
+ * @rev_tm:
+
+ * @bg_work:	background worker (periodic GC/Mem purge)
+ * @bg_wq:	background work queue
+ * @stats_lock:
+ * @gc_active:	Current GC state
+ * @gc_lock:	GC Lock
+ * @gc_work:	GC Worker
+ * @gc_wq:	GC Work Queue
+ * @data_zones:	# of data zones on device
+ * @gz_count:	# of 256G mega-zones
+ * @nr_blocks:	4k blocks on backing device
+ * @md_start:	LBA at start of metadata pool
+ * @data_lba:	LBA at start of data pool
+ * @zdstart:	ZONE # at start of data pool (first Possible WP ZONE)
+ * @start_sect:	where ZDM partition starts (RAW LBA)
+ * @sec_dev_start_sect: secondary device partition starts. (RAW LBA)
+ * @flags:	See: enum znd_flags_enum
+ * @gc_backlog:
+ * @gc_io_buf:
+ * @io_vcache[32]:
+ * @io_vcache_flags:
+ * @z_sballoc:
+ * @super_block:
+ * @z_mega:
+ * @meta_wq:
+ * @gc_postmap:
+ * @jrnl_map:
+ * @io_client:
+ * @io_wq:
+ * @zone_action_wq:
+ * @timer:
+ * @bins:	Memory usage accounting/reporting.
+ * @bdev_name:
+ * @memstat:
+ * @suspended:
+ * @gc_mz_pref:
+ * @mz_provision:	Number of zones per 1024 of over-provisioning.
+ * @is_empty:		For fast discards on initial format
+ *
+ * Longer description of this structure.
+ */
+struct zdm {
+	struct dm_target *ti;
+	struct dm_dev *dev;
+
+#if ENABLE_SEC_METADATA
+	struct dm_dev *meta_dev;
+	u64 sec_zone_align;
+	u8 meta_dst_flag;
+#endif
+
+	struct list_head zltpool;
+	struct list_head lzy_pool;
+
+	struct work_struct bg_work;
+	struct workqueue_struct *bg_wq;
+
+	spinlock_t zlt_lck;
+	spinlock_t lzy_lck;
+	spinlock_t stats_lock;
+	spinlock_t mapkey_lock; /* access LUT and CRC array of pointers */
+	spinlock_t ct_lock; /* access LUT and CRC array of pointers */
+
+	struct mutex gc_wait;
+	struct mutex pool_mtx;
+	struct mutex mz_io_mutex;
+	struct mutex vcio_lock;
+
+	/* Primary ingress cache */
+	struct map_pool *ingress;
+	struct map_pool *in[2];  /* active and update */
+	spinlock_t in_rwlck;
+
+	/* unused blocks cache */
+	struct map_pool *unused;
+	struct map_pool *_use[2]; /* active and update */
+	spinlock_t unused_rwlck;
+
+	struct map_pool *trim;
+	struct map_pool *trim_mp[2]; /* active and update */
+	spinlock_t trim_rwlck;
+
+	struct map_pool *wbjrnl;
+	struct map_pool *_wbj[2]; /* active and update */
+	spinlock_t wbjrnl_rwlck;
+
+	mempool_t *mempool_pages;
+	mempool_t *mempool_maps;
+	mempool_t *mempool_wset;
+
+	struct task_struct *bio_kthread;
+	struct list_head bio_srt_jif_lst_head;
+	spinlock_t zdm_bio_q_lck;
+
+	mempool_t *bp_q_node;
+	mempool_t *bp_chain_vec;
+	atomic_t enqueued;
+	wait_queue_head_t wait_bio;
+	struct bio_set *bio_set;
+
+	struct gc_state *gc_active;
+	spinlock_t gc_lock;
+	struct delayed_work gc_work;
+	struct workqueue_struct *gc_wq;
+
+	u64 nr_blocks;
+	u64 start_sect;
+#if ENABLE_SEC_METADATA
+	u64 sec_dev_start_sect;
+#endif
+	u64 htlba;
+
+	u64 md_start;
+	u64 md_end;
+	u64 data_lba;
+
+	unsigned long flags;
+
+	u64 r_base;
+	u64 s_base;
+	u64 c_base;
+	u64 c_mid;
+	u64 c_end;
+
+	u64 sk_low;   /* unused ? */
+	u64 sk_high;  /* unused ? */
+
+	struct meta_pg *wp;
+	DECLARE_HASHTABLE(fwd_hm, HLUT_ORDER);
+	DECLARE_HASHTABLE(rev_hm, HLUT_ORDER);
+	DECLARE_HASHTABLE(fwd_hcrc, HCRC_ORDER);
+	DECLARE_HASHTABLE(rev_hcrc, HCRC_ORDER);
+	__le16 *md_crcs;   /* one of crc16's for fwd, 1 for rev */
+	u32 crc_count;
+	u32 map_count;
+
+	void *z_sballoc;
+	struct mz_superkey *bmkeys;
+	struct zdm_superblock *super_block;
+
+	struct work_struct meta_work;
+	sector_t last_w;
+	u8 *cow_block;
+	u64 cow_addr;
+	u32 zone_count;
+	u32 data_zones;
+	u32 dz_start;
+	u32 gz_count;
+	u32 zdstart;
+	u32 z_gc_free;
+	atomic_t incore;
+	u32 discard_count;
+	u32 z_current;
+	u32 z_meta_resv;
+	u32 z_gc_resv;
+	u32 gc_events;
+	int mc_entries;
+	int journal_entries;
+	int dc_entries;
+	int in_zlt;
+	int in_lzy;
+	int meta_result;
+	struct stale_tracking stale;
+
+	int gc_backlog;
+	void *gc_io_buf;
+	struct mutex gc_vcio_lock;
+	struct io_4k_block *io_vcache[32];
+	unsigned long io_vcache_flags;
+	u64 age;
+	u64 flush_age;
+	struct workqueue_struct *meta_wq;
+	struct gc_map_cache gc_postmap;
+	struct dm_io_client *io_client;
+	struct workqueue_struct *io_wq;
+	struct workqueue_struct *zone_action_wq;
+	struct timer_list timer;
+	bool last_op_is_flush;
+
+	u32 bins[40];
+	u32 max_bins[40];
+	char bdev_name[BDEVNAME_SIZE];
+	char bdev_metaname[BDEVNAME_SIZE];
+	char proc_name[BDEVNAME_SIZE+4];
+	struct proc_dir_entry *proc_fs;
+	loff_t wp_proc_at;
+	loff_t used_proc_at;
+
+	size_t memstat;
+	atomic_t suspended;
+	atomic_t gc_throttle;
+
+#if ALLOC_DEBUG
+	atomic_t allocs;
+	int hw_allocs;
+	void **alloc_trace;
+#endif
+	long queue_delay;
+
+	u32 queue_depth;
+	u32 gc_prio_def;
+	u32 gc_prio_low;
+	u32 gc_prio_high;
+	u32 gc_prio_crit;
+	u32 gc_wm_crit;
+	u32 gc_wm_high;
+	u32 gc_wm_low;
+	u32 gc_status;
+	u32 cache_ageout_ms;
+	u32 cache_size;
+	u32 cache_to_pagecache;
+	u32 cache_reada;
+	u32 journal_age;
+
+	u32 filled_zone;
+	u16 mz_provision;
+	unsigned bdev_is_zoned:1;
+	unsigned is_empty:1;
+	unsigned enable_trim:1;
+};
+
+/**
+ * struct zdm_ioc_request - Sector to LBA mapping.
+ * @result_size:
+ * @megazone_nr:
+ *
+ * Longer description of this structure.
+ */
+struct zdm_ioc_request {
+	u32 result_size;
+	u32 megazone_nr;
+};
+
+/**
+ * struct zdm_ioc_status - Sector to LBA mapping.
+ * @b_used: Number of blocks used
+ * @b_available: Number of blocks free
+ * @b_discard: Number of blocks stale
+ * @m_zones: Number of zones.
+ * @mc_entries: Mem cache blocks in use
+ * @dc_entries: Discard cache blocks in use.
+ * @mlut_blocks:
+ * @crc_blocks:
+ * @memstat: Total memory in use by ZDM via *alloc()
+ * @bins: Allocation by subsystem.
+ *
+ * This status structure is used to pass run-time information to
+ * user spaces tools (zdm-tools) for diagnostics and tuning.
+ */
+struct zdm_ioc_status {
+	u64 b_used;
+	u64 b_available;
+	u64 b_discard;
+	u64 m_zones;
+	u32 map_cache_entries;
+	u32 discard_cache_entries;
+	u64 mlut_blocks;
+	u64 crc_blocks;
+	u64 memstat;
+	u32 journal_pages;
+	u32 journal_entries;
+	u32 bins[40];
+};
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _DM_ZONED_H */
diff --git a/drivers/md/libzdm.c b/drivers/md/libzdm.c
new file mode 100644
index 0000000..5f89a2e
--- /dev/null
+++ b/drivers/md/libzdm.c
@@ -0,0 +1,10043 @@
+/*
+ * Kernel Device Mapper for abstracting ZAC/ZBC devices as normal
+ * block devices for linux file systems.
+ *
+ * Copyright (C) 2015,2016 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@xxxxxxxxxxx>
+ *
+ * Bio queue support and metadata relocation by:
+ * Vineet Agarwal <vineet.agarwal@xxxxxxxxxxx>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#define BUILD_NO		127
+
+#define EXTRA_DEBUG		0
+
+#define MAX_PER_PAGE(x)		(PAGE_SIZE / sizeof(*(x)))
+#define DISCARD_IDLE_MSECS	2000
+#define DISCARD_MAX_INGRESS	150
+
+/*
+ *  For performance tuning:
+ *   Q? smaller strips give smoother performance
+ *      a single drive I/O is 8 (or 32?) blocks?
+ *   A? Does not seem to ...
+ */
+#define GC_MAX_STRIPE		256
+#define REPORT_ORDER		7
+#define REPORT_FILL_PGS		65 /* 65 -> min # pages for 4096 descriptors */
+#define MAX_WSET		4096
+#define SYNC_MAX		MAX_WSET
+
+#define MZTEV_UNUSED		(cpu_to_le32(0xFFFFFFFFu))
+#define MZTEV_NF		(cpu_to_le32(0xFFFFFFFEu))
+
+#define Z_TABLE_MAGIC		0x123456787654321Eul
+#define Z_KEY_SIG		0xFEDCBA987654321Ful
+
+#define Z_CRC_4K		4096
+#define MAX_ZONES_PER_MZ	1024
+
+#define GC_READ			(1ul << 15)
+#define GC_WROTE		(1ul << 14)
+#define GC_DROP			(1ul << 13)
+
+#define BAD_ADDR		(~0ul)
+#define MC_INVALID		(cpu_to_le64(BAD_ADDR))
+#define NOZONE			(~0u)
+
+#define GZ_BITS			10
+#define GZ_MMSK			((1u << GZ_BITS) - 1)
+
+#define CRC_BITS		11
+#define CRC_MMSK		((1u << CRC_BITS) - 1)
+
+#define MD_CRC_INIT		(cpu_to_le16(0x5249u))
+
+#define ISCT_BASE		1
+
+#define MC_HEAD			0
+#define MC_INTERSECT		1
+#define MC_TAIL			2
+#define MC_SKIP			2
+
+static int map_addr_calc(struct zdm *, u64 dm_s, struct map_addr *out);
+static int zoned_io_flush(struct zdm *znd);
+static int zoned_wp_sync(struct zdm *znd, int reset_non_empty);
+static void cache_if_dirty(struct zdm *znd, struct map_pg *pg, int wq);
+static int write_if_dirty(struct zdm *, struct map_pg *, int use_wq, int snc);
+static void gc_work_task(struct work_struct *work);
+static void meta_work_task(struct work_struct *work);
+static u64 mcache_greatest_gen(struct zdm *, int, u64 *, u64 *);
+static u64 mcache_find_gen(struct zdm *, u64 base, int, u64 *out);
+static int find_superblock(struct zdm *znd, int use_wq, int do_init);
+static int sync_mapped_pages(struct zdm *znd, int sync, int drop);
+static struct io_4k_block *get_io_vcache(struct zdm *znd, gfp_t gfp);
+static int put_io_vcache(struct zdm *znd, struct io_4k_block *cache);
+static struct map_pg *gme_noio(struct zdm *znd, u64 lba);
+static struct map_pg *get_map_entry(struct zdm *znd, u64 lba,
+			     int ahead, int async, int noio, gfp_t gfp);
+static struct map_pg *do_gme_io(struct zdm *znd, u64 lba,
+				int ahead, int async, gfp_t gfp);
+static void put_map_entry(struct map_pg *);
+static int cache_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp,
+		    struct mpinfo *mpi);
+static int _pool_read(struct zdm *znd, struct map_pg **wset, int count);
+static int wait_for_map_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp);
+static int move_to_map_tables(struct zdm *znd, struct map_cache_entry *maps,
+			      int count, struct map_pg **pgs, int npgs);
+static int zlt_move_unused(struct zdm *znd, struct map_cache_entry *maps,
+			   int count, struct map_pg **pgs, int npgs);
+static int unused_add(struct zdm *znd, u64 addr, u64, u32 count, gfp_t gfp);
+static int _cached_to_tables(struct zdm *znd, u32 zone, gfp_t gfp);
+static void update_stale_ratio(struct zdm *znd, u32 zone);
+static int zoned_create_disk(struct dm_target *ti, struct zdm *znd);
+static int do_init_zoned(struct dm_target *ti, struct zdm *znd);
+static void update_all_stale_ratio(struct zdm *znd);
+static int unmap_deref_chunk(struct zdm *znd, u32 blks, int, gfp_t gfp);
+static u64 z_lookup_journal_cache(struct zdm *znd, u64 addr);
+static u64 z_lookup_ingress_cache(struct zdm *znd, u64 addr);
+static int z_lookup_trim_cache(struct zdm *znd, u64 addr);
+static u64 z_lookup_table(struct zdm *znd, u64 addr, gfp_t gfp);
+static u64 current_mapping(struct zdm *znd, u64 addr, gfp_t gfp);
+static u64 current_map_range(struct zdm *znd, u64 addr, u32 *range, gfp_t gfp);
+static int do_sort_merge(struct map_pool *to, struct map_pool *src,
+			 struct map_cache_entry *chng, int nchgs, int drop);
+static int ingress_add(struct zdm *znd, u64 addr, u64 target, u32 count, gfp_t);
+static int z_mapped_addmany(struct zdm *znd, u64 addr, u64, u32, gfp_t);
+static int md_journal_add_map(struct zdm *znd, u64 addr, u64 lba);
+static int md_handle_crcs(struct zdm *znd);
+static int z_mapped_sync(struct zdm *znd);
+static int z_mapped_init(struct zdm *znd);
+static u64 z_acquire(struct zdm *znd, u32 flags, u32 nblks, u32 *nfound);
+static __le32 sb_crc32(struct zdm_superblock *sblock);
+static int update_map_entry(struct zdm *, struct map_pg *,
+			    struct map_addr *, u64, int);
+static int read_block(struct zdm *, enum dm_io_mem_type,
+		      void *, u64, unsigned int, int);
+static int write_block(struct zdm *, enum dm_io_mem_type,
+		       void *, u64, unsigned int, int);
+static int writef_block(struct zdm *ti, enum dm_io_mem_type dtype,
+			void *data, u64 lba, unsigned int op_flags,
+			unsigned int count, int queue);
+static int zoned_init_disk(struct dm_target *ti, struct zdm *znd,
+			   int create, int force);
+
+#define MutexLock(m)	test_and_lock((m), __LINE__)
+
+/**
+ * test_and_lock() - mutex test and lock (tracing)
+ * @m: mutex
+ * @lineno: line number
+ */
+static __always_inline void test_and_lock(struct mutex *m, int lineno)
+{
+	if (!mutex_trylock(m)) {
+		pr_debug("mutex stall at %d\n", lineno);
+		mutex_lock(m);
+	}
+}
+
+/**
+ * ref_pg() - Decrement refcount on page of ZLT
+ * @pg: Page of ZLT map
+ */
+static __always_inline void deref_pg(struct map_pg *pg)
+{
+	atomic_dec(&pg->refcount);
+
+#if 0 /* ALLOC_DEBUG */
+	if (atomic_dec_return(&pg->refcount) < 0) {
+		pr_err("Excessive %"PRIx64": %d who dunnit?\n",
+			pg->lba, atomic_read(&pg->refcount));
+		dump_stack();
+	}
+#endif
+
+}
+
+/**
+ * ref_pg() - Increment refcount on page of ZLT
+ * @pg: Page of ZLT map
+ */
+static __always_inline void ref_pg(struct map_pg *pg)
+{
+	atomic_inc(&pg->refcount);
+#if 0 /* ALLOC_DEBUG */
+	if (atomic_read(&pg->refcount) > 20) {
+		pr_err("Excessive %d who dunnit?\n",
+			atomic_read(&pg->refcount));
+		dump_stack();
+	}
+#endif
+}
+
+/**
+ * getref_pg() - Read the refcount
+ * @pg: Page of ZLT map
+ */
+static __always_inline int getref_pg(struct map_pg *pg)
+{
+	return atomic_read(&pg->refcount);
+}
+
+/**
+ * crc16_md() - 16 bit CRC on metadata blocks
+ * @data: Block of metadata.
+ * @len:  Number of bytes in block.
+ *
+ * Return: 16 bit CRC.
+ */
+static inline u16 crc16_md(void const *data, size_t len)
+{
+	const u16 init = 0xFFFF;
+	const u8 *p = data;
+
+	return crc16(init, p, len);
+}
+
+/**
+ * crc_md_le16() - 16 bit CRC on metadata blocks in little endian
+ * @data: Block of metadata.
+ * @len:  Number of bytes in block.
+ *
+ * Return: 16 bit CRC.
+ */
+static inline __le16 crc_md_le16(void const *data, size_t len)
+{
+	u16 crc = crc16_md(data, len);
+
+	return cpu_to_le16(crc);
+}
+
+/**
+ * crcpg() - 32 bit CRC [NOTE: 32c is HW assisted on Intel]
+ * @data: Block of metadata [4K bytes].
+ *
+ * Return: 32 bit CRC.
+ */
+static inline u32 crcpg(void *data)
+{
+	return crc32c(~0u, data, Z_CRC_4K) ^ SUPERBLOCK_CSUM_XOR;
+}
+
+/**
+ * crc32c_le32() - 32 bit CRC [NOTE: 32c is HW assisted on Intel]
+ * @init: Starting value (usually: ~0u)
+ * @data: Data to be CRC'd.
+ * @sz: Number of bytes to be CRC'd
+ *
+ * Return: 32 bit CRC in little endian format.
+ */
+static inline __le32 crc32c_le32(u32 init, void *data, u32 sz)
+{
+	return cpu_to_le32(crc32c(init, data, sz));
+}
+
+/**
+ * le64_to_lba48() - Return the lower 48 bits of LBA
+ * @enc: 64 bit LBA + flags
+ * @flg: optional 16 bits of classification.
+ *
+ * Return: 48 bits of LBA [and flg].
+ */
+static inline u64 le64_to_lba48(__le64 enc, u32 *flg)
+{
+	const u64 lba64 = le64_to_cpu(enc);
+
+	if (flg)
+		*flg = (lba64 >> LBA48_BITS) & Z_UPPER16;
+
+	return lba64 & Z_LOWER48;
+}
+
+/**
+ * lba48_to_le64() - Encode 48 bits of lba + 16 bits of flags.
+ * @flags: flags to encode.
+ * @lba48: LBA to encode
+ *
+ * Return: Little endian u64.
+ */
+static inline __le64 lba48_to_le64(u32 flags, u64 lba48)
+{
+	u64 high_bits = flags;
+
+	return (high_bits << LBA48_BITS) | (lba48 & Z_LOWER48);
+}
+
+/**
+ * sb_test_flag() - Test if flag is set in Superblock.
+ * @sb: zdm_superblock.
+ * @bit_no: superblock flag
+ *
+ * Return: non-zero if flag is set.
+ */
+static inline int sb_test_flag(struct zdm_superblock *sb, int bit_no)
+{
+	u32 flags = le32_to_cpu(sb->flags);
+
+	return (flags & (1 << bit_no)) ? 1 : 0;
+}
+
+/**
+ * sb_set_flag() - Set a flag in superblock.
+ * @sb: zdm_superblock.
+ * @bit_no: superblock flag
+ */
+static inline void sb_set_flag(struct zdm_superblock *sb, int bit_no)
+{
+	u32 flags = le32_to_cpu(sb->flags);
+
+	flags |= (1 << bit_no);
+	sb->flags = cpu_to_le32(flags);
+}
+
+/**
+ * zone_to_sector() - Calculate starting LBA of zone
+ * @zone: zone number (0 based)
+ *
+ * Return: LBA at start of zone.
+ */
+static inline u64 zone_to_sector(u64 zone)
+{
+	return zone << ZONE_SECT_BITS;
+}
+
+/**
+ * is_expired_msecs() - Determine if age + msecs is older than now.
+ * @age: jiffies at last access
+ * @msecs: msecs of extra time.
+ *
+ * Return: non-zero if block is expired.
+ */
+static inline int is_expired_msecs(u64 age, u32 msecs)
+{
+	int expired = 1;
+	if (age) {
+		u64 expire_at = age + msecs_to_jiffies(msecs);
+
+		expired = time_after64(jiffies_64, expire_at);
+	}
+	return expired;
+}
+
+/**
+ * is_expired() - Determine if age is older than znd->cache_ageout_ms.
+ * @age: jiffies at last access
+ *
+ * Return: non-zero if block is expired.
+ */
+static inline int is_expired(struct zdm *znd, u64 age)
+{
+	return is_expired_msecs(age, znd->cache_ageout_ms);
+}
+
+/**
+ * _calc_zone() - Determine zone number from addr
+ * @addr: 4k sector number
+ *
+ * Return: znum or 0xFFFFFFFF if addr is in metadata space.
+ */
+static inline u32 _calc_zone(struct zdm *znd, u64 addr)
+{
+	u32 znum = NOZONE;
+
+	if (addr < znd->md_start)
+		return znum;
+
+	addr -= znd->md_start;
+	znum = addr >> Z_BLKBITS;
+
+	return znum;
+}
+
+/**
+ * lazy_pool_add - Set a flag and add map page to the lazy pool
+ * @znd: ZDM Instance
+ * @expg: Map table page.
+ * @bit: Flag to set on page.
+ *
+ * Lazy pool is used for deferred adding and delayed removal.
+ */
+static __always_inline
+void lazy_pool_add(struct zdm *znd, struct map_pg *expg, int bit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&znd->lzy_lck, flags);
+	if (!test_bit(IS_LAZY, &expg->flags)) {
+		set_bit(IS_LAZY, &expg->flags);
+
+		list_add(&expg->lazy, &znd->lzy_pool);
+		znd->in_lzy++;
+	}
+	set_bit(bit, &expg->flags);
+	spin_unlock_irqrestore(&znd->lzy_lck, flags);
+}
+
+/**
+ * lazy_pool_splice - Add a list of pages to the lazy pool
+ * @znd: ZDM Instance
+ * @list: List of map table page to add.
+ *
+ * Lazy pool is used for deferred adding and delayed removal.
+ */
+static __always_inline
+void lazy_pool_splice(struct zdm *znd, struct list_head *list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&znd->lzy_lck, flags);
+	list_splice_tail(list, &znd->lzy_pool);
+	spin_unlock_irqrestore(&znd->lzy_lck, flags);
+}
+
+/**
+ * zlt_pool_splice - Add a list of pages to the zlt pool
+ * @znd: ZDM Instance
+ * @list: List of map table page to add.
+ */
+static __always_inline
+void zlt_pool_splice(struct zdm *znd, struct list_head *list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&znd->zlt_lck, flags);
+	list_splice_tail(list, &znd->zltpool);
+	spin_unlock_irqrestore(&znd->zlt_lck, flags);
+}
+
+/**
+ * pool_add() - Add metadata block to zltlst
+ * @znd: ZDM instance
+ * @expg: current metadata block to add to zltlst list.
+ */
+static inline int pool_add(struct zdm *znd, struct map_pg *expg)
+{
+	int rcode = 0;
+	unsigned long flags;
+
+	/* undrop from journal will be in lazy list */
+	if (test_bit(IS_LAZY, &expg->flags)) {
+		set_bit(DELAY_ADD, &expg->flags);
+		return rcode;
+	}
+	if (test_bit(IN_ZLT, &expg->flags)) {
+		return rcode;
+	}
+
+	if (spin_trylock_irqsave(&znd->zlt_lck, flags)) {
+		if (test_bit(IN_ZLT, &expg->flags)) {
+			Z_ERR(znd, "Double list_add from:");
+			dump_stack();
+		} else {
+			set_bit(IN_ZLT, &expg->flags);
+			list_add(&expg->zltlst, &znd->zltpool);
+			znd->in_zlt++;
+		}
+		spin_unlock_irqrestore(&znd->zlt_lck, flags);
+		rcode = 0;
+	} else {
+		lazy_pool_add(znd, expg, DELAY_ADD);
+	}
+
+	return rcode;
+}
+
+static __always_inline bool _io_pending(struct map_pg *pg)
+{
+	if (!pg->data.addr ||
+	    test_bit(IS_ALLOC, &pg->flags) ||
+	    test_bit(R_IN_FLIGHT, &pg->flags) ||
+	    test_bit(R_SCHED, &pg->flags))
+		return true;
+
+	if (test_bit(R_CRC_PENDING, &pg->flags) &&
+	    test_bit(IS_LUT, &pg->flags))
+		return true;
+
+	return false;
+}
+
+/**
+ * zlt_pgs_in_use() - Get number of zlt pages in use.
+ * @znd: ZDM instance
+ *
+ * Return: Current in zlt count.
+ */
+static inline int zlt_pgs_in_use(struct zdm *znd)
+{
+	return znd->in_zlt;
+}
+
+/**
+ * low_cache_mem() - test if cache memory is running low
+ * @znd: ZDM instance
+ *
+ * Return: non-zero of cache memory is low.
+ */
+static inline int low_cache_mem(struct zdm *znd)
+{
+	return zlt_pgs_in_use(znd) > (znd->cache_size >> 1);
+}
+
+/**
+ * to_table_entry() - Deconstrct metadata page into mpinfo
+ * @znd: ZDM instance
+ * @lba: Address (4k resolution)
+ * @ra: Is speculative (may be beyond FWD IDX)
+ * @mpi: mpinfo describing the entry (OUT, Required)
+ *
+ * Return: Index into mpinfo.table.
+ */
+static int to_table_entry(struct zdm *znd, u64 lba, int ra, struct mpinfo *mpi)
+{
+	int index = -1;
+
+	mpi->lock = &znd->mapkey_lock;
+
+	if (lba >= znd->s_base && lba < znd->r_base) {
+		index		= lba - znd->s_base;
+		mpi->htable	= znd->fwd_hm;
+		mpi->ht_order	= HASH_BITS(znd->fwd_hm);
+		mpi->bit_type	= IS_LUT;
+		mpi->bit_dir	= IS_FWD;
+		mpi->crc.pg_no	= index >> CRC_BITS;
+		mpi->crc.lba    = znd->c_base + mpi->crc.pg_no;
+		mpi->crc.pg_idx	= index & CRC_MMSK;
+		if (index < 0 ||  index >= znd->map_count) {
+			if (ra)
+				goto done;
+			Z_ERR(znd, "%s: FWD BAD IDX %"PRIx64" %d of %d",
+				__func__, lba, index, znd->map_count);
+			dump_stack();
+		}
+	} else if (lba >= znd->r_base && lba < znd->c_base) {
+		index		= lba - znd->r_base;
+		mpi->htable	= znd->rev_hm;
+		mpi->ht_order	= HASH_BITS(znd->rev_hm);
+		mpi->bit_type	= IS_LUT;
+		mpi->bit_dir	= IS_REV;
+		mpi->crc.pg_no	= index >> CRC_BITS;
+		mpi->crc.lba    = znd->c_mid + mpi->crc.pg_no;
+		mpi->crc.pg_idx	= index & CRC_MMSK;
+		if (index < 0 ||  index >= znd->map_count) {
+			if (ra)
+				goto done;
+			Z_ERR(znd, "%s: REV BAD IDX %"PRIx64" %d of %d",
+				__func__, lba, index, znd->map_count);
+			dump_stack();
+		}
+	} else if (lba >= znd->c_base && lba < znd->c_mid) {
+		index		= lba - znd->c_base;
+		mpi->htable	= znd->fwd_hcrc;
+		mpi->ht_order	= HASH_BITS(znd->fwd_hcrc);
+		mpi->lock	= &znd->ct_lock;
+		mpi->bit_type	= IS_CRC;
+		mpi->bit_dir	= IS_FWD;
+		mpi->crc.lba    = ~0ul;
+		mpi->crc.pg_no	= 0;
+		mpi->crc.pg_idx	= index & CRC_MMSK;
+		if (index < 0 ||  index >= znd->crc_count) {
+			if (ra)
+				goto done;
+			Z_ERR(znd, "%s: CRC BAD IDX %"PRIx64" %d of %d",
+				__func__, lba, index, znd->crc_count);
+			dump_stack();
+		}
+	} else if (lba >= znd->c_mid && lba < znd->c_end) {
+		index		= lba - znd->c_mid;
+		mpi->htable	= znd->rev_hcrc;
+		mpi->ht_order	= HASH_BITS(znd->rev_hcrc);
+		mpi->lock	= &znd->ct_lock;
+		mpi->bit_type	= IS_CRC;
+		mpi->bit_dir	= IS_REV;
+		mpi->crc.lba    = ~0ul;
+		mpi->crc.pg_no	= 1;
+		mpi->crc.pg_idx	= (1 << CRC_BITS) + (index & CRC_MMSK);
+		if (index < 0 ||  index >= znd->crc_count) {
+			if (ra)
+				goto done;
+			Z_ERR(znd, "%s: CRC BAD IDX %"PRIx64" %d of %d",
+				__func__, lba, index, znd->crc_count);
+			dump_stack();
+		}
+	} else {
+		Z_ERR(znd, "** Corrupt lba %" PRIx64 " not in range.", lba);
+		znd->meta_result = -EIO;
+		dump_stack();
+	}
+done:
+	mpi->index = index;
+	return index;
+}
+
+/**
+ * get_htbl_entry() - Get map_pg from Hashtable using Map Page Info
+ * @znd: ZDM Instance
+ * @mpi: Map Page Information descriptor.
+ *
+ * Returns the current map_pg or NULL if map_pg is not in core.
+ */
+static inline struct map_pg *get_htbl_entry(struct zdm *znd,
+					    struct mpinfo *mpi)
+{
+	struct map_pg *obj;
+	struct hlist_head *hlist;
+
+	hlist = &mpi->htable[hash_min(mpi->index, mpi->ht_order)];
+	hlist_for_each_entry(obj, hlist, hentry)
+		if (obj->index == mpi->index)
+			return obj;
+
+	return NULL;
+}
+
+/**
+ * add_htbl_entry() - Add a map_pg to Hashtable using Map Page Info
+ * @znd: ZDM Instance
+ * @mpi: Map Page Information descriptor.
+ * @pg: Map Page to add
+ *
+ * Returns 1 if the pages was added. 0 if the page was already added.
+ *
+ * Since locks are not held between lookup and new page allocation races
+ * can happen, caller is responsible for cleanup.
+ */
+static inline int add_htbl_entry(struct zdm *znd, struct mpinfo *mpi,
+				 struct map_pg *pg)
+{
+	struct hlist_head *hlist;
+	struct map_pg *obj = get_htbl_entry(znd, mpi);
+
+	if (obj) {
+		if (obj->lba != pg->lba) {
+			Z_ERR(znd, "Page %" PRIx64 " already added %" PRIx64,
+				obj->lba, pg->lba);
+			dump_stack();
+		}
+		return 0;
+	}
+
+	hlist = &mpi->htable[hash_min(mpi->index, mpi->ht_order)];
+	hlist_add_head(&pg->hentry, hlist);
+
+	return 1;
+}
+
+/**
+ * htbl_lut_hentry() - Get hash table head for map page lookup
+ * @znd: ZDM Instance
+ * @is_fwd: If the entry is for the forward map or reverse map.
+ * @idx: The map page index.
+ *
+ * Returns the corresponding hash entry head.
+ */
+static inline struct hlist_head *htbl_lut_hentry(struct zdm *znd, int is_fwd,
+						 int idx)
+{
+	struct hlist_head *hlist;
+
+	if (is_fwd)
+		hlist = &znd->fwd_hm[hash_min(idx, HASH_BITS(znd->fwd_hm))];
+	else
+		hlist = &znd->rev_hm[hash_min(idx, HASH_BITS(znd->rev_hm))];
+
+	return hlist;
+}
+
+/**
+ * htbl_crc_hentry() - Get hash table head for CRC page entry
+ * @znd: ZDM Instance
+ * @is_fwd: If the entry is for the forward map or reverse map.
+ * @idx: The map page index.
+ *
+ * Returns the corresponding hash entry head.
+ */
+static inline struct hlist_head *htbl_crc_hentry(struct zdm *znd, int is_fwd,
+						 int idx)
+{
+	struct hlist_head *hlist;
+
+	if (is_fwd)
+		hlist = &znd->fwd_hcrc[hash_min(idx, HASH_BITS(znd->fwd_hcrc))];
+	else
+		hlist = &znd->rev_hcrc[hash_min(idx, HASH_BITS(znd->rev_hcrc))];
+
+	return hlist;
+}
+
+/**
+ * del_htbl_entry() - Delete an entry from a hash table
+ * @znd: ZDM Instance
+ * @pg: Map Page to add
+ *
+ * Returns 1 if the pages was removed. 0 if the page was not found.
+ *
+ * Since locks are not held between lookup and page removal races
+ * can happen, caller is responsible for cleanup.
+ */
+static inline int del_htbl_entry(struct zdm *znd, struct map_pg *pg)
+{
+	struct map_pg *obj;
+	struct hlist_head *hlist;
+	int is_fwd = test_bit(IS_FWD, &pg->flags);
+
+	if (test_bit(IS_LUT, &pg->flags))
+		hlist = htbl_lut_hentry(znd, is_fwd, pg->index);
+	else
+		hlist = htbl_crc_hentry(znd, is_fwd, pg->index);
+
+	hlist_for_each_entry(obj, hlist, hentry) {
+		if (obj->index == pg->index) {
+			hash_del(&pg->hentry);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * is_ready_for_gc() - Test zone flags for GC sanity and ready flag.
+ * @znd: ZDM instance
+ * @z_id: Address (4k resolution)
+ *
+ * Return: non-zero if zone is suitable for GC.
+ */
+static inline int is_ready_for_gc(struct zdm *znd, u32 z_id)
+{
+	u32 gzno  = z_id >> GZ_BITS;
+	u32 gzoff = z_id & GZ_MMSK;
+	struct meta_pg *wpg = &znd->wp[gzno];
+	u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+	u32 used = le32_to_cpu(wpg->wp_used[gzoff]) & Z_WP_VALUE_MASK;
+
+	if (((wp & Z_WP_GC_BITS) == Z_WP_GC_READY) && (used == Z_BLKSZ))
+		return 1;
+	return 0;
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+/*
+ *  generic-ish n-way alloc/free
+ *  Use kmalloc for small (< 4k) allocations.
+ *  Use vmalloc for multi-page alloctions
+ *  Except:
+ *  Use multipage allocations for dm_io'd pages that a frequently hit.
+ *
+ *  NOTE: ALL allocations are zero'd before returning.
+ *	  alloc/free count is tracked for dynamic analysis.
+ */
+#define GET_PGS		0x020000
+#define GET_ZPG		0x040000
+#define GET_KM		0x080000
+#define GET_VM		0x100000
+
+#define xx_01    (GET_ZPG |  1) /* unused */
+#define xx_13    (GET_ZPG | 13) /* unused */
+#define xx_17    (GET_ZPG | 17) /* unused */
+
+#define xx_26    (GET_KM  | 26) /* unused */
+#define xx_28    (GET_KM  | 28) /* unused */
+#define xx_29    (GET_KM  | 29) /* unused */
+#define xx_30    (GET_KM  | 30) /* unused */
+
+#define PG_02    (GET_ZPG |  2) /* CoW [RMW] block */
+#define PG_05    (GET_ZPG |  5) /* superblock */
+#define PG_06    (GET_ZPG |  6) /* WP: Alloc, Used, Shadow */
+#define PG_08    (GET_ZPG |  8) /* map_pool data block */
+#define PG_09    (GET_ZPG |  9) /* mc pg (copy/sync) */
+#define PG_10    (GET_ZPG | 10) /* superblock: temporary */
+#define PG_11    (GET_ZPG | 11) /* superblock: temporary */
+#define PG_27    (GET_ZPG | 27) /* map_pg data block */
+
+#define KM_00    (GET_KM  |  0) /* ZDM: Instance */
+#define KM_07    (GET_KM  |  7) /* mcache struct */
+#define KM_14    (GET_KM  | 14) /* bio pool: queue */
+#define KM_15    (GET_KM  | 15) /* bio pool: chain vec */
+#define KM_16    (GET_KM  | 16) /* gc descriptor */
+#define KM_18    (GET_KM  | 18) /* wset : sync */
+#define KM_19    (GET_KM  | 19) /* wset */
+#define KM_20    (GET_KM  | 20) /* map_pg struct */
+#define KM_21    (GET_KM  | 21) /* wp array (of ptrs) */
+#define KM_25    (GET_KM  | 25) /* map_pool struct... */
+#define KM_26    (GET_KM  | 26) /* gc pgs** */
+
+#define VM_01    (GET_VM  |  1) /* wb journal */
+#define VM_03    (GET_VM  |  3) /* gc postmap */
+#define VM_04    (GET_VM  |  4) /* gc io buffer: 1MiB*/
+#define VM_12    (GET_VM  | 12) /* vm io cache */
+
+/* alloc'd page of order > 0 */
+#define MP_22    (GET_PGS | 22) /* Metadata CRCs */
+#define MP_23    (GET_PGS | 23) /* WB Journal map */
+
+#define ZDM_FREE(z, _p, sz, id) \
+	do { zdm_free((z), (_p), (sz), (id)); (_p) = NULL; } while (0)
+
+#define ZDM_ALLOC(z, sz, id, gfp)       zdm_alloc((z), (sz), (id), (gfp))
+#define ZDM_CALLOC(z, n, sz, id, gfp)   zdm_calloc((z), (n), (sz), (id), (gfp))
+
+/**
+ * zdm_free_debug() - Extra cleanup for memory debugging.
+ * @znd: ZDM instance
+ * @p: memory to be released.
+ * @sz: allocated size.
+ * @id: allocation bin.
+ *
+ * Additional alloc/free debugging and statistics handling.
+ */
+static inline void zdm_free_debug(struct zdm *znd, void *p, size_t sz, int id)
+{
+	unsigned long flags;
+#if ALLOC_DEBUG
+	int iter;
+	int okay = 0;
+#endif
+
+	spin_lock_irqsave(&znd->stats_lock, flags);
+	if (sz > znd->memstat)
+		Z_ERR(znd, "Free'd more mem than allocated? %d", id);
+
+	if (sz > znd->bins[id]) {
+		Z_ERR(znd, "Free'd more mem than allocated? %d", id);
+		dump_stack();
+	}
+
+#if ALLOC_DEBUG
+	for (iter = 0; iter < ADBG_ENTRIES; iter++) {
+		if (p == znd->alloc_trace[iter]) {
+			znd->alloc_trace[iter] = NULL;
+			okay = 1;
+			atomic_dec(&znd->allocs);
+			break;
+		}
+	}
+	if (!okay) {
+		Z_ERR(znd, "Free'd something *NOT* allocated? %d", id);
+		dump_stack();
+	}
+#endif
+
+	znd->memstat -= sz;
+	znd->bins[id] -= sz;
+	spin_unlock_irqrestore(&znd->stats_lock, flags);
+}
+
+/**
+ * zdm_free() - Unified free by allocation 'code'
+ * @znd: ZDM instance
+ * @p: memory to be released.
+ * @sz: allocated size.
+ * @code: allocation size
+ *
+ * This (ugly) unified scheme helps to find leaks and monitor usage
+ *   via ioctl tools.
+ */
+static void zdm_free(struct zdm *znd, void *p, size_t sz, u32 code)
+{
+	int id    = code & 0x00FFFF;
+	int flag  = code & 0xFF0000;
+
+	if (!p)
+		goto out_invalid;
+
+	if (znd)
+		zdm_free_debug(znd, p, sz, id);
+
+#if ALLOC_DEBUG
+	memset(p, 0x69, sz); /* DEBUG */
+#endif
+
+	switch (code) {
+	case PG_08:
+	case PG_09:
+	case PG_27:
+		BUG_ON(!znd->mempool_pages);
+		mempool_free(virt_to_page(p), znd->mempool_pages);
+		return;
+	case KM_14:
+		BUG_ON(!znd->bp_q_node);
+		BUG_ON(sz != sizeof(struct zdm_q_node));
+		mempool_free(p, znd->bp_q_node);
+		return;
+	case KM_15:
+		BUG_ON(!znd->bp_chain_vec);
+		BUG_ON(sz != sizeof(struct zdm_bio_chain));
+		mempool_free(p, znd->bp_chain_vec);
+		return;
+	case KM_18:
+	case KM_19:
+		BUG_ON(!znd->mempool_wset);
+		mempool_free(p, znd->mempool_wset);
+		return;
+	case KM_20:
+		BUG_ON(!znd->mempool_maps);
+		BUG_ON(sz != sizeof(struct map_pg));
+		mempool_free(p, znd->mempool_maps);
+		return;
+	default:
+		break;
+	}
+
+	switch (flag) {
+	case GET_ZPG:
+		free_page((unsigned long)p);
+		break;
+	case GET_PGS:
+		free_pages((unsigned long)p, ilog2(sz >> PAGE_SHIFT));
+		break;
+	case GET_KM:
+		kfree(p);
+		break;
+	case GET_VM:
+		vfree(p);
+		break;
+	default:
+		Z_ERR(znd,
+		      "zdm_free %p scheme %x not mapped.", p, code);
+		break;
+	}
+	return;
+
+out_invalid:
+	Z_ERR(znd, "double zdm_free %p [%d]", p, id);
+	dump_stack();
+}
+
+/**
+ * zdm_free_debug() - Extra tracking for memory debugging.
+ * @znd: ZDM instance
+ * @p: memory to be released.
+ * @sz: allocated size.
+ * @id: allocation bin.
+ *
+ * Additional alloc/free debugging and statistics handling.
+ */
+static inline
+void zdm_alloc_debug(struct zdm *znd, void *p, size_t sz, int id)
+{
+	unsigned long flags;
+#if ALLOC_DEBUG
+	int iter;
+	int count;
+#endif
+
+	spin_lock_irqsave(&znd->stats_lock, flags);
+
+#if ALLOC_DEBUG
+	atomic_inc(&znd->allocs);
+	count = atomic_read(&znd->allocs);
+	if (atomic_read(&znd->allocs) < ADBG_ENTRIES) {
+		for (iter = 0; iter < ADBG_ENTRIES; iter++) {
+			if (!znd->alloc_trace[iter]) {
+				znd->alloc_trace[iter] = p;
+				break;
+			}
+		}
+	} else {
+		Z_ERR(znd, "Exceeded max debug alloc trace");
+	}
+#endif
+
+	znd->memstat += sz;
+	znd->bins[id] += sz;
+	if (znd->bins[id]/sz > znd->max_bins[id]) {
+		znd->max_bins[id] = znd->bins[id]/sz;
+	}
+	spin_unlock_irqrestore(&znd->stats_lock, flags);
+}
+
+/**
+ * zdm_alloc() - Unified alloc by 'code':
+ * @znd: ZDM instance
+ * @sz: allocated size.
+ * @code: allocation size
+ * @gfp: kernel allocation flags.
+ *
+ * There a few things (like dm_io) that seem to need pages and not just
+ *   kmalloc'd memory.
+ *
+ * This (ugly) unified scheme helps to find leaks and monitor usage
+ *   via ioctl tools.
+ */
+static void *zdm_alloc(struct zdm *znd, size_t sz, int code, gfp_t gfp)
+{
+	struct page *page = NULL;
+	void *pmem = NULL;
+	int id   = code & 0x00FFFF;
+	int flag = code & 0xFF0000;
+	gfp_t gfp_mask = GFP_KERNEL;
+	int zeroed = 0;
+
+	if (znd && gfp != GFP_KERNEL)
+		gfp_mask = GFP_ATOMIC;
+
+	switch (code) {
+	case PG_08:
+	case PG_09:
+	case PG_27:
+		BUG_ON(!znd->mempool_pages);
+		page = mempool_alloc(znd->mempool_pages, gfp_mask);
+		if (page)
+			pmem = page_address(page);
+		goto out;
+	case KM_14:
+		BUG_ON(!znd->bp_q_node);
+		BUG_ON(sz != sizeof(struct zdm_q_node));
+		pmem = mempool_alloc(znd->bp_q_node, gfp_mask);
+		goto out;
+	case KM_15:
+		BUG_ON(!znd->bp_chain_vec);
+		BUG_ON(sz != sizeof(struct zdm_bio_chain));
+		pmem = mempool_alloc(znd->bp_chain_vec, gfp_mask);
+		goto out;
+	case KM_18:
+	case KM_19:
+		BUG_ON(!znd->mempool_wset);
+		pmem = mempool_alloc(znd->mempool_wset, gfp_mask);
+		goto out;
+	case KM_20:
+		BUG_ON(!znd->mempool_maps);
+		BUG_ON(sz != sizeof(struct map_pg));
+		pmem = mempool_alloc(znd->mempool_maps, gfp_mask);
+		goto out;
+	default:
+		break;
+	}
+
+	if (flag == GET_VM)
+		might_sleep();
+
+	if (gfp_mask == GFP_KERNEL)
+		might_sleep();
+
+	zeroed = 1;
+	switch (flag) {
+	case GET_ZPG:
+		pmem = (void *)get_zeroed_page(gfp_mask);
+		if (!pmem && gfp_mask == GFP_ATOMIC) {
+			if (znd)
+				Z_ERR(znd, "No atomic for %d, try noio.", id);
+			pmem = (void *)get_zeroed_page(GFP_NOIO);
+		}
+		break;
+	case GET_PGS:
+		zeroed = 0;
+		page = alloc_pages(gfp_mask, ilog2(sz >> PAGE_SHIFT));
+		if (page)
+			pmem = page_address(page);
+		break;
+	case GET_KM:
+		pmem = kzalloc(sz, gfp_mask);
+		if (!pmem && gfp_mask == GFP_ATOMIC) {
+			if (znd)
+				Z_ERR(znd, "No atomic for %d, try noio.", id);
+			pmem = kzalloc(sz, GFP_NOIO);
+		}
+		break;
+	case GET_VM:
+		pmem = vzalloc(sz);
+		break;
+	default:
+		Z_ERR(znd, "zdm alloc scheme for %u unknown.", code);
+		break;
+	}
+
+out:
+	if (pmem) {
+		if (znd)
+			zdm_alloc_debug(znd, pmem, sz, id);
+		if (!zeroed)
+			memset(pmem, 0, sz);
+	} else {
+		if (znd)
+			Z_ERR(znd, "Out of memory. %d", id);
+		dump_stack();
+	}
+
+	return pmem;
+}
+
+/**
+ * zdm_calloc() - Unified alloc by 'code':
+ * @znd: ZDM instance
+ * @n: number of elements in array.
+ * @sz: allocation size of each element.
+ * @c: allocation strategy (VM, KM, PAGE, N-PAGES).
+ * @q: kernel allocation flags.
+ *
+ * calloc is just an zeroed memory array alloc.
+ * all zdm_alloc schemes are for zeroed memory so no extra memset needed.
+ */
+static void *zdm_calloc(struct zdm *znd, size_t n, size_t sz, int c, gfp_t q)
+{
+	return zdm_alloc(znd, sz * n, c, q);
+}
+
+/**
+ * get_io_vcache() - Get a pre-allocated pool of memory for IO.
+ * @znd: ZDM instance
+ * @gfp: Allocation flags if no pre-allocated pool can be found.
+ *
+ * Return: Pointer to pool memory or NULL.
+ */
+static struct io_4k_block *get_io_vcache(struct zdm *znd, gfp_t gfp)
+{
+	struct io_4k_block *cache = NULL;
+	int avail;
+
+	might_sleep();
+
+	for (avail = 0; avail < ARRAY_SIZE(znd->io_vcache); avail++) {
+		if (!test_and_set_bit(avail, &znd->io_vcache_flags)) {
+			cache = znd->io_vcache[avail];
+			if (!cache)
+				znd->io_vcache[avail] = cache =
+					ZDM_CALLOC(znd, IO_VCACHE_PAGES,
+						sizeof(*cache), VM_12, gfp);
+			if (cache)
+				break;
+		}
+	}
+	return cache;
+}
+
+/**
+ * put_io_vcache() - Get a pre-allocated pool of memory for IO.
+ * @znd: ZDM instance
+ * @cache: Allocated cache entry.
+ *
+ * Return: Pointer to pool memory or NULL.
+ */
+static int put_io_vcache(struct zdm *znd, struct io_4k_block *cache)
+{
+	int err = -ENOENT;
+	int avail;
+
+	if (cache) {
+		for (avail = 0; avail < ARRAY_SIZE(znd->io_vcache); avail++) {
+			if (cache == znd->io_vcache[avail]) {
+				WARN_ON(!test_and_clear_bit(avail,
+					&znd->io_vcache_flags));
+				err = 0;
+				break;
+			}
+		}
+	}
+	return err;
+}
+
+/**
+ * map_value() - translate a lookup table entry to a Sector #, or LBA.
+ * @znd: ZDM instance
+ * @delta: little endian map entry.
+ *
+ * Return: LBA or 0 if invalid.
+ */
+static inline u64 map_value(struct zdm *znd, __le32 delta)
+{
+	u64 mval = 0ul;
+
+	if ((delta != MZTEV_UNUSED) && (delta != MZTEV_NF))
+		mval = le32_to_cpu(delta);
+
+	return mval;
+}
+
+/**
+ * map_encode() - Encode a Sector # or LBA to a lookup table entry value.
+ * @znd: ZDM instance
+ * @to_addr: address to encode.
+ * @value: encoded value
+ *
+ * Return: 0.
+ */
+static int map_encode(struct zdm *znd, u64 to_addr, __le32 *value)
+{
+	int err = 0;
+
+	*value = MZTEV_UNUSED;
+	if (to_addr != BAD_ADDR)
+		*value = cpu_to_le32((u32)to_addr);
+
+	return err;
+}
+
+/**
+ * do_map_pool_free() - Free a map pool
+ * @mp: Map pool
+ */
+static void do_map_pool_free(struct map_pool *mp)
+{
+	u32 npgs;
+	u32 iter;
+
+	if (!mp)
+		return;
+
+	npgs = mp->size >> MCE_SHIFT;
+	for (iter = 0; iter < npgs; iter++) {
+		if (mp->pgs[iter]) {
+			ZDM_FREE(mp->znd, mp->pgs[iter], Z_C4K, PG_08);
+			mp->pgs[iter] = NULL;
+		}
+	}
+	ZDM_FREE(mp->znd, mp, sizeof(*mp), KM_25);
+}
+
+/**
+ * mce_at() - Get a map cache entry from a map pool
+ * @mp: Map pool
+ * @idx: Entry to grab
+ */
+static inline struct map_cache_entry *mce_at(struct map_pool *mp, int idx)
+{
+	u32 pg_no  = idx >> MCE_SHIFT;
+	u32 pg_idx = idx & MCE_MASK;
+	struct map_cache_entry *pg;
+
+	pg = mp->pgs[pg_no];
+	if (pg)
+		return &pg[pg_idx];
+
+	dump_stack();
+	return NULL;
+}
+
+/**
+ * _alloc_map_pool() - Allocate a pool
+ * @znd: ZDM instance
+ * @count: Number of entries desired
+ * @gfp: allocation mask
+ */
+static struct map_pool *_alloc_map_pool(struct zdm *znd, u32 count, gfp_t gfp)
+{
+	struct map_pool *mp = NULL;
+	struct map_cache_entry *pg;
+	u32 npgs;
+	u32 iter;
+
+	npgs = count >> MCE_SHIFT;
+	if (npgs < 2)
+		npgs = 2;
+	npgs += (count & MCE_MASK) ? 2 : 1;
+	if (npgs >= Z_MCE_MAX)
+		goto out;
+
+	mp = ZDM_ALLOC(znd, sizeof(*mp), KM_25, gfp);
+	if (!mp)
+		goto out;
+
+	mp->znd = znd;
+	mp->size = npgs << MCE_SHIFT;
+	for (iter = 0; iter < npgs; iter++) {
+		pg = ZDM_ALLOC(znd, Z_C4K, PG_08, gfp);
+		if (!pg) {
+			do_map_pool_free(mp);
+			mp = NULL;
+			goto out;
+		}
+		mp->pgs[iter] = pg;
+	}
+out:
+	return mp;
+}
+
+/**
+ * mp_grow() - Add a page to a map pool
+ * @cur: Current pool
+ * @pool: Pair pools
+ * @gfp: allocation mask
+ */
+static void mp_grow(struct map_pool *cur, struct map_pool **pool, gfp_t gfp)
+{
+	struct map_pool *mp;
+	struct map_cache_entry *pg;
+	u32 npgs;
+	u32 iter;
+	int idx;
+
+	npgs = cur->count >> MCE_SHIFT;
+	if (npgs < 2)
+		npgs = 2;
+	npgs += (cur->count & MCE_MASK) ? 2 : 1;
+	if (npgs > Z_MCE_MAX)
+		npgs = Z_MCE_MAX;
+
+	for (idx = 0; idx < 2; idx++) {
+		mp = pool[idx];
+		for (iter = mp->size >> MCE_SHIFT; iter < npgs; iter++) {
+			if (mp->pgs[iter])
+				continue;
+			pg = ZDM_ALLOC(mp->znd, Z_C4K, PG_08, gfp);
+			if (!pg) {
+				Z_ERR(mp->znd, "Unable to expand pool size.");
+				break;
+			}
+			mp->pgs[iter] = pg;
+		}
+		mp->size = iter << MCE_SHIFT;
+	}
+}
+
+/**
+ * mp_pick() - Toggle/Init the next map pool
+ * @cur: Current pool
+ * @pool: Pair of next pools
+ */
+static struct map_pool *mp_pick(struct map_pool *cur, struct map_pool **pool)
+{
+	struct map_pool *mp = (cur != pool[1]) ? pool[1] : pool[0];
+
+	mp->sorted = mp->count = 0;
+	return mp;
+}
+
+/**
+ * warn_bad_lba() - Warn if a give LBA is not valid (Esp if beyond a WP)
+ * @znd: ZDM instance
+ * @lba48: 48 bit lba.
+ *
+ * Return: non-zero if lba is not valid.
+ */
+static inline int warn_bad_lba(struct zdm *znd, u64 lba48)
+{
+#define FMT_ERR "LBA %" PRIx64 " is not valid: Z# %u, off:%x wp:%x"
+	int rcode = 0;
+	u32 zone;
+
+	if (lba48 < znd->md_start)
+		return rcode;
+
+	zone = _calc_zone(znd, lba48);
+	if (zone < znd->zone_count) { /* FIXME: MAYBE? md_end */
+		u32 gzoff = zone & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[zone >> GZ_BITS];
+		u32 wp_at = le32_to_cpu(wpg->wp_alloc[gzoff]) & Z_WP_VALUE_MASK;
+		u16 off = (lba48 - znd->md_start) % Z_BLKSZ;
+
+		if (off >= wp_at) {
+			rcode = 1;
+			Z_ERR(znd, FMT_ERR, lba48, zone, off, wp_at);
+			dump_stack();
+		}
+	} else {
+		rcode = 1;
+		Z_ERR(znd, "LBA is not valid - Z# %u, count %u",
+		      zone, znd->zone_count);
+	}
+
+	return rcode;
+}
+
+/**
+ * mapped_free() - Release a page of lookup table entries.
+ * @znd: ZDM instance
+ * @mapped: mapped page struct to free.
+ */
+static void mapped_free(struct zdm *znd, struct map_pg *mapped)
+{
+	unsigned long flags;
+
+	if (mapped) {
+		spin_lock_irqsave(&mapped->md_lock, flags);
+		WARN_ON(test_bit(IS_DIRTY, &mapped->flags));
+		if (mapped->data.addr) {
+			ZDM_FREE(znd, mapped->data.addr, Z_C4K, PG_27);
+			atomic_dec(&znd->incore);
+		}
+		if (mapped->crc_pg) {
+			deref_pg(mapped->crc_pg);
+			mapped->crc_pg = NULL;
+		}
+		spin_unlock_irqrestore(&mapped->md_lock, flags);
+		ZDM_FREE(znd, mapped, sizeof(*mapped), KM_20);
+	}
+}
+
+/**
+ * flush_map() - write dirty map entries to disk.
+ * @znd: ZDM instance
+ * @map: Array of mapped pages.
+ * @count: number of elements in range.
+ * Return: non-zero on error.
+ */
+static int flush_map(struct zdm *znd, struct hlist_head *hmap, u32 count)
+{
+	struct map_pg *pg;
+	const int use_wq = 1;
+	const int sync = 1;
+	int err = 0;
+	u32 ii;
+
+	if (!hmap)
+		return err;
+
+	for (ii = 0; ii < count; ii++) {
+		hlist_for_each_entry(pg, &hmap[ii], hentry) {
+			if (pg && pg->data.addr) {
+				cache_if_dirty(znd, pg, use_wq);
+				err |= write_if_dirty(znd, pg, use_wq, sync);
+			}
+		}
+	}
+
+	return err;
+}
+
+/**
+ * zoned_io_flush() - flush all pending IO.
+ * @znd: ZDM instance
+ */
+static int zoned_io_flush(struct zdm *znd)
+{
+	int err = 0;
+	unsigned long flags;
+
+	set_bit(ZF_FREEZE, &znd->flags);
+	atomic_inc(&znd->gc_throttle);
+
+	mod_delayed_work(znd->gc_wq, &znd->gc_work, 0);
+	flush_delayed_work(&znd->gc_work);
+
+	set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+	set_bit(DO_MEMPOOL, &znd->flags);
+	set_bit(DO_SYNC, &znd->flags);
+	set_bit(DO_FLUSH, &znd->flags);
+	queue_work(znd->meta_wq, &znd->meta_work);
+	flush_workqueue(znd->meta_wq);
+	flush_workqueue(znd->bg_wq);
+
+	mod_delayed_work(znd->gc_wq, &znd->gc_work, 0);
+	flush_delayed_work(&znd->gc_work);
+	atomic_dec(&znd->gc_throttle);
+
+	spin_lock_irqsave(&znd->lzy_lck, flags);
+	INIT_LIST_HEAD(&znd->lzy_pool);
+	spin_unlock_irqrestore(&znd->lzy_lck, flags);
+
+	spin_lock_irqsave(&znd->zlt_lck, flags);
+	INIT_LIST_HEAD(&znd->zltpool);
+	spin_unlock_irqrestore(&znd->zlt_lck, flags);
+
+	err = flush_map(znd, znd->fwd_hm, ARRAY_SIZE(znd->fwd_hm));
+	if (err)
+		goto out;
+
+	err = flush_map(znd, znd->rev_hm, ARRAY_SIZE(znd->rev_hm));
+	if (err)
+		goto out;
+
+	err = flush_map(znd, znd->fwd_hcrc, ARRAY_SIZE(znd->fwd_hcrc));
+	if (err)
+		goto out;
+
+	err = flush_map(znd, znd->rev_hcrc, ARRAY_SIZE(znd->rev_hcrc));
+	if (err)
+		goto out;
+
+	set_bit(DO_SYNC, &znd->flags);
+	set_bit(DO_FLUSH, &znd->flags);
+	queue_work(znd->meta_wq, &znd->meta_work);
+	flush_workqueue(znd->meta_wq);
+
+out:
+	return err;
+}
+
+/**
+ * release_table_pages() - flush and free all table map entries.
+ * @znd: ZDM instance
+ */
+static void release_table_pages(struct zdm *znd)
+{
+	int entry;
+	struct map_pg *expg;
+	struct hlist_node *_tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&znd->mapkey_lock, flags);
+	hash_for_each_safe(znd->fwd_hm, entry, _tmp, expg, hentry) {
+		hash_del(&expg->hentry);
+		mapped_free(znd, expg);
+	}
+	hash_for_each_safe(znd->rev_hm, entry, _tmp, expg, hentry) {
+		hash_del(&expg->hentry);
+		mapped_free(znd, expg);
+	}
+	spin_unlock_irqrestore(&znd->mapkey_lock, flags);
+
+	spin_lock_irqsave(&znd->ct_lock, flags);
+	hash_for_each_safe(znd->fwd_hcrc, entry, _tmp, expg, hentry) {
+		hash_del(&expg->hentry);
+		mapped_free(znd, expg);
+	}
+	hash_for_each_safe(znd->rev_hcrc, entry, _tmp, expg, hentry) {
+		hash_del(&expg->hentry);
+		mapped_free(znd, expg);
+	}
+	spin_unlock_irqrestore(&znd->ct_lock, flags);
+}
+
+/**
+ * _release_wp() - free all WP alloc/usage/used data.
+ * @znd: ZDM instance
+ * @wp: Object to free.
+ */
+static void _release_wp(struct zdm *znd, struct meta_pg *wp)
+{
+	u32 gzno;
+
+	for (gzno = 0; gzno < znd->gz_count; gzno++) {
+		struct meta_pg *wpg = &wp[gzno];
+
+		if (wpg->wp_alloc)
+			ZDM_FREE(znd, wpg->wp_alloc, Z_C4K, PG_06);
+		if (wpg->zf_est)
+			ZDM_FREE(znd, wpg->zf_est, Z_C4K, PG_06);
+		if (wpg->wp_used)
+			ZDM_FREE(znd, wpg->wp_used, Z_C4K, PG_06);
+	}
+	ZDM_FREE(znd, wp, znd->gz_count * sizeof(*wp), KM_21);
+	znd->wp = NULL;
+}
+
+/**
+ * _free_map_pools() - Free map pools holding extent cache blocks.
+ * @znd: ZDM instance
+ */
+static void _free_map_pools(struct zdm *znd)
+{
+	int idx;
+
+	for (idx = 0; idx < 2; idx++) {
+		do_map_pool_free(znd->in[idx]);
+		znd->in[idx] = NULL;
+		do_map_pool_free(znd->_use[idx]);
+		znd->_use[idx] = NULL;
+		do_map_pool_free(znd->trim_mp[idx]);
+		znd->trim_mp[idx] = NULL;
+		do_map_pool_free(znd->_wbj[idx]);
+		znd->_wbj[idx] = NULL;
+	}
+}
+
+/**
+ * zoned_destroy() - Teardown a zdm device mapper instance.
+ * @znd: ZDM instance
+ */
+static void zoned_destroy(struct zdm *znd)
+{
+	int purge;
+
+	if (znd->timer.data)
+		del_timer_sync(&znd->timer);
+
+	if (znd->gc_wq && znd->meta_wq)
+		if (zoned_io_flush(znd))
+			Z_ERR(znd, "sync/flush failure");
+
+	release_table_pages(znd);
+
+	if (znd->dev) {
+		dm_put_device(znd->ti, znd->dev);
+		znd->dev = NULL;
+	}
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dev) {
+		dm_put_device(znd->ti, znd->meta_dev);
+		znd->meta_dev = NULL;
+	}
+#endif
+	if (znd->io_wq) {
+		destroy_workqueue(znd->io_wq);
+		znd->io_wq = NULL;
+	}
+	if (znd->zone_action_wq) {
+		destroy_workqueue(znd->zone_action_wq);
+		znd->zone_action_wq = NULL;
+	}
+	if (znd->bg_wq) {
+		destroy_workqueue(znd->bg_wq);
+		znd->bg_wq = NULL;
+	}
+	if (znd->gc_wq) {
+		destroy_workqueue(znd->gc_wq);
+		znd->gc_wq = NULL;
+	}
+	if (znd->meta_wq) {
+		destroy_workqueue(znd->meta_wq);
+		znd->meta_wq = NULL;
+	}
+	if (znd->io_client)
+		dm_io_client_destroy(znd->io_client);
+	if (znd->wp)
+		_release_wp(znd, znd->wp);
+	if (znd->md_crcs)
+		ZDM_FREE(znd, znd->md_crcs, Z_C4K * 2, MP_22);
+	if (znd->gc_io_buf)
+		ZDM_FREE(znd, znd->gc_io_buf, Z_C4K * GC_MAX_STRIPE, VM_04);
+	if (znd->gc_postmap.gc_mcd) {
+		size_t sz = sizeof(struct gc_map_cache_data);
+
+		ZDM_FREE(znd, znd->gc_postmap.gc_mcd, sz, VM_03);
+	}
+
+	for (purge = 0; purge < ARRAY_SIZE(znd->io_vcache); purge++) {
+		size_t vcsz = IO_VCACHE_PAGES * sizeof(struct io_4k_block *);
+
+		if (znd->io_vcache[purge]) {
+			if (test_and_clear_bit(purge, &znd->io_vcache_flags))
+				Z_ERR(znd, "sync cache entry %d still in use!",
+				      purge);
+			ZDM_FREE(znd, znd->io_vcache[purge], vcsz, VM_12);
+		}
+	}
+	if (znd->z_sballoc)
+		ZDM_FREE(znd, znd->z_sballoc, Z_C4K, PG_05);
+
+	if (znd->cow_block)
+		ZDM_FREE(znd, znd->cow_block, Z_C4K, PG_02);
+
+	_free_map_pools(znd);
+
+	if (znd->bio_set) {
+		struct bio_set *bs = znd->bio_set;
+
+		znd->bio_set = NULL;
+		bioset_free(bs);
+	}
+	if (znd->mempool_pages)
+		mempool_destroy(znd->mempool_pages), znd->mempool_pages = NULL;
+	if (znd->mempool_maps)
+		mempool_destroy(znd->mempool_maps), znd->mempool_maps = NULL;
+	if (znd->mempool_wset)
+		mempool_destroy(znd->mempool_wset), znd->mempool_wset = NULL;
+	if (znd->bp_q_node)
+		mempool_destroy(znd->bp_q_node), znd->bp_q_node = NULL;
+	if (znd->bp_chain_vec)
+		mempool_destroy(znd->bp_chain_vec), znd->bp_chain_vec = NULL;
+	ZDM_FREE(NULL, znd, sizeof(*znd), KM_00);
+}
+
+/**
+ * _init_streams() - Setup initial conditions for streams and reserved zones.
+ * @znd: ZDM instance
+ */
+static void _init_streams(struct zdm *znd)
+{
+	u64 stream;
+
+	for (stream = 0; stream < ARRAY_SIZE(znd->bmkeys->stream); stream++)
+		znd->bmkeys->stream[stream] = cpu_to_le32(NOZONE);
+
+	znd->z_meta_resv = cpu_to_le32(znd->zone_count - 2);
+	znd->z_gc_resv   = cpu_to_le32(znd->zone_count - 1);
+	znd->z_gc_free   = znd->data_zones - 2;
+}
+
+/**
+ * _init_mdcrcs() - Setup initial values for empty CRC blocks.
+ * @znd: ZDM instance
+ */
+static void _init_mdcrcs(struct zdm *znd)
+{
+	int idx;
+
+	for (idx = 0; idx < Z_C4K; idx++)
+		znd->md_crcs[idx] = MD_CRC_INIT;
+}
+
+/**
+ * _init_wp() - Setup initial usage for empty data zones.
+ * @znd: ZDM instance
+ */
+static void _init_wp(struct zdm *znd, u32 gzno, struct meta_pg *wpg)
+{
+	u32 gzcount = 1 << GZ_BITS;
+	u32 iter;
+
+	if (znd->zone_count < ((gzno+1) << GZ_BITS))
+		gzcount = znd->zone_count & GZ_MMSK;
+
+	/* mark as empty */
+	for (iter = 0; iter < gzcount; iter++)
+		wpg->zf_est[iter] = cpu_to_le32(Z_BLKSZ);
+
+	/* mark as n/a -- full */
+	gzcount = 1 << GZ_BITS;
+	for (; iter < gzcount; iter++) {
+		wpg->wp_alloc[iter] = cpu_to_le32(~0u);
+		wpg->wp_used[iter] = cpu_to_le32(~0u);
+	}
+}
+
+/**
+ * _alloc_wp() - Allocate needed WP (meta_pg) objects
+ * @znd: ZDM instance
+ */
+static struct meta_pg *_alloc_wp(struct zdm *znd)
+{
+	struct meta_pg *wp;
+	u32 gzno;
+	const gfp_t gfp = GFP_KERNEL;
+
+	wp = ZDM_CALLOC(znd, znd->gz_count, sizeof(*znd->wp), KM_21, gfp);
+	if (!wp)
+		goto out;
+	for (gzno = 0; gzno < znd->gz_count; gzno++) {
+		struct meta_pg *wpg = &wp[gzno];
+
+		spin_lock_init(&wpg->wplck);
+		wpg->lba      = 2048ul + (gzno * 2);
+		wpg->wp_alloc = ZDM_ALLOC(znd, Z_C4K, PG_06, gfp);
+		wpg->zf_est   = ZDM_ALLOC(znd, Z_C4K, PG_06, gfp);
+		wpg->wp_used  = ZDM_ALLOC(znd, Z_C4K, PG_06, gfp);
+		if (!wpg->wp_alloc || !wpg->zf_est || !wpg->wp_used) {
+			_release_wp(znd, wp);
+			wp = NULL;
+			goto out;
+		}
+		_init_wp(znd, gzno, wpg);
+		set_bit(IS_DIRTY, &wpg->flags);
+		clear_bit(IS_FLUSH, &wpg->flags);
+	}
+
+out:
+	return wp;
+}
+
+/**
+ * is_conventional() - Determine if zone is conventional.
+ * @dentry: Zone descriptor entry.
+ *
+ * Return: 1 if zone type is conventional.
+ */
+static inline bool is_conventional(struct blk_zone *dentry)
+{
+	return BLK_ZONE_TYPE_CONVENTIONAL == dentry->type;
+}
+
+/**
+ * is_zone_reset() - Determine if zone is reset / ready for writing.
+ * @dentry: Zone descriptor entry.
+ *
+ * Return: 1 if zone condition is empty or zone type is conventional.
+ */
+static inline bool is_zone_reset(struct blk_zone *dentry)
+{
+	return (BLK_ZONE_COND_EMPTY == dentry->cond ||
+		BLK_ZONE_TYPE_CONVENTIONAL == dentry->type);
+}
+
+/**
+ * normalize_zone_wp() - Decode write pointer as # of blocks from start
+ * @znd: ZDM Instance
+ * @dentry_in: Zone descriptor entry.
+ *
+ * Return: Write Pointer as number of blocks from start of zone.
+ */
+static inline u32 normalize_zone_wp(struct zdm *znd, struct blk_zone *zone)
+{
+	return zone->wp - zone->start;
+}
+
+/**
+ * _dec_wp_avail_by_lost() - Update free count due to lost/unusable blocks.
+ * @wpg: Write pointer metadata page.
+ * @gzoff: Zone entry in page.
+ * @lost: Number of blocks 'lost'.
+ */
+static inline
+void _dec_wp_avail_by_lost(struct meta_pg *wpg, u32 gzoff, u32 lost)
+{
+	u32 est = le32_to_cpu(wpg->zf_est[gzoff]) + lost;
+
+	if (est > Z_BLKSZ)
+		est = Z_BLKSZ;
+	wpg->zf_est[gzoff] = cpu_to_le32(est);
+}
+
+/**
+ * zoned_wp_sync() - Re-Sync expected WP location with drive
+ * @znd: ZDM Instance
+ * @reset_non_empty: Reset the non-empty zones.
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int zoned_wp_sync(struct zdm *znd, int reset_non_empty)
+{
+	struct blk_zone *zones;
+	int rcode = 0;
+	u32 rcount = 0;
+	u32 iter;
+	const gfp_t gfp = GFP_KERNEL;
+
+	zones = kcalloc(Z_C4K, sizeof(*zones), gfp);
+	if (!zones) {
+		rcode = -ENOMEM;
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		goto out;
+	}
+
+	for (iter = znd->dz_start; iter < znd->zone_count; iter++) {
+		u32 entry = (iter - znd->dz_start) % Z_C4K;
+		u32 gzno  = iter >> 10;
+		u32 gzoff = iter & ((1 << 10) - 1);
+		struct meta_pg *wpg = &znd->wp[gzno];
+		struct blk_zone *dentry;
+		u32 wp_flgs;
+		u32 wp_at;
+		u32 wp;
+
+		if (entry == 0) {
+			unsigned int nz = Z_C4K;
+			int err = dmz_report_zones(znd, iter, zones, &nz, gfp);
+
+			if (err) {
+				Z_ERR(znd, "report zones-> %d", err);
+				if (err != -ENOTSUPP)
+					rcode = err;
+				goto out;
+			}
+			rcount = nz;
+		}
+
+		if (entry >= rcount)
+			break;
+
+		dentry = &zones[entry];
+		if (reset_non_empty && !is_conventional(dentry)) {
+			int err = 0;
+
+			if (!is_zone_reset(dentry))
+				err = dmz_reset_wp(znd, iter);
+
+			if (err) {
+				Z_ERR(znd, "reset wp-> %d", err);
+				if (err != -ENOTSUPP)
+					rcode = err;
+				goto out;
+			}
+			wp = wp_at = 0;
+			wpg->wp_alloc[gzoff] = cpu_to_le32(0);
+			wpg->zf_est[gzoff] = cpu_to_le32(Z_BLKSZ);
+			wpg->wp_used[gzoff] = wpg->wp_alloc[gzoff];
+			continue;
+		}
+
+		wp = normalize_zone_wp(znd, dentry);
+		wp >>= Z_SHFT4K; /* 512 sectors to 4k sectors */
+		wp_at = le32_to_cpu(wpg->wp_alloc[gzoff]) & Z_WP_VALUE_MASK;
+		wp_flgs = le32_to_cpu(wpg->wp_alloc[gzoff]) & Z_WP_FLAGS_MASK;
+
+		if (is_conventional(dentry)) {
+			wp = wp_at;
+			wp_flgs |= Z_WP_NON_SEQ;
+		} else {
+			wp_flgs &= ~Z_WP_NON_SEQ;
+		}
+
+		if (wp > wp_at) {
+			u32 lost = wp - wp_at;
+
+			wp_at = wp;
+			_dec_wp_avail_by_lost(wpg, gzoff, lost);
+
+			Z_ERR(znd, "Z#%u z:%x [wp:%x rz:%x] lost %u blocks.",
+			      iter, gzoff, wp_at, wp, lost);
+		}
+		wpg->wp_alloc[gzoff] = cpu_to_le32(wp_at|wp_flgs);
+		wpg->wp_used[gzoff] = cpu_to_le32(wp_at);
+	}
+
+out:
+	kfree(zones);
+
+	return rcode;
+}
+
+/**
+ * _init_map_pools() - Construct map pools
+ * @znd: ZDM Instance
+ *
+ * Return: 0 on success, otherwise error.
+ */
+static int _init_map_pools(struct zdm *znd)
+{
+	int rcode = 0;
+	int idx;
+
+	for (idx = 0; idx < 2; idx++) {
+		znd->in[idx] = _alloc_map_pool(znd, 1, GFP_KERNEL);
+		znd->_use[idx] = _alloc_map_pool(znd, 1, GFP_KERNEL);
+		znd->trim_mp[idx] = _alloc_map_pool(znd, 1, GFP_KERNEL);
+		znd->_wbj[idx] = _alloc_map_pool(znd, 1, GFP_KERNEL);
+
+		if (!znd->in[idx] || !znd->_use[idx] ||
+		    !znd->trim_mp[idx] || !znd->_wbj[idx]) {
+			znd->ti->error = "couldn't allocate map pools.";
+			rcode = -ENOMEM;
+			goto out;
+		}
+	}
+
+	for (idx = 0; idx < 2; idx++) {
+		znd->in[idx]->isa =  IS_INGRESS;
+		znd->_use[idx]->isa = IS_UNUSED;
+		znd->trim_mp[idx]->isa = IS_TRIM;
+		znd->_wbj[idx]->isa = IS_WBJRNL;
+		if (idx == 0) {
+			znd->ingress = znd->in[idx];
+			znd->unused = znd->_use[idx];
+			znd->trim = znd->trim_mp[idx];
+			znd->wbjrnl = znd->_wbj[idx];
+		}
+	}
+out:
+	return rcode;
+}
+
+/**
+ * do_init_zoned() - Initialize a zdm device mapper instance
+ * @ti: DM Target Info
+ * @znd: ZDM Target
+ *
+ * Return: 0 on success.
+ *
+ * Setup the zone pointer table and do a one time calculation of some
+ * basic limits.
+ *
+ *  While start of partition may not be zone aligned
+ *  md_start, data_lba and md_end are all zone aligned.
+ *  From start of partition [or device] to md_start some conv/pref
+ *  space is required for superblocks, memcache, zone pointers, crcs
+ *  and optionally pinned forward lookup blocks.
+ *
+ *    0 < znd->md_start <= znd->data_lba <= znd->md_end
+ *
+ *  incoming [FS sectors] are linearly mapped after md_end.
+ *  And blocks following data_lba are serialzied into zones either with
+ *  explicit stream id support from BIOs [FUTURE], or implictly by LBA
+ *  or type of data.
+ */
+static int do_init_zoned(struct dm_target *ti, struct zdm *znd)
+{
+	u64 part_sz = i_size_read(get_bdev_bd_inode(znd));
+	u64 zone_count = part_sz >> 28;
+	u64 gz_count = (zone_count + 1023) >> 10;
+	u64 overprovision = znd->mz_provision * gz_count;
+	u64 zltblks = (znd->zdstart << 19) - znd->start_sect;
+	u64 blocks = (zone_count - overprovision) << 19;
+	u64 data_zones = (blocks >> 19) + overprovision;
+	u64 mapct;
+	u64 crcct;
+	u32 mz_min = 0; /* cache */
+	const gfp_t gfp = GFP_KERNEL;
+	int rcode = 0;
+
+	INIT_LIST_HEAD(&znd->zltpool);
+	INIT_LIST_HEAD(&znd->lzy_pool);
+
+	spin_lock_init(&znd->zlt_lck);
+	spin_lock_init(&znd->lzy_lck);
+	spin_lock_init(&znd->stats_lock);
+	spin_lock_init(&znd->mapkey_lock);
+	spin_lock_init(&znd->ct_lock);
+	spin_lock_init(&znd->gc_lock);
+	spin_lock_init(&znd->gc_postmap.cached_lock);
+
+	spin_lock_init(&znd->in_rwlck);
+	spin_lock_init(&znd->unused_rwlck);
+	spin_lock_init(&znd->trim_rwlck);
+	spin_lock_init(&znd->wbjrnl_rwlck);
+
+	mutex_init(&znd->pool_mtx);
+	mutex_init(&znd->gc_wait);
+	mutex_init(&znd->gc_vcio_lock);
+	mutex_init(&znd->vcio_lock);
+	mutex_init(&znd->mz_io_mutex);
+
+	/*
+	 * The space from the start of the partition [znd->start_sect]
+	 * to the first zone used for data [znd->zdstart]
+	 * is the number of blocks reserved for fLUT, rLUT and CRCs [zltblks]
+	 */
+	blocks -= zltblks;
+	data_zones = (blocks >> 19) + overprovision;
+
+	pr_err("ZDM: Part Sz %llu\n", part_sz);
+	pr_err("ZDM: DM Size %llu\n", blocks);
+	pr_err("ZDM: Zones %llu -- Reported Data Zones %llu\n",
+	       zone_count, blocks >> 19);
+	pr_err("ZDM: Data Zones %llu\n", (blocks >> 19) + overprovision);
+	pr_err("ZDM: GZones %llu\n", gz_count);
+
+
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag == DST_TO_SEC_DEVICE) {
+		/*
+		 * When metadata is *only* on secondary we can reclaim the
+		 * space reserved for metadata on the primary.
+		 * We can add back upto: zltblks
+		 * However data should still be zone aligned so:
+		 *     zltblks & ((1ull << 19) - 1)
+		 */
+		u64 t = (part_sz - (znd->sec_zone_align << Z_SHFT_SEC)) >> 12;
+
+// FIXME: Add back the zones used for SBlock + fLUT rLUT and CRCs ?
+
+		data_zones = t >> Z_BLKBITS;
+//		gz_count = dm_div_up(zone_count, MAX_ZONES_PER_MZ);
+	}
+	mapct = dm_div_up(zone_count << Z_BLKBITS, 1024);
+	crcct = dm_div_up(mapct, 2048);
+#else
+	mapct = dm_div_up(zone_count << Z_BLKBITS, 1024);
+	crcct = dm_div_up(mapct, 2048);
+#endif
+
+	/* pool of single pages (order 0) */
+	znd->mempool_pages = mempool_create_page_pool(4096, 0);
+	znd->mempool_maps = mempool_create_kmalloc_pool(4096,
+						sizeof(struct map_pg));
+	znd->mempool_wset = mempool_create_kmalloc_pool(3,
+					sizeof(struct map_pg*) * MAX_WSET);
+
+	if (!znd->mempool_pages || !znd->mempool_maps || !znd->mempool_wset) {
+		ti->error = "couldn't allocate mempools";
+		rcode = -ENOMEM;
+		goto out;
+	}
+	znd->bp_q_node = mempool_create_kmalloc_pool(
+				1024, sizeof(struct zdm_q_node));
+	znd->bp_chain_vec = mempool_create_kmalloc_pool(
+				256, sizeof(struct zdm_bio_chain));
+	if (!znd->bp_q_node || !znd->bp_chain_vec) {
+		ti->error = "couldn't allocate bio queue mempools";
+		rcode = -ENOMEM;
+		goto out;
+	}
+
+	znd->zone_count = zone_count;
+	znd->data_zones = data_zones;
+	znd->gz_count = gz_count;
+	znd->crc_count = crcct;
+	znd->map_count = mapct;
+
+	znd->md_start = dm_round_up(znd->start_sect, Z_BLKSZ) - znd->start_sect;
+	if (znd->md_start < (WB_JRNL_BASE + WB_JRNL_MIN)) {
+		znd->md_start += Z_BLKSZ;
+		mz_min++;
+	}
+
+	/* md_start - lba of first full zone in partition addr space */
+	znd->s_base = znd->md_start;
+	mz_min += gz_count;
+	if (mz_min < znd->zdstart)
+		set_bit(ZF_POOL_FWD, &znd->flags);
+
+	znd->r_base = znd->s_base + (gz_count << Z_BLKBITS);
+	mz_min += gz_count;
+	if (mz_min < znd->zdstart)
+		set_bit(ZF_POOL_REV, &znd->flags);
+
+	znd->c_base = znd->r_base + (gz_count << Z_BLKBITS);
+	znd->c_mid  = znd->c_base + (gz_count * 0x20);
+	znd->c_end  = znd->c_mid + (gz_count * 0x20);
+	mz_min++;
+
+	if (mz_min < znd->zdstart) {
+		Z_ERR(znd, "Conv space for CRCs: Seting ZF_POOL_CRCS");
+		set_bit(ZF_POOL_CRCS, &znd->flags);
+	}
+
+	if (test_bit(ZF_POOL_FWD, &znd->flags)) {
+		znd->sk_low = znd->sk_high = 0;
+	} else {
+		znd->sk_low = znd->s_base;
+		znd->sk_high = znd->sk_low + (gz_count * 0x40);
+	}
+
+	/* logical *ending* lba for meta [bumped up to next zone alignment] */
+	znd->md_end = dm_round_up(znd->c_end, 1 << Z_BLKBITS);
+
+	/* actual starting lba for data pool */
+	znd->data_lba = znd->md_end;
+	if (!test_bit(ZF_POOL_CRCS, &znd->flags))
+		znd->data_lba = znd->c_base;
+	if (!test_bit(ZF_POOL_REV, &znd->flags))
+		znd->data_lba = znd->r_base;
+	if (!test_bit(ZF_POOL_FWD, &znd->flags))
+		znd->data_lba = znd->s_base;
+
+	/* NOTE: md_end == data_lba => all meta is in conventional zones. */
+	Z_INFO(znd, "ZDM #%u", BUILD_NO);
+	Z_INFO(znd, "  Bio Queue %s",
+	       znd->queue_depth > 0 ? "enabled" : "disabled");
+	Z_INFO(znd, "  Trim %s", znd->enable_trim ? "enabled" : "disabled");
+
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag != DST_TO_PRI_DEVICE)
+		Z_INFO(znd, "  Metadata %s secondary %s",
+		       znd->meta_dst_flag == DST_TO_BOTH_DEVICE ?
+		       "mirrored with" : "only on", znd->bdev_metaname);
+#endif
+
+	Z_INFO(znd, "  Starting Zone# %u", znd->zdstart);
+	Z_INFO(znd, "  Starting Sector: %" PRIx64, znd->start_sect);
+	Z_INFO(znd, "  Metadata range [%" PRIx64 ", %" PRIx64 "]",
+	       znd->md_start, znd->md_end);
+	Z_INFO(znd, "  Data LBA %" PRIx64, znd->data_lba);
+	Z_INFO(znd, "  Size: %" PRIu64" / %" PRIu64,
+	       blocks >> 3, part_sz >> 12);
+	Z_INFO(znd, "  Zones: %u total %u data",
+	       znd->zone_count, znd->data_zones);
+
+#if ALLOC_DEBUG
+	znd->alloc_trace = vzalloc(sizeof(*znd->alloc_trace) * ADBG_ENTRIES);
+	if (!znd->alloc_trace) {
+		ti->error = "couldn't allocate in-memory mem debug trace";
+		rcode = -ENOMEM;
+		goto out;
+	}
+#endif
+
+	znd->z_sballoc = ZDM_ALLOC(znd, Z_C4K, PG_05, GFP_KERNEL);
+	if (!znd->z_sballoc) {
+		ti->error = "couldn't allocate in-memory superblock";
+		rcode = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	* MD journal space: 15360 [15]
+	* Journal Index     688
+	* Journal data      698
+	* Metadata start    ff00
+	* Forward table     ff00
+	* Reverse table     2ff00
+	* Forward CRC table 4ff00
+	* Reverse CRC table 4ff40
+	* Normal data start 5ff00
+	*/
+
+	Z_INFO(znd, "Metadata start    %" PRIx64, znd->md_start);
+	Z_INFO(znd, "Forward table     %" PRIx64, znd->s_base );
+	Z_INFO(znd, "Reverse table     %" PRIx64, znd->r_base );
+	Z_INFO(znd, "Forward CRC table %" PRIx64, znd->c_base );
+	Z_INFO(znd, "Reverse CRC table %" PRIx64, znd->c_mid );
+	Z_INFO(znd, "Normal data start %" PRIx64, znd->data_lba);
+
+	znd->dz_start = znd->data_lba >> Z_BLKBITS;
+	znd->z_current = znd->dz_start;
+	znd->gc_postmap.gc_mcd = ZDM_ALLOC(znd, sizeof(struct gc_map_cache_data),
+					VM_03, GFP_KERNEL);
+	znd->md_crcs = ZDM_ALLOC(znd, Z_C4K * 2, MP_22, GFP_KERNEL);
+	znd->gc_io_buf = ZDM_CALLOC(znd, GC_MAX_STRIPE, Z_C4K, VM_04, gfp);
+	znd->wp = _alloc_wp(znd);
+	znd->io_vcache[0] = ZDM_CALLOC(znd, IO_VCACHE_PAGES,
+				sizeof(struct io_4k_block), VM_12, gfp);
+	znd->io_vcache[1] = ZDM_CALLOC(znd, IO_VCACHE_PAGES,
+				sizeof(struct io_4k_block), VM_12, gfp);
+
+	if (!znd->gc_postmap.gc_mcd || !znd->md_crcs || !znd->gc_io_buf ||
+	    !znd->wp || !znd->io_vcache[0] || !znd->io_vcache[1]) {
+		rcode = -ENOMEM;
+		goto out;
+	}
+	znd->gc_postmap.jsize = Z_BLKSZ;
+	znd->gc_postmap.map_content = IS_POST_MAP;
+	_init_mdcrcs(znd);
+
+	znd->io_client = dm_io_client_create();
+	if (!znd->io_client) {
+		rcode = -ENOMEM;
+		goto out;
+	}
+
+	rcode = _init_map_pools(znd);
+	if (rcode)
+		goto out;
+
+#define ZDM_WQ (__WQ_LEGACY | WQ_MEM_RECLAIM)
+
+	znd->meta_wq = alloc_ordered_workqueue("zwq_md_%s", ZDM_WQ, znd->bdev_name);
+	if (!znd->meta_wq) {
+		ti->error = "couldn't start metadata workqueue";
+		rcode = -ENOMEM;
+		goto out;
+	}
+	znd->gc_wq = alloc_ordered_workqueue("zwq_gc_%s", ZDM_WQ, znd->bdev_name);
+	if (!znd->gc_wq) {
+		ti->error = "couldn't start GC workqueue.";
+		rcode = -ENOMEM;
+		goto out;
+	}
+	znd->bg_wq = alloc_ordered_workqueue("zwq_bg_%s", ZDM_WQ, znd->bdev_name);
+	if (!znd->bg_wq) {
+		ti->error = "couldn't start background workqueue.";
+		rcode = -ENOMEM;
+		goto out;
+	}
+	znd->io_wq = alloc_ordered_workqueue("zwq_io_%s", ZDM_WQ, znd->bdev_name);
+	if (!znd->io_wq) {
+		ti->error = "couldn't start DM I/O workqueue";
+		rcode = -ENOMEM;
+		goto out;
+	}
+	znd->zone_action_wq = alloc_ordered_workqueue("zwq_za_%s", ZDM_WQ, znd->bdev_name);
+	if (!znd->zone_action_wq) {
+		ti->error = "couldn't start zone action workqueue";
+		rcode = -ENOMEM;
+		goto out;
+	}
+	init_waitqueue_head(&znd->wait_bio);
+	INIT_WORK(&znd->meta_work, meta_work_task);
+	INIT_WORK(&znd->bg_work, bg_work_task);
+	INIT_DELAYED_WORK(&znd->gc_work, gc_work_task);
+	setup_timer(&znd->timer, activity_timeout, (unsigned long)znd);
+	znd->last_w = BAD_ADDR;
+	set_bit(DO_SYNC, &znd->flags);
+
+out:
+	return rcode;
+}
+
+/**
+ * check_metadata_version() - Test ZDM version for compatibility.
+ * @sblock: Super block
+ *
+ * Return 0 if valud, or -EINVAL if version is not recognized.
+ */
+static int check_metadata_version(struct zdm_superblock *sblock)
+{
+	u32 metadata_version = le32_to_cpu(sblock->version);
+
+	if (metadata_version < MIN_ZONED_VERSION
+	    || metadata_version > MAX_ZONED_VERSION) {
+		DMERR("Unsupported metadata version %u found.",
+		      metadata_version);
+		DMERR("Only versions between %u and %u supported.",
+		      MIN_ZONED_VERSION, MAX_ZONED_VERSION);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * sb_crc32() - CRC check for superblock.
+ * @sblock: Superblock to check.
+ */
+static __le32 sb_crc32(struct zdm_superblock *sblock)
+{
+	const __le32 was = sblock->csum;
+	u32 crc;
+
+	sblock->csum = 0;
+	crc = crc32c(~(u32) 0u, sblock, sizeof(*sblock)) ^ SUPERBLOCK_CSUM_XOR;
+
+	sblock->csum = was;
+	return cpu_to_le32(crc);
+}
+
+/**
+ * sb_check() - Check the superblock to see if it is valid and not corrupt.
+ * @sblock: Superblock to check.
+ */
+static int sb_check(struct zdm_superblock *sblock)
+{
+	__le32 csum_le;
+
+	if (le64_to_cpu(sblock->magic) != SUPERBLOCK_MAGIC) {
+		DMERR("sb_check failed: magic %" PRIx64 ": wanted %lx",
+		      le64_to_cpu(sblock->magic), SUPERBLOCK_MAGIC);
+		return -EILSEQ;
+	}
+
+	csum_le = sb_crc32(sblock);
+	if (csum_le != sblock->csum) {
+		DMERR("sb_check failed: csum %u: wanted %u",
+		      csum_le, sblock->csum);
+		return -EILSEQ;
+	}
+
+	return check_metadata_version(sblock);
+}
+
+/**
+ * zoned_create_disk() - Initialize the on-disk format of a zdm device mapper.
+ * @ti: DM Target Instance
+ * @znd: ZDM Instance
+ */
+static int zoned_create_disk(struct dm_target *ti, struct zdm *znd)
+{
+	const int reset_non_empty = 1;
+	struct zdm_superblock *sblock = znd->super_block;
+	int err;
+
+	memset(sblock, 0, sizeof(*sblock));
+	generate_random_uuid(sblock->uuid);
+	sblock->magic = cpu_to_le64(SUPERBLOCK_MAGIC);
+	sblock->version = cpu_to_le32(Z_VERSION);
+	sblock->zdstart = cpu_to_le32(znd->zdstart);
+
+	err = zoned_wp_sync(znd, reset_non_empty);
+
+	return err;
+}
+
+/**
+ * zoned_repair() - Attempt easy on-line fixes.
+ * @znd: ZDM Instance
+ *
+ * Repair an otherwise good device mapper instance that was not cleanly removed.
+ */
+static int zoned_repair(struct zdm *znd)
+{
+	Z_INFO(znd, "Is Dirty .. zoned_repair consistency fixer TODO!!!.");
+	return -ENOMEM;
+}
+
+/**
+ * zoned_init_disk() - Init from exising or re-create DM Target (ZDM)
+ * @ti: DM Target Instance
+ * @znd: ZDM Instance
+ * @create: Create if not found.
+ * @force: Force create even if it looks like a ZDM was here.
+ *
+ * Locate the existing SB on disk and re-load or create the device-mapper
+ * instance based on the existing disk state.
+ */
+static int zoned_init_disk(struct dm_target *ti, struct zdm *znd,
+			   int create, int force)
+{
+	struct mz_superkey *key_blk = znd->z_sballoc;
+
+	int jinit = 1;
+	int n4kblks = 1;
+	int use_wq = 1;
+	int rc = 0;
+	u32 zdstart = znd->zdstart;
+
+	memset(key_blk, 0, sizeof(*key_blk));
+
+	znd->super_block = &key_blk->sblock;
+
+	znd->bmkeys = key_blk;
+	znd->bmkeys->sig0 = Z_KEY_SIG;
+	znd->bmkeys->sig1 = cpu_to_le64(Z_KEY_SIG);
+	znd->bmkeys->magic = cpu_to_le64(Z_TABLE_MAGIC);
+
+	_init_streams(znd);
+
+	znd->stale.binsz = STREAM_SIZE;
+	if (znd->zone_count < STREAM_SIZE)
+		znd->stale.binsz = znd->zone_count;
+	else if ((znd->zone_count / STREAM_SIZE) > STREAM_SIZE)
+		znd->stale.binsz = dm_div_up(znd->zone_count, STREAM_SIZE);
+	znd->stale.count = dm_div_up(znd->zone_count, znd->stale.binsz);
+
+	Z_ERR(znd, "Bin: Sz %u, count %u", znd->stale.binsz, znd->stale.count);
+
+	if (create && force) {
+		Z_ERR(znd, "Force Creating a clean instance.");
+	} else if (find_superblock(znd, use_wq, 1)) {
+		u64 sb_lba = 0;
+		u64 generation;
+
+		Z_INFO(znd, "Found existing superblock");
+		if (zdstart != znd->zdstart) {
+			if (force) {
+				Z_ERR(znd, "  (force) zdstart: %u <- %u",
+					zdstart, znd->zdstart);
+			} else {
+				znd->zdstart = zdstart;
+				jinit = 0;
+			}
+		}
+
+		generation = mcache_greatest_gen(znd, use_wq, &sb_lba, NULL);
+		Z_DBG(znd, "Generation: %" PRIu64 " @ %" PRIx64,
+			generation, sb_lba);
+
+		rc = read_block(znd, DM_IO_KMEM, key_blk, sb_lba,
+				n4kblks, use_wq);
+		if (rc) {
+			ti->error = "Superblock read error.";
+			return rc;
+		}
+	}
+
+	rc = sb_check(znd->super_block);
+	if (rc) {
+		jinit = 0;
+		if (create) {
+			DMWARN("Check failed .. creating superblock.");
+			zoned_create_disk(ti, znd);
+			znd->super_block->nr_zones =
+				cpu_to_le64(znd->data_zones);
+			DMWARN("in-memory superblock created.");
+			znd->is_empty = 1;
+		} else {
+			ti->error = "Superblock check failed.";
+			return rc;
+		}
+	}
+
+	if (sb_test_flag(znd->super_block, SB_DIRTY)) {
+		int repair_check = zoned_repair(znd);
+
+		if (!force) {
+			/* if repair failed -- don't load from disk */
+			if (repair_check)
+				jinit = 0;
+		} else if (repair_check && jinit) {
+			Z_ERR(znd, "repair failed, force enabled loading ...");
+		}
+	}
+
+	if (jinit) {
+		Z_ERR(znd, "INIT: Reloading DM Zoned metadata from DISK");
+		znd->zdstart = le32_to_cpu(znd->super_block->zdstart);
+		set_bit(DO_ZDM_RELOAD, &znd->flags);
+		queue_work(znd->meta_wq, &znd->meta_work);
+		Z_ERR(znd, "Waiting for load to complete.");
+		flush_workqueue(znd->meta_wq);
+	}
+
+	Z_ERR(znd, "ZONED: Build No %d marking superblock dirty.", BUILD_NO);
+
+	/* write the 'dirty' flag back to disk. */
+	sb_set_flag(znd->super_block, SB_DIRTY);
+	znd->super_block->csum = sb_crc32(znd->super_block);
+
+	return 0;
+}
+
+/**
+ * gc_lba_cmp() - Compare on tlba48 ignoring high 16 bits.
+ * @x1: Page of map cache
+ * @x2: Page of map cache
+ *
+ * Return: 1 less than, -1 greater than, 0 if equal.
+ */
+static int gc_lba_cmp(const void *x1, const void *x2)
+{
+	const struct map_cache_entry *r1 = x1;
+	const struct map_cache_entry *r2 = x2;
+	const u64 v1 = le64_to_lba48(r1->tlba, NULL);
+	const u64 v2 = le64_to_lba48(r2->tlba, NULL);
+
+	return (v1 < v2) ? -1 : ((v1 > v2) ? 1 : 0);
+}
+
+/**
+ * bsrch_n() - A binary search that understands the internal @mp layout
+ * @key: Key (tLBA) being sought
+ * @mp: Map pool to search
+ */
+static int bsrch_n(u64 key, struct map_pool *mp)
+{
+	struct map_cache_entry *mce;
+	u64 v2;
+	size_t lower = 0;
+	size_t upper = mp->count;
+	size_t idx;
+	u32 rng;
+
+	while (lower < upper) {
+		idx = (lower + upper) / 2;
+		mce = mce_at(mp, idx);
+		if (!mce)
+			goto not_found;
+
+		v2 = le64_to_lba48(mce->tlba, NULL);
+		le64_to_lba48(mce->bval, &rng);
+
+		if (key < v2)
+			upper = idx;
+		else if (key >= (v2 + rng))
+			lower = idx + 1;
+		else
+			return idx;
+	}
+
+not_found:
+	return -1;
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+/**
+ * increment_used_blks() - Update the 'used' WP when data hits disk.
+ * @znd: ZDM Instance
+ * @lba: blba if bio completed.
+ * @blks: number of blocks of bio completed.
+ *
+ * Called from a BIO end_io function so should not sleep or deadlock
+ * The 'critical' piece here is ensuring that the wp is advanced to 0x10000
+ * Secondarily is triggering filled_zone which ultimatly sets the
+ *   Z_WP_GC_READY in wp_alloc. While important this flag could be set
+ *   during other non-critical passes over wp_alloc and wp_used such
+ *   as during update_stale_ratio().
+ */
+static void increment_used_blks(struct zdm *znd, u64 lba, u32 blks)
+{
+	u32 zone = _calc_zone(znd, lba);
+	u32 wwp = ((lba - znd->md_start) - (zone << Z_BLKBITS)) + blks;
+
+	if (lba < znd->md_end)
+		return;
+
+	if (zone < znd->zone_count) {
+		u32 gzno  = zone >> GZ_BITS;
+		u32 gzoff = zone & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 used, uflags;
+
+		used = le32_to_cpu(wpg->wp_used[gzoff]);
+		uflags = used & Z_WP_FLAGS_MASK;
+		used &= Z_WP_VALUE_MASK;
+
+		if (wwp > used) {
+			wpg->wp_used[gzoff] = cpu_to_le32(wwp | uflags);
+			/* signal zone closure */
+			if (wwp == Z_BLKSZ)
+				znd->filled_zone = zone;
+		}
+	}
+}
+
+/**
+ * _current_mapping() - Lookup a logical sector address to find the disk LBA
+ * @znd: ZDM instance
+ * @nodisk: Optional ignore the discard cache
+ * @addr: Logical LBA of page.
+ * @gfp: Memory allocation rule
+ *
+ * Return: Disk LBA or 0 if not found.
+ */
+static u64 _current_mapping(struct zdm *znd, u64 addr,
+			    bool trim, bool jrnl, gfp_t gfp)
+{
+	u64 found = 0ul;
+
+	if (addr < znd->data_lba) {
+		if (jrnl)
+			found = z_lookup_journal_cache(znd, addr);
+		if (!found)
+			found = addr;
+		goto out;
+	}
+	if (!found && trim && z_lookup_trim_cache(znd, addr)) {
+		goto out;
+	}
+	if (!found) {
+		found = z_lookup_ingress_cache(znd, addr);
+		if (found == ~0ul) {
+			found = 0ul;
+			goto out;
+		}
+	}
+	if (!found)
+		found = z_lookup_table(znd, addr, gfp);
+out:
+	return found;
+}
+
+/**
+ * current_mapping() - Lookup a logical sector address to find the disk LBA
+ * @znd: ZDM instance
+ * @addr: Logical LBA of page.
+ * @gfp: Memory allocation rule
+ *
+ * NOTE: Discard cache is checked.
+ *
+ * Return: Disk LBA or 0 if not found.
+ */
+static u64 current_mapping(struct zdm *znd, u64 addr, gfp_t gfp)
+{
+	const bool trim = true;
+	const bool jrnl = true;
+
+	return _current_mapping(znd, addr, trim, jrnl, gfp);
+}
+
+static int _common_intersect(struct map_pool *mcache, u64 key, u32 range,
+			     struct map_cache_page *mc_pg);
+
+
+/**
+ * _lookup_table_range() - resolve a sector mapping via ZLT mapping
+ * @znd: ZDM Instance
+ * @addr: Address to resolve (via FWD map).
+ * @gfp: Current allocation flags.
+ */
+static u64 _lookup_table_range(struct zdm *znd, u64 addr, u32 *range,
+			       int noio, gfp_t gfp)
+{
+	struct map_addr maddr;
+	struct map_pg *pg;
+	u64 tlba = 0;
+	const int ahead = (gfp == GFP_ATOMIC) ? 1 : znd->cache_reada;
+	const int async = 0;
+
+	map_addr_calc(znd, addr, &maddr);
+	pg = get_map_entry(znd, maddr.lut_s, ahead, async, noio, gfp);
+	if (pg) {
+		ref_pg(pg);
+		wait_for_map_pg(znd, pg, gfp);
+		if (pg->data.addr) {
+			unsigned long flags;
+			u64 tgt;
+			__le32 delta;
+			u32 limit = *range;
+			u32 num;
+			u32 idx;
+
+			spin_lock_irqsave(&pg->md_lock, flags);
+			delta = pg->data.addr[maddr.pg_idx];
+			tlba = map_value(znd, delta);
+			for (num = 0; num < limit; num++) {
+				idx = maddr.pg_idx + num;
+				if (idx >= 1024)
+					break;
+				tgt = map_value(znd, pg->data.addr[idx]);
+				if (tlba && tgt != (tlba + num))
+					break;
+				if (!tlba && tgt)
+					break;
+			}
+			*range = num;
+			spin_unlock_irqrestore(&pg->md_lock, flags);
+			pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+			clear_bit(IS_READA, &pg->flags);
+		} else {
+			Z_ERR(znd, "lookup page has no data. retry.");
+			*range = 0;
+			tlba = ~0ul;
+		}
+		deref_pg(pg);
+		put_map_entry(pg);
+	} else {
+		*range = 0;
+		tlba = ~0ul;
+	}
+	return tlba;
+}
+
+/**
+ * __map_rng() - Find the current bLBA for @addr
+ * @znd: ZDM instance
+ * @addr: tLBA to find
+ * @range: in/out number of entries in extent
+ * @trim: If trim/discard cache should be consulted.
+ * @gfp: allocation mask
+ */
+static u64 __map_rng(struct zdm *znd, u64 addr, u32 *range, bool trim,
+		     int noio, gfp_t gfp)
+{
+	struct map_cache_page *wpg = NULL;
+	u64 found = 0ul;
+	unsigned long flags;
+	u32 blks = *range;
+
+	if (addr < znd->data_lba) {
+		*range = blks = 1;
+		found = z_lookup_journal_cache(znd, addr);
+		if (!found)
+			found = addr;
+		goto out;
+	}
+
+	if (!found) {
+		wpg = ZDM_ALLOC(znd, sizeof(*wpg), PG_09, gfp);
+		if (!wpg) {
+			Z_ERR(znd, "%s: Out of memory.", __func__);
+			goto out;
+		}
+	}
+
+	if (!found) {
+		struct map_cache_entry *maps = wpg->maps;
+		u64 iadr;
+		u64 itgt;
+		u32 num;
+		u32 offset;
+		int isects = 0;
+
+		if (trim) {
+			spin_lock_irqsave(&znd->trim_rwlck, flags);
+			isects = _common_intersect(znd->trim, addr, blks, wpg);
+			spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+		}
+		if (isects) {
+			iadr = le64_to_lba48(maps[ISCT_BASE].tlba, NULL);
+			itgt = le64_to_lba48(maps[ISCT_BASE].bval, &num);
+			if (addr == iadr) {
+				if (blks > num)
+					blks = num;
+				*range = blks;
+				goto out;
+			} else if (addr > iadr) {
+				num -= addr - iadr;
+				if (blks > num)
+					blks = num;
+				*range = blks;
+				goto out;
+			} else if (addr < iadr) {
+				blks -= iadr - addr;
+				*range = blks;
+			}
+		}
+
+		memset(wpg, 0, sizeof(*wpg));
+		spin_lock_irqsave(&znd->in_rwlck, flags);
+		isects = _common_intersect(znd->ingress, addr, blks, wpg);
+		spin_unlock_irqrestore(&znd->in_rwlck, flags);
+		if (isects) {
+			iadr = le64_to_lba48(maps[ISCT_BASE].tlba, NULL);
+			itgt = le64_to_lba48(maps[ISCT_BASE].bval, &num);
+			if (addr == iadr) {
+				if (blks > num)
+					blks = num;
+				*range = blks;
+				found = itgt;
+				goto out;
+			} else if (addr > iadr) {
+				offset = addr - iadr;
+				itgt += offset;
+				num -= offset;
+
+				if (blks > num)
+					blks = num;
+				*range = blks;
+				found = itgt;
+				goto out;
+			} else if (addr < iadr) {
+				blks -= iadr - addr;
+				*range = blks;
+			}
+		}
+	}
+	if (!found) {
+		found = _lookup_table_range(znd, addr, &blks, noio, gfp);
+		*range = blks;
+		if (!noio && found == ~0ul) {
+			Z_ERR(znd, "%s: invalid addr? %llx", __func__, addr);
+			dump_stack();
+		}
+	}
+out:
+	if (wpg)
+		ZDM_FREE(znd, wpg, Z_C4K, PG_09);
+
+	return found;
+}
+
+/**
+ * current_map_range() - Find the current bLBA for @addr
+ * @znd: ZDM instance
+ * @addr: tLBA to find
+ * @range: in/out number of entries in extent
+ * @gfp: allocation mask
+ */
+static u64 current_map_range(struct zdm *znd, u64 addr, u32 *range, gfp_t gfp)
+{
+	const bool trim = true;
+	const int noio = 0;
+	u64 lba;
+
+	if (*range == 1)
+		return current_mapping(znd, addr, gfp);
+
+	lba = __map_rng(znd, addr, range, trim, noio, GFP_ATOMIC);
+
+	return lba;
+}
+
+/**
+ * _backref_mcache() - Scan map pool @mcache for matching bLBA
+ * @znd: ZDM instance
+ * @mcache: Map pool to scan
+ * @blba: (bLBA) referenced by ingress/journal extent
+ */
+static u64 _backref_mcache(struct zdm *znd, struct map_pool *mcache, u64 blba)
+{
+	u64 addr = 0ul;
+	int idx;
+
+	for (idx = 0; idx < mcache->count; idx++) {
+		u32 count;
+		struct map_cache_entry *mce = mce_at(mcache, idx);
+		u64 bval = le64_to_lba48(mce->bval, &count);
+
+		if (count && bval <= blba && blba < (bval + count)) {
+			addr = le64_to_lba48(mce->tlba, NULL) + (blba - bval);
+			break;
+		}
+	}
+	return addr;
+}
+
+/**
+ * backref_cache() - Scan ingress and metadata journal for matching bLBA
+ * @znd: ZDM instance
+ * @blba: (bLBA) referenced by ingress/journal extent
+ */
+static u64 backref_cache(struct zdm *znd, u64 blba)
+{
+	u64 addr = 0ul;
+
+	if (blba < znd->md_start)
+		return addr;
+
+	addr = _backref_mcache(znd, znd->wbjrnl, blba);
+	if (addr)
+		return addr;
+
+	if (blba < znd->md_end)
+		return addr;
+
+	addr = _backref_mcache(znd, znd->ingress, blba);
+	return addr;
+}
+
+/**
+ * gc_sort_lba() - Sort an unsorted map cache page
+ * @znd: ZDM instance
+ * @mcache: Map cache page.
+ *
+ * Sort a map cache entry if is sorted.
+ * Lock using mutex if not already locked.
+ */
+static void gc_sort_lba(struct zdm *znd, struct gc_map_cache *postmap)
+{
+	if (postmap->jcount > 1 && postmap->jsorted < postmap->jcount) {
+		struct map_cache_entry *base = &postmap->gc_mcd->maps[0];
+
+		sort(base, postmap->jcount, sizeof(*base), gc_lba_cmp, NULL);
+		postmap->jsorted = postmap->jcount;
+	}
+}
+
+/**
+ * z_lookup_journal_cache_nlck() - Scan wb journal entries for addr
+ * @znd: ZDM Instance
+ * @addr: Address [tLBA] to find.
+ * @type: mcache type (MAP or DISCARD cache).
+ */
+static u64 z_lookup_journal_cache_nlck(struct zdm *znd, u64 addr)
+{
+	u64 found = 0ul;
+	int at;
+
+	at = bsrch_n(addr, znd->wbjrnl);
+	if (at != -1) {
+		struct map_cache_entry *mce = mce_at(znd->wbjrnl, at);
+		u32 nelem;
+		u32 flags;
+		u64 tlba = le64_to_lba48(mce->tlba, &flags);
+		u64 bval = le64_to_lba48(mce->bval, &nelem);
+
+		if (bval)
+			bval += (addr - tlba);
+
+		found = bval;
+	}
+
+	return found;
+}
+
+/**
+ * z_lookup_journal_cache() - Find extent in metadata journal extents
+ * @znd: ZDM instance
+ * @addr: (tLBA) address to find
+ */
+static u64 z_lookup_journal_cache(struct zdm *znd, u64 addr)
+{
+	unsigned long flags;
+	u64 found;
+
+	spin_lock_irqsave(&znd->wbjrnl_rwlck, flags);
+	found = z_lookup_journal_cache_nlck(znd, addr);
+	spin_unlock_irqrestore(&znd->wbjrnl_rwlck, flags);
+
+	return found;
+}
+
+/**
+ * z_lookup_ingress_cache_nlck() - Scan mcache entries for addr
+ * @znd: ZDM Instance
+ * @addr: Address [tLBA] to find.
+ * @type: mcache type (MAP or DISCARD cache).
+ */
+static inline u64 z_lookup_ingress_cache_nlck(struct zdm *znd, u64 addr)
+{
+	u64 found = 0ul;
+	int at;
+
+	if (znd->ingress->count != znd->ingress->sorted)
+		Z_ERR(znd, " ** NOT SORTED");
+
+	at = bsrch_n(addr, znd->ingress);
+	if (at != -1) {
+		struct map_cache_entry *mce = mce_at(znd->ingress, at);
+		u32 nelem;
+		u32 flags;
+		u64 tlba = le64_to_lba48(mce->tlba, &flags);
+		u64 bval = le64_to_lba48(mce->bval, &nelem);
+
+		if (flags & MCE_NO_ENTRY)
+			goto done;
+
+		if (bval)
+			bval += (addr - tlba);
+
+		found = bval ? bval : ~0ul;
+	}
+
+done:
+	return found;
+}
+
+/**
+ * z_lookup_trim_cache_nlck() - Find extent in trim extents
+ * @znd: ZDM instance
+ * @addr: (tLBA) address to find
+ */
+static inline int z_lookup_trim_cache_nlck(struct zdm *znd, u64 addr)
+{
+	struct map_cache_entry *mce;
+	int found = 0;
+	u32 flags;
+	int at;
+
+	at = bsrch_n(addr, znd->trim);
+	if (at != -1) {
+		mce = mce_at(znd->trim, at);
+		(void)le64_to_lba48(mce->tlba, &flags);
+		found = (flags & MCE_NO_ENTRY) ? 0 : 1;
+	}
+
+	return found;
+}
+
+/**
+ * z_lookup_unused_cache_nlck() - Find extent in unused extents
+ * @znd: ZDM instance
+ * @addr: (tLBA) address to find
+ */
+static u64 z_lookup_unused_cache_nlck(struct zdm *znd, u64 addr)
+{
+	struct map_cache_entry *mce;
+	u64 found = 0ul;
+	u32 flags;
+	int at;
+
+	at = bsrch_n(addr, znd->unused);
+	if (at != -1) {
+		mce = mce_at(znd->unused, at);
+		found = le64_to_lba48(mce->tlba, &flags);
+		if (flags & MCE_NO_ENTRY)
+			found = 0ul;
+	}
+
+	return found;
+}
+
+/**
+ * z_lookup_ingress_cache() - Find extent in ingress extents
+ * @znd: ZDM instance
+ * @addr: (tLBA) address to find
+ */
+static u64 z_lookup_ingress_cache(struct zdm *znd, u64 addr)
+{
+	unsigned long flags;
+	u64 found;
+
+	spin_lock_irqsave(&znd->in_rwlck, flags);
+	found = z_lookup_ingress_cache_nlck(znd, addr);
+	spin_unlock_irqrestore(&znd->in_rwlck, flags);
+
+	return found;
+}
+
+/**
+ * z_lookup_trim_cache() - Find extent in trim/discard extents
+ * @znd: ZDM instance
+ * @addr: (tLBA) address to find
+ */
+static int z_lookup_trim_cache(struct zdm *znd, u64 addr)
+{
+	unsigned long flags;
+	int found;
+
+	spin_lock_irqsave(&znd->trim_rwlck, flags);
+	found = z_lookup_trim_cache_nlck(znd, addr);
+	spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+
+	return found;
+}
+
+/**
+ * z_flush_bdev() - Request backing device flushed to disk.
+ * @znd: ZDM instance
+ *
+ * Return: 0 on success or -errno value
+ */
+static int z_flush_bdev(struct zdm *znd, gfp_t gfp)
+{
+	int err;
+	sector_t bi_done;
+
+	err = blkdev_issue_flush(znd->dev->bdev, gfp, &bi_done);
+	if (err)
+		Z_ERR(znd, "%s: flush failing sector %lu!", __func__, bi_done);
+
+	return err;
+}
+
+/**
+ * pg_delete - Free a map_pg with spin locks held.
+ * @znd: ZDM Instance
+ * @expg: Page being released.
+ *
+ * Forced inline as it is 'optional' and because it is called with
+ * spin locks enabled and only from a single caller.
+ */
+static __always_inline int pg_delete(struct zdm *znd, struct map_pg *expg)
+{
+	int req_flush = 0;
+	int dropped = 0;
+	int is_lut = test_bit(IS_LUT, &expg->flags);
+	unsigned long flags;
+	unsigned long mflgs;
+	spinlock_t *lock = is_lut ? &znd->mapkey_lock : &znd->ct_lock;
+
+	if (test_bit(R_IN_FLIGHT, &expg->flags))
+		return req_flush;
+
+	if (!spin_trylock_irqsave(lock, flags))
+		return req_flush;
+
+	if (test_bit(IN_WB_JOURNAL, &expg->flags)) {
+		req_flush = !test_bit(IS_FLUSH, &expg->flags);
+		if (req_flush)
+			goto out;
+		if (test_bit(IS_CRC, &expg->flags))
+			goto out;
+		if (!test_bit(IS_DROPPED, &expg->flags))
+			goto out;
+
+		spin_lock_irqsave(&expg->md_lock, mflgs);
+		if (expg->data.addr) {
+			void *pg = expg->data.addr;
+
+			Z_DBG(znd, "** jrnl pg %"PRIx64" data dropped (%"
+				   PRIx64 ") %04x",
+			      expg->lba, expg->last_write, crc16_md(pg, Z_C4K));
+
+			expg->data.addr = NULL;
+			__smp_mb();
+			init_completion(&expg->event);
+			set_bit(IS_ALLOC, &expg->flags);
+			ZDM_FREE(znd, pg, Z_C4K, PG_27);
+			req_flush = !test_bit(IS_FLUSH, &expg->flags);
+			atomic_dec(&znd->incore);
+		}
+		spin_unlock_irqrestore(&expg->md_lock, mflgs);
+	} else if (test_bit(IS_DROPPED, &expg->flags)) {
+		unsigned long zflgs;
+
+		if (!spin_trylock_irqsave(&znd->zlt_lck, zflgs))
+			goto out;
+		del_htbl_entry(znd, expg);
+
+		spin_lock_irqsave(&expg->md_lock, mflgs);
+		if (test_and_clear_bit(IS_LAZY, &expg->flags)) {
+			clear_bit(IS_DROPPED, &expg->flags);
+			clear_bit(DELAY_ADD, &expg->flags);
+			if (expg->data.addr) {
+				void *pg = expg->data.addr;
+
+				expg->data.addr = NULL;
+				__smp_mb();
+				init_completion(&expg->event);
+				set_bit(IS_ALLOC, &expg->flags);
+				ZDM_FREE(znd, pg, Z_C4K, PG_27);
+				atomic_dec(&znd->incore);
+			}
+			list_del(&expg->lazy);
+			znd->in_lzy--;
+			req_flush = !test_bit(IS_FLUSH, &expg->flags);
+			dropped = 1;
+		} else {
+			Z_ERR(znd, "Detected double list del.");
+		}
+		if (dropped && expg->crc_pg) {
+			deref_pg(expg->crc_pg);
+			expg->crc_pg = NULL;
+		}
+		spin_unlock_irqrestore(&expg->md_lock, mflgs);
+		if (dropped)
+			ZDM_FREE(znd, expg, sizeof(*expg), KM_20);
+		spin_unlock_irqrestore(&znd->zlt_lck, zflgs);
+	}
+
+out:
+	spin_unlock_irqrestore(lock, flags);
+	return req_flush;
+}
+
+/**
+ * manage_lazy_activity() - Migrate delayed 'add' entries to the 'ZTL'
+ * @znd: ZDM Instance
+ *
+ * The lzy list is used to perform less critical activities that could
+ * be done via the ZTL primary list but gives a second chance when
+ *   - Adding if the spin lock would lock.
+ *   - Deleting ... if the cache entry turns out to be 'hotter' than
+ *     the default we can catch it and make it 'hotter' before the
+ *     hotness indicator is lost.
+ */
+static int manage_lazy_activity(struct zdm *znd)
+{
+	struct map_pg *expg;
+	struct map_pg *_tpg;
+	int want_flush = 0;
+	const u32 msecs = znd->cache_ageout_ms;
+	struct map_pg **wset = NULL;
+	int entries = 0;
+	unsigned long flags;
+	LIST_HEAD(movelist);
+
+	wset = ZDM_CALLOC(znd, sizeof(*wset), MAX_WSET, KM_19, GFP_KERNEL);
+	if (!wset) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&znd->lzy_lck, flags);
+	expg = list_first_entry_or_null(&znd->lzy_pool, typeof(*expg), lazy);
+	if (!expg || (&expg->lazy == &znd->lzy_pool))
+		goto out;
+
+	_tpg = list_next_entry(expg, lazy);
+	while (&expg->lazy != &znd->lzy_pool) {
+		/*
+		 * this should never happen:
+		 */
+		if (test_bit(IS_DIRTY, &expg->flags))
+			set_bit(DELAY_ADD, &expg->flags);
+
+		if (test_bit(WB_RE_CACHE, &expg->flags)) {
+			if (!expg->data.addr) {
+				if (entries < MAX_WSET) {
+					ref_pg(expg);
+					wset[entries] = expg;
+					entries++;
+					clear_bit(WB_RE_CACHE, &expg->flags);
+				}
+			} else {
+				set_bit(IS_DIRTY, &expg->flags);
+				clear_bit(IS_FLUSH, &expg->flags);
+				clear_bit(WB_RE_CACHE, &expg->flags);
+			}
+		}
+
+		/*
+		 * Migrage pg to zltlst list
+		 */
+		if (test_bit(DELAY_ADD, &expg->flags)) {
+			if (!test_bit(IN_ZLT, &expg->flags)) {
+				list_del(&expg->lazy);
+				znd->in_lzy--;
+				clear_bit(IS_LAZY, &expg->flags);
+				clear_bit(IS_DROPPED, &expg->flags);
+				clear_bit(DELAY_ADD, &expg->flags);
+				set_bit(IN_ZLT, &expg->flags);
+				list_add(&expg->zltlst, &movelist);
+				znd->in_zlt++;
+			} else {
+				Z_ERR(znd, "** ZLT double add? %"PRIx64,
+					expg->lba);
+			}
+		} else {
+			/*
+			 * Delete page
+			 */
+			if (!test_bit(IN_ZLT, &expg->flags) &&
+			    !test_bit(R_IN_FLIGHT, &expg->flags) &&
+			     getref_pg(expg) == 0 &&
+			     test_bit(IS_FLUSH, &expg->flags) &&
+			     test_bit(IS_DROPPED, &expg->flags) &&
+			     is_expired_msecs(expg->age, msecs))
+				want_flush |= pg_delete(znd, expg);
+		}
+		expg = _tpg;
+		_tpg = list_next_entry(expg, lazy);
+	}
+
+out:
+	spin_unlock_irqrestore(&znd->lzy_lck, flags);
+
+	if (entries > 0)
+		_pool_read(znd, wset, entries);
+
+	if (!list_empty(&movelist))
+		zlt_pool_splice(znd, &movelist);
+
+	if (wset)
+		ZDM_FREE(znd, wset, sizeof(*wset) * MAX_WSET, KM_19);
+
+	return want_flush;
+}
+
+/**
+ * pg_toggle_wb_journal() - Handle bouncing from WB JOURNAL <-> WB DIRECT
+ * @znd: ZDM Instance
+ * @expg: Page of map data being toggled.
+ *
+ */
+static inline void pg_toggle_wb_journal(struct zdm *znd, struct map_pg *expg)
+{
+	if (test_bit(IN_WB_JOURNAL, &expg->flags)) {
+		short delta = (znd->bmkeys->generation & 0xfff) - expg->gen;
+
+		if (abs(delta) > znd->journal_age)
+			set_bit(WB_RE_CACHE, &expg->flags);
+	}
+}
+
+/**
+ * mark_clean_flush_zlt() - Mark all non-dirty ZLT blocks as 'FLUSH'
+ * @znd: ZDM instance
+ *
+ * After a FLUSH/FUA these blocks are on disk and redundant FLUSH
+ * can be skipped if the block is later ejected.
+ */
+static void mark_clean_flush_zlt(struct zdm *znd, bool wb_toggle)
+{
+	struct map_pg *expg = NULL;
+	struct map_pg *_tpg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&znd->zlt_lck, flags);
+	znd->flush_age = jiffies_64;
+	if (list_empty(&znd->zltpool))
+		goto out;
+
+	expg = list_last_entry(&znd->zltpool, typeof(*expg), zltlst);
+	if (!expg || &expg->zltlst == (&znd->zltpool))
+		goto out;
+
+	_tpg = list_prev_entry(expg, zltlst);
+	while (&expg->zltlst != &znd->zltpool) {
+		ref_pg(expg);
+		if (wb_toggle && !test_bit(R_IN_FLIGHT, &expg->flags))
+			pg_toggle_wb_journal(znd, expg);
+		deref_pg(expg);
+		expg = _tpg;
+		_tpg = list_prev_entry(expg, zltlst);
+	}
+
+out:
+	spin_unlock_irqrestore(&znd->zlt_lck, flags);
+}
+
+/**
+ * mark_clean_flush_lzy() - Mark all non-dirty ZLT blocks as 'FLUSH'
+ * @znd: ZDM instance
+ *
+ * After a FLUSH/FUA these blocks are on disk and redundant FLUSH
+ * can be skipped if the block is later ejected.
+ */
+static void mark_clean_flush_lzy(struct zdm *znd, bool wb_toggle)
+{
+	struct map_pg *expg = NULL;
+	struct map_pg *_tpg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&znd->lzy_lck, flags);
+	expg = list_first_entry_or_null(&znd->lzy_pool, typeof(*expg), lazy);
+	if (!expg || (&expg->lazy == &znd->lzy_pool))
+		goto out;
+
+	_tpg = list_next_entry(expg, lazy);
+	while (&expg->lazy != &znd->lzy_pool) {
+		ref_pg(expg);
+		if (!test_bit(R_IN_FLIGHT, &expg->flags)) {
+			if (!test_bit(IS_DIRTY, &expg->flags))
+				set_bit(IS_FLUSH, &expg->flags);
+			if (wb_toggle)
+				pg_toggle_wb_journal(znd, expg);
+		}
+		deref_pg(expg);
+		expg = _tpg;
+		_tpg = list_next_entry(expg, lazy);
+	}
+
+out:
+	spin_unlock_irqrestore(&znd->lzy_lck, flags);
+}
+
+/**
+ * mark_clean_flush() - Mark all non-dirty ZLT/LZY blocks as 'FLUSH'
+ * @znd: ZDM instance
+ *
+ * After a FLUSH/FUA these blocks are on disk and redundant FLUSH
+ * can be skipped if the block is later ejected.
+ */
+static void mark_clean_flush(struct zdm *znd, bool wb_toggle)
+{
+	mark_clean_flush_zlt(znd, wb_toggle);
+	mark_clean_flush_lzy(znd, wb_toggle);
+}
+
+/**
+ * do_sync_metadata() - Write ZDM state to disk.
+ * @znd: ZDM instance
+ *
+ * Return: 0 on success or -errno value
+ */
+static int do_sync_metadata(struct zdm *znd, int sync, int drop)
+{
+	int err = 0;
+
+	MutexLock(&znd->pool_mtx);
+	if (manage_lazy_activity(znd)) {
+		if (!test_bit(DO_FLUSH, &znd->flags))
+			Z_ERR(znd, "Extra flush [MD Cache too small]");
+		set_bit(DO_FLUSH, &znd->flags);
+	}
+	mutex_unlock(&znd->pool_mtx);
+
+	/* if drop is non-zero, DO_FLUSH may be set on return */
+	err = sync_mapped_pages(znd, sync, drop);
+	if (err) {
+		Z_ERR(znd, "Uh oh: sync_mapped_pages -> %d", err);
+		goto out;
+	}
+
+	/*
+	 * If we are lucky then this sync will get us to a 'clean'
+	 * state and the follow on bdev flush is redunant and skipped
+	 *
+	 * If not we will suffer a performance stall because we were
+	 * ejected blocks.
+	 *
+	 * TODO: On Sync/Flush/FUA we can mark all of our clean ZLT
+	 *       as flushed and we can bypass elevating the drop count
+	 *       to trigger a flush for such already flushed blocks.
+	 */
+	if (test_bit(DO_FLUSH, &znd->flags)) {
+		err = z_mapped_sync(znd);
+		if (err) {
+			Z_ERR(znd, "Uh oh. z_mapped_sync -> %d", err);
+			goto out;
+		}
+		md_handle_crcs(znd);
+	}
+
+	if (test_and_clear_bit(DO_FLUSH, &znd->flags)) {
+		err = z_flush_bdev(znd, GFP_KERNEL);
+		if (err) {
+			Z_ERR(znd, "Uh oh. flush_bdev failed. -> %d", err);
+			goto out;
+		}
+		mark_clean_flush(znd, false);
+	}
+
+out:
+	return err;
+}
+
+/**
+ * do_zdm_reload_from_disc() - Restore ZDM state from disk.
+ * @znd: ZDM instance
+ *
+ * Return: 0 on success or -errno value
+ */
+static int do_zdm_reload_from_disc(struct zdm *znd)
+{
+	int err = 0;
+
+	if (test_and_clear_bit(DO_ZDM_RELOAD, &znd->flags))
+		err = z_mapped_init(znd);
+
+	return err;
+}
+
+/**
+ * do_move_map_cache_to_table() - Migrate memcache entries to lookup tables
+ * @znd: ZDM instance
+ *
+ * Return: 0 on success or -errno value
+ */
+static int do_move_map_cache_to_table(struct zdm *znd, int locked, gfp_t gfp)
+{
+	int err = 0;
+
+	if (!locked && gfp != GFP_KERNEL)
+		return err;
+
+	if (test_and_clear_bit(DO_MAPCACHE_MOVE, &znd->flags) ||
+	    test_bit(DO_SYNC, &znd->flags)) {
+		if (!locked)
+			MutexLock(&znd->mz_io_mutex);
+		err = _cached_to_tables(znd, znd->zone_count, gfp);
+		if (!locked)
+			mutex_unlock(&znd->mz_io_mutex);
+	}
+
+	if (znd->ingress->count > MC_MOVE_SZ || znd->unused->count > MC_MOVE_SZ)
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+
+	if (znd->trim->count > MC_HIGH_WM)
+		unmap_deref_chunk(znd, MC_HIGH_WM, 0, gfp);
+
+	return err;
+}
+
+/**
+ * do_sync_to_disk() - Write ZDM state to disk.
+ * @znd: ZDM instance
+ *
+ * Return: 0 on success or -errno value
+ */
+static int do_sync_to_disk(struct zdm *znd)
+{
+	int err = 0;
+	int drop = 0;
+	int sync = 0;
+
+	if (test_and_clear_bit(DO_SYNC, &znd->flags))
+		sync = 1;
+	else if (test_bit(DO_FLUSH, &znd->flags))
+		sync = 1;
+
+	if (sync || test_and_clear_bit(DO_MEMPOOL, &znd->flags)) {
+		int pool_size = znd->cache_size >> 1;
+
+		/**
+		 * Trust our cache miss algo
+		 */
+		if (is_expired_msecs(znd->age, znd->cache_ageout_ms * 2))
+			pool_size = 3;
+		else if (is_expired_msecs(znd->age, znd->cache_ageout_ms))
+			pool_size = (znd->cache_size >> 3);
+
+		if (atomic_read(&znd->incore) > pool_size)
+			drop = atomic_read(&znd->incore) - pool_size;
+	}
+	if (sync || drop)
+		err = do_sync_metadata(znd, sync, drop);
+
+	return err;
+}
+
+/**
+ * meta_work_task() - Worker thread from metadata activity.
+ * @work: Work struct containing ZDM instance.
+ */
+static void meta_work_task(struct work_struct *work)
+{
+	int err = 0;
+	int locked = 0;
+	struct zdm *znd;
+
+	if (!work)
+		return;
+
+	znd = container_of(work, struct zdm, meta_work);
+	if (!znd)
+		return;
+
+	err = do_zdm_reload_from_disc(znd);
+
+	if (test_bit(DO_MAPCACHE_MOVE, &znd->flags) ||
+	    test_bit(DO_FLUSH, &znd->flags) ||
+	    test_bit(DO_SYNC, &znd->flags)) {
+		MutexLock(&znd->mz_io_mutex);
+		locked = 1;
+	}
+
+	/*
+	 * Reduce memory pressure on map cache list of arrays
+	 * by pushing them into the sector map lookup tables
+	 */
+	if (!err) {
+		err = do_move_map_cache_to_table(znd, locked, GFP_KERNEL);
+		if (err == -EAGAIN || err == -EBUSY) {
+			err = 0;
+			set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		}
+	}
+
+	/* force a consistent set of meta data out to disk */
+	if (!err)
+		err = do_sync_to_disk(znd);
+
+	if (locked)
+		mutex_unlock(&znd->mz_io_mutex);
+
+	znd->age = jiffies_64;
+	if (err < 0)
+		znd->meta_result = err;
+
+	clear_bit(DO_METAWORK_QD, &znd->flags);
+}
+
+/**
+ * next_generation() - Increment generation number for superblock.
+ * @znd: ZDM instance
+ */
+static inline u64 next_generation(struct zdm *znd)
+{
+	u64 generation = le64_to_cpu(znd->bmkeys->generation);
+
+	if (generation == 0)
+		generation = 2;
+
+	generation++;
+	if (generation == 0)
+		generation++;
+
+	return generation;
+}
+
+/**
+ * next_bio() - Allocate a new bio and chain it to existing bio.
+ * @znd: ZDM instance
+ * @bio: bio to chain and submit
+ * @gfp: memory allocation mask
+ * @nr_pages: number of pages (segments) for new bio
+ * @bs: Bio set used for allocation
+ */
+static struct bio *next_bio(struct zdm *znd, struct bio *bio, gfp_t gfp,
+			    unsigned int nr_pages, struct bio_set *bs)
+{
+	struct bio *new = bio_alloc_bioset(gfp, nr_pages, bs);
+#if ENABLE_SEC_METADATA
+	sector_t sector;
+#endif
+
+	if (bio && new) {
+		bio_chain(bio, new);
+#if ENABLE_SEC_METADATA
+		if (znd->meta_dst_flag == DST_TO_SEC_DEVICE) {
+			sector = bio->bi_iter.bi_sector;
+			bio->bi_bdev = znd_get_backing_dev(znd, &sector);
+			bio->bi_iter.bi_sector = sector;
+		}
+#endif
+		submit_bio(bio);
+	}
+
+	return new;
+}
+
+/**
+ * bio_add_km() - Fill a page from pool
+ * @bio: bio to add WP block to
+ * @kmem: kmalloc'd memory
+ * @pgs: number of pages to add
+ */
+static unsigned int bio_add_km(struct bio *bio, void *kmem, int pgs)
+{
+	unsigned int added = 0;
+	unsigned int len = pgs << PAGE_SHIFT;
+	struct page *pg;
+	unsigned long addr = (unsigned long)kmem;
+
+	if (addr && bio) {
+		pg = virt_to_page((void*)addr);
+		if (pg) {
+			added = bio_add_page(bio, pg, len, 0);
+			if (added != len)
+				pr_err("Failed to add %u to bio\n", len);
+		} else {
+			pr_err("Invalid pg ? \n");
+		}
+	} else {
+		pr_err("Invalid addr / bio ? \n");
+	}
+	return added;
+}
+
+/**
+ * bio_add_wp() - Fill a page from pool
+ * @bio: bio to add WP block to
+ * @znd: ZDM instance
+ * @idx: index of page set to add
+ */
+static int bio_add_wp(struct bio *bio, struct zdm *znd, int idx)
+{
+	int len;
+	struct meta_pg *wpg = &znd->wp[idx];
+
+	znd->bmkeys->wp_crc[idx] = crc_md_le16(wpg->wp_alloc, Z_CRC_4K);
+	len = bio_add_km(bio, wpg->wp_alloc, 1);
+	znd->bmkeys->zf_crc[idx] = crc_md_le16(wpg->zf_est, Z_CRC_4K);
+	len += bio_add_km(bio, wpg->zf_est, 1);
+
+	return len;
+}
+
+/**
+ * is_key_page() - Probe block for magic and crc to see if it is recognized.
+ * @_data: ZDM instance
+ */
+static inline int is_key_page(void *_data)
+{
+	int is_key = 0;
+	struct mz_superkey *data = _data;
+
+	/* Starts with Z_KEY_SIG and ends with magic */
+
+	if (le64_to_cpu(data->sig1) == Z_KEY_SIG &&
+	    le64_to_cpu(data->magic) == Z_TABLE_MAGIC) {
+		__le32 orig = data->crc32;
+		__le32 crc_check;
+
+		data->crc32 = 0;
+		crc_check = cpu_to_le32(crc32c(~0u, data, Z_CRC_4K));
+		data->crc32 = orig;
+		if (crc_check == orig)
+			is_key = 1;
+	}
+	return is_key;
+}
+
+
+struct on_sync {
+	u64 lba;
+	struct map_cache_page **pgs;
+	int cpg;
+	int npgs;
+	int cached;
+	int off;
+	int discards;
+	int maps;
+	int unused;
+	int wbjrnl;
+	int n_writes;
+	int n_blocks;
+};
+
+/**
+ * _fill_mcp() - Fill a page from pool
+ * @mcp: Page of map data
+ * @mp: Map pool
+ * @from: Starting index of @mp to copy to @mcp
+ */
+static int _fill_mcp(struct map_cache_page *mcp, struct map_pool *mp, int from)
+{
+	int count = 0;
+	int iter;
+
+	memset(mcp->maps, 0, sizeof(mcp->maps));
+	for (count = 0, iter = from;
+	     iter < mp->count && count < ARRAY_SIZE(mcp->maps);
+	     count++, iter++)
+		memcpy(&mcp->maps[count], mce_at(mp, iter), sizeof(*mcp->maps));
+
+	return count;
+}
+
+#if ENABLE_SEC_METADATA
+/**
+ * znd_bio_copy() - "Clone" or mirror bio / segments
+ * @znd: ZDM instance
+ * @bio: source bio
+ * @clone: mirror bio for secondary target
+ */
+static void znd_bio_copy(struct zdm *znd, struct bio *bio, struct bio *clone)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+
+	clone->bi_iter.bi_sector = bio->bi_iter.bi_sector;
+	clone->bi_bdev = znd->meta_dev->bdev;
+	clone->bi_opf = bio->bi_opf;
+	clone->bi_iter.bi_size = bio->bi_iter.bi_size;
+	clone->bi_vcnt = 0;
+
+	bio_for_each_segment(bv, bio, iter)
+		clone->bi_io_vec[clone->bi_vcnt++] = bv;
+}
+#endif
+
+/**
+ * add_mpool() - Add a set of map_pool pages to a bio
+ * @znd: ZDM instance
+ * @bio: bio of primary target
+ * @mpool: Map pool
+ * @osc: tlba of extent
+ * @clone: mirror bio of secondary target
+ * @gfp: allocation mask
+ */
+static struct bio *add_mpool(struct zdm *znd, struct bio *bio,
+			     struct map_pool *mpool, struct on_sync *osc,
+			     struct bio *clone, gfp_t gfp)
+{
+	int idx = 0;
+
+	while (idx < mpool->count) {
+		struct map_cache_page *pg;
+		int count;
+
+		pg = ZDM_ALLOC(znd, sizeof(*pg), PG_09, gfp);
+		if (!pg) {
+			Z_ERR(znd, "%s: alloc pool page.", __func__);
+			goto out;
+		}
+
+		count = _fill_mcp(pg, mpool, idx);
+		pg->header.bval = lba48_to_le64(count, 0);
+		pg->header.tlba = 0ul;
+		znd->bmkeys->crcs[idx+osc->off] = crc_md_le16(pg, Z_CRC_4K);
+
+		bio_add_km(bio, pg, 1);
+
+#if ENABLE_SEC_METADATA
+		if (znd->meta_dst_flag == DST_TO_BOTH_DEVICE)
+			znd_bio_copy(znd, bio, clone);
+#endif
+		if (osc->cpg < osc->npgs) {
+			osc->pgs[osc->cpg] = pg;
+			osc->cpg++;
+		}
+		osc->cached++;
+		osc->lba++;
+
+		switch (mpool->isa) {
+		case IS_TRIM:
+			osc->discards++;
+			break;
+		case IS_INGRESS:
+			osc->maps++;
+			break;
+		case IS_UNUSED:
+			osc->unused++;
+			break;
+		case IS_WBJRNL:
+			osc->wbjrnl++;
+			break;
+		default:
+			break;
+		}
+
+		if (osc->cached == BIO_MAX_PAGES) {
+			bio = next_bio(znd, bio, BIO_MAX_PAGES, gfp,
+					znd->bio_set);
+			if (!bio) {
+				Z_ERR(znd, "%s: alloc bio.", __func__);
+				goto out;
+			}
+
+			bio->bi_iter.bi_sector = osc->lba << Z_SHFT4K;
+			bio->bi_bdev = znd->dev->bdev;
+			bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+			bio->bi_iter.bi_size = 0;
+#if ENABLE_SEC_METADATA
+			if (znd->meta_dst_flag == DST_TO_BOTH_DEVICE) {
+				clone = next_bio(znd, clone, BIO_MAX_PAGES, gfp,
+						znd->bio_set);
+				znd_bio_copy(znd, bio, clone);
+			}
+#endif
+			osc->n_writes++;
+			osc->n_blocks += osc->cached;
+			osc->cached = 0;
+		}
+
+		idx += count;
+		if (count < Z_MAP_MAX)
+			break;
+
+	}
+	osc->off += idx;
+
+out:
+	return bio;
+}
+
+
+/**
+ * z_mapped_sync() - Write cache entries and bump superblock generation.
+ * @znd: ZDM instance
+ */
+static int z_mapped_sync(struct zdm *znd)
+{
+	struct on_sync osc;
+	struct bio *bio = NULL;
+	struct bio *clone = NULL;
+	u64 generation = next_generation(znd);
+	u64 modulo = CACHE_COPIES;
+	u64 incr = MAX_SB_INCR_SZ;
+#if ENABLE_SEC_METADATA
+	sector_t sector;
+#endif
+	int rc = 1;
+	int idx = 0;
+	int more_data = 1;
+	const gfp_t gfp = GFP_KERNEL;
+
+	memset(&osc, 0, sizeof(osc));
+	osc.npgs = SYNC_MAX;
+	osc.pgs = ZDM_CALLOC(znd, sizeof(*osc.pgs), osc.npgs, KM_18, gfp);
+	if (!osc.pgs) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		return -ENOMEM;
+	}
+
+	/* write dirty WP/ZF_EST blocks */
+	osc.lba = WP_ZF_BASE; /* have 3 copies as well */
+	for (idx = 0; idx < znd->gz_count; idx++) {
+		struct meta_pg *wpg = &znd->wp[idx];
+
+		if (test_bit(IS_DIRTY, &wpg->flags)) {
+			if (bio)
+				osc.n_writes++, osc.n_blocks += 2;
+
+			bio = next_bio(znd, bio, gfp, 2, znd->bio_set);
+			if (!bio) {
+				Z_ERR(znd, "%s: alloc bio.", __func__);
+				rc = -ENOMEM;
+				goto out;
+			}
+			bio->bi_iter.bi_sector = osc.lba << Z_SHFT4K;
+			bio->bi_bdev = znd->dev->bdev;
+			bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+			bio->bi_iter.bi_size = 0;
+			if (!bio_add_wp(bio, znd, idx)) {
+				rc = -EIO;
+				goto out;
+			}
+#if ENABLE_SEC_METADATA
+			if (znd->meta_dst_flag == DST_TO_BOTH_DEVICE) {
+				clone = next_bio(znd, clone, gfp, 2,
+								znd->bio_set);
+				if (!clone) {
+					Z_ERR(znd, "%s: alloc bio.", __func__);
+					rc = -ENOMEM;
+					bio_put(bio);
+					goto out;
+				}
+				znd_bio_copy(znd, bio, clone);
+			}
+#endif
+			/* FIXME? only clear on 3rd copy */
+			clear_bit(IS_DIRTY, &wpg->flags);
+
+			Z_DBG(znd, "%d# -- WP: %04x | ZF: %04x",
+			      idx, znd->bmkeys->wp_crc[idx],
+				   znd->bmkeys->zf_crc[idx]);
+		}
+		osc.lba += 2;
+	}
+	osc.lba = (generation % modulo) * incr;
+	if (osc.lba == 0)
+		osc.lba++;
+	if (bio)
+		osc.n_writes++, osc.n_blocks += 2;
+
+	bio = next_bio(znd, bio, gfp, BIO_MAX_PAGES, znd->bio_set);
+	if (!bio) {
+		Z_ERR(znd, "%s: alloc bio.", __func__);
+		rc = -ENOMEM;
+		goto out;
+	}
+	bio->bi_iter.bi_sector = osc.lba << Z_SHFT4K;
+	bio->bi_bdev = znd->dev->bdev;
+	bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+	bio->bi_iter.bi_size = 0;
+
+	osc.cached = 0;
+
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag == DST_TO_BOTH_DEVICE) {
+		clone = next_bio(znd, clone, gfp, BIO_MAX_PAGES, znd->bio_set);
+		if (!clone) {
+			Z_ERR(znd, "%s: alloc bio.", __func__);
+			rc = -ENOMEM;
+			bio_put(bio);
+			goto out;
+		}
+		znd_bio_copy(znd, bio, clone);
+	}
+#endif
+	znd->bmkeys->generation = cpu_to_le64(generation);
+	znd->bmkeys->gc_resv = cpu_to_le32(znd->z_gc_resv);
+	znd->bmkeys->meta_resv = cpu_to_le32(znd->z_meta_resv);
+	osc.lba += osc.cached;
+
+	/* for znd->ingress, znd->trim, znd->unused, znd->wbjrnl allocate
+	 * pages and copy data ....
+	 */
+	bio = add_mpool(znd, bio, znd->ingress, &osc, clone, gfp);
+	bio = add_mpool(znd, bio, znd->trim, &osc, clone, gfp);
+	bio = add_mpool(znd, bio, znd->unused, &osc, clone, gfp);
+	bio = add_mpool(znd, bio, znd->wbjrnl, &osc, clone, gfp);
+
+	znd->bmkeys->md_crc = crc_md_le16(znd->md_crcs, Z_CRC_4K << 1);
+	znd->bmkeys->discards = cpu_to_le16(osc.discards);
+	znd->bmkeys->maps = cpu_to_le16(osc.maps);
+	znd->bmkeys->unused = cpu_to_le16(osc.unused);
+	znd->bmkeys->wbjrnld = cpu_to_le16(osc.wbjrnl);
+	znd->bmkeys->n_crcs = cpu_to_le16(osc.maps + osc.discards +
+					  osc.unused + osc.wbjrnl);
+	znd->bmkeys->crc32 = 0;
+	znd->bmkeys->crc32 = cpu_to_le32(crc32c(~0u, znd->bmkeys, Z_CRC_4K));
+	if (osc.cached < (BIO_MAX_PAGES - 3)) {
+		bio_add_km(bio, znd->z_sballoc, 1);
+		bio_add_km(bio, znd->md_crcs, 2);
+		more_data = 0;
+	}
+
+#if ENABLE_SEC_METADATA
+	if (znd->meta_dst_flag == DST_TO_BOTH_DEVICE)
+		znd_bio_copy(znd, bio, clone);
+#endif
+	if (unlikely(more_data)) {
+		bio = next_bio(znd, bio, gfp, 3, znd->bio_set);
+		bio->bi_iter.bi_sector = osc.lba << Z_SHFT4K;
+		bio->bi_bdev = znd->dev->bdev;
+		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+		bio->bi_iter.bi_size = 0;
+
+		osc.n_writes++, osc.n_blocks += osc.cached;
+		osc.cached = 0;
+
+		bio_add_km(bio, znd->z_sballoc, 1);
+		bio_add_km(bio, znd->md_crcs, 2);
+#if ENABLE_SEC_METADATA
+		if (znd->meta_dst_flag == DST_TO_BOTH_DEVICE) {
+			clone = next_bio(znd, clone, gfp, 3, znd->bio_set);
+			znd_bio_copy(znd, bio, clone);
+		}
+#endif
+	}
+	osc.cached += 3;
+
+	if (bio) {
+		bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_FLUSH_FUA);
+#if ENABLE_SEC_METADATA
+		if (znd->meta_dst_flag == DST_TO_SEC_DEVICE) {
+			sector = bio->bi_iter.bi_sector;
+			bio->bi_bdev = znd_get_backing_dev(znd, &sector);
+			bio->bi_iter.bi_sector = sector;
+		}
+		if (clone && znd->meta_dst_flag == DST_TO_BOTH_DEVICE) {
+			znd_bio_copy(znd, bio, clone);
+			rc = submit_bio_wait(clone);
+			bio_put(clone);
+		}
+#endif
+		rc = submit_bio_wait(bio);
+		osc.n_writes++, osc.n_blocks += osc.cached;
+		bio_put(bio);
+		clear_bit(DO_FLUSH, &znd->flags);
+		mark_clean_flush(znd, true);
+	}
+
+	for (idx = 0; idx < osc.cpg; idx++)
+		ZDM_FREE(znd, osc.pgs[idx], Z_C4K, PG_09);
+
+	Z_DBG(znd, "Sync/Flush: %d writes / %d blocks written [gen %"PRIx64"]",
+		osc.n_writes, osc.n_blocks, generation);
+out:
+	if (osc.pgs)
+		ZDM_FREE(znd, osc.pgs, sizeof(*osc.pgs) * osc.npgs, KM_18);
+
+	return rc;
+}
+
+/**
+ * zoned_personality() - Update zdstart value from superblock
+ * @znd: ZDM instance
+ * @sblock: ZDM superblock.
+ */
+static inline
+void zoned_personality(struct zdm *znd, struct zdm_superblock *sblock)
+{
+	znd->zdstart = le32_to_cpu(sblock->zdstart);
+}
+
+/**
+ * find_superblock_at() - Find superblock following lba
+ * @znd: ZDM instance
+ * @lba: Lba to start scanning for superblock.
+ * @use_wq: If a work queue is needed to scanning.
+ * @do_init: Set zdstart from found superblock.
+ */
+static
+int find_superblock_at(struct zdm *znd, u64 lba, int use_wq, int do_init)
+{
+	int found = 0;
+	int nblks = 1;
+	int rc = -ENOMEM;
+	u32 count = 0;
+	u64 *data = ZDM_ALLOC(znd, Z_C4K, PG_10, GFP_KERNEL);
+
+	if (!data) {
+		Z_ERR(znd, "No memory for finding generation ..");
+		return 0;
+	}
+	if (lba == 0)
+		lba++;
+
+	lba += WB_JRNL_BLKS;
+	do {
+		rc = read_block(znd, DM_IO_KMEM, data, lba, nblks, use_wq);
+		if (rc) {
+			Z_ERR(znd, "%s: read @%" PRIu64 " [%d blks] %p -> %d",
+			      __func__, lba, nblks, data, rc);
+			goto out;
+		}
+		if (is_key_page(data)) {
+			struct mz_superkey *kblk = (struct mz_superkey *) data;
+			struct zdm_superblock *sblock = &kblk->sblock;
+			int err = sb_check(sblock);
+
+			if (!err) {
+				found = 1;
+				if (do_init)
+					zoned_personality(znd, sblock);
+			}
+			goto out;
+		}
+		if (data[0] == 0 && data[1] == 0) {
+			/* No SB here. */
+			Z_ERR(znd, "FGen: Invalid block %" PRIx64 "?", lba);
+			if (count > 16)
+				goto out;
+		}
+		lba++;
+		count++;
+		if (count > MAX_CACHE_SYNC) {
+			Z_ERR(znd, "FSB: Too deep to be useful.");
+			goto out;
+		}
+	} while (!found);
+
+out:
+	ZDM_FREE(znd, data, Z_C4K, PG_10);
+	return found;
+}
+
+/**
+ * find_superblock() - Find (any) superblock
+ * @znd: ZDM instance
+ * @use_wq: If a work queue is needed to scanning.
+ * @do_init: Set/Retrieve zdstart from found super block.
+ */
+static int find_superblock(struct zdm *znd, int use_wq, int do_init)
+{
+	int found = 0;
+	int iter = 0;
+	u64 lba = LBA_SB_START;
+	u64 last = MAX_SB_INCR_SZ * CACHE_COPIES;
+
+	do {
+		found = find_superblock_at(znd, lba, use_wq, do_init);
+		if (found)
+			break;
+		iter++;
+		lba = MAX_SB_INCR_SZ * iter;
+	} while (lba < last);
+
+	return found;
+}
+
+/**
+ * mcache_find_gen() - Find the super block following lba and get gen#
+ * @znd: ZDM instance
+ * @lba: LBA to start scanning for the super block.
+ * @use_wq: If a work queue is needed to scanning.
+ * @sb_lba: LBA where the super block was found.
+ */
+static u64 mcache_find_gen(struct zdm *znd, u64 lba, int use_wq, u64 *sb_lba)
+{
+	u64 generation = 0;
+	int nblks = 1;
+	int rc = 1;
+	int done = 0;
+	u32 count = 0;
+	u64 *data = ZDM_ALLOC(znd, Z_C4K, PG_11, GFP_KERNEL);
+
+	if (!data) {
+		Z_ERR(znd, "No memory for finding generation ..");
+		return 0;
+	}
+	do {
+		rc = read_block(znd, DM_IO_KMEM, data, lba, nblks, use_wq);
+		if (rc) {
+			Z_ERR(znd, "%s: mcache-> %" PRIu64
+				    " [%d blks] %p -> %d",
+			      __func__, lba, nblks, data, rc);
+			goto out;
+		}
+		if (is_key_page(data)) {
+			struct mz_superkey *kblk = (struct mz_superkey *) data;
+
+			generation = le64_to_cpu(kblk->generation);
+			done = 1;
+			if (sb_lba)
+				*sb_lba = lba;
+			goto out;
+		}
+		lba++;
+		count++;
+		if (count > MAX_CACHE_SYNC) {
+			Z_ERR(znd, "FGen: Too deep to be useful.");
+			goto out;
+		}
+	} while (!done);
+
+out:
+	ZDM_FREE(znd, data, Z_C4K, PG_11);
+	return generation;
+}
+
+/**
+ * cmp_gen() - compare two u64 numbers considering rollover
+ * @left: a u64
+ * @right: a u64
+ * Return: -1, 0, 1 if left < right, equal, or > respectively.
+ */
+static inline int cmp_gen(u64 left, u64 right)
+{
+	int result = 0;
+
+	if (left != right) {
+		u64 delta = (left > right) ? left - right : right - left;
+
+		result = (left > right) ? -1 : 1;
+		if (delta > 0xFFFFFFFF) {
+			if (left == BAD_ADDR)
+				result = 1;
+		} else {
+			if (right > left)
+				result = 1;
+		}
+	}
+
+	return result;
+}
+
+/**
+ * mcache_greatest_gen() - Pick the lba where the super block should start.
+ * @znd: ZDM instance
+ * @use_wq: If a workqueue is needed for IO.
+ * @sb: LBA of super block itself.
+ * @s_lba: LBA where sync data starts (in front of the super block).
+ */
+static u64 mcache_greatest_gen(struct zdm *znd, int use_wq, u64 *sb, u64 *s_lba)
+{
+	u64 lba = LBA_SB_START;
+	u64 gen_no[CACHE_COPIES] = { 0ul, 0ul, 0ul };
+	u64 gen_lba[CACHE_COPIES] = { 0ul, 0ul, 0ul };
+	u64 gen_sb[CACHE_COPIES] = { 0ul, 0ul, 0ul };
+	u64 incr = MAX_SB_INCR_SZ;
+	int locations = ARRAY_SIZE(gen_lba);
+	int pick = 0;
+	int idx;
+
+	for (idx = 0; idx < locations; idx++) {
+		u64 *pAt = &gen_sb[idx];
+
+		gen_lba[idx] = lba;
+		gen_no[idx] = mcache_find_gen(znd, lba, use_wq, pAt);
+		if (gen_no[idx])
+			pick = idx;
+		lba = idx * incr;
+	}
+
+	for (idx = 0; idx < locations; idx++) {
+		if (cmp_gen(gen_no[pick], gen_no[idx]) > 0)
+			pick = idx;
+	}
+
+	if (gen_no[pick]) {
+		if (s_lba)
+			*s_lba = gen_lba[pick];
+		if (sb)
+			*sb = gen_sb[pick];
+	}
+
+	return gen_no[pick];
+}
+
+/**
+ * count_stale_blocks() - Number of stale blocks covered by meta_pg.
+ * @znd: ZDM instance
+ * @gzno: Meta page # to scan.
+ * @wpg: Meta page to scan.
+ */
+static u64 count_stale_blocks(struct zdm *znd, u32 gzno, struct meta_pg *wpg)
+{
+	u32 gzcount = 1 << GZ_BITS;
+	u32 iter;
+	u64 stale = 0;
+
+	if ((gzno << GZ_BITS) > znd->zone_count)
+		gzcount = znd->zone_count & GZ_MMSK;
+
+	/* mark as empty */
+	for (iter = 0; iter < gzcount; iter++) {
+		u32 wp = le32_to_cpu(wpg->wp_alloc[iter]) & Z_WP_VALUE_MASK;
+		u32 nf = le32_to_cpu(wpg->zf_est[iter]) & Z_WP_VALUE_MASK;
+
+		if (wp > (Z_BLKSZ - nf))
+			stale += (wp - (Z_BLKSZ - nf));
+	}
+
+	return stale;
+}
+
+/**
+ *
+ * FIXME ... read into znd->ingress and znd->trim and ...
+ *
+ * do_load_cache() - Read a series of map cache blocks to restore from disk.
+ * @znd: ZDM Instance
+ * @type: Map cache list (MAP, DISCARD, JOURNAL .. )
+ * @lba: Starting LBA for reading
+ * @idx: Saved/Expected CRC of block.
+ * @wq: True when I/O needs to use worker thread.
+ */
+static int do_load_cache(struct zdm *znd, int type, u64 lba, int idx, int wq)
+{
+	unsigned long flags;
+	struct map_cache_page *work_pg = NULL;
+	int rc = -ENOMEM;
+	const gfp_t gfp = GFP_KERNEL;
+	u32 count;
+	__le16 crc;
+	int blks = 1;
+
+	work_pg = ZDM_ALLOC(znd, sizeof(*work_pg), PG_09, gfp);
+	if (!work_pg) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		goto out;
+	}
+
+	rc = read_block(znd, DM_IO_KMEM, work_pg, lba, blks, wq);
+	if (rc) {
+		Z_ERR(znd, "%s: pg -> %" PRIu64
+			   " [%d blks] %p -> %d",
+		      __func__, lba, blks, work_pg, rc);
+		goto out;
+	}
+	crc = crc_md_le16(work_pg, Z_CRC_4K);
+	if (crc != znd->bmkeys->crcs[idx]) {
+		rc = -EIO;
+		Z_ERR(znd, "%s: bad crc %" PRIu64,  __func__, lba);
+		goto out;
+	}
+	(void)le64_to_lba48(work_pg->header.bval, &count);
+
+	switch (type) {
+	case IS_INGRESS:
+		mp_grow(znd->ingress, znd->in, gfp);
+		spin_lock_irqsave(&znd->in_rwlck, flags);
+		if (count) {
+			struct map_cache_entry *maps = maps = work_pg->maps;
+			struct map_pool *mp;
+
+			mp = mp_pick(znd->ingress, znd->in);
+			rc = do_sort_merge(mp, znd->ingress, maps, count, 0);
+			znd->ingress = mp;
+			__smp_mb();
+			if (rc)
+				Z_ERR(znd, "SortMerge failed: %d [%d]", rc,
+				      __LINE__);
+			if (znd->ingress->count > MC_MOVE_SZ)
+				set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		}
+		spin_unlock_irqrestore(&znd->in_rwlck, flags);
+		break;
+	case IS_WBJRNL:
+		mp_grow(znd->wbjrnl, znd->_wbj, gfp);
+		spin_lock_irqsave(&znd->wbjrnl_rwlck, flags);
+		if (count) {
+			struct map_cache_entry *maps = maps = work_pg->maps;
+			struct map_pool *mp;
+
+			mp = mp_pick(znd->wbjrnl, znd->_wbj);
+			rc = do_sort_merge(mp, znd->wbjrnl, maps, count, 0);
+			znd->wbjrnl = mp;
+			__smp_mb();
+			if (rc)
+				Z_ERR(znd, "SortMerge failed: %d [%d]", rc,
+				      __LINE__);
+			if (znd->wbjrnl->count > MC_MOVE_SZ)
+				set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		}
+		spin_unlock_irqrestore(&znd->wbjrnl_rwlck, flags);
+		break;
+	case IS_TRIM:
+		mp_grow(znd->trim, znd->trim_mp, gfp);
+		spin_lock_irqsave(&znd->trim_rwlck, flags);
+		if (count) {
+			struct map_cache_entry *maps = maps = work_pg->maps;
+			struct map_pool *mp;
+
+			mp = mp_pick(znd->trim, znd->trim_mp);
+			rc = do_sort_merge(mp, znd->trim, maps, count, 0);
+			znd->trim = mp;
+			__smp_mb();
+			if (rc)
+				Z_ERR(znd, "DSortMerge failed: %d [%d]", rc,
+				      __LINE__);
+
+			rc = 0;
+		}
+		if (znd->trim->count > MC_MOVE_SZ)
+			set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+		break;
+	case IS_UNUSED:
+		mp_grow(znd->unused, znd->_use, gfp);
+		spin_lock_irqsave(&znd->unused_rwlck, flags);
+		if (count) {
+			struct map_pool *mp;
+			struct map_cache_entry *maps;
+
+			maps = work_pg->maps;
+			mp = mp_pick(znd->unused, znd->_use);
+			rc = do_sort_merge(mp, znd->unused, maps, count, 0);
+			znd->unused = mp;
+			__smp_mb();
+			if (rc)
+				Z_ERR(znd, "USortMerge failed: %d [%d]", rc,
+				      __LINE__);
+			if (znd->unused->count > MC_MOVE_SZ)
+				set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		}
+		spin_unlock_irqrestore(&znd->unused_rwlck, flags);
+		break;
+	default:
+		rc = -EFAULT;
+	}
+
+out:
+	if (work_pg)
+		ZDM_FREE(znd, work_pg, Z_C4K, PG_09);
+	return rc;
+}
+
+/**
+ * do_load_map_cache() - Read a series of map cache blocks to restore from disk.
+ * @znd: ZDM Instance
+ * @lba: Starting LBA for reading
+ * @idx: Saved/Expected CRC of block.
+ * @wq: True when I/O needs to use worker thread.
+ */
+static int do_load_map_cache(struct zdm *znd, u64 lba, int idx, int wq)
+{
+	return do_load_cache(znd, IS_INGRESS, lba, idx, wq);
+}
+
+/**
+ * do_load_discard_cache() - Read a DISCARD map cache blocks.
+ * @znd: ZDM Instance
+ * @lba: Starting LBA for reading
+ * @idx: Saved/Expected CRC of block.
+ * @wq: True when I/O needs to use worker thread.
+ */
+static int do_load_discard_cache(struct zdm *znd, u64 lba, int idx, int wq)
+{
+	return do_load_cache(znd, IS_TRIM, lba, idx, wq);
+}
+
+/**
+ * do_load_unused_cache() - Read a series of map cache blocks to restore ZDM.
+ * @znd: ZDM Instance
+ * @lba: Starting LBA for reading
+ * @idx: Saved/Expected CRC of block.
+ * @wq: True when I/O needs to use worker thread.
+ */
+static int do_load_unused_cache(struct zdm *znd, u64 lba, int idx, int wq)
+{
+	return do_load_cache(znd, IS_UNUSED, lba, idx, wq);
+}
+
+/**
+ * do_load_wbjrnl_cache() - Read a series of map cache blocks to restore from disk.
+ * @znd: ZDM Instance
+ * @lba: Starting LBA for reading
+ * @idx: Saved/Expected CRC of block.
+ * @wq: True when I/O needs to use worker thread.
+ */
+static int do_load_wbjrnl_cache(struct zdm *znd, u64 lba, int idx, int wq)
+{
+	return do_load_cache(znd, IS_WBJRNL, lba, idx, wq);
+}
+
+/**
+ * z_mapped_init() - Re-Load an existing ZDM instance from the block device.
+ * @znd: ZDM instance
+ *
+ * FIXME: Discard extent read-back does not match z_mapped_sync writing
+ */
+static int z_mapped_init(struct zdm *znd)
+{
+	int nblks = 1;
+	int wq = 0;
+	int rc = 1;
+	int idx = 0;
+	int jcount = 0;
+	u64 sblba = 0;
+	u64 lba = 0;
+	u64 generation;
+	__le32 crc_chk;
+	gfp_t gfp = GFP_KERNEL;
+	struct io_4k_block *io_vcache;
+
+	MutexLock(&znd->vcio_lock);
+	io_vcache = get_io_vcache(znd, gfp);
+
+	if (!io_vcache) {
+		Z_ERR(znd, "%s: FAILED to get SYNC CACHE.", __func__);
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	generation = mcache_greatest_gen(znd, wq, &sblba, &lba);
+	if (generation == 0) {
+		rc = -ENODATA;
+		goto out;
+	}
+
+	if (lba == 0)
+		lba++;
+
+	/* read superblock */
+	rc = read_block(znd, DM_IO_VMA, io_vcache, sblba, nblks, wq);
+	if (rc)
+		goto out;
+
+	memcpy(znd->bmkeys, io_vcache, sizeof(*znd->bmkeys));
+
+	/* read in map cache */
+	for (idx = 0; idx < le16_to_cpu(znd->bmkeys->maps); idx++) {
+		rc = do_load_map_cache(znd, lba++, jcount++, wq);
+		if (rc)
+			goto out;
+	}
+
+	/* read in discard cache */
+	for (idx = 0; idx < le16_to_cpu(znd->bmkeys->discards); idx++) {
+		rc = do_load_discard_cache(znd, lba++, jcount++, wq);
+		if (rc)
+			goto out;
+	}
+
+	/* read in unused cache */
+	for (idx = 0; idx < le16_to_cpu(znd->bmkeys->unused); idx++) {
+		rc = do_load_unused_cache(znd, lba++, jcount++, wq);
+		if (rc)
+			goto out;
+	}
+
+	/* read in wbjrnl cache */
+	for (idx = 0; idx < le16_to_cpu(znd->bmkeys->wbjrnld); idx++) {
+		rc = do_load_wbjrnl_cache(znd, lba++, jcount++, wq);
+		if (rc)
+			goto out;
+	}
+
+	/* skip re-read of superblock */
+	if (lba == sblba)
+		lba++;
+
+	/* read in CRC pgs */
+	rc = read_block(znd, DM_IO_KMEM, znd->md_crcs, lba, 2, wq);
+	if (rc)
+		goto out;
+
+	crc_chk = znd->bmkeys->crc32;
+	znd->bmkeys->crc32 = 0;
+	znd->bmkeys->crc32 = cpu_to_le32(crc32c(~0u, znd->bmkeys, Z_CRC_4K));
+
+	if (crc_chk != znd->bmkeys->crc32) {
+		Z_ERR(znd, "Bad Block Map KEYS!");
+		Z_ERR(znd, "Key CRC: Ex: %04x vs %04x <- calculated",
+		      le32_to_cpu(crc_chk),
+		      le32_to_cpu(znd->bmkeys->crc32));
+		rc = -EIO;
+		goto out;
+	}
+
+	if (jcount != le16_to_cpu(znd->bmkeys->n_crcs)) {
+		Z_ERR(znd, " ... mcache entries: found = %u, expected = %u",
+		      jcount, le16_to_cpu(znd->bmkeys->n_crcs));
+		rc = -EIO;
+		goto out;
+	}
+
+	crc_chk = crc_md_le16(znd->md_crcs, Z_CRC_4K * 2);
+	if (crc_chk != znd->bmkeys->md_crc) {
+		Z_ERR(znd, "CRC of CRC PGs: Ex %04x vs %04x  <- calculated",
+		      le16_to_cpu(znd->bmkeys->md_crc),
+		      le16_to_cpu(crc_chk));
+		rc = -EIO;
+		goto out;
+	}
+
+	/*
+	 * Read write pointers / free counters.
+	 */
+	lba = WP_ZF_BASE;
+	znd->discard_count = 0;
+	for (idx = 0; idx < znd->gz_count; idx++) {
+		struct meta_pg *wpg = &znd->wp[idx];
+		__le16 crc_wp;
+		__le16 crc_zf;
+
+		rc = read_block(znd, DM_IO_KMEM, wpg->wp_alloc, lba, 1, wq);
+		if (rc)
+			goto out;
+		crc_wp = crc_md_le16(wpg->wp_alloc, Z_CRC_4K);
+		if (znd->bmkeys->wp_crc[idx] != crc_wp)
+			Z_ERR(znd, "WP @ %d does not match written.", idx);
+
+		rc = read_block(znd, DM_IO_KMEM, wpg->zf_est, lba + 1, 1, wq);
+		if (rc)
+			goto out;
+		crc_zf = crc_md_le16(wpg->zf_est, Z_CRC_4K);
+		if (znd->bmkeys->zf_crc[idx] != crc_zf)
+			Z_ERR(znd, "ZF @ %d does not match written.", idx);
+
+		Z_DBG(znd, "%d# -- WP: %04x [%04x] | ZF: %04x [%04x]",
+		      idx, znd->bmkeys->wp_crc[idx], crc_wp,
+			   znd->bmkeys->zf_crc[idx], crc_zf);
+
+		if (znd->bmkeys->wp_crc[idx] == crc_wp &&
+		    znd->bmkeys->zf_crc[idx] == crc_zf)
+			znd->discard_count += count_stale_blocks(znd, idx, wpg);
+
+		lba += 2;
+	}
+	znd->z_gc_resv   = le32_to_cpu(znd->bmkeys->gc_resv);
+	znd->z_meta_resv = le32_to_cpu(znd->bmkeys->meta_resv);
+
+out:
+	put_io_vcache(znd, io_vcache);
+	mutex_unlock(&znd->vcio_lock);
+	return rc;
+}
+
+/**
+ * mpool_try_merge() - Merge or Insert extent to pool
+ * @mcache: Map pool
+ * @entry: Map pool entry
+ * @tlba: tlba of extent
+ * @num: size of extent
+ * @blba: target of extent
+ */
+static int mpool_try_merge(struct map_pool *mcache, int entry, u64 tlba,
+			   u32 num, u64 blba)
+{
+	struct map_cache_entry *mce = mce_at(mcache, entry);
+	u32 nelem;
+	u32 flags;
+	u64 addr = le64_to_lba48(mce->tlba, &flags);
+	u64 bval = le64_to_lba48(mce->bval, &nelem);
+	int rc = 0;
+
+	if (flags & (MCE_NO_MERGE|MCE_NO_ENTRY))
+		goto out;
+
+	if ((num + nelem) < EXTENT_CEILING) {
+		u64 extent = (bval ? bval + nelem : 0);
+
+		if ((addr + nelem) == tlba && blba == extent) {
+			nelem += num;
+			mce->bval = lba48_to_le64(nelem, bval);
+			rc = 1;
+			goto out;
+		}
+	}
+
+out:
+	return rc;
+}
+
+/**
+ * mp_mrg_ins() - Merge or Insert extent to pool
+ * @mcache: Map pool
+ * @tlba: tlba of extent
+ * @flg: extent flag
+ * @num: size of extent
+ * @blba: target of extent
+ * @sorted: Is sorted flag.
+ */
+static int mp_mrg_ins(struct map_pool *mcache, u64 tlba, u32 flg, u32 num,
+		      u64 blba, int sorted)
+{
+	struct map_cache_entry *mce;
+	u64 prev;
+	int top = mcache->count;
+	int rc = 0;
+	bool do_ins = false;
+	bool do_mrg;
+
+	if (top < mcache->size) {
+		do_ins = do_mrg = true;
+
+		if (flg & MCE_NO_ENTRY) {
+			pr_err("mp_insert() no entry flag invalid!!\n");
+			dump_stack();
+		}
+
+		if (top == 0 || (flg & MCE_NO_MERGE))
+			do_mrg = false;
+		if (do_mrg) {
+			if (mpool_try_merge(mcache, top - 1, tlba, num, blba)) {
+				rc = 1;
+				do_ins = false;
+			}
+		}
+	}
+	if (do_ins && top > 0) {
+		mce = mce_at(mcache, top - 1);
+		prev = le64_to_lba48(mce->tlba, NULL);
+		if (prev > tlba)
+			do_ins = false;
+	}
+
+	if (do_ins) {
+		mce = mce_at(mcache, top);
+		mce->tlba = lba48_to_le64(flg, tlba);
+		mce->bval = lba48_to_le64(num, blba);
+		if (mcache->sorted == mcache->count)
+			++mcache->sorted, ++mcache->count;
+		rc = 1;
+	}
+
+	if (mcache->count != mcache->sorted)
+		pr_err(" ** NOT SORTED %d/%d\n", mcache->sorted, mcache->count);
+
+	return rc;
+}
+
+/**
+ * mp_mrg() - Merge extent to pool
+ * @mp: Map pool
+ * @tlba: tlba of extent
+ * @flg: extent flag
+ * @num: size of extent
+ * @blba: target of extent
+ */
+static int mp_mrg(struct map_pool *mp, u64 tlba, u32 flg, u32 num, u64 blba)
+{
+	return mp_mrg_ins(mp, tlba, flg, num, blba, 1);
+
+}
+
+/**
+ * mp_insert() - Insert extent to pool
+ * @mp: Map pool
+ * @tlba: tlba of extent
+ * @flg: extent flag
+ * @num: size of extent
+ * @blba: target of extent
+ */
+static int mp_insert(struct map_pool *mp, u64 tlba, u32 flg, u32 num, u64 blba)
+{
+	return mp_mrg_ins(mp, tlba, flg, num, blba, 0);
+}
+
+/**
+ * mpool_split() - Apply splits over eval from cache
+ * @eval:
+ * @tlba: tlba of extent
+ * @num: size of extent
+ * @blba: target of extent
+ * @cache:
+ */
+static int mpool_split(struct map_cache_entry *eval, u64 tlba, u32 num,
+		       u64 blba, struct map_cache_entry *cache)
+{
+	u32 nelem;
+	u32 flags;
+	u64 addr = le64_to_lba48(eval->tlba, &flags);
+	u64 bval = le64_to_lba48(eval->bval, &nelem);
+
+	if (num == 0) {
+		pr_err("%s: 0'd split? %llx/%u -> %llx", __func__,
+			tlba, num, blba);
+		dump_stack();
+	}
+
+	if (addr < tlba) {
+		u32 delta = tlba - addr;
+
+		if (nelem < delta) {
+			pr_err("Split does not overlap? E:%u | D:%u\n",
+				nelem, delta);
+			delta = nelem;
+		}
+
+		cache[MC_HEAD].tlba = lba48_to_le64(0, addr);
+		cache[MC_HEAD].bval = lba48_to_le64(delta, bval);
+		nelem -= delta;
+		addr += delta;
+		if (bval)
+			bval += delta;
+	} else if (addr > tlba) {
+		u32 delta = addr - tlba;
+
+		if (num < delta) {
+			pr_err("Split does not overlap? N:%u / D:%u\n",
+				num, delta);
+			delta = num;
+		}
+		cache[MC_HEAD].tlba = lba48_to_le64(0, tlba);
+		cache[MC_HEAD].bval = lba48_to_le64(delta, blba);
+		num -= delta;
+		tlba += delta;
+		if (blba)
+			blba += delta;
+	}
+
+	if (num >= nelem) {
+		num -= nelem;
+
+		/* note: Intersect will be updated by caller */
+		cache[MC_INTERSECT].tlba = lba48_to_le64(0, addr);
+		cache[MC_INTERSECT].bval = lba48_to_le64(nelem, bval);
+		if (num) {
+			tlba += nelem;
+			if (blba)
+				blba += nelem;
+			cache[MC_TAIL].tlba = lba48_to_le64(0, tlba);
+			cache[MC_TAIL].bval = lba48_to_le64(num, blba);
+		}
+	} else {
+		nelem -= num;
+		cache[MC_INTERSECT].tlba = lba48_to_le64(0, addr);
+		cache[MC_INTERSECT].bval = lba48_to_le64(num, bval);
+		addr += num;
+		if (bval)
+			bval += num;
+		if (nelem) {
+			cache[MC_TAIL].tlba = lba48_to_le64(0, addr);
+			cache[MC_TAIL].bval = lba48_to_le64(nelem, bval);
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * do_sort_merge() - Preform a sorted merge of @src and @chng into @to
+ * @to: New map pool to contain updated data
+ * @src: Current map pool source
+ * @chng: Changes to be merged
+ * @nchgs: Number of entries in @chng
+ * @drop: Drop empty extents from src / chng
+ */
+static int do_sort_merge(struct map_pool *to, struct map_pool *src,
+			 struct map_cache_entry *chng, int nchgs, int drop)
+{
+	struct map_cache_entry *s_mpe;
+	u64 c_addr = ~0u;
+	u64 c_lba = 0;
+	u32 c_num = 0;
+	u32 c_flg = 0;
+	int c_idx = 0;
+	u64 s_addr = ~0u;
+	u64 s_lba = 0;
+	u32 s_flg = 0;
+	u32 s_num = 0;
+	int s_idx = 0;
+	int err = 0;
+
+	if (to->count) {
+		pr_err("Sort Merge target is non-empty?\n");
+		dump_stack();
+	}
+
+
+	do {
+		while (s_idx < src->count) {
+			s_mpe = mce_at(src, s_idx);
+			s_addr = le64_to_lba48(s_mpe->tlba, &s_flg);
+			s_lba = le64_to_lba48(s_mpe->bval, &s_num);
+			if (s_flg & MCE_NO_ENTRY)
+				s_addr = ~0ul;
+			else if (s_addr)
+				break;
+			s_idx++;
+		}
+		while (c_idx < nchgs) {
+			c_addr = le64_to_lba48(chng[c_idx].tlba, &c_flg);
+			c_lba = le64_to_lba48(chng[c_idx].bval, &c_num);
+			if (c_addr)
+				break;
+			c_idx++;
+		}
+		if (s_idx >= src->count)
+			s_addr = ~0ul;
+		if (c_idx >= nchgs)
+			c_addr = ~0ul;
+
+		if (s_addr == ~0ul && c_addr == ~0ul)
+			break;
+
+		if (s_addr == c_addr) {
+			int add = (c_flg & MCE_NO_ENTRY) ? 0 : 1;
+
+			if (add && !mp_mrg(to, c_addr, c_flg, c_num, c_lba)) {
+				pr_err("Failed to (overwrite) insert %llx\n",
+				       c_addr);
+				err = -EIO;
+				goto out;
+			}
+			c_idx++;
+			s_idx++;
+		} else if (s_addr < c_addr) {
+			int add = (s_flg & MCE_NO_ENTRY) ? 0 : 1;
+
+			if (add && s_num &&
+			    !mp_mrg(to, s_addr, s_flg, s_num, s_lba)) {
+				pr_err("Failed to (cur) insert %llx\n", s_addr);
+				err = -EIO;
+				goto out;
+			}
+			s_idx++;
+		} else {
+			int add = (c_flg & MCE_NO_ENTRY) ? 0 : 1;
+
+			if (add && c_num &&
+			    !mp_mrg(to, c_addr, c_flg, c_num, c_lba)) {
+				pr_err("Failed to (new) insert %llx\n", c_addr);
+				err = -EIO;
+				goto out;
+			}
+			c_idx++;
+		}
+	} while (s_idx < src->count || c_idx < nchgs);
+
+out:
+
+	return err;
+}
+
+/**
+ * _common_intersect() - Common merge ranges
+ * @mcache: Pool for intersect
+ * @key: addr of map entry
+ * @range: Number of (contiguous) map entries.
+ * @mc_pg: Page of data to merge
+ */
+static int _common_intersect(struct map_pool *mcache, u64 key, u32 range,
+			     struct map_cache_page *mc_pg)
+{
+	struct map_cache_entry *mpe;
+	struct map_cache_entry *out = mc_pg->maps;
+	int avail = ARRAY_SIZE(mc_pg->maps);
+	int count = 0;
+	int fill = ISCT_BASE;
+	int idx;
+
+	for (idx = 0; idx < mcache->count; idx++) {
+		u32 melem;
+		u32 flags;
+		u64 addr;
+
+		mpe = mce_at(mcache, idx);
+		addr = le64_to_lba48(mpe->tlba, &flags);
+
+		if (flags & MCE_NO_ENTRY)
+			continue;
+
+		(void)le64_to_lba48(mpe->bval, &melem);
+		if (key < (addr + melem) && addr < (key + range)) {
+			if (fill < avail) {
+				flags |= MCE_NO_MERGE;
+				mpe->tlba = lba48_to_le64(flags, addr);
+				memcpy(&out[fill], mpe, sizeof(*out));
+			}
+			fill += MC_SKIP;
+			count++;
+		}
+	}
+	return count;
+}
+
+/**
+ * __mrg_splt() - Common merge ranges
+ * @znd: ZDM instance
+ * @issue_unused: Is unused flag
+ * @mc_pg: Page of data to merge
+ * @entries: entries is mc_pg
+ * @addr: addr of map entry
+ * @count: Number of (contiguous) map entries.
+ * @target: bLBA
+ * @gfp: allocation mask.
+ */
+static int __mrg_splt(struct zdm *znd, bool issue_unused,
+		      struct map_cache_page *mc_pg, int entries,
+		      u64 addr, u32 count, u64 target, gfp_t gfp)
+{
+	int idx;
+	u64 lba_new = target;
+	u64 unused;
+	u64 lba_was;
+	int in_use = ISCT_BASE + (entries * MC_SKIP);
+	struct map_cache_entry *mce = mc_pg->maps;
+
+	for (idx = ISCT_BASE; idx < in_use; idx += MC_SKIP) {
+		struct map_cache_entry cache[3];
+		u32 decr = count;
+
+		memset(cache, 0, sizeof(cache));
+		mpool_split(&mce[idx], addr, count, target, cache);
+
+		/* head:       *before* addr */
+		/* intersect:  *includes* addr */
+		/* tail:       *MAYBE* addr *OR* mce */
+
+		/* copy cache* entries back to mc_pg */
+		mce[idx - 1] = cache[MC_HEAD];
+
+		/* Check: if addr was before isect entry */
+		unused = le64_to_lba48(cache[MC_HEAD].tlba, NULL);
+		if (unused == addr) {
+			lba_was = le64_to_lba48(cache[MC_HEAD].bval, &decr);
+			if (addr)
+				addr += decr;
+			if (target)
+				lba_new = lba_was + decr;
+			count -= decr;
+		}
+
+		unused = le64_to_lba48(cache[MC_INTERSECT].tlba, NULL);
+		lba_was = le64_to_lba48(cache[MC_INTERSECT].bval, &decr);
+
+		if (unused != addr)
+			Z_ERR(znd, "FAIL: addr [%llx] != intersect [%llx]",
+			      addr, unused);
+
+		if (issue_unused && lba_was != lba_new && decr > 0)
+			unused_add(znd, lba_was, unused, decr, GFP_ATOMIC);
+
+		mce[idx].tlba = lba48_to_le64(0, addr);
+		mce[idx].bval = lba48_to_le64(decr, target);
+
+		if (addr)
+			addr += decr;
+		if (target)
+			lba_new = lba_was + decr;
+		count -= decr;
+
+		mce[idx + 1] = cache[MC_TAIL];
+	}
+	return 0;
+}
+
+/**
+ * _common_merges() - Common merge ranges
+ * @znd: ZDM instance
+ * @mc_pg: Page of data to merge
+ * @entries: entries is mc_pg
+ * @addr: addr of map entry
+ * @count: Number of (contiguous) map entries.
+ * @target: bLBA
+ * @gfp: allocation mask.
+ */
+static int _common_merges(struct zdm *znd,
+			  struct map_cache_page *mc_pg, int entries,
+			  u64 addr, u32 count, u64 target, gfp_t gfp)
+{
+	return __mrg_splt(znd, true, mc_pg, entries, addr, count, target, gfp);
+}
+
+/**
+ * unused_merges() - Merge unused ranges
+ * @znd: ZDM instance
+ * @mc_pg: Page of data to merge
+ * @entries: entries is mc_pg
+ * @addr: addr of map entry
+ * @count: Number of (contiguous) map entries.
+ * @target: bLBA
+ * @gfp: allocation mask.
+ */
+static int unused_merges(struct zdm *znd,
+			 struct map_cache_page *mc_pg, int entries,
+			 u64 addr, u32 count, u64 target, gfp_t gfp)
+{
+	return __mrg_splt(znd, false, mc_pg, entries, addr, count, target, gfp);
+}
+
+/**
+ * _common_drop() - Drop empty extents
+ * @znd: ZDM instance
+ * @mc_pg: Page of data to drop
+ * @key: addr of map entry
+ * @range: Number of (contiguous) map entries.
+ * @entries: entries is mc_pg
+ */
+static int _common_drop(struct zdm *znd, struct map_cache_page *mc_pg,
+			u64 key, u32 range, int entries)
+{
+	int idx;
+	struct map_cache_entry *maps = mc_pg->maps;
+	int in_use = 0;
+
+	if (entries)
+		in_use = ISCT_BASE + (entries * MC_SKIP);
+
+	for (idx = 0; idx < in_use; idx++) {
+		u32 melem;
+		u32 flags;
+		u64 addr = le64_to_lba48(maps[idx].tlba, &flags);
+		u64 bval = le64_to_lba48(maps[idx].bval, &melem);
+
+		if (key < (addr+melem) && addr < (key + range)) {
+			flags |= MCE_NO_ENTRY;
+			maps[idx].tlba = lba48_to_le64(flags, addr);
+			Z_DBG(znd, " .. drop: [%llx, %llx} {+%u}",
+			      addr, addr + melem, melem);
+		}
+		(void) bval;
+	}
+	return 0;
+}
+
+/**
+ * unused_update() - Add a range of unused extents
+ * @znd: ZDM instance
+ * @addr: tLBA
+ * @source: bLBA
+ * @count: Number of (contiguous) map entries to add.
+ * @drop: Drop flag
+ * @gfp: allocation mask.
+ */
+static int unused_update(struct zdm *znd, u64 addr, u64 source, u32 count,
+			 bool drop, gfp_t gfp)
+{
+	struct map_cache_entry *maps = NULL;
+	struct map_cache_page *m_pg = NULL;
+	unsigned long flags;
+	int rc = 0;
+	int matches;
+	int avail;
+
+	if (addr < znd->data_lba) /* FIXME? */
+		return rc;
+
+	m_pg = ZDM_ALLOC(znd, sizeof(*m_pg), PG_09, gfp);
+	if (!m_pg)
+		return -ENOMEM;
+
+	source = 0ul; /* this is not needed again */
+
+resubmit:
+	rc = 0;
+	mp_grow(znd->unused, znd->_use, gfp);
+	spin_lock_irqsave(&znd->unused_rwlck, flags);
+	maps = m_pg->maps;
+	avail = ARRAY_SIZE(m_pg->maps);
+	matches = _common_intersect(znd->unused, addr, count, m_pg);
+	if (drop && matches == 0)
+		goto out_unlock;
+
+	if (matches == 0) {
+		const u32 sflg = 0;
+		int in = mp_insert(znd->unused, addr, sflg, count, source);
+
+		if (in != 1) {
+			maps[ISCT_BASE].tlba = lba48_to_le64(sflg, addr);
+			maps[ISCT_BASE].bval = lba48_to_le64(count, source);
+			matches = 1;
+		}
+	}
+	if (matches) {
+		struct map_pool *mp;
+		const int drop = 0;
+
+		unused_merges(znd, m_pg, matches, addr, count, source, gfp);
+		if (drop)
+			_common_drop(znd, m_pg, addr, count, matches);
+
+		mp = mp_pick(znd->unused, znd->_use);
+		rc = do_sort_merge(mp, znd->unused, maps, avail, drop);
+		if (unlikely(rc)) {
+			Z_ERR(znd, "USortMerge failed: %d [%d]", rc, __LINE__);
+			rc = -EBUSY;
+		} else {
+			znd->unused = mp;
+			__smp_mb();
+		}
+	}
+out_unlock:
+	spin_unlock_irqrestore(&znd->unused_rwlck, flags);
+
+	if (rc == -EBUSY) {
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		do_move_map_cache_to_table(znd, 0, gfp);
+		goto resubmit;
+	}
+
+	if (znd->unused->count > MC_MOVE_SZ)
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+
+	if (m_pg)
+		ZDM_FREE(znd, m_pg, sizeof(*m_pg), PG_09);
+
+	return rc;
+}
+
+/**
+ * trim_deref_range() - Add a range of unused extents
+ * @znd: ZDM instance
+ * @addr: tLBA
+ * @from: bLBA
+ * @count: Number of (contiguous) map entries to add.
+ * @gfp: allocation mask.
+ */
+static int unused_add(struct zdm *znd, u64 addr, u64 from, u32 count, gfp_t gfp)
+{
+	return unused_update(znd, addr, from, count, false, gfp);
+}
+
+/**
+ * unused_reuse() - Remove a range of unused extents
+ * @znd: ZDM instance
+ * @addr: tLBA
+ * @count: Number of (contiguous) map entries to add.
+ * @gfp: allocation mask.
+ */
+static int unused_reuse(struct zdm *znd, u64 addr, u64 count, gfp_t gfp)
+{
+	return unused_update(znd, addr, 0ul, count, true, gfp);
+}
+
+/**
+ * trim_deref_range() - Remove a range of trim extents
+ * @znd: ZDM instance
+ * @addr: tLBA
+ * @count: Number of (contiguous) map entries to add.
+ * @lba: lba on backing device.
+ * @gfp: allocation mask.
+ *
+ * On Ingress we may need to punch a hole in a discard extent
+ */
+static int trim_deref_range(struct zdm *znd, u64 addr, u32 count, u64 lba,
+			    gfp_t gfp)
+{
+	struct map_cache_page *m_pg = NULL;
+	unsigned long flags;
+	int err = 0;
+	int avail;
+	int matches;
+
+	m_pg = ZDM_ALLOC(znd, sizeof(*m_pg), PG_09, gfp);
+	if (!m_pg)
+		return -ENOMEM;
+resubmit:
+	mp_grow(znd->trim, znd->trim_mp, gfp);
+	spin_lock_irqsave(&znd->trim_rwlck, flags);
+	avail = ARRAY_SIZE(m_pg->maps);
+	matches = _common_intersect(znd->trim, addr, count, m_pg);
+	if (matches) {
+		struct map_pool *mp;
+		const int drop = 1;
+
+		_common_merges(znd, m_pg, matches, addr, count, lba, gfp);
+		_common_drop(znd, m_pg, addr, count, matches);
+		mp = mp_pick(znd->trim, znd->trim_mp);
+		err = do_sort_merge(mp, znd->trim, m_pg->maps, avail, drop);
+		if (unlikely(err)) {
+			Z_ERR(znd, "DSortMerge failed: %d [%d]", err, __LINE__);
+			err = -EBUSY;
+		} else {
+			znd->trim = mp;
+			__smp_mb();
+		}
+	}
+	spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+
+	if (err == -EBUSY) {
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		do_move_map_cache_to_table(znd, 0, gfp);
+		goto resubmit;
+	}
+	if (m_pg)
+		ZDM_FREE(znd, m_pg, sizeof(*m_pg), PG_09);
+
+	return err;
+}
+
+/**
+ * ref_crc_pgs() - Ref CRC pages on an array of pages
+ * pgs: Array of pages
+ * cpg: Number of pages in use
+ * npgs: Number of pages in npgs (size)
+ */
+static int ref_crc_pgs(struct map_pg **pgs, int cpg, int npgs)
+{
+	if (pgs) {
+		int count = cpg;
+		int idx;
+
+		for (idx = 0; idx < count; idx++) {
+			if (pgs[idx]->crc_pg && cpg < npgs) {
+				if (cpg > 0 && pgs[cpg - 1] == pgs[idx]->crc_pg)
+					continue;
+				pgs[cpg] = pgs[idx]->crc_pg;
+				ref_pg(pgs[idx]->crc_pg);
+				cpg++;
+			}
+		}
+	}
+	return cpg;
+}
+
+/**
+ * deref_all_pgs() - Deref the array of pages
+ * znd: ZDM Instance
+ * pgs: Array of pages
+ * npgs: Number of pages in npgs
+ */
+static void deref_all_pgs(struct zdm *znd, struct map_pg **pgs, int npgs)
+{
+	const int fastage = low_cache_mem(znd);
+	const u64 decr = msecs_to_jiffies(znd->cache_ageout_ms - 1);
+	int cpg;
+
+	for (cpg = 0; cpg < npgs; cpg++) {
+		if (pgs[cpg] == NULL)
+			break;
+		if (fastage) {
+			pgs[cpg]->age = jiffies_64;
+			if (pgs[cpg]->age > decr)
+				pgs[cpg]->age -= decr;
+		}
+		deref_pg(pgs[cpg]);
+		pgs[cpg] = NULL;
+	}
+}
+
+/**
+ * working_set() - Manage a page of working set
+ * @wset: Working set
+ * @mp: Map pool
+ * @count: Number of entries to pull
+ * @create: create (or refresh) the working set
+ */
+static void working_set(struct map_cache_entry *wset, struct map_pool *mp,
+		        int count, int create)
+{
+	struct map_cache_entry *mce;
+	u64 addr;
+	u64 conf;
+	u32 flags;
+	u32 incl = min(count, mp->count);
+	int iter;
+	bool do_clear;
+
+	for (iter = 0; iter < count; iter++) {
+		do_clear = false;
+		if (iter < incl) {
+			mce = mce_at(mp, iter);
+			addr = le64_to_lba48(mce->tlba, &flags);
+			if (create) {
+				flags |= MCE_NO_MERGE;
+				mce->tlba = lba48_to_le64(flags, addr);
+				memcpy(&wset[iter], mce, sizeof(*wset));
+				continue;
+			}
+			conf = le64_to_lba48(wset[iter].tlba, NULL);
+			if (conf == addr && (flags & MCE_NO_MERGE))
+				memcpy(&wset[iter], mce, sizeof(*wset));
+			else
+				do_clear = true;
+		} else {
+			do_clear = true;
+		}
+		if (do_clear)
+			memset(&wset[iter], 0, sizeof(*wset));
+	}
+}
+
+/**
+ * unmap_deref_chunk() - Migrate a chunk of discarded blocks to ingress cache
+ * znd: ZDM Instance
+ * minblks: Minimum number of blocks to move
+ * more: Migrate blocks even when trim cache is mostly empty.
+ * gfp: GFP allocation mask
+ */
+static int unmap_deref_chunk(struct zdm *znd, u32 minblks, int more, gfp_t gfp)
+{
+	u64 lba;
+	unsigned long flags;
+	int iter;
+	int squash = 0;
+	int err = 0;
+	int moving = 5;
+	const int trim = false;
+	int noio = 0;
+
+	if (znd->ingress->count > MC_HIGH_WM || znd->unused->count > MC_HIGH_WM)
+		goto out;
+
+	if (znd->trim->count < 10 && !more)
+		goto out;
+
+	if (gfp == GFP_KERNEL) {
+		if (!spin_trylock_irqsave(&znd->gc_postmap.cached_lock, flags))
+			goto out;
+	} else {
+		spin_lock_irqsave(&znd->gc_postmap.cached_lock, flags);
+	}
+	spin_unlock_irqrestore(&znd->gc_postmap.cached_lock, flags);
+
+	moving = 5;
+	for (iter = 0; iter < moving; iter++) {
+		struct map_cache_entry *mce;
+		u64 tlba, addr;
+		u32 flgs, blks, range, decr;
+
+		spin_lock_irqsave(&znd->trim_rwlck, flags);
+		mce = mce_at(znd->trim, iter);
+		if (moving < znd->trim->count) {
+			moving = znd->trim->count;
+			if (iter >= moving) {
+				spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+				break;
+			}
+		}
+		tlba = addr = le64_to_lba48(mce->tlba, &flgs);
+		if (flgs & MCE_NO_ENTRY) {
+			spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+			continue;
+		}
+		(void)le64_to_lba48(mce->bval, &blks);
+		addr = tlba;
+		decr = blks;
+		flgs |= MCE_NO_MERGE;
+		mce->tlba = lba48_to_le64(flgs, tlba);
+		spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+
+		do {
+			range = blks;
+			lba = __map_rng(znd, tlba, &range, trim,
+					noio, GFP_ATOMIC);
+			if (lba == ~0ul) {
+				Z_DBG(znd, "%s: __map_rng failed [%llx+%u].",
+				      __func__, tlba, blks);
+				err = -EBUSY;
+				goto out;
+			}
+			if (range) {
+				blks -= range;
+				tlba += range;
+			}
+		} while (blks > 0);
+
+		spin_lock_irqsave(&znd->trim_rwlck, flags);
+		mce = mce_at(znd->trim, iter);
+		if (moving < znd->trim->count) {
+			moving = znd->trim->count;
+			if (iter >= moving) {
+				spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+				break;
+			}
+		}
+
+		tlba = addr;
+		(void)le64_to_lba48(mce->bval, &blks);
+		if (decr != blks ||
+		    tlba != le64_to_lba48(mce->tlba, NULL)) {
+			spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+			goto out;
+		}
+
+		noio = 1;
+		do {
+			range = blks;
+			lba = __map_rng(znd, tlba, &range, trim,
+					noio, GFP_ATOMIC);
+			if (lba == ~0ul) {
+				err = -EBUSY;
+				spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+				goto out;
+			}
+			if (range) {
+				if (lba) {
+					err = ingress_add(znd, tlba, 0ul, range,
+							  GFP_ATOMIC);
+					if (err) {
+						spin_unlock_irqrestore(
+							&znd->trim_rwlck, flags);
+						goto out;
+					}
+				}
+				blks -= range;
+				tlba += range;
+			}
+		} while (blks > 0);
+		mce->tlba = lba48_to_le64(MCE_NO_ENTRY, tlba);
+		squash = 1;
+		spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+
+		if (znd->ingress->count > MC_HIGH_WM ||
+		    znd->unused->count > MC_HIGH_WM)
+			goto out;
+
+		if (minblks < decr)
+			break;
+		minblks -= decr;
+	}
+
+out:
+	/* remove the MCE_NO_ENTRY maps from the table */
+	if (squash) {
+		struct map_cache_page *m_pg = NULL;
+		int avail = 0;
+		struct map_pool *mp;
+		const int drop = 1;
+		int rc;
+
+		m_pg = ZDM_ALLOC(znd, sizeof(*m_pg), PG_09, gfp);
+		if (!m_pg)
+			return -ENOMEM;
+
+resubmit:
+		mp_grow(znd->trim, znd->trim_mp, gfp);
+		spin_lock_irqsave(&znd->trim_rwlck, flags);
+		mp = mp_pick(znd->trim, znd->trim_mp);
+		rc = do_sort_merge(mp, znd->trim, m_pg->maps, avail, drop);
+		if (unlikely(rc)) {
+			Z_ERR(znd, "DSortMerge failed: %d [%d]", rc, __LINE__);
+			rc = -EBUSY;
+		} else {
+			znd->trim = mp;
+			__smp_mb();
+		}
+		spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+
+		if (rc == -EBUSY) {
+			set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+			do_move_map_cache_to_table(znd, 0, gfp);
+			goto resubmit;
+		}
+
+		ZDM_FREE(znd, m_pg, sizeof(*m_pg), PG_09);
+	}
+
+	if (znd->trim->count > MC_MOVE_SZ || znd->ingress->count > MC_MOVE_SZ)
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+
+	return err;
+}
+
+/**
+ * gc_map_drop() - Notify active GC queue of update mapping(s)
+ * @znd: ZDM instance
+ * @addr: starting address being discarded
+ * @count: number of blocks being discarded
+ */
+static int gc_map_drop(struct zdm *znd, u64 addr, u32 count)
+{
+	struct gc_map_cache *post = &znd->gc_postmap;
+	struct gc_map_cache_data *data = post->gc_mcd;
+	u64 last_addr = addr + count;
+	u64 tlba;
+	unsigned long flags;
+	u32 t_flgs;
+	int idx;
+
+	spin_lock_irqsave(&post->cached_lock, flags);
+	for (idx = 0; idx < post->jcount; idx++) {
+		if (data->maps[idx].tlba == MC_INVALID)
+			continue;
+		tlba = le64_to_lba48(data->maps[idx].tlba, &t_flgs);
+		if (tlba == Z_LOWER48)
+			continue;
+		if (t_flgs & GC_DROP)
+			continue;
+		if (tlba >= addr && tlba < last_addr) {
+			u64 bval = le64_to_lba48(data->maps[idx].bval, NULL);
+
+			Z_DBG(znd, "GC postmap #%d: %llx in range "
+				   "[%llx, %llx) is stale. {%llx}",
+			      idx, tlba, addr, addr+count, bval);
+
+			t_flgs |= GC_DROP;
+			data->maps[idx].tlba = lba48_to_le64(t_flgs, tlba);
+			data->maps[idx].bval = lba48_to_le64(0, bval);
+		}
+	}
+	spin_unlock_irqrestore(&post->cached_lock, flags);
+
+	return 0;
+}
+
+/**
+ * ingress_add() - Add a tLBA/bLBA mapping
+ * @znd: ZDM instance
+ * @addr: starting address being discarded
+ * @lba: starting allocated block
+ * @count: number of blocks being discarded
+ * @gfp: allocation mask
+ */
+static int ingress_add(struct zdm *znd, u64 addr, u64 lba, u32 count,
+		       gfp_t gfp)
+{
+	struct map_cache_page *m_pg = NULL;
+	unsigned long flags;
+	int rc = 0;
+	int matches;
+	int avail;
+
+	m_pg = ZDM_ALLOC(znd, sizeof(*m_pg), PG_09, gfp);
+	if (!m_pg)
+		return -ENOMEM;
+
+	if (znd->ingress->count > MC_MOVE_SZ)
+		do_move_map_cache_to_table(znd, 0, gfp);
+
+resubmit:
+	mp_grow(znd->ingress, znd->in, gfp);
+	spin_lock_irqsave(&znd->in_rwlck, flags);
+	avail = ARRAY_SIZE(m_pg->maps);
+	matches = _common_intersect(znd->ingress, addr, count, m_pg);
+	if (matches == 0) {
+		const u32 sflg = 0;
+		int in = mp_insert(znd->ingress, addr, sflg, count, lba);
+
+		if (in != 1) {
+			m_pg->maps[ISCT_BASE].tlba = lba48_to_le64(sflg, addr);
+			m_pg->maps[ISCT_BASE].bval = lba48_to_le64(count, lba);
+			matches = 1;
+		}
+	}
+	if (matches) {
+		struct map_pool *mp;
+		const int drop = 0;
+
+		_common_merges(znd, m_pg, matches, addr, count, lba, gfp);
+		mp = mp_pick(znd->ingress, znd->in);
+		rc = do_sort_merge(mp, znd->ingress, m_pg->maps, avail, drop);
+		if (unlikely(rc)) {
+			Z_ERR(znd, "SortMerge failed: %d [%d]", rc, __LINE__);
+			dump_stack();
+			rc = -EBUSY;
+		} else {
+			znd->ingress = mp;
+			__smp_mb();
+		}
+	}
+	spin_unlock_irqrestore(&znd->in_rwlck, flags);
+
+	if (znd->ingress->count > MC_MOVE_SZ)
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+
+	if (rc == -EBUSY) {
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		do_move_map_cache_to_table(znd, 0, gfp);
+		memset(m_pg->maps, 0, sizeof(*m_pg));
+		goto resubmit;
+	}
+
+	if (m_pg)
+		ZDM_FREE(znd, m_pg, sizeof(*m_pg), PG_09);
+
+	return rc;
+}
+
+/**
+ * unmap_overwritten() - Update the unused cache with new mapping data
+ * @znd: ZDM instance
+ * @addr: starting address being discarded
+ * @count: number of blocks being discarded
+ * @gfp: allocation mask
+ */
+static int unmap_overwritten(struct zdm *znd, u64 addr, u32 count, gfp_t gfp)
+{
+	u64 lba;
+	u64 tlba = addr;
+	u32 blks = count;
+	u32 range;
+	const bool trim = true; /* check discard entries */
+	const int noio = 0;
+	int err = 0;
+
+	do {
+		range = blks;
+		lba = __map_rng(znd, tlba, &range, trim, noio, gfp);
+
+		if (lba == ~0ul || range == 0 || lba < znd->data_lba)
+			goto out;
+
+		if (lba) {
+			u32 zone = _calc_zone(znd, lba);
+			u32 gzno  = zone >> GZ_BITS;
+			u32 gzoff = zone & GZ_MMSK;
+			struct meta_pg *wpg = &znd->wp[gzno];
+
+			if (zone < znd->zone_count) {
+				_dec_wp_avail_by_lost(wpg, gzoff, range);
+				update_stale_ratio(znd, zone);
+				err = unused_add(znd, lba, tlba, range, gfp);
+				if (err)
+					goto out;
+			}
+		}
+		if (range) {
+			blks -= range;
+			tlba += range;
+		}
+	} while (blks > 0);
+
+out:
+	return err;
+}
+
+/**
+ * do_add_map() - Add multiple entries into the map_cache
+ * @znd: ZDM instance
+ * @addr: tLBA
+ * @lba: lba on backing device.
+ * @count: Number of (contiguous) map entries to add.
+ * @gc: drop from gc post map (if true).
+ * @gfp: allocation mask.
+ */
+static int do_add_map(struct zdm *znd, u64 addr, u64 lba, u32 count, bool gc,
+		      gfp_t gfp)
+{
+	int rc = 0;
+
+	if (addr < znd->data_lba)
+		return rc;
+
+	if (gc)
+		gc_map_drop(znd, addr, count);
+
+	/*
+	 * When mapping new (non-discard) entries we need to punch out any
+	 * entries in the 'trim' table.
+	 */
+	if (lba) {
+		rc = trim_deref_range(znd, addr, count, lba, gfp);
+		if (rc)
+			goto out;
+		rc = unmap_overwritten(znd, addr, count, gfp);
+		if (rc)
+			goto out;
+		rc = unused_reuse(znd, lba, count, gfp);
+		if (rc)
+			goto out;
+	}
+	rc = ingress_add(znd, addr, lba, count, gfp);
+
+out:
+	return rc;
+}
+
+/**
+ * z_mapped_addmany() - Add multiple entries into the map_cache
+ * @znd: ZDM instance
+ * @dm_s: tLBA
+ * @lba: lba on backing device.
+ * @count: Number of (contiguous) map entries to add.
+ * @gfp: allocation mask.
+ */
+static int z_mapped_addmany(struct zdm *znd, u64 addr, u64 lba, u32 count,
+			    gfp_t gfp)
+{
+	return do_add_map(znd, addr, lba, count, true, gfp);
+}
+
+/**
+ * predictive_zfree() - Predict stale count
+ * @znd: ZDM instance
+ * @addr: starting address being discarded
+ * @count: number of blocks being discarded
+ * @gfp: allocation mask
+ */
+static void predictive_zfree(struct zdm *znd, u64 addr, u32 count, gfp_t gfp)
+{
+	u64 lba;
+	u32 offset;
+	u32 num;
+	const bool trim = false;
+	const bool jrnl = true;
+
+	for (offset = 0; offset < count; ) {
+		num = count - offset;
+		if (num == 0)
+			break;
+		if (num > 64)
+			num = 64;
+		lba = _current_mapping(znd, addr + offset, trim, jrnl, gfp);
+		if (lba == 0 || lba == ~0ul)
+			break;
+		if (lba && lba >= znd->data_lba) {
+			u32 zone = _calc_zone(znd, lba);
+			u32 limit = (lba - znd->md_start) & ~(Z_BLKSZ - 1);
+			u32 gzno  = zone >> GZ_BITS;
+			u32 gzoff = zone & GZ_MMSK;
+			struct meta_pg *wpg = &znd->wp[gzno];
+
+			if (limit && num > limit)
+				num = limit;
+
+			_dec_wp_avail_by_lost(wpg, gzoff, num);
+			update_stale_ratio(znd, zone);
+		}
+		offset += num;
+	}
+}
+
+/**
+ * z_mapped_discard() - Add a discard extent to the mapping cache
+ * @znd: ZDM Instance
+ * @tlba: Address being discarded.
+ * @blks: number of blocks being discard.
+ * @lba: Lba for ?
+ */
+static int z_mapped_discard(struct zdm *znd, u64 addr, u32 count, gfp_t gfp)
+{
+	const u64 lba = 0ul;
+	struct map_cache_page *m_pg = NULL;
+	struct map_cache_entry *maps = NULL;
+	unsigned long flags;
+	int rc = 0;
+	int matches;
+	int avail;
+
+	m_pg = ZDM_ALLOC(znd, sizeof(*m_pg), PG_09, gfp);
+	if (!m_pg)
+		return -ENOMEM;
+
+	maps = m_pg->maps;
+	avail = ARRAY_SIZE(m_pg->maps);
+
+	predictive_zfree(znd, addr, count, gfp);
+	gc_map_drop(znd, addr, count);
+
+resubmit:
+	mp_grow(znd->trim, znd->trim_mp, gfp);
+	spin_lock_irqsave(&znd->trim_rwlck, flags);
+	matches = _common_intersect(znd->trim, addr, count, m_pg);
+	if (matches == 0) {
+		const u32 sflg = 0;
+
+		if (mp_insert(znd->trim, addr, sflg, count, lba) != 1) {
+			maps[ISCT_BASE].tlba = lba48_to_le64(0, addr);
+			maps[ISCT_BASE].bval = lba48_to_le64(count, lba);
+			matches = 1;
+		}
+	}
+	if (matches) {
+		struct map_pool *mp;
+		const int drop = 0;
+
+		_common_merges(znd, m_pg, matches, addr, count, lba, gfp);
+		mp = mp_pick(znd->trim, znd->trim_mp);
+		rc = do_sort_merge(mp, znd->trim, maps, avail, drop);
+		if (unlikely(rc)) {
+			Z_ERR(znd, "DSortMerge failed: %d [%d]", rc, __LINE__);
+			rc = -EBUSY;
+		} else {
+			znd->trim = mp;
+			__smp_mb();
+		}
+		rc = 0;
+	}
+	if (znd->trim->count > MC_MOVE_SZ)
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+	spin_unlock_irqrestore(&znd->trim_rwlck, flags);
+
+	if (rc == -EBUSY) {
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		do_move_map_cache_to_table(znd, 0, gfp);
+		goto resubmit;
+	}
+
+	if (m_pg)
+		ZDM_FREE(znd, m_pg, sizeof(*m_pg), PG_09);
+
+	return rc;
+}
+
+/**
+ * alloc_pg() - Allocate a map page
+ * @znd: ZDM instance
+ * @entry: entry (in mpi table) to update on allocation.
+ * @lba: LBA associated with the page of ZLT
+ * @mpi: Map page information (lookup table entry, bit flags, etc)
+ * @ahead: Flag to set READA flag on page
+ * @gfp: Allocation flags (for _ALLOC)
+ */
+static struct map_pg *alloc_pg(struct zdm *znd, int entry, u64 lba,
+			       struct mpinfo *mpi, int ahead, gfp_t gfp)
+{
+	struct map_pg *found = ZDM_ALLOC(znd, sizeof(*found), KM_20, gfp);
+	unsigned long flags;
+
+	if (found) {
+		found->lba = lba;
+		spin_lock_init(&found->md_lock);
+		set_bit(mpi->bit_dir, &found->flags);
+		set_bit(mpi->bit_type, &found->flags);
+		found->age = jiffies_64;
+		found->index = entry;
+		INIT_LIST_HEAD(&found->zltlst);
+		INIT_LIST_HEAD(&found->lazy);
+		INIT_HLIST_NODE(&found->hentry);
+		found->znd = znd;
+		found->crc_pg = NULL; /* redundant */
+		ref_pg(found);
+		if (ahead)
+			set_bit(IS_READA, &found->flags);
+		init_completion(&found->event);
+		set_bit(IS_ALLOC, &found->flags);
+
+		/*
+		 * allocation done. check and see if there as a
+		 * concurrent race
+		 */
+		spin_lock_irqsave(mpi->lock, flags);
+		if (!add_htbl_entry(znd, mpi, found))
+			ZDM_FREE(znd, found, sizeof(*found), KM_20);
+		spin_unlock_irqrestore(mpi->lock, flags);
+	} else {
+		Z_ERR(znd, "NO MEM for mapped_t !!!");
+	}
+	return found;
+}
+
+/**
+ * _maybe_undrop() - If a page is on its way out of cache pull it back.
+ * @znd: ZDM instance
+ * @pg: Page to claim
+ *
+ * When a table page is being dropped from the cache it may transition
+ * through the lazy pool. It a page is caught in the lazy pool it is
+ * deemed to be 'warm'. Not hot enough to be frequently hit but clearly
+ * too warm to be dropped quickly. Give it a boost to keep in in cache
+ * longer.
+ */
+static __always_inline int _maybe_undrop(struct zdm *znd, struct map_pg *pg)
+{
+	int undrop = 0;
+	unsigned long flags;
+
+	if (test_bit(IS_DROPPED, &pg->flags)) {
+		spin_lock_irqsave(&znd->lzy_lck, flags);
+		if (test_bit(IS_DROPPED, &pg->flags) &&
+		    test_bit(IS_LAZY, &pg->flags)) {
+			clear_bit(IS_DROPPED, &pg->flags);
+			set_bit(DELAY_ADD, &pg->flags);
+		}
+		if (pg->data.addr && pg->hotness < znd->cache_ageout_ms)
+			pg->hotness += (znd->cache_ageout_ms >> 1);
+		if (!pg->data.addr) {
+			if (!test_bit(IS_ALLOC, &pg->flags)) {
+				Z_ERR(znd, "Undrop no pg? %"PRIx64, pg->lba);
+				init_completion(&pg->event);
+				set_bit(IS_ALLOC, &pg->flags);
+			}
+		}
+		pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+		undrop = 1;
+		spin_unlock_irqrestore(&znd->lzy_lck, flags);
+	}
+	ref_pg(pg);
+	return undrop;
+}
+
+/**
+ * _load_backing_pages() - Cache a backing page
+ * @znd: ZDM instance
+ * @lba: Logical LBA of page.
+ * @gfp: Memory allocation rule
+ *
+ * When metadata is pooled with data the FWD table lookup can
+ * be recursive (the page needed to resolve the FWD entry is
+ * itself on disk). The recursion is never deep but it can
+ * be avoided or mitigated by keep such 'key' pages in cache.
+ */
+static int _load_backing_pages(struct zdm *znd, u64 lba, gfp_t gfp)
+{
+	const int raflg = 1;
+	int rc = 0;
+	int entry;
+	unsigned long flags;
+	struct mpinfo mpi;
+	struct map_addr maddr;
+	struct map_pg *found;
+
+	gfp = GFP_ATOMIC;
+	if (lba < znd->data_lba)
+		goto out;
+
+	map_addr_calc(znd, lba, &maddr);
+	entry = to_table_entry(znd, maddr.lut_s, 0, &mpi);
+	if (entry > -1) {
+		spin_lock_irqsave(mpi.lock, flags);
+		found = get_htbl_entry(znd, &mpi);
+		if (found)
+			_maybe_undrop(znd, found);
+		spin_unlock_irqrestore(mpi.lock, flags);
+		if (!found)
+			found = alloc_pg(znd, entry, lba, &mpi, raflg, gfp);
+		if (found) {
+			if (!found->data.addr) {
+				rc = cache_pg(znd, found, gfp, &mpi);
+				if (rc < 0 && rc != -EBUSY)
+					znd->meta_result = rc;
+			}
+			deref_pg(found);
+
+			if (getref_pg(found) != 0)
+				Z_ERR(znd, "Backing page with elevated ref: %u",
+				      getref_pg(found));
+		}
+	}
+
+out:
+	return rc;
+}
+
+/**
+ * _load_crc_page() - Cache a page of CRC
+ * @znd: ZDM instance
+ * @lba: Logical LBA of page.
+ * @gfp: Memory allocation rule
+ *
+ * When a table page is cached the page containing its CRC is also pulled
+ * into cache. Rather than defer it to cache_pg() it's brought into the
+ * cache here.
+ */
+static int _load_crc_page(struct zdm *znd, struct mpinfo *mpi, gfp_t gfp)
+{
+	const int raflg = 0;
+	int rc = 0;
+	int entry;
+	unsigned long flags;
+	struct map_pg *found;
+	u64 base = (mpi->bit_dir == IS_REV) ? znd->c_mid : znd->c_base;
+
+	base += mpi->crc.pg_no;
+	entry = to_table_entry(znd, base, 0, mpi);
+
+	if (mpi->bit_type != IS_CRC)
+		return rc;
+	if (entry >= znd->crc_count)
+		return rc;
+
+	spin_lock_irqsave(mpi->lock, flags);
+	found = get_htbl_entry(znd, mpi);
+	if (found)
+		_maybe_undrop(znd, found);
+	spin_unlock_irqrestore(mpi->lock, flags);
+	if (!found)
+		found = alloc_pg(znd, entry, base, mpi, raflg, GFP_ATOMIC);
+	if (found) {
+		if (!found->data.crc) {
+			rc = cache_pg(znd, found, gfp, mpi);
+			if (rc < 0 && rc != -EBUSY)
+				znd->meta_result = rc;
+		}
+		deref_pg(found);
+	}
+	return rc;
+}
+
+/**
+ * put_map_entry() - Decrement refcount of mapped page.
+ * @pg: mapped page
+ */
+static inline void put_map_entry(struct map_pg *pg)
+{
+	if (pg)
+		deref_pg(pg);
+}
+
+/**
+ * gme_noio() - Pull page of ZLT from memory.
+ * @znd: ZDM instance
+ * @lba: Map entry
+ */
+static struct map_pg *gme_noio(struct zdm *znd, u64 lba)
+{
+	struct map_pg *found = NULL;
+	struct mpinfo mpi;
+	unsigned long flags;
+	int entry = to_table_entry(znd, lba, 0, &mpi);
+
+	if (entry < 0)
+		goto out;
+
+	spin_lock_irqsave(mpi.lock, flags);
+	found = get_htbl_entry(znd, &mpi);
+	spin_unlock_irqrestore(mpi.lock, flags);
+
+	if (found) {
+		spinlock_t *lock = &found->md_lock;
+
+		spin_lock_irqsave(lock, flags);
+		if (_io_pending(found))
+			found = NULL;
+		else
+			ref_pg(found);
+		spin_unlock_irqrestore(lock, flags);
+	}
+
+out:
+	return found;
+}
+
+/**
+ * do_gme_io() - Find a page of LUT or CRC table map.
+ * @znd: ZDM instance
+ * @lba: Logical LBA of page.
+ * @ra: Number of blocks to read ahead
+ * @async: If cache_pg needs to wait on disk
+ * @gfp: Memory allocation rule
+ *
+ * Return: struct map_pg * or NULL on error.
+ *
+ * Page will be loaded from disk it if is not already in core memory.
+ */
+static struct map_pg *do_gme_io(struct zdm *znd, u64 lba,
+				int ra, int async, gfp_t gfp)
+{
+	struct mpinfo mpi;
+	struct map_pg **ahead;
+	struct map_pg *pg = NULL;
+	int entry;
+	int iter;
+	int range;
+	u32 count;
+	int rc;
+
+	entry = to_table_entry(znd, lba, 0, &mpi);
+	if (entry < 0) {
+		Z_ERR(znd, "%s: bad lba? %llx", __func__, lba);
+		dump_stack();
+		return NULL;
+	}
+
+	ahead = ZDM_ALLOC(znd, PAGE_SIZE, PG_09, gfp);
+	if (!ahead)
+		return NULL;
+
+	if (ra > MAX_PER_PAGE(ahead))
+		ra = MAX_PER_PAGE(ahead);
+	if (ra > 16 && znd->queue_depth == 0)
+		ra = 16;
+	if (ra > 2 && low_cache_mem(znd))
+		ra = 2;
+
+	if (mpi.bit_type == IS_LUT) {
+		count = znd->map_count;
+
+		if (mpi.bit_dir == IS_FWD)
+			_load_backing_pages(znd, lba, gfp);
+		else if (ra > 4)
+			ra = 4;
+
+		_load_crc_page(znd, &mpi, gfp);
+		range = entry + ra;
+	} else {
+		count = znd->crc_count;
+
+		/* CRC's cover 2k pages .. so only pull two extra */
+		range = entry + 2;
+	}
+	if (range > count)
+		range = count;
+
+	iter = 0;
+	while (entry < range) {
+		int want_cached = 1;
+		unsigned long flags;
+		struct map_pg *found;
+
+		entry = to_table_entry(znd, lba, iter, &mpi);
+		if (entry < 0)
+			break;
+
+		spin_lock_irqsave(mpi.lock, flags);
+		found = get_htbl_entry(znd, &mpi);
+		if (found &&
+		    _maybe_undrop(znd, found) &&
+		    test_bit(IS_READA, &found->flags)
+		    && iter > 0)
+			want_cached = 0;
+
+		if (found) {
+			if (want_cached)
+				found->age = jiffies_64
+					   + msecs_to_jiffies(found->hotness);
+			else
+				found->age += msecs_to_jiffies(
+						znd->cache_ageout_ms >> 1);
+		}
+		spin_unlock_irqrestore(mpi.lock, flags);
+
+		if (want_cached && !found)
+			found = alloc_pg(znd, entry, lba, &mpi, iter, gfp);
+
+		if (found) {
+			if (want_cached)
+				ahead[iter] = found;
+			else
+				deref_pg(found);
+		}
+		iter++;
+		entry++;
+		lba++;
+	}
+	ra = iter;
+
+	/*
+	 * Each entry in ahead has an elevated refcount.
+	 * Only allow the target of do_gme_io() to remain elevated.
+	 */
+	for (iter = 0; iter < ra; iter++) {
+		pg = ahead[iter];
+
+		if (pg) {
+			if (!pg->data.addr) {
+				to_table_entry(znd, pg->lba, iter, &mpi);
+				rc = cache_pg(znd, pg, gfp, &mpi);
+				if (rc < 0) {
+					ahead[iter] = NULL;
+					if (iter == 0 && pg->io_count > 2) {
+						znd->meta_result = pg->io_error;
+						Z_ERR(znd,
+						    "%s: cache_pg failed? %llx",
+						    __func__, lba);
+						dump_stack();
+					}
+				}
+			}
+			if (iter > 0)
+				deref_pg(pg);
+		}
+	}
+
+	pg = ahead[0];
+	if (pg && !async) {
+		/*
+		 * if ahead[0] is queued but not yet available ... wait for
+		 * the io to complete ..
+		 */
+		rc = wait_for_map_pg(znd, pg, gfp);
+		if (rc == 0) {
+			pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+		} else {
+			deref_pg(pg);
+			pg = NULL;
+		}
+	}
+
+	if (ahead)
+		ZDM_FREE(znd, ahead, PAGE_SIZE, PG_09);
+
+	return pg;
+}
+
+/**
+ * metadata_dirty_fling() - Force a ZLT block into cache and flag it dirty.
+ * @znd: ZDM Instance
+ * @dm_s: Current lba to consider.
+ *
+ * Used when data and ZDM's metadata are co-mingled. If dm_s is a block
+ * of ZDM's metadata it needs to be relocated. Since we re-locate
+ * blocks that are in dirty and in the cache ... if this block is
+ * metadata, force it into the cache and flag it as dirty.
+ */
+static int metadata_dirty_fling(struct zdm *znd, u64 addr, gfp_t gfp)
+{
+	struct map_pg *pg = NULL;
+	unsigned long flags;
+	int is_flung = 0;
+	const int noio = 0;
+
+	/*
+	 * When co
+	 * nothing in the GC reverse map should point to
+	 * a block *before* a data pool block
+	 */
+	if (addr >= znd->data_lba)
+		return is_flung;
+
+	if (z_lookup_journal_cache(znd, addr) == 0)
+		return is_flung;
+
+	pg = get_map_entry(znd, addr, 4, 0, noio, gfp);
+	if (pg) {
+		ref_pg(pg);
+		wait_for_map_pg(znd, pg, gfp);
+	}
+	if (pg && pg->data.addr) {
+		if (!test_bit(IS_DIRTY, &pg->flags)) {
+			if (!test_bit(IN_WB_JOURNAL, &pg->flags))
+				Z_ERR(znd, "!JOURNAL flagged: %llx [%llx]",
+					pg->lba, pg->last_write);
+
+			spin_lock_irqsave(&pg->md_lock, flags);
+			pg->age = jiffies_64;
+			clear_bit(IS_READA, &pg->flags);
+			set_bit(IS_DIRTY, &pg->flags);
+			clear_bit(IS_FLUSH, &pg->flags);
+			clear_bit(IS_READA, &pg->flags);
+			is_flung = 1;
+			spin_unlock_irqrestore(&pg->md_lock, flags);
+		}
+	} else {
+		is_flung = -EAGAIN;
+	}
+
+	if (pg)
+		deref_pg(pg);
+	put_map_entry(pg);
+
+	return is_flung;
+}
+
+/**
+ * z_do_copy_more() - GC transition to read more blocks.
+ * @gc_state: GC State to be updated.
+ */
+static inline void z_do_copy_more(struct gc_state *gc_entry)
+{
+	unsigned long flags;
+	struct zdm *znd = gc_entry->znd;
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	set_bit(DO_GC_READ, &gc_entry->gc_flags);
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+}
+
+/**
+ * gc_post_add() - Add a tLBA and current bLBA origin.
+ * @znd: ZDM Instance
+ * @addr: tLBA
+ * @lba: bLBA
+ *
+ * Return: 1 if tLBA is added, 0 if block was stale.
+ *
+ * Stale block checks are performed before tLBA is added.
+ * Add a non-stale block to the list of blocks for moving and
+ * metadata updating.
+ */
+static int gc_post_add(struct zdm *znd, u64 addr, u64 lba)
+{
+	struct gc_map_cache *post = &znd->gc_postmap;
+	int handled = 0;
+
+	if (post->jcount < post->jsize) {
+		struct gc_map_cache_data *data = post->gc_mcd;
+
+		data->maps[post->jcount].tlba = lba48_to_le64(0, addr);
+		data->maps[post->jcount].bval = lba48_to_le64(1, lba);
+		post->jcount++;
+		handled = 1;
+	} else {
+		Z_ERR(znd, "*CRIT* post overflow L:%" PRIx64 "-> S:%" PRIx64,
+		      lba, addr);
+	}
+	return handled;
+}
+
+/**
+ * z_zone_gc_metadata_to_ram() - Load affected metadata blocks to ram.
+ * @gc_entry: Compaction event in progress
+ *
+ * Return: 0, otherwise errno.
+ *
+ * Use the reverse ZLT to find the forward ZLT entries that need to be
+ * remapped in this zone.
+ * When complete the znd->gc_postmap have a map of all the non-stale
+ * blocks remaining in the zone.
+ */
+static int z_zone_gc_metadata_to_ram(struct gc_state *gc_entry)
+{
+	struct zdm *znd = gc_entry->znd;
+	u64 from_lba = (gc_entry->z_gc << Z_BLKBITS) + znd->md_end;
+	struct map_pg *rev_pg = NULL;
+	struct map_pg *fwd_pg = NULL;
+	struct map_cache_page *_pg = NULL;
+	struct map_addr ori;
+	unsigned long flags;
+	unsigned long tflgs;
+	unsigned long iflgs;
+	unsigned long uflgs;
+	unsigned long jflgs;
+	int idx;
+	const int async = 0;
+	int noio = 0;
+	int cpg = 0;
+	int err = 0;
+	int md;
+	gfp_t gfp = GFP_KERNEL;
+	struct map_pg **pgs = NULL;
+	u64 lut_r = BAD_ADDR;
+	u64 lut_s = BAD_ADDR;
+
+	pgs = ZDM_CALLOC(znd, sizeof(*pgs), MAX_WSET, KM_19, gfp);
+	if (!pgs) {
+		err = -ENOMEM;
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		goto out;
+	}
+
+	/* pull all of the affected struct map_pg and crc pages into memory: */
+	for (idx = 0; idx < Z_BLKSZ; idx++) {
+		u64 blba = from_lba + idx;
+		u64 tlba = 0ul;
+		u64 update;
+		__le32 ORencoded;
+
+		spin_lock_irqsave(&znd->wbjrnl_rwlck, jflgs);
+		spin_lock_irqsave(&znd->in_rwlck, iflgs);
+		update = backref_cache(znd, blba);
+		spin_unlock_irqrestore(&znd->in_rwlck, iflgs);
+		spin_unlock_irqrestore(&znd->wbjrnl_rwlck, jflgs);
+
+		map_addr_calc(znd, blba, &ori);
+		if (lut_r != ori.lut_r) {
+			if (rev_pg)
+				deref_pg(rev_pg);
+			put_map_entry(rev_pg);
+			rev_pg = get_map_entry(znd, ori.lut_r, 4, async,
+					       noio, gfp);
+			if (!rev_pg) {
+				err = -ENOMEM;
+				Z_ERR(znd, "%s: ENOMEM @ %d",
+				      __func__, __LINE__);
+				goto out;
+			}
+
+			if (!rev_pg->data.addr) {
+				struct mpinfo mpi;
+
+				to_table_entry(znd, rev_pg->lba, 0, &mpi);
+				cache_pg(znd, rev_pg, gfp, &mpi);
+			}
+
+			if (_io_pending(rev_pg)) {
+				Z_ERR(znd, "*** gme: IO PENDING: %" PRIx64
+					   " (R:%" PRIx64
+					   ") [Flgs %lx] -> failed.",
+					blba, ori.lut_r, rev_pg->flags);
+				wait_for_map_pg(znd, rev_pg, gfp);
+			}
+			if (pgs && cpg < MAX_WSET) {
+				pgs[cpg] = rev_pg;
+				ref_pg(rev_pg);
+				cpg++;
+			}
+			ref_pg(rev_pg);
+			lut_r = rev_pg->lba;
+		}
+
+		if (update) {
+			tlba = update;
+		} else if (rev_pg && rev_pg->data.addr) {
+			ref_pg(rev_pg);
+			spin_lock_irqsave(&rev_pg->md_lock, flags);
+			ORencoded = rev_pg->data.addr[ori.pg_idx];
+			spin_unlock_irqrestore(&rev_pg->md_lock, flags);
+			if (ORencoded != MZTEV_UNUSED)
+				tlba = map_value(znd, ORencoded);
+			deref_pg(rev_pg);
+		}
+
+		if (!tlba)
+			continue;
+
+		map_addr_calc(znd, tlba, &ori);
+		if (lut_s != ori.lut_s) {
+			if (fwd_pg)
+				deref_pg(fwd_pg);
+			put_map_entry(fwd_pg);
+			fwd_pg = get_map_entry(znd, ori.lut_s, 4, async,
+					       noio, gfp);
+			if (fwd_pg && !_io_pending(rev_pg)) {
+				if (pgs && cpg < MAX_WSET) {
+					pgs[cpg] = fwd_pg;
+					ref_pg(fwd_pg);
+					cpg++;
+				}
+				ref_pg(fwd_pg);
+				lut_s = fwd_pg->lba;
+			}
+		}
+		md = metadata_dirty_fling(znd, tlba, gfp);
+		if (md < 0) {
+			err = md;
+			goto out;
+		}
+	}
+
+	cpg = ref_crc_pgs(pgs, cpg, MAX_WSET);
+
+	/* pull all of the affected struct map_pg and crc pages into memory: */
+	spin_lock_irqsave(&znd->trim_rwlck, tflgs);
+	spin_lock_irqsave(&znd->in_rwlck, iflgs);
+	spin_lock_irqsave(&znd->unused_rwlck, uflgs);
+	spin_lock_irqsave(&znd->gc_postmap.cached_lock, flags);
+
+	noio = 1;
+	for (idx = 0; idx < Z_BLKSZ; idx++) {
+		__le32 ORencoded;
+		u64 blba = from_lba + idx;
+		u64 update = backref_cache(znd, blba);
+		u64 tlba = 0ul;
+
+		map_addr_calc(znd, blba, &ori);
+		rev_pg = get_map_entry(znd, ori.lut_r, 4, async, noio, gfp);
+		if (!rev_pg) {
+			err = -EAGAIN;
+
+			Z_ERR(znd, "*** gme: %llx (R:%llx) -> failed.",
+				blba, ori.lut_r);
+
+			znd->gc_postmap.jcount = 0;
+			znd->gc_postmap.jsorted = 0;
+			spin_unlock_irqrestore(
+				&znd->gc_postmap.cached_lock, flags);
+			spin_unlock_irqrestore(&znd->unused_rwlck, uflgs);
+			spin_unlock_irqrestore(&znd->in_rwlck, iflgs);
+			spin_unlock_irqrestore(&znd->trim_rwlck, tflgs);
+
+			rev_pg = get_map_entry(znd, ori.lut_r, 4,
+					       async, 0, gfp);
+			if (!rev_pg)
+				goto out;
+
+			spin_lock_irqsave(&znd->trim_rwlck, tflgs);
+			spin_lock_irqsave(&znd->in_rwlck, iflgs);
+			spin_lock_irqsave(&znd->unused_rwlck, uflgs);
+			spin_lock_irqsave(&znd->gc_postmap.cached_lock, flags);
+		}
+
+		if (update) {
+			u64 conf = blba;
+			u64 jlba = 0ul;
+
+			if (update < znd->data_lba)
+				jlba = z_lookup_journal_cache_nlck(znd, update);
+			else
+				conf = z_lookup_ingress_cache_nlck(znd, update);
+
+			if (jlba)
+				conf = jlba;
+
+			if (conf != blba) {
+				Z_ERR(znd, "*** BACKREF BAD!!");
+				Z_ERR(znd, "*** BREF %llx -> %llx -> %llx",
+					blba, update, conf);
+				Z_ERR(znd, "*** BACKREF BAD!!");
+			}
+			tlba = update;
+		} else if (rev_pg && rev_pg->data.addr) {
+			unsigned long flags;
+
+			ref_pg(rev_pg);
+			spin_lock_irqsave(&rev_pg->md_lock, flags);
+			ORencoded = rev_pg->data.addr[ori.pg_idx];
+			spin_unlock_irqrestore(&rev_pg->md_lock, flags);
+
+			if (ORencoded != MZTEV_UNUSED)
+				tlba = map_value(znd, ORencoded);
+			deref_pg(rev_pg);
+		}
+		put_map_entry(rev_pg);
+
+		if (!tlba)
+			continue;
+		if (z_lookup_trim_cache_nlck(znd, tlba))
+			continue;
+		if (tlba < znd->data_lba &&
+		    z_lookup_journal_cache_nlck(znd, tlba))
+			continue;
+		update = z_lookup_ingress_cache_nlck(znd, tlba);
+		if (update == ~0ul)
+			continue;
+		if (update && update != blba)
+			continue;
+		if (z_lookup_unused_cache_nlck(znd, blba))
+			continue;
+		gc_post_add(znd, tlba, blba);
+	}
+	gc_sort_lba(znd, &znd->gc_postmap);
+	spin_unlock_irqrestore(&znd->gc_postmap.cached_lock, flags);
+	spin_unlock_irqrestore(&znd->unused_rwlck, uflgs);
+	spin_unlock_irqrestore(&znd->in_rwlck, iflgs);
+	spin_unlock_irqrestore(&znd->trim_rwlck, tflgs);
+
+	if (znd->gc_postmap.jcount == Z_BLKSZ) {
+		struct gc_map_cache *post = &znd->gc_postmap;
+		struct gc_map_cache_data *data = post->gc_mcd;
+		u64 addr;
+		u64 blba;
+		u64 curr;
+		u32 count;
+		int at;
+
+		_pg = ZDM_ALLOC(znd, sizeof(*_pg), PG_09, gfp);
+		err = -EBUSY;
+		for (idx = 0; idx < post->jcount; idx++) {
+			addr = le64_to_lba48(data->maps[idx].tlba, NULL);
+			blba = le64_to_lba48(data->maps[idx].bval, &count);
+
+			if (addr == Z_LOWER48 || addr == 0ul) {
+				Z_ERR(znd, "Bad GC Add of bogus source");
+				continue;
+			}
+
+			curr = current_mapping(znd, addr, gfp);
+			if (curr != blba) {
+				addr = Z_LOWER48;
+				data->maps[idx].tlba = lba48_to_le64(0, addr);
+				data->maps[idx].bval = lba48_to_le64(0, 0ul);
+				err = 0;
+				continue;
+			}
+			if (_pg) {
+				at = _common_intersect(znd->trim, addr, 1, _pg);
+				if (at)
+					Z_ERR(znd, "GC ADD: TRIM Lookup FAIL:"
+						   " found via isect %d", at);
+			}
+		}
+	}
+
+out:
+	if (_pg)
+		ZDM_FREE(znd, _pg, Z_C4K, PG_09);
+
+	if (pgs) {
+		deref_all_pgs(znd, pgs, MAX_WSET);
+		ZDM_FREE(znd, pgs, sizeof(*pgs) * MAX_WSET, KM_19);
+	}
+
+	if (fwd_pg)
+		deref_pg(fwd_pg);
+	put_map_entry(fwd_pg);
+
+	if (rev_pg)
+		deref_pg(rev_pg);
+	put_map_entry(rev_pg);
+
+	return err;
+}
+
+/**
+ * append_blks() - Read (more) blocks into buffer.
+ * @znd: ZDM Instance
+ * @lba: Starting blba
+ * @io_buf: Buffer to read into
+ * @count: Number of blocks to read.
+ */
+static int append_blks(struct zdm *znd, u64 lba,
+		       struct io_4k_block *io_buf, int count)
+{
+	int rcode = 0;
+	int rc;
+	u32 chunk;
+	struct io_4k_block *io_vcache;
+
+	MutexLock(&znd->gc_vcio_lock);
+	io_vcache = get_io_vcache(znd, GFP_KERNEL);
+	if (!io_vcache) {
+		Z_ERR(znd, "%s: FAILED to get SYNC CACHE.", __func__);
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	for (chunk = 0; chunk < count; chunk += IO_VCACHE_PAGES) {
+		u32 nblks = count - chunk;
+
+		if (nblks > IO_VCACHE_PAGES)
+			nblks = IO_VCACHE_PAGES;
+
+		rc = read_block(znd, DM_IO_VMA, io_vcache, lba, nblks, 0);
+		if (rc) {
+			Z_ERR(znd, "Reading error ... disable zone: %u",
+				(u32)(lba >> 16));
+			rcode = -EIO;
+			goto out;
+		}
+		memcpy(&io_buf[chunk], io_vcache, nblks * Z_C4K);
+		lba += nblks;
+	}
+out:
+	put_io_vcache(znd, io_vcache);
+	mutex_unlock(&znd->gc_vcio_lock);
+
+	return rcode;
+}
+
+/**
+ * set_gc_read_flag() - Flag a GC entry has 'read'
+ * @entry: Map entry
+ */
+static inline void set_gc_read_flag(struct map_cache_entry *entry)
+{
+	u32 flgs;
+	u64 addr;
+
+	addr = le64_to_lba48(entry->tlba, &flgs);
+	flgs |= GC_READ;
+	entry->tlba = lba48_to_le64(flgs, addr);
+}
+
+/**
+ * is_valid_and_not_dropped() - Push metadata to lookup table.
+ * @mce: Map entry
+ */
+static bool is_valid_and_not_dropped(struct map_cache_entry *mce)
+{
+	u32 num;
+	u32 tflg;
+	u64 addr = le64_to_lba48(mce->tlba, &tflg);
+	u64 targ = le64_to_lba48(mce->bval, &num);
+
+	return (targ != Z_LOWER48 && addr != Z_LOWER48 &&
+		num != 0 && !(tflg & GC_DROP));
+}
+
+/**
+ * z_zone_gc_read() - Read (up to) a buffer worth of data from zone.
+ * @gc_entry: Active GC state
+ */
+static int z_zone_gc_read(struct gc_state *gc_entry)
+{
+	struct zdm *znd = gc_entry->znd;
+	struct io_4k_block *io_buf = znd->gc_io_buf;
+	struct gc_map_cache *post = &znd->gc_postmap;
+	struct gc_map_cache_data *mcd = post->gc_mcd;
+	unsigned long flags;
+	unsigned long gflgs;
+	u64 start_lba;
+	u64 lba;
+	u32 num;
+	int nblks;
+	int rcode = 0;
+	int fill = 0;
+	int idx;
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	idx = gc_entry->r_ptr;
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+	spin_lock_irqsave(&post->cached_lock, gflgs);
+	do {
+		nblks = 0;
+
+		while (idx < post->jcount) {
+			if (is_valid_and_not_dropped(&mcd->maps[idx]))
+				break;
+			idx++;
+		}
+		if (idx >= post->jcount)
+			goto out_finished;
+
+		/* schedule the first block */
+		start_lba = le64_to_lba48(mcd->maps[idx].bval, &num);
+		set_gc_read_flag(&mcd->maps[idx]);
+		nblks = 1;
+		idx++;
+
+		while (idx < post->jcount && (nblks+fill) < GC_MAX_STRIPE) {
+			bool nothing_to_add = true;
+
+			if (is_valid_and_not_dropped(&mcd->maps[idx])) {
+				lba = le64_to_lba48(mcd->maps[idx].bval, &num);
+				if (lba == (start_lba + nblks)) {
+					set_gc_read_flag(&mcd->maps[idx]);
+					nblks++;
+					idx++;
+					nothing_to_add = false;
+				}
+			}
+			if (nothing_to_add)
+				break;
+		}
+		if (nblks) {
+			int err;
+
+			spin_lock_irqsave(&znd->gc_lock, flags);
+			gc_entry->r_ptr = idx;
+			spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+			spin_unlock_irqrestore(&post->cached_lock, gflgs);
+			err = append_blks(znd, start_lba, &io_buf[fill], nblks);
+			spin_lock_irqsave(&post->cached_lock, gflgs);
+			if (err) {
+				rcode = err;
+				goto out;
+			}
+			fill += nblks;
+		}
+	} while (fill < GC_MAX_STRIPE);
+
+out_finished:
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	gc_entry->nblks = fill;
+	gc_entry->r_ptr = idx;
+	if (fill > 0)
+		set_bit(DO_GC_WRITE, &gc_entry->gc_flags);
+	else
+		set_bit(DO_GC_MD_SYNC, &gc_entry->gc_flags);
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+out:
+	spin_unlock_irqrestore(&post->cached_lock, gflgs);
+
+	return rcode;
+}
+
+/**
+ * z_zone_gc_write() - Write (up to) a buffer worth of data to WP.
+ * @gc_entry: Active GC state
+ * @stream_id: Stream Id to prefer for allocation.
+ */
+static int z_zone_gc_write(struct gc_state *gc_entry, u32 stream_id)
+{
+	struct zdm *znd = gc_entry->znd;
+	struct io_4k_block *io_buf = znd->gc_io_buf;
+	struct gc_map_cache *post = &znd->gc_postmap;
+	struct gc_map_cache_data *mcd = post->gc_mcd;
+	unsigned long flags;
+	unsigned long clflgs;
+	u32 aq_flags = Z_AQ_GC | Z_AQ_STREAM_ID | stream_id;
+	u64 lba;
+	u32 nblks;
+	u32 n_out = 0;
+	u32 updated;
+	u32 avail;
+	u32 nwrt;
+	int err = 0;
+	int idx;
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	idx = gc_entry->w_ptr;
+	nblks = gc_entry->nblks;
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+	spin_lock_irqsave(&post->cached_lock, clflgs);
+	while (n_out < nblks) {
+		const enum dm_io_mem_type io = DM_IO_VMA;
+		const unsigned int oflg = REQ_PRIO;
+
+		spin_unlock_irqrestore(&post->cached_lock, clflgs);
+		/*
+		 * When lba is zero blocks were not allocated.
+		 * Retry with the smaller request
+		 */
+		avail = nblks - n_out;
+		do {
+			nwrt = 0;
+			lba = z_acquire(znd, aq_flags, avail, &nwrt);
+			if (!lba && !nwrt) {
+				err = -ENOSPC;
+				goto out;
+			}
+			avail = nwrt;
+		} while (!lba && nwrt);
+
+		err = writef_block(znd, io, &io_buf[n_out], lba, oflg, nwrt, 0);
+		spin_lock_irqsave(&post->cached_lock, clflgs);
+		if (err) {
+			Z_ERR(znd, "Write %d blocks to %"PRIx64". ERROR: %d",
+			      nwrt, lba, err);
+			goto out;
+		}
+
+		/*
+		 * nwrt blocks were written starting from lba ...
+		 * update the postmap to point to the new lba(s)
+		 */
+		updated = 0;
+		while (idx < post->jcount && updated < nwrt) {
+			u32 num;
+			u32 rflg;
+			u64 addr = le64_to_lba48(mcd->maps[idx].tlba, &rflg);
+
+			(void)le64_to_lba48(mcd->maps[idx].bval, &num);
+			if (rflg & GC_READ) {
+				rflg |= GC_WROTE;
+				mcd->maps[idx].tlba = lba48_to_le64(rflg, addr);
+				mcd->maps[idx].bval = lba48_to_le64(num, lba);
+				lba++;
+				updated++;
+			}
+			idx++;
+		}
+		spin_lock_irqsave(&znd->gc_lock, flags);
+		gc_entry->w_ptr = idx;
+		spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+		n_out += nwrt;
+
+		if (updated < nwrt && idx >= post->jcount) {
+			Z_ERR(znd, "GC: Failed accounting: %d/%d. Map %d/%d",
+				updated, nwrt, idx, post->jcount);
+		}
+	}
+	Z_DBG(znd, "Write %d blocks from %d", gc_entry->nblks, gc_entry->w_ptr);
+	set_bit(DO_GC_CONTINUE, &gc_entry->gc_flags);
+
+out:
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	gc_entry->nblks = 0;
+	gc_entry->w_ptr = idx;
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+	spin_unlock_irqrestore(&post->cached_lock, clflgs);
+
+	return err;
+}
+
+/**
+ * gc_finalize() - Final sanity check on GC'd block map.
+ * @gc_entry: Active GC state
+ *
+ * gc_postmap is expected to be empty (all blocks original
+ * scheduled to be moved to a new zone have been accounted for...
+ */
+static int gc_finalize(struct gc_state *gc_entry)
+{
+	unsigned long clflgs;
+	struct zdm *znd = gc_entry->znd;
+	struct gc_map_cache *post = &znd->gc_postmap;
+	struct gc_map_cache_data *mcd = post->gc_mcd;
+	u64 addr;
+	u64 lba;
+	u32 flgs;
+	u32 count;
+	int err = 0;
+	int idx;
+	int entries;
+
+	spin_lock_irqsave(&post->cached_lock, clflgs);
+	entries = post->jcount;
+	post->jcount = 0;
+	post->jsorted = 0;
+	for (idx = 0; idx < entries; idx++) {
+		if (mcd->maps[idx].tlba == MC_INVALID)
+			continue;
+
+		addr = le64_to_lba48(mcd->maps[idx].tlba, &flgs);
+		lba = le64_to_lba48(mcd->maps[idx].bval, &count);
+
+		if (lba == Z_LOWER48 || count == 0)
+			continue;
+		if (addr == Z_LOWER48 || (flgs & GC_DROP))
+			continue;
+
+		Z_ERR(znd, "GC: Failed to move %"PRIx64" from %"PRIx64
+			   " {flgs: %x %s%s%s} [%d]",
+		      addr, lba, flgs,
+		      flgs & GC_READ ? "r" : "",
+		      flgs & GC_WROTE ? "w" : "",
+		      flgs & GC_DROP ? "X" : "",
+		      idx);
+	}
+	spin_unlock_irqrestore(&post->cached_lock, clflgs);
+
+	return err;
+}
+
+/**
+ * clear_gc_target_flag() - Clear any zone tagged as a GC target.
+ * @znd: ZDM Instance
+ *
+ * FIXME: Can we reduce the weight of this ?
+ *	Ex. execute as zones are closed and specify the zone to clear
+ *	    at GC completion/cleanup.
+ */
+static void clear_gc_target_flag(struct zdm *znd)
+{
+	unsigned long flags;
+	int z_id;
+
+	for (z_id = znd->dz_start; z_id < znd->zone_count; z_id++) {
+		u32 gzno  = z_id >> GZ_BITS;
+		u32 gzoff = z_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp;
+
+		spin_lock_irqsave(&wpg->wplck, flags);
+		wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+		if (wp & Z_WP_GC_TARGET) {
+			wp &= ~Z_WP_GC_TARGET;
+			wpg->wp_alloc[gzoff] = cpu_to_le32(wp);
+		}
+		set_bit(IS_DIRTY, &wpg->flags);
+		clear_bit(IS_FLUSH, &wpg->flags);
+		spin_unlock_irqrestore(&wpg->wplck, flags);
+	}
+}
+
+/**
+ * z_zone_gc_metadata_update() - Update ZLT as needed.
+ * @gc_entry: Active GC state
+ *
+ * Dispose or account for all blocks originally scheduled to be
+ * moved. Update ZLT (via map cache) for all moved blocks.
+ */
+static int z_zone_gc_metadata_update(struct gc_state *gc_entry)
+{
+	struct zdm *znd = gc_entry->znd;
+	struct gc_map_cache *post = &znd->gc_postmap;
+	struct gc_map_cache_data *mcd = post->gc_mcd;
+	unsigned long clflgs;
+	u32 used = post->jcount;
+	int err = 0;
+	int idx;
+
+	for (idx = 0; idx < post->jcount; idx++) {
+		int discard = 0;
+		int mapping = 0;
+		struct map_pg *mapped = NULL;
+		u32 num;
+		u32 tflg;
+		u64 addr = le64_to_lba48(mcd->maps[idx].tlba, &tflg);
+		u64 lba = le64_to_lba48(mcd->maps[idx].bval, &num);
+		struct mpinfo mpi;
+
+		if ((znd->s_base <= addr) && (addr < znd->md_end)) {
+			if (to_table_entry(znd, addr, 0, &mpi) >= 0)
+				mapped = get_htbl_entry(znd, &mpi);
+			mapping = 1;
+		}
+
+		if (mapping && !mapped)
+			Z_ERR(znd, "MD: addr: %" PRIx64 " -> lba: %" PRIx64
+				   " no mapping in ram.", addr, lba);
+
+		if (mapped) {
+			unsigned long flgs;
+			u32 in_z;
+
+			ref_pg(mapped);
+			spin_lock_irqsave(&mapped->md_lock, flgs);
+			in_z = _calc_zone(znd, mapped->last_write);
+			if (in_z != gc_entry->z_gc) {
+				Z_ERR(znd, "MD: %" PRIx64
+				      " Discarded - %" PRIx64
+				      " already flown to: %x",
+				      addr, mapped->last_write, in_z);
+				discard = 1;
+			} else if (mapped->data.addr &&
+				   test_bit(IS_DIRTY, &mapped->flags)) {
+				Z_ERR(znd,
+				      "MD: %" PRIx64 " Discarded - %"PRIx64
+				      " is in-flight",
+				      addr, mapped->last_write);
+				discard = 2;
+			}
+			if (!discard)
+				mapped->last_write = lba;
+			spin_unlock_irqrestore(&mapped->md_lock, flgs);
+			deref_pg(mapped);
+		}
+
+		spin_lock_irqsave(&post->cached_lock, clflgs);
+		if (discard == 1) {
+			Z_ERR(znd, "Dropped: %" PRIx64 " ->  %"PRIx64,
+			      addr, lba);
+			tflg |= GC_DROP;
+			mcd->maps[idx].tlba = lba48_to_le64(tflg, addr);
+			mcd->maps[idx].bval = lba48_to_le64(0, lba);
+		}
+		if (tflg & GC_DROP)
+			used--;
+		else if (lba && num)
+			increment_used_blks(znd, lba, 1);
+
+		spin_unlock_irqrestore(&post->cached_lock, clflgs);
+	}
+	return err;
+}
+
+/**
+ * z_zone_gc_metadata_zlt() - Push metadata to lookup table.
+ * @gc_entry: GC Entry
+ */
+static int z_zone_gc_metadata_zlt(struct gc_state *gc_entry)
+{
+	struct zdm *znd = gc_entry->znd;
+	struct gc_map_cache *post = &znd->gc_postmap;
+	struct gc_map_cache_data *mcd = post->gc_mcd;
+	u64 addr;
+	u64 lba;
+	u32 num;
+	u32 tflg;
+	int err = 0;
+	int idx;
+	const bool gc = false;
+	const gfp_t gfp = GFP_KERNEL;
+
+	for (idx = 0; idx < post->jcount; idx++) {
+		if (mcd->maps[idx].tlba == MC_INVALID)
+			continue;
+		addr = le64_to_lba48(mcd->maps[idx].tlba, &tflg);
+		lba = le64_to_lba48(mcd->maps[idx].bval, &num);
+		if (!num || addr == Z_LOWER48 || lba == Z_LOWER48 ||
+		    !(tflg & GC_READ) || !(tflg & GC_WROTE))
+			continue;
+		if (lba) {
+			if (tflg & GC_DROP)
+				err = unused_add(znd, lba, addr, num, gfp);
+			else
+				err = do_add_map(znd, addr, lba, num, gc, gfp);
+			if (err)
+				Z_ERR(znd, "ReIngress Post GC failure");
+
+			Z_DBG(znd, "GC Add: %llx -> %llx (%u) [%d] %s-> %d",
+			      addr, lba, num, idx,
+			      (tflg & GC_DROP) ? "U " : "", err);
+
+		}
+		/* mark entry as handled */
+		mcd->maps[idx].tlba = lba48_to_le64(0, Z_LOWER48);
+
+		if (test_bit(DO_MAPCACHE_MOVE, &znd->flags)) {
+			if (mutex_is_locked(&znd->mz_io_mutex))
+				return -EAGAIN;
+
+			if (do_move_map_cache_to_table(znd, 0, gfp))
+				Z_ERR(znd, "Move to tables post GC failure");
+		}
+	}
+	clear_gc_target_flag(znd);
+
+	return err;
+}
+
+/**
+ * _blkalloc() - Attempt to reserve blocks at z_at in ZDM znd
+ * @znd:	 ZDM instance.
+ * @z_at:	 Zone to write data to
+ * @flags:	 Acquisition type.
+ * @nblks:	 Number of blocks desired.
+ * @nfound:	 Number of blocks allocated or available.
+ *
+ * Attempt allocation of @nblks within fron the current WP of z_at
+ * When nblks are not available 0 is returned and @nfound is the
+ * contains the number of blocks *available* but not *allocated*.
+ * When nblks are available the starting LBA in 4k space is returned and
+ * nblks are allocated *allocated* and *nfound is the number of blocks
+ * remaining in zone z_at from the LBA returned.
+ *
+ * Return: LBA if request is met, otherwise 0. nfound will contain the
+ *	 available blocks remaining.
+ */
+static sector_t _blkalloc(struct zdm *znd, u32 z_at, u32 flags,
+			  u32 nblks, u32 *nfound)
+{
+#define ALLOC_STICKY (Z_WP_GC_TARGET|Z_WP_NON_SEQ|Z_WP_RRECALC)
+	unsigned long flgs;
+	sector_t found = 0;
+	u32 avail = 0;
+	int do_open_zone = 0;
+	u32 gzno  = z_at >> GZ_BITS;
+	u32 gzoff = z_at & GZ_MMSK;
+	struct meta_pg *wpg = &znd->wp[gzno];
+	u32 wp;
+	u32 wptr;
+	u32 gc_tflg;
+
+	if (gzno >= znd->gz_count || z_at >= znd->zone_count) {
+		Z_ERR(znd, "Invalid zone for allocation: %u", z_at);
+		dump_stack();
+		return 0ul;
+	}
+
+	spin_lock_irqsave(&wpg->wplck, flgs);
+	wp      = le32_to_cpu(wpg->wp_alloc[gzoff]);
+	gc_tflg = wp & ALLOC_STICKY;
+	wptr    = wp & ~ALLOC_STICKY;
+	if (wptr < Z_BLKSZ)
+		avail = Z_BLKSZ - wptr;
+
+#if 0 /* DEBUG START: Testing zm_write_pages() */
+	if (avail > 7)
+		avail = 7;
+#endif /* DEBUG END: Testing zm_write_pages() */
+
+	*nfound = avail;
+	if (nblks <= avail) {
+		u64 lba = ((u64)z_at << Z_BLKBITS) + znd->md_start;
+		u32 zf_est = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_VALUE_MASK;
+
+		found = lba + wptr;
+		*nfound = nblks;
+		if (wptr == 0)
+			do_open_zone = 1;
+
+		wptr += nblks;
+		zf_est -= nblks;
+		if (wptr == Z_BLKSZ)
+			znd->discard_count += zf_est;
+
+		wptr |= gc_tflg;
+		if (flags & Z_AQ_GC)
+			wptr |= Z_WP_GC_TARGET;
+
+		if (flags & Z_AQ_STREAM_ID)
+			zf_est |= (flags & Z_AQ_STREAM_MASK) << 24;
+		else
+			zf_est |= le32_to_cpu(wpg->zf_est[gzoff])
+				& Z_WP_STREAM_MASK;
+
+		wpg->wp_alloc[gzoff] = cpu_to_le32(wptr);
+		wpg->zf_est[gzoff] = cpu_to_le32(zf_est);
+		set_bit(IS_DIRTY, &wpg->flags);
+		clear_bit(IS_FLUSH, &wpg->flags);
+	}
+	spin_unlock_irqrestore(&wpg->wplck, flgs);
+
+	if (do_open_zone)
+		dmz_open_zone(znd, z_at);
+
+	return found;
+}
+
+/**
+ * update_stale_ratio() - Update the stale ratio for the finished bin.
+ * @znd: ZDM instance
+ * @zone: Zone that needs update.
+ */
+static void update_stale_ratio(struct zdm *znd, u32 zone)
+{
+	u64 total_stale = 0;
+	u64 free_zones = 1;
+	unsigned long flgs;
+	u32 bin = zone / znd->stale.binsz;
+	u32 z_id = bin * znd->stale.binsz;
+	u32 s_end = z_id + znd->stale.binsz;
+
+	if (s_end > znd->zone_count)
+		s_end = znd->zone_count;
+
+	for (; z_id < s_end; z_id++) {
+		u32 gzno  = z_id >> GZ_BITS;
+		u32 gzoff = z_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 stale = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_VALUE_MASK;
+		u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]) & Z_WP_VALUE_MASK;
+		u32 wflg = le32_to_cpu(wpg->wp_alloc[gzoff]);
+
+		if (wflg & Z_WP_RRECALC) {
+			spin_lock_irqsave(&wpg->wplck, flgs);
+			wflg = le32_to_cpu(wpg->wp_alloc[gzoff])
+			     & ~Z_WP_RRECALC;
+			wpg->wp_alloc[gzoff] = cpu_to_le32(wflg);
+			spin_unlock_irqrestore(&wpg->wplck, flgs);
+		}
+
+		if (wp == Z_BLKSZ)
+			total_stale += stale;
+		else
+			free_zones++;
+	}
+
+	total_stale /= free_zones;
+	znd->stale.bins[bin] = (total_stale > ~0u) ? ~0u : total_stale;
+}
+
+/**
+ * update_all_stale_ratio() - Update the stale ratio for all bins.
+ * @znd: ZDM instance
+ */
+static void update_all_stale_ratio(struct zdm *znd)
+{
+	u32 iter;
+
+	for (iter = 0; iter < znd->stale.count; iter += znd->stale.binsz)
+		update_stale_ratio(znd, iter);
+}
+
+/**
+ * gc_ref() - Increase refcount on gc entry
+ * @gc_entry: increment reference count
+ */
+static void gc_ref(struct gc_state *gc_entry)
+{
+	if (gc_entry)
+		atomic_inc(&gc_entry->refcount);
+}
+
+/**
+ * gc_deref() - Deref a gc entry
+ * @gc_entry: drop reference count and free
+ */
+static void gc_deref(struct gc_state *gc_entry)
+{
+	if (gc_entry) {
+		struct zdm *znd = gc_entry->znd;
+
+		atomic_dec(&gc_entry->refcount);
+		if (atomic_read(&gc_entry->refcount) == 0) {
+			ZDM_FREE(znd, gc_entry, sizeof(*gc_entry), KM_16);
+		}
+	}
+}
+
+/**
+ * z_zone_compact_queue() - Queue zone compaction.
+ * @znd: ZDM instance
+ * @z_gc: Zone to queue.
+ * @delay: Delay queue metric
+ * @gfp: Allocation scheme.
+ *
+ * Return: 1 on success, 0 if not queued/busy, negative on error.
+ */
+static
+int z_zone_compact_queue(struct zdm *znd, u32 z_gc, int delay, int cpick,
+			 gfp_t gfp)
+{
+	unsigned long flags;
+	int do_queue = 0;
+	int err = 0;
+	struct gc_state *gc_entry;
+
+	gc_entry = ZDM_ALLOC(znd, sizeof(*gc_entry), KM_16, gfp);
+	if (!gc_entry) {
+		Z_ERR(znd, "No Memory for compact!!");
+		return -ENOMEM;
+	}
+
+	gc_ref(gc_entry);
+	init_completion(&gc_entry->gc_complete);
+	gc_entry->znd = znd;
+	gc_entry->z_gc = z_gc;
+	gc_entry->is_cpick = cpick;
+	set_bit(DO_GC_INIT, &gc_entry->gc_flags);
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	znd->gc_backlog++;
+	if (znd->gc_active) {
+		gc_deref(gc_entry);
+		znd->gc_backlog--;
+	} else {
+		znd->gc_active = gc_entry;
+		do_queue = 1;
+	}
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+	if (do_queue) {
+		unsigned long tval = msecs_to_jiffies(delay);
+
+		if (queue_delayed_work(znd->gc_wq, &znd->gc_work, tval))
+			err = 1;
+	}
+
+	return err;
+}
+
+/**
+ * zone_zfest() - Queue zone compaction.
+ * @znd: ZDM instance
+ * @z_id: Zone to queue.
+ */
+static u32 zone_zfest(struct zdm *znd, u32 z_id)
+{
+	u32 gzno  = z_id >> GZ_BITS;
+	u32 gzoff = z_id & GZ_MMSK;
+	struct meta_pg *wpg = &znd->wp[gzno];
+
+	return le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_VALUE_MASK;
+}
+
+/**
+ * gc_request_queued() - Called periodically to initiate GC
+ *
+ * @znd: ZDM instance
+ * @bin: Bin with stale zones to scan for GC
+ * @delay: Metric for delay queuing.
+ * @gfp: Default memory allocation scheme.
+ *
+ */
+static int gc_request_queued(struct zdm *znd, int bin, int delay, gfp_t gfp)
+{
+	unsigned long flags;
+	int queued = 0;
+	u32 top_roi = NOZONE;
+	u32 stale = 0;
+	u32 z_gc = bin * znd->stale.binsz;
+	u32 s_end = z_gc + znd->stale.binsz;
+
+	if (znd->meta_result)
+		goto out;
+
+	if (test_bit(ZF_FREEZE, &znd->flags)) {
+		Z_ERR(znd, "Is frozen -- GC paused.");
+		goto out;
+	}
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	if (znd->gc_active)
+		queued = 1;
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+	if (queued)
+		goto out;
+
+	if (z_gc < znd->dz_start)
+		z_gc = znd->dz_start;
+	if (s_end > znd->zone_count)
+		s_end = znd->zone_count;
+
+	/* scan for most stale zone in STREAM [top_roi] */
+	for (; z_gc < s_end; z_gc++) {
+		u32 gzno  = z_gc >> GZ_BITS;
+		u32 gzoff = z_gc & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp_v = le32_to_cpu(wpg->wp_alloc[gzoff]);
+		u32 zfe = le32_to_cpu(wpg->zf_est[gzoff]);
+		u32 nfree = zfe & Z_WP_VALUE_MASK;
+		u32 sid = (zfe & Z_WP_STREAM_MASK) >> 24;
+		u32 wp_f = wp_v & Z_WP_FLAGS_MASK;
+
+		wp_v &= Z_WP_VALUE_MASK;
+		if (wp_v == 0)
+			continue;
+		if ((wp_f & Z_WP_GC_PENDING) != 0)
+			continue;
+
+		if (wp_v == Z_BLKSZ) {
+			stale += nfree;
+
+			if (sid == Z_MDJRNL_SID && nfree < znd->gc_prio_def)
+				continue;
+			if ((wp_f & Z_WP_GC_BITS) == Z_WP_GC_READY) {
+				if (top_roi == NOZONE)
+					top_roi = z_gc;
+				else if (nfree > zone_zfest(znd, top_roi))
+					top_roi = z_gc;
+			}
+		}
+	}
+
+	if (!delay && top_roi == NOZONE)
+		Z_ERR(znd, "No GC candidate in bin: %u -> %u", z_gc, s_end);
+
+	/* determine the cut-off for GC based on MZ overall staleness */
+	if (top_roi != NOZONE) {
+		int rc;
+		u32 state_metric = znd->gc_prio_def;
+		u32 n_empty = znd->z_gc_free;
+		int pctfree = n_empty * 100 / znd->data_zones;
+
+		/*
+		 * -> at less than 5 zones free switch to critical
+		 * -> at less than 5% zones free switch to HIGH
+		 * -> at less than 25% free switch to LOW
+		 * -> high level is 'cherry picking' near empty zones
+		 */
+		if (znd->z_gc_free < znd->gc_wm_crit)
+			state_metric = znd->gc_prio_crit;
+		else if (pctfree < znd->gc_wm_high)
+			state_metric = znd->gc_prio_high;
+		else if (pctfree < znd->gc_wm_low)
+			state_metric = znd->gc_prio_low;
+
+		if (zone_zfest(znd, top_roi) > state_metric) {
+			delay *= 5;
+			rc = z_zone_compact_queue(znd, top_roi, delay, 0, gfp);
+			if (rc == 1)
+				queued = 1;
+			else if (rc < 0)
+				Z_ERR(znd, "GC: Z#%u !Q: ERR: %d", top_roi, rc);
+		}
+
+		if (!delay && !queued) {
+			delay *= 5;
+			rc = z_zone_compact_queue(znd, top_roi, delay, 0, gfp);
+			if (rc == 1)
+				queued = 1;
+		}
+
+		if (!delay && !queued)
+			Z_ERR(znd, "GC: Z#%u !Q .. M: %u E: %u PCT: %d ZF: %u",
+				   top_roi, state_metric, n_empty, pctfree,
+				   zone_zfest(znd, top_roi));
+	}
+out:
+	return queued;
+}
+
+/**
+ * z_zone_gc_compact() - Primary compaction worker.
+ * @gc_entry: GC State
+ */
+static int z_zone_gc_compact(struct gc_state *gc_entry)
+{
+	unsigned long flags;
+	unsigned long wpflgs;
+	int err = 0;
+	struct zdm *znd = gc_entry->znd;
+	u32 z_gc = gc_entry->z_gc;
+	u32 gzno  = z_gc >> GZ_BITS;
+	u32 gzoff = z_gc & GZ_MMSK;
+	struct meta_pg *wpg = &znd->wp[gzno];
+
+	znd->age = jiffies_64;
+
+	/*
+	 * this could be a little smarter ... just check that any
+	 * MD mapped to the target zone has it's DIRTY/FLUSH flags clear
+	 */
+	if (test_and_clear_bit(DO_GC_INIT, &gc_entry->gc_flags)) {
+		set_bit(GC_IN_PROGRESS, &gc_entry->gc_flags);
+		err = z_flush_bdev(znd, GFP_KERNEL);
+		if (err) {
+			gc_entry->result = err;
+			goto out;
+		}
+		set_bit(DO_GC_MD_MAP, &gc_entry->gc_flags);
+	}
+
+	/* If a SYNC is in progress and we can delay then postpone*/
+	if (mutex_is_locked(&znd->mz_io_mutex) &&
+	    atomic_read(&znd->gc_throttle) == 0)
+		return -EAGAIN;
+
+	if (test_and_clear_bit(DO_GC_MD_MAP, &gc_entry->gc_flags)) {
+		int nak = 0;
+		u32 wp;
+
+		spin_lock_irqsave(&wpg->wplck, wpflgs);
+		wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+		wp |= Z_WP_GC_FULL;
+		wpg->wp_alloc[gzoff] = cpu_to_le32(wp);
+		set_bit(IS_DIRTY, &wpg->flags);
+		clear_bit(IS_FLUSH, &wpg->flags);
+		spin_unlock_irqrestore(&wpg->wplck, wpflgs);
+
+		if (znd->gc_postmap.jcount > 0) {
+			Z_ERR(znd, "*** Unexpected data in postmap!!");
+			znd->gc_postmap.jcount = 0;
+			znd->gc_postmap.jsorted = 0;
+		}
+
+		err = z_zone_gc_metadata_to_ram(gc_entry);
+		if (err) {
+			if (err == -EAGAIN) {
+				set_bit(DO_GC_MD_MAP, &gc_entry->gc_flags);
+				znd->gc_postmap.jcount = 0;
+				znd->gc_postmap.jsorted = 0;
+
+				Z_ERR(znd, "*** metadata to ram, again!!");
+				return err;
+			}
+			if (err != -EBUSY) {
+				Z_ERR(znd,
+				      "DO_GC_MD_MAP state failed!! %d", err);
+				gc_entry->result = err;
+				goto out;
+			}
+		}
+
+		if (err == -EBUSY && znd->gc_postmap.jcount == Z_BLKSZ)
+			nak = 1;
+		else if (gc_entry->is_cpick && znd->gc_postmap.jcount > 64)
+			nak = 1;
+
+		if (nak) {
+			u32 non_seq;
+			u32 sid;
+
+			Z_DBG(znd, "Schedule 'move %u' aborting GC",
+			      znd->gc_postmap.jcount);
+
+			spin_lock_irqsave(&wpg->wplck, wpflgs);
+			non_seq = le32_to_cpu(wpg->wp_alloc[gzoff]);
+			non_seq &= ~Z_WP_GC_FULL;
+			wpg->wp_alloc[gzoff] = cpu_to_le32(non_seq);
+			sid = le32_to_cpu(wpg->zf_est[gzoff]);
+			sid &= Z_WP_STREAM_MASK;
+			sid |= (Z_BLKSZ - znd->gc_postmap.jcount);
+			wpg->zf_est[gzoff] = cpu_to_le32(sid);
+			set_bit(IS_DIRTY, &wpg->flags);
+			clear_bit(IS_FLUSH, &wpg->flags);
+			spin_unlock_irqrestore(&wpg->wplck, wpflgs);
+			clear_bit(GC_IN_PROGRESS, &gc_entry->gc_flags);
+			complete_all(&gc_entry->gc_complete);
+			update_stale_ratio(znd, gc_entry->z_gc);
+
+			spin_lock_irqsave(&znd->gc_lock, flags);
+			if (znd->gc_active && gc_entry == znd->gc_active) {
+				set_bit(DO_GC_COMPLETE, &gc_entry->gc_flags);
+				znd->gc_active = NULL;
+				__smp_mb();
+				gc_deref(gc_entry);
+			}
+			spin_unlock_irqrestore(&znd->gc_lock, flags);
+			znd->gc_postmap.jcount = 0;
+			znd->gc_postmap.jsorted = 0;
+			err = -EBUSY;
+			goto out;
+		}
+		if (znd->gc_postmap.jcount == 0)
+			set_bit(DO_GC_DONE, &gc_entry->gc_flags);
+		else
+			set_bit(DO_GC_READ, &gc_entry->gc_flags);
+
+		if (atomic_read(&znd->gc_throttle) == 0 &&
+		    znd->z_gc_free > znd->gc_wm_crit)
+			return -EAGAIN;
+	}
+
+next_in_queue:
+	znd->age = jiffies_64;
+
+	if (test_and_clear_bit(DO_GC_READ, &gc_entry->gc_flags)) {
+		err = z_zone_gc_read(gc_entry);
+		if (err < 0) {
+			Z_ERR(znd, "z_zone_gc_chunk issue failure: %d", err);
+			gc_entry->result = err;
+			goto out;
+		}
+		if (atomic_read(&znd->gc_throttle) == 0 &&
+		    znd->z_gc_free > znd->gc_wm_crit)
+			return -EAGAIN;
+	}
+
+	if (test_and_clear_bit(DO_GC_WRITE, &gc_entry->gc_flags)) {
+		u32 sid = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_STREAM_MASK;
+
+		err = z_zone_gc_write(gc_entry, sid >> 24);
+		if (err) {
+			Z_ERR(znd, "z_zone_gc_write issue failure: %d", err);
+			gc_entry->result = err;
+			goto out;
+		}
+		if (atomic_read(&znd->gc_throttle) == 0 &&
+		    znd->z_gc_free > znd->gc_wm_crit)
+			return -EAGAIN;
+	}
+
+	if (test_and_clear_bit(DO_GC_CONTINUE, &gc_entry->gc_flags)) {
+		z_do_copy_more(gc_entry);
+		goto next_in_queue;
+	}
+
+	znd->age = jiffies_64;
+	if (test_and_clear_bit(DO_GC_MD_SYNC, &gc_entry->gc_flags)) {
+		err = z_zone_gc_metadata_update(gc_entry);
+		gc_entry->result = err;
+		if (err) {
+			Z_ERR(znd, "Metadata error ... disable zone: %u",
+			      gc_entry->z_gc);
+			gc_entry->result = err;
+			goto out;
+		}
+		set_bit(DO_GC_MD_ZLT, &gc_entry->gc_flags);
+	}
+	znd->age = jiffies_64;
+	if (test_and_clear_bit(DO_GC_MD_ZLT, &gc_entry->gc_flags)) {
+		err = z_zone_gc_metadata_zlt(gc_entry);
+		if (err) {
+			if (err == -EAGAIN)
+				return err;
+
+			Z_ERR(znd, "Metadata error ... disable zone: %u",
+			      gc_entry->z_gc);
+
+			gc_entry->result = err;
+			goto out;
+		}
+		err = gc_finalize(gc_entry);
+		if (err) {
+			Z_ERR(znd, "GC: Failed to finalize: %d", err);
+			gc_entry->result = err;
+			goto out;
+		}
+		set_bit(DO_GC_DONE, &gc_entry->gc_flags);
+
+		/* flush *before* reset wp occurs to avoid data loss */
+		err = z_flush_bdev(znd, GFP_KERNEL);
+		if (err) {
+			gc_entry->result = err;
+			goto out;
+		}
+        }
+	if (test_and_clear_bit(DO_GC_DONE, &gc_entry->gc_flags)) {
+		u32 non_seq;
+		u32 reclaimed;
+
+		/* Release the zones for writing */
+		dmz_reset_wp(znd, gc_entry->z_gc);
+
+		spin_lock_irqsave(&wpg->wplck, wpflgs);
+		non_seq = le32_to_cpu(wpg->wp_alloc[gzoff]) & Z_WP_NON_SEQ;
+		reclaimed = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_VALUE_MASK;
+		wpg->wp_alloc[gzoff] = cpu_to_le32(non_seq);
+		wpg->wp_used[gzoff] = cpu_to_le32(0u);
+		wpg->zf_est[gzoff] = cpu_to_le32(Z_BLKSZ);
+		znd->discard_count -= reclaimed;
+		znd->z_gc_free++;
+
+		/*
+		 * If we used a 'reserved' zone for GC/Meta then re-purpose
+		 * the just emptied zone as the new reserved zone. Releasing
+		 * the reserved zone into the normal allocation pool.
+		 */
+		if (znd->z_gc_resv & Z_WP_GC_ACTIVE)
+			znd->z_gc_resv = gc_entry->z_gc;
+		else if (znd->z_meta_resv & Z_WP_GC_ACTIVE)
+			znd->z_meta_resv = gc_entry->z_gc;
+		set_bit(IS_DIRTY, &wpg->flags);
+		clear_bit(IS_FLUSH, &wpg->flags);
+		spin_unlock_irqrestore(&wpg->wplck, wpflgs);
+		complete_all(&gc_entry->gc_complete);
+
+		znd->gc_events++;
+		update_stale_ratio(znd, gc_entry->z_gc);
+		spin_lock_irqsave(&znd->gc_lock, flags);
+		if (znd->gc_active && gc_entry == znd->gc_active) {
+			set_bit(DO_GC_COMPLETE, &gc_entry->gc_flags);
+			znd->gc_active = NULL;
+			__smp_mb();
+			gc_deref(gc_entry);
+		} else {
+			Z_ERR(znd, "GC: FAIL. FAIL.");
+		}
+		znd->gc_backlog--;
+		spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+		set_bit(DO_MEMPOOL, &znd->flags);
+		set_bit(DO_SYNC, &znd->flags);
+	}
+out:
+	return 0;
+}
+
+/**
+ * gc_work_task() - Worker thread for GC activity.
+ * @work: Work struct holding the ZDM instance to do work on ...
+ */
+static void gc_work_task(struct work_struct *work)
+{
+	struct gc_state *gc_entry = NULL;
+	unsigned long flags;
+	struct zdm *znd;
+	int err;
+
+	if (!work)
+		return;
+
+	znd = container_of(to_delayed_work(work), struct zdm, gc_work);
+	if (!znd)
+		return;
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	if (znd->gc_active)
+		gc_entry = znd->gc_active;
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+	if (!gc_entry)
+		return;
+
+	err = z_zone_gc_compact(gc_entry);
+	if (-EAGAIN == err) {
+		int requeue = 0;
+		unsigned long tval = msecs_to_jiffies(10);
+
+		spin_lock_irqsave(&znd->gc_lock, flags);
+		if (znd->gc_active)
+			requeue = 1;
+		spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+		if (requeue) {
+			if (atomic_read(&znd->gc_throttle) > 0)
+				tval = 0;
+			if (znd->z_gc_free <= znd->gc_wm_crit)
+				tval = 0;
+			queue_delayed_work(znd->gc_wq, &znd->gc_work, tval);
+		}
+	} else {
+		const int delay = 10;
+
+		on_timeout_activity(znd, delay);
+	}
+}
+
+/**
+ * is_reserved() - Check to see if a zone is 'special'
+ * @znd: ZDM Instance
+ * @z_pref: Zone to be tested.
+ */
+static inline int is_reserved(struct zdm *znd, const u32 z_pref)
+{
+	const u32 gc   = znd->z_gc_resv   & Z_WP_VALUE_MASK;
+	const u32 meta = znd->z_meta_resv & Z_WP_VALUE_MASK;
+
+	return (gc == z_pref || meta == z_pref) ? 1 : 0;
+}
+
+/**
+ * gc_can_cherrypick() - Queue a GC for zone in this bin if ... it will be easy
+ * @znd: ZDM Instance
+ * @bin: The bin (0 to 255)
+ * @delay: Delay metric
+ * @gfp: Allocation flags to use.
+ */
+static int gc_can_cherrypick(struct zdm *znd, u32 bin, int delay, gfp_t gfp)
+{
+	u32 z_id = bin * znd->stale.binsz;
+	u32 s_end = z_id + znd->stale.binsz;
+
+	if (z_id < znd->dz_start)
+		z_id = znd->dz_start;
+
+	if (s_end > znd->zone_count)
+		s_end = znd->zone_count;
+
+	for (; z_id < s_end; z_id++) {
+		u32 gzno  = z_id >> GZ_BITS;
+		u32 gzoff = z_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+		u32 nfree = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_VALUE_MASK;
+
+		if (wp & Z_WP_RRECALC)
+			update_stale_ratio(znd, z_id);
+
+		if (((wp & Z_WP_GC_BITS) == Z_WP_GC_READY) &&
+		    ((wp & Z_WP_VALUE_MASK) == Z_BLKSZ) &&
+		    (nfree == Z_BLKSZ)) {
+			if (z_zone_compact_queue(znd, z_id, delay, 1, gfp))
+				return 1;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * gc_queue_with_delay() - Scan to see if a GC can/should be queued.
+ * @znd: ZDM Instance
+ * @delay: Delay metric
+ * @gfp: Allocation flags to use.
+ *
+ * Return 1 if gc in progress or queued. 0 otherwise.
+ */
+static int gc_queue_with_delay(struct zdm *znd, int delay, gfp_t gfp)
+{
+	int gc_idle = 0;
+	unsigned long flags;
+
+	if (znd->gc_status == GC_OFF)
+		return gc_idle;
+
+	spin_lock_irqsave(&znd->gc_lock, flags);
+	gc_idle = znd->gc_active ? 0 : 1;
+	spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+	if (gc_idle) {
+		int bin = 0;
+		int ratio = 0;
+		u32 iter;
+
+		if (znd->gc_status == GC_FORCE)
+			delay = 0;
+
+		/* Find highest ratio stream */
+		for (iter = 0; iter < znd->stale.count; iter++)
+			if (znd->stale.bins[iter] > ratio)
+				ratio = znd->stale.bins[iter], bin = iter;
+
+		/* Cherrypick a zone in the stream */
+		if (gc_idle && gc_can_cherrypick(znd, bin, delay, gfp))
+			gc_idle = 0;
+
+		/* Otherwise cherrypick *something* */
+		for (iter = 0; gc_idle && (iter < znd->stale.count); iter++)
+			if (gc_idle && (bin != iter) &&
+			    gc_can_cherrypick(znd, iter, delay, gfp))
+				gc_idle = 0;
+
+		/* Otherwise compact a zone in the stream */
+		if (gc_idle && gc_request_queued(znd, bin, delay, gfp))
+			gc_idle = 0;
+
+		if (delay)
+			return !gc_idle;
+
+		/* Otherwise compact *something* */
+		for (iter = 0; gc_idle && (iter < znd->stale.count); iter++)
+			if (gc_idle && gc_request_queued(znd, iter, delay, gfp))
+				gc_idle = 0;
+	}
+	return !gc_idle;
+}
+
+/**
+ * gc_immediate() - Free up some space as soon as possible.
+ * @znd: ZDM Instance
+ * @gfp: Allocation flags to use.
+ */
+static int gc_immediate(struct zdm *znd, int wait, gfp_t gfp)
+{
+	const int delay = 0;
+	int can_retry = 0;
+	int queued = 0;
+
+	if (wait) {
+		struct gc_state *gc_entry = NULL;
+		unsigned long flags;
+
+		atomic_inc(&znd->gc_throttle);
+		spin_lock_irqsave(&znd->gc_lock, flags);
+		if (znd->gc_active) {
+			gc_entry = znd->gc_active;
+			if (test_bit(GC_IN_PROGRESS, &gc_entry->gc_flags))
+				gc_ref(gc_entry);
+			else
+				gc_entry = NULL;
+		}
+		spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+		if (gc_entry) {
+			unsigned long to = 1;
+
+			if (test_bit(GC_IN_PROGRESS, &gc_entry->gc_flags)) {
+				to = wait_for_completion_io_timeout(
+					&gc_entry->gc_complete,
+					msecs_to_jiffies(15000));
+				if (!to) {
+					can_retry = 1;
+					Z_ERR(znd, "gc_imm: timeout: %lu", to);
+				}
+			}
+
+			spin_lock_irqsave(&znd->gc_lock, flags);
+			gc_deref(gc_entry);
+			spin_unlock_irqrestore(&znd->gc_lock, flags);
+
+			if (to && delayed_work_pending(&znd->gc_work)) {
+				mod_delayed_work(znd->gc_wq, &znd->gc_work, 0);
+				can_retry = flush_delayed_work(&znd->gc_work);
+			}
+		}
+	}
+
+	queued = gc_queue_with_delay(znd, delay, gfp);
+	if (wait) {
+		atomic_inc(&znd->gc_throttle);
+		if (!can_retry && delayed_work_pending(&znd->gc_work)) {
+			mod_delayed_work(znd->gc_wq, &znd->gc_work, 0);
+			can_retry = flush_delayed_work(&znd->gc_work);
+		}
+		atomic_dec(&znd->gc_throttle);
+	}
+
+	if (!queued || !can_retry) {
+		/*
+		 * Couldn't find a zone with enough stale blocks,
+		 * but we could after we deref some more discard
+		 * extents .. so try again later.
+		 */
+		if (znd->trim->count > 0)
+			can_retry = 1;
+
+		unmap_deref_chunk(znd, 2048, 1, gfp);
+	}
+
+	if (wait)
+		atomic_dec(&znd->gc_throttle);
+
+	return can_retry | queued;
+}
+
+/**
+ * set_current() - Make zone the preferred zone for allocation.
+ * @znd: ZDM Instance
+ * @flags: BLock allocation scheme (including stream id)
+ * @zone: The zone to make preferred.
+ *
+ * Once a zone is opened for allocation, future allocations will prefer
+ * the same zone, until the zone is full.
+ * Each stream id has it's own preferred zone.
+ *
+ * NOTE: z_current is being deprecated if favor of assuming a default
+ *       stream id when nothing is provided.
+ */
+static inline void set_current(struct zdm *znd, u32 flags, u32 zone)
+{
+	if (flags & Z_AQ_STREAM_ID) {
+		u32 stream_id = flags & Z_AQ_STREAM_MASK;
+
+		znd->bmkeys->stream[stream_id] = cpu_to_le32(zone);
+	}
+	znd->z_current = zone;
+	if (znd->z_gc_free > 0)
+		znd->z_gc_free--;
+}
+
+/**
+ * next_open_zone() - Grab the next available zone
+ * @znd: ZDM Instance
+ * @z_at: Zone to start scanning from (presumable just filled).
+ *
+ * Return: NOZONE if no zone exists with space for writing.
+ *
+ * Scan through the available zones for an empty zone.
+ * If no empty zone is available the a zone that is not full is
+ * used instead.
+ */
+static u32 next_open_zone(struct zdm *znd, u32 z_at)
+{
+	u32 zone = NOZONE;
+	u32 z_id;
+
+	if (z_at > znd->zone_count)
+		z_at = znd->zone_count;
+
+	/* scan higher lba zones */
+	for (z_id = z_at; z_id < znd->zone_count; z_id++) {
+		u32 gzno  = z_id >> GZ_BITS;
+		u32 gzoff = z_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+
+		if ((wp & Z_WP_VALUE_MASK) == 0) {
+			u32 check = gzno << GZ_BITS | gzoff;
+
+			if (!is_reserved(znd, check)) {
+				zone = check;
+				goto out;
+			}
+		}
+	}
+
+	/* scan lower lba zones */
+	for (z_id = znd->dz_start; z_id < z_at; z_id++) {
+		u32 gzno  = z_id >> GZ_BITS;
+		u32 gzoff = z_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+
+		if ((wp & Z_WP_VALUE_MASK) == 0) {
+			u32 check = gzno << GZ_BITS | gzoff;
+
+			if (!is_reserved(znd, check)) {
+				zone = check;
+				goto out;
+			}
+		}
+	}
+
+	/* No empty zones .. start co-mingling streams */
+	for (z_id = znd->dz_start; z_id < znd->zone_count; z_id++) {
+		u32 gzno  = z_id >> GZ_BITS;
+		u32 gzoff = z_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+
+		if ((wp & Z_WP_VALUE_MASK) < Z_BLKSZ) {
+			u32 check = gzno << GZ_BITS | gzoff;
+
+			if (!is_reserved(znd, check)) {
+				zone = check;
+				goto out;
+			}
+		}
+	}
+
+out:
+	return zone;
+}
+
+/**
+ * zone_filled_cleanup() - Update wp_alloc GC Readu flags based on wp_used.
+ * @znd: ZDM Instance
+ */
+static void zone_filled_cleanup(struct zdm *znd)
+{
+	if (znd->filled_zone != NOZONE) {
+		unsigned long wpflgs;
+		u32 zone = znd->filled_zone;
+		u32 gzno;
+		u32 gzoff;
+		u32 wp;
+		u32 used;
+		struct meta_pg *wpg;
+
+		znd->filled_zone = NOZONE;
+
+		gzno = zone >> GZ_BITS;
+		gzoff = zone & GZ_MMSK;
+		wpg = &znd->wp[gzno];
+
+		spin_lock_irqsave(&wpg->wplck, wpflgs);
+		wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+		used = le32_to_cpu(wpg->wp_used[gzoff]) & Z_WP_VALUE_MASK;
+		if (used == Z_BLKSZ) {
+			if (Z_BLKSZ == (wp & Z_WP_VALUE_MASK)) {
+				wpg->wp_alloc[gzoff] = cpu_to_le32(wp
+						     | Z_WP_GC_READY);
+				set_bit(IS_DIRTY, &wpg->flags);
+				clear_bit(IS_FLUSH, &wpg->flags);
+			} else {
+				Z_ERR(znd, "Zone %u seems bogus. "
+					   "wp: %x used: %x",
+				      zone, wp, used);
+			}
+		}
+		spin_unlock_irqrestore(&wpg->wplck, wpflgs);
+
+		dmz_close_zone(znd, zone);
+		update_stale_ratio(znd, zone);
+		set_bit(DO_MAPCACHE_MOVE, &znd->flags);
+	}
+}
+
+/**
+ * z_acquire() - Allocate blocks for writing
+ * @znd: ZDM Instance
+ * @flags: Alloc strategy and stream id.
+ * @nblks: Number of blocks desired.
+ * @nfound: Number of blocks available.
+ *
+ * Return: Lba for writing.
+ */
+static u64 z_acquire(struct zdm *znd, u32 flags, u32 nblks, u32 *nfound)
+{
+	sector_t found = 0;
+	u32 z_pref = znd->z_current;
+	u32 stream_id = 0;
+	u32 z_find;
+	const int wait = 1;
+	gfp_t gfp = (flags & Z_AQ_NORMAL) ? GFP_ATOMIC : GFP_KERNEL;
+
+	if (!(flags & Z_AQ_GC))
+		zone_filled_cleanup(znd);
+
+	if (flags & Z_AQ_STREAM_ID) {
+		stream_id = flags & Z_AQ_STREAM_MASK;
+		z_pref = le32_to_cpu(znd->bmkeys->stream[stream_id]);
+	}
+	if (z_pref >= znd->zone_count) {
+		z_pref = next_open_zone(znd, znd->z_current);
+		if (z_pref < znd->zone_count)
+			set_current(znd, flags, z_pref);
+	}
+
+	if (z_pref < znd->zone_count) {
+		found = _blkalloc(znd, z_pref, flags, nblks, nfound);
+		if (found || *nfound)
+			goto out;
+	}
+
+	if (znd->z_gc_free < znd->gc_wm_crit) {
+		Z_DBG(znd, "... alloc - gc low on free space.");
+		gc_immediate(znd, znd->z_gc_free < (znd->gc_wm_crit >> 1), gfp);
+	}
+
+retry:
+	z_find = next_open_zone(znd, znd->z_current);
+	if (z_find < znd->zone_count) {
+		found = _blkalloc(znd, z_find, flags, nblks, nfound);
+		if (found || *nfound) {
+			set_current(znd, flags, z_find);
+			goto out;
+		}
+	}
+
+	if (flags & Z_AQ_GC) {
+		u32 gresv = znd->z_gc_resv & Z_WP_VALUE_MASK;
+
+		Z_ERR(znd, "Using GC Reserve (%u)", gresv);
+		found = _blkalloc(znd, gresv, flags, nblks, nfound);
+		znd->z_gc_resv |= Z_WP_GC_ACTIVE;
+	}
+
+	if (flags & Z_AQ_META) {
+		int can_retry = gc_immediate(znd, wait, gfp);
+		u32 mresv = znd->z_meta_resv & Z_WP_VALUE_MASK;
+
+		Z_DBG(znd, "GC: Need META.");
+		if (can_retry)
+			goto retry;
+
+		Z_ERR(znd, "Using META Reserve (%u)", znd->z_meta_resv);
+		found = _blkalloc(znd, mresv, flags, nblks, nfound);
+	}
+
+out:
+	if (!found && (*nfound == 0)) {
+		if (gc_immediate(znd, wait, gfp))
+			goto retry;
+
+		Z_ERR(znd, "%s: -> Out of space.", __func__);
+	}
+	return found;
+}
+
+/**
+ * wset_cmp_wr() - Compare map page on lba.
+ * @x1: map page
+ * @x2: map page
+ *
+ * Return -1, 0, or 1 if x1 < x2, equal, or >, respectivly.
+ */
+static int wset_cmp_wr(const void *x1, const void *x2)
+{
+	const struct map_pg *v1 = *(const struct map_pg **)x1;
+	const struct map_pg *v2 = *(const struct map_pg **)x2;
+	int cmp = (v1->lba < v2->lba) ? -1 : ((v1->lba > v2->lba) ? 1 : 0);
+
+	return cmp;
+}
+
+/**
+ * wset_cmp_rd() - Compare map page on lba.
+ * @x1: map page
+ * @x2: map page
+ *
+ * Return -1, 0, or 1 if x1 < x2, equal, or >, respectivly.
+ */
+static int wset_cmp_rd(const void *x1, const void *x2)
+{
+	const struct map_pg *v1 = *(const struct map_pg **)x1;
+	const struct map_pg *v2 = *(const struct map_pg **)x2;
+	int cmp = (v1->last_write < v2->last_write) ? -1
+		: ((v1->last_write > v2->last_write) ? 1 : wset_cmp_wr(x1, x2));
+
+	return cmp;
+}
+
+/**
+ * is_dirty() - Test map page if is of bit_type and dirty.
+ * @expg: map page
+ * @bit_type: map page flag to test for...
+ *
+ * Return 1 if page had dirty and bit_type flags set.
+ *
+ * Note: bit_type is IS_CRC and IS_LUT most typically.
+ */
+static __always_inline int is_dirty(struct map_pg *expg, int bit_type)
+{
+	return (test_bit(bit_type, &expg->flags) &&
+		test_bit(IS_DIRTY, &expg->flags));
+}
+
+/**
+ * is_old_and_clean() - Test map page if it expired and can be dropped.
+ * @expg: map page
+ * @bit_type: map page flag to test for...
+ *
+ * Return 1 if page is clean, not in flight, and old.
+ */
+static __always_inline int is_old_and_clean(struct map_pg *expg, int bit_type)
+{
+	int is_match = 0;
+
+	if (!test_bit(IS_DIRTY, &expg->flags) &&
+	    is_expired(expg->znd, expg->age)) {
+		if (test_bit(R_IN_FLIGHT, &expg->flags))
+			pr_debug("%5"PRIx64": clean, exp, and in flight: %d\n",
+				 expg->lba, getref_pg(expg));
+		else if (getref_pg(expg) == 1)
+			is_match = 1;
+#if EXTRA_DEBUG
+		else if (test_bit(IS_LUT, &expg->flags))
+			pr_err("%5"PRIx64": clean, exp, and elev: %d\n",
+				 expg->lba, getref_pg(expg));
+#endif
+	}
+
+	(void)bit_type;
+
+	return is_match;
+}
+
+/**
+ * _pool_write() - Sort/Write and array of ZLT pages.
+ * @znd: ZDM Instance
+ * @wset: Array of pages to be written.
+ * @count: Number of entries.
+ *
+ *  NOTE: On entry all map_pg entries have elevated refcount from _pool_fill().
+ *	write_if_dirty() will dec the refcount when the block hits disk.
+ */
+static int _pool_write(struct zdm *znd, struct map_pg **wset, int count)
+{
+	const int use_wq = 0;
+	int iter;
+	struct map_pg *expg;
+	int err = 0;
+
+	/* write dirty table pages */
+	if (count <= 0)
+		goto out;
+
+	if (count > 1)
+		sort(wset, count, sizeof(*wset), wset_cmp_wr, NULL);
+
+	for (iter = 0; iter < count; iter++) {
+		expg = wset[iter];
+		if (expg) {
+			if (iter && expg->lba == wset[iter-1]->lba)
+				wset[iter] = NULL;
+			else
+				cache_if_dirty(znd, expg, use_wq);
+		}
+	}
+
+	for (iter = 0; iter < count; iter++) {
+		const int sync = 1;
+		/* REDUX: For Async WB:  int sync = (iter == last) ? 1 : 0; */
+
+		expg = wset[iter];
+		if (expg) {
+			err = write_if_dirty(znd, expg, use_wq, sync);
+			deref_pg(expg);
+			if (err) {
+				Z_ERR(znd, "Write failed: %d", err);
+				goto out;
+			}
+		}
+	}
+	err = count;
+
+out:
+	return err;
+}
+
+/**
+ * _pool_read() - Sort/Read an array of ZLT pages.
+ * @znd: ZDM Instance
+ * @wset: Array of pages to be written.
+ * @count: Number of entries.
+ *
+ *  NOTE: On entry all map_pg entries have elevated refcount from _pool_fill().
+ *	write_if_dirty() will dec the refcount when the block hits disk.
+ */
+static int _pool_read(struct zdm *znd, struct map_pg **wset, int count)
+{
+	int iter;
+	struct map_pg *expg;
+	struct map_pg *prev = NULL;
+	int err = 0;
+
+	/* write dirty table pages */
+	if (count <= 0)
+		goto out;
+
+	if (count > 1)
+		sort(wset, count, sizeof(*wset), wset_cmp_rd, NULL);
+
+	for (iter = 0; iter < count; iter++) {
+		expg = wset[iter];
+		if (expg) {
+			if (!expg->data.addr) {
+				gfp_t gfp = GFP_KERNEL;
+				struct mpinfo mpi;
+				int rc;
+
+				to_table_entry(znd, expg->lba, 0, &mpi);
+				rc = cache_pg(znd, expg, gfp, &mpi);
+				if (!rc)
+					rc = wait_for_map_pg(znd, expg, gfp);
+				if (rc < 0 && rc != -EBUSY)
+					znd->meta_result = rc;
+			}
+			set_bit(IS_DIRTY, &expg->flags);
+			clear_bit(IS_FLUSH, &expg->flags);
+			clear_bit(IS_READA, &expg->flags);
+			deref_pg(expg);
+		}
+		if (prev && expg == prev)
+			Z_ERR(znd, "Dupe %"PRIx64" in pool_read list.",
+			      expg->lba);
+		prev = expg;
+	}
+
+out:
+	return err;
+}
+
+/**
+ * md_journal_add_map() - Add an entry to the map cache block mapping.
+ * @znd: ZDM Instance
+ * @addr: Address being added to journal.
+ * @lba: bLBA addr is being mapped to (0 to delete the map)
+ *
+ * Add a new journal wb entry.
+ */
+static int md_journal_add_map(struct zdm *znd, u64 addr, u64 lba)
+{
+	unsigned long flgs;
+	struct map_cache_page *m_pg = NULL;
+	const u32 count = 1;
+	int rc = 0;
+	int matches;
+	const gfp_t gfp = GFP_ATOMIC;
+
+	if (addr >= znd->data_lba) {
+		Z_ERR(znd, "%s: Error invalid addr.", __func__);
+		return rc;
+	}
+
+	m_pg = ZDM_ALLOC(znd, sizeof(*m_pg), PG_09, gfp);
+	if (!m_pg)
+		return -ENOMEM;
+
+resubmit:
+	mp_grow(znd->wbjrnl, znd->_wbj, gfp);
+	spin_lock_irqsave(&znd->wbjrnl_rwlck, flgs);
+	matches = _common_intersect(znd->wbjrnl, addr, count, m_pg);
+	if (matches == 0) {
+		const u32 sflg = 0;
+
+		if (mp_insert(znd->wbjrnl, addr, sflg, count, lba) != 1) {
+			m_pg->maps[ISCT_BASE].tlba = lba48_to_le64(0, addr);
+			m_pg->maps[ISCT_BASE].bval = lba48_to_le64(count, lba);
+			matches = 1;
+		}
+	}
+	if (matches) {
+		struct map_pool *mp;
+		const int avail = ARRAY_SIZE(m_pg->maps);
+		const int drop = 1;
+
+		_common_merges(znd, m_pg, matches, addr, count, lba, gfp);
+		mp = mp_pick(znd->wbjrnl, znd->_wbj);
+		rc = do_sort_merge(mp, znd->wbjrnl, m_pg->maps, avail, drop);
+		if (unlikely(rc)) {
+			Z_ERR(znd, "JSortMerge failed: %d [%d]", rc, __LINE__);
+			rc = -EBUSY;
+		} else {
+			znd->wbjrnl = mp;
+			__smp_mb();
+		}
+	}
+	spin_unlock_irqrestore(&znd->wbjrnl_rwlck, flgs);
+	if (rc == -EBUSY)
+		goto resubmit;
+
+	if (m_pg)
+		ZDM_FREE(znd, m_pg, sizeof(*m_pg), PG_09);
+
+	return rc;
+}
+
+/**
+ * z_metadata_lba() - Alloc a block for [lookup table] metadata.
+ * @znd: ZDM Instance
+ * @map: Block of metadata.
+ * @num: Number of blocks allocated.
+ *
+ * Return: lba or 0 on failure.
+ *
+ * When map->lba is less than data_lba the metadata is pinned to it's logical
+ * location.
+ * When map->lba lands in data space it is dynmaically allocated and intermixed
+ * within the datapool.
+ */
+static u64 z_metadata_lba(struct zdm *znd, struct map_pg *map, u32 *num)
+{
+	u64 jrnl_lba = map->lba;
+	u32 nblks = 1;
+	bool to_jrnl = true;
+
+	if (test_bit(IN_WB_JOURNAL, &map->flags)) {
+		short delta = (znd->bmkeys->generation & 0xfff) - map->gen;
+
+		if (abs(delta) > znd->journal_age) {
+			*num = 1;
+			map->gen = 0;
+			clear_bit(IN_WB_JOURNAL, &map->flags);
+			to_jrnl = false;
+		}
+	}
+	if (znd->journal_age == 0) {
+		to_jrnl = false;
+		*num = 1;
+	}
+	if (to_jrnl) {
+		jrnl_lba = z_acquire(znd, Z_AQ_META_STREAM, nblks, num);
+		if (!jrnl_lba) {
+			Z_ERR(znd, "Out of MD journal space?");
+			jrnl_lba = map->lba;
+		}
+		set_bit(IN_WB_JOURNAL, &map->flags);
+		map->gen = znd->bmkeys->generation & 0xfff;
+	}
+
+	return jrnl_lba;
+}
+
+/**
+ * pg_update_crc() - Update CRC for page pg
+ * @znd: ZDM Instance
+ * @pg: Entry of lookup table or CRC page
+ * @md_crc: 16 bit crc of page.
+ *
+ * callback from dm_io notify.. cannot hold mutex here
+ */
+static void pg_update_crc(struct zdm *znd, struct map_pg *pg, __le16 md_crc)
+{
+	struct mpinfo mpi;
+
+	to_table_entry(znd, pg->lba, 0, &mpi);
+	if (pg->crc_pg) {
+		struct map_pg *crc_pg = pg->crc_pg;
+		int entry = mpi.crc.pg_idx;
+
+		if (crc_pg && crc_pg->data.crc) {
+			ref_pg(crc_pg);
+			if (crc_pg->data.crc[entry] != md_crc) {
+				crc_pg->data.crc[entry] = md_crc;
+				clear_bit(IS_READA, &crc_pg->flags);
+				set_bit(IS_DIRTY, &crc_pg->flags);
+				clear_bit(IS_FLUSH, &crc_pg->flags);
+				crc_pg->age = jiffies_64
+					    + msecs_to_jiffies(crc_pg->hotness);
+			}
+			if (crc_pg->lba != mpi.crc.lba)
+				Z_ERR(znd, "*** BAD CRC PG: %"PRIx64
+				      " != %" PRIx64,
+				      crc_pg->lba, mpi.crc.lba);
+
+			Z_DBG(znd, "Write if dirty (lut): %"
+				   PRIx64" -> %" PRIx64 " crc [%"
+				   PRIx64 ".%u] : %04x",
+			      pg->lba, pg->last_write,
+			      crc_pg->lba, mpi.crc.pg_idx, le16_to_cpu(md_crc));
+
+			deref_pg(crc_pg);
+		} else {
+			Z_ERR(znd, "**** What CRC Page !?!? %"PRIx64, pg->lba);
+		}
+		put_map_entry(crc_pg);
+
+	} else if (!test_bit(IS_LUT, &pg->flags)) {
+
+		Z_DBG(znd, "Write if dirty (crc): %"
+			   PRIx64" -> %" PRIx64 " crc[%u]:%04x",
+		      pg->lba, pg->last_write,
+		      mpi.crc.pg_idx, le16_to_cpu(md_crc));
+
+		znd->md_crcs[mpi.crc.pg_idx] = md_crc;
+	} else {
+		Z_ERR(znd, "unexpected state.");
+		dump_stack();
+	}
+}
+
+/**
+ * pg_journal_entry() - Add journal entry and flag in journal status.
+ * @znd: ZDM Instance
+ * @pg: The page of lookup table [or CRC] that was written.
+ *
+ * Write has completed. Update index/map to the new location.
+ *
+ */
+static int pg_journal_entry(struct zdm *znd, struct map_pg *pg, gfp_t gfp)
+{
+	int rcode = 0;
+
+	if (pg->lba < znd->data_lba) {
+		u64 blba = 0ul; /* if not in journal clean map entry */
+
+		if (test_bit(IN_WB_JOURNAL, &pg->flags))
+			blba = pg->last_write;
+		rcode = md_journal_add_map(znd, pg->lba, blba);
+		if (rcode)
+			Z_ERR(znd, "%s: MD Journal failed.", __func__);
+		if (blba)
+			increment_used_blks(znd, blba, 1);
+	}
+	return rcode;
+}
+
+/**
+ * pg_written() - Handle accouting related to lookup table page writes
+ * @pg: The page of lookup table [or CRC] that was written.
+ * @error: non-zero if an error occurred.
+ *
+ * callback from dm_io notify.. cannot hold mutex here, cannot sleep.
+ */
+static int pg_written(struct map_pg *pg, unsigned long error)
+{
+	int rcode = 0;
+	struct zdm *znd = pg->znd;
+	__le16 md_crc;
+
+	if (error) {
+		Z_ERR(znd, "write_page: %" PRIx64 " -> %" PRIx64
+			   " ERR: %ld", pg->lba, pg->last_write, error);
+		rcode = -EIO;
+		goto out;
+	}
+
+	/*
+	 * Re-calculate CRC on current memory page. If unchanged then on-disk
+	 * is stable and in-memory is not dirty. Otherwise in memory changed
+	 * during write back so leave the dirty flag set. For the purpose of
+	 * the CRC table we assume that in-memory == on-disk although this
+	 * is not strictly true as the page could have updated post disk write.
+	 */
+
+	md_crc = crc_md_le16(pg->data.addr, Z_CRC_4K);
+	pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+	if (md_crc == pg->md_crc)
+		clear_bit(IS_DIRTY, &pg->flags);
+	else
+		Z_ERR(znd, "write: crc changed in flight.");
+
+	clear_bit(W_IN_FLIGHT, &pg->flags);
+	pg_update_crc(znd, pg, md_crc);
+
+	Z_DBG(znd, "write: %" PRIx64 " -> %" PRIx64 " -> %" PRIx64
+		   " crc:%04x [async]",
+	      pg->lba, pg->lba48_in, pg->last_write, le16_to_cpu(md_crc));
+
+	rcode = pg_journal_entry(znd, pg, GFP_ATOMIC);
+
+out:
+	return rcode;
+}
+
+/**
+ * on_pg_written() - A block of map table was written.
+ * @error: Any error code that occurred during the I/O.
+ * @context: The map_pg that was queued/written.
+ */
+static void on_pg_written(unsigned long error, void *context)
+{
+	struct map_pg *pg = context;
+	int rcode;
+
+	pg->last_write = pg->lba48_in;
+	rcode = pg_written(pg, error);
+	deref_pg(pg);
+	if (rcode < 0)
+		pg->znd->meta_result = rcode;
+}
+
+/**
+ * queue_pg() - Queue a map table page for writeback
+ * @znd: ZDM Instance
+ * @pg: The target page to ensure the cover CRC blocks is cached.
+ * @lba: The address to write the block to.
+ */
+static int queue_pg(struct zdm *znd, struct map_pg *pg, u64 lba)
+{
+	unsigned long flgs;
+	sector_t block = lba << Z_SHFT4K;
+	unsigned int nDMsect = 1 << Z_SHFT4K;
+	const int use_wq = 0;
+	int rc;
+
+	pg->znd = znd;
+	spin_lock_irqsave(&pg->md_lock, flgs);
+	pg->md_crc = crc_md_le16(pg->data.addr, Z_CRC_4K);
+	pg->lba48_in = lba;
+	spin_unlock_irqrestore(&pg->md_lock, flgs);
+
+	rc = znd_async_io(znd, DM_IO_KMEM, pg->data.addr, block, nDMsect,
+			  REQ_OP_WRITE, 0, use_wq, on_pg_written, pg);
+	if (rc) {
+		Z_ERR(znd, "queue error: %d Q: %" PRIx64 " [%u dm sect] (Q:%d)",
+		      rc, lba, nDMsect, use_wq);
+		dump_stack();
+	}
+
+	return rc;
+}
+
+/**
+ * cache_if_dirty() - Load a page of CRC's into memory.
+ * @znd: ZDM Instance
+ * @pg: The target page to ensure the cover CRC blocks is cached.
+ * @wp: If a queue is needed for I/O.
+ *
+ * The purpose of loading is to ensure the pages are in memory when the
+ * async_io (write) completes the CRC accounting doesn't cause a sleep
+ * and violate the callback() API rules.
+ */
+static void cache_if_dirty(struct zdm *znd, struct map_pg *pg, int wq)
+{
+	if (test_bit(IS_DIRTY, &pg->flags) && test_bit(IS_LUT, &pg->flags) &&
+	    pg->data.addr) {
+		unsigned long flgs;
+		struct map_pg *crc_pg;
+		struct mpinfo mpi;
+		const gfp_t gfp = GFP_ATOMIC;
+		const int async = 0; /* can this be async? */
+
+		to_table_entry(znd, pg->lba, 0, &mpi);
+		crc_pg = get_map_entry(znd, mpi.crc.lba, 4, async, 0, gfp);
+		if (!crc_pg) {
+			Z_ERR(znd, "Out of memory. No CRC Pg");
+			return;
+		}
+
+		if (pg->crc_pg)
+			return;
+
+		spin_lock_irqsave(&pg->md_lock, flgs);
+		if (!pg->crc_pg) {
+			ref_pg(crc_pg);
+			pg->crc_pg = crc_pg;
+			__smp_mb();
+		}
+		spin_unlock_irqrestore(&pg->md_lock, flgs);
+
+	}
+}
+
+/**
+ * write_if_dirty() - Write old pages (flagged as DIRTY) pages of table map.
+ * @znd: ZDM instance.
+ * @pg: A page of table map data.
+ * @wq: Use worker queue for sync writes.
+ * @snc: Performa a Sync or Async write.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int write_if_dirty(struct zdm *znd, struct map_pg *pg, int wq, int snc)
+{
+	u32 nf;
+	u64 lba;
+	int rcode = 0;
+
+	if (!pg)
+		return rcode;
+
+	if (!test_bit(IS_DIRTY, &pg->flags) || !pg->data.addr)
+		goto out;
+
+	lba = z_metadata_lba(znd, pg, &nf);
+	if (lba && nf) {
+		int rcwrt;
+		int count = 1;
+		__le16 md_crc;
+
+		set_bit(W_IN_FLIGHT, &pg->flags);
+		if (!snc) {
+			rcode = queue_pg(znd, pg, lba);
+			goto out_queued;
+		}
+		md_crc = crc_md_le16(pg->data.addr, Z_CRC_4K);
+		rcwrt = write_block(znd, DM_IO_KMEM,
+				    pg->data.addr, lba, count, wq);
+		Z_DBG(znd, "write: %" PRIx64 " -> %" PRIx64 " -> %" PRIx64
+			   " crc:%04x [drty]", pg->lba, lba,
+		      pg->last_write, le16_to_cpu(md_crc));
+		if (rcwrt) {
+			Z_ERR(znd, "write_page: %" PRIx64 " -> %" PRIx64
+				   " ERR: %d", pg->lba, lba, rcwrt);
+			rcode = rcwrt;
+			goto out;
+		}
+		pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+		pg->last_write = pg->lba48_in = lba;
+		pg_update_crc(znd, pg, md_crc);
+
+		if (crc_md_le16(pg->data.addr, Z_CRC_4K) == md_crc)
+			clear_bit(IS_DIRTY, &pg->flags);
+		clear_bit(W_IN_FLIGHT, &pg->flags);
+		rcode = pg_journal_entry(znd, pg, GFP_ATOMIC);
+	} else {
+		Z_ERR(znd, "%s: Out of space for metadata?", __func__);
+		rcode = -ENOSPC;
+		goto out;
+	}
+
+out:
+	deref_pg(pg); /* ref'd by queue_pg */
+
+out_queued:
+	if (rcode < 0)
+		znd->meta_result = rcode;
+
+	return rcode;
+}
+
+/**
+ * _sync_dirty() - Write all *dirty* ZLT blocks to disk (journal->SYNC->home)
+ * @znd: ZDM instance
+ * @bit_type: MAP blocks then CRC blocks.
+ * @sync: If true write dirty blocks to disk
+ * @drop: Number of ZLT blocks to free.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int _sync_dirty(struct zdm *znd, int bit_type, int sync, int drop)
+{
+	struct map_pg *expg = NULL;
+	struct map_pg *_tpg;
+	struct map_pg **wset = NULL;
+	const u64 decr = msecs_to_jiffies(znd->cache_ageout_ms - 1);
+	LIST_HEAD(droplist);
+	unsigned long zflgs;
+	int err = 0;
+	int entries = 0;
+	int dlstsz = 0;
+
+	wset = ZDM_CALLOC(znd, sizeof(*wset), MAX_WSET, KM_19, GFP_KERNEL);
+	if (!wset) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&znd->zlt_lck, zflgs);
+	if (list_empty(&znd->zltpool))
+		goto writeback;
+
+	expg = list_last_entry(&znd->zltpool, typeof(*expg), zltlst);
+	if (!expg || &expg->zltlst == (&znd->zltpool))
+		goto writeback;
+
+	_tpg = list_prev_entry(expg, zltlst);
+	while (&expg->zltlst != &znd->zltpool) {
+		ref_pg(expg);
+
+		if (sync &&
+		    entries < MAX_WSET &&
+		    test_bit(WB_RE_CACHE, &expg->flags)) {
+			/*
+			 * Force inclusion to wset, where data will
+			 * conditionally be reloaded to core memory before
+			 * being scheduled for writing
+			 */
+			set_bit(IS_DIRTY, &expg->flags);
+			clear_bit(IS_FLUSH, &expg->flags);
+		}
+
+		if (sync && is_dirty(expg, bit_type)) {
+			if (entries < MAX_WSET) {
+				ref_pg(expg);
+				wset[entries] = expg;
+				clear_bit(WB_RE_CACHE, &expg->flags);
+				entries++;
+			}
+		} else if ((drop > 0 || low_cache_mem(znd)) &&
+			    is_old_and_clean(expg, bit_type)) {
+			int is_lut = test_bit(IS_LUT, &expg->flags);
+			unsigned long flags;
+			spinlock_t *lock;
+
+			lock = is_lut ? &znd->mapkey_lock : &znd->ct_lock;
+			spin_lock_irqsave(lock, flags);
+			if (getref_pg(expg) == 1) {
+				list_del(&expg->zltlst);
+				znd->in_zlt--;
+				clear_bit(IN_ZLT, &expg->flags);
+				if (drop > 0)
+					drop--;
+
+				expg->age = jiffies_64;
+				if (low_cache_mem(znd) && expg->age > decr)
+					expg->age -= decr;
+
+				if (test_bit(IS_LAZY, &expg->flags))
+					Z_ERR(znd, "** Pg is lazy && zlt %"
+					      PRIx64, expg->lba);
+
+				if (!test_bit(IS_LAZY, &expg->flags)) {
+					list_add(&expg->lazy, &droplist);
+					znd->in_lzy++;
+					set_bit(IS_LAZY, &expg->flags);
+					set_bit(IS_DROPPED, &expg->flags);
+					dlstsz++;
+				}
+				deref_pg(expg);
+				if (getref_pg(expg) > 0)
+					Z_ERR(znd, "Moving elv ref: %u",
+					      getref_pg(expg));
+			}
+			spin_unlock_irqrestore(lock, flags);
+		} else {
+			deref_pg(expg);
+		}
+		if (entries == MAX_WSET)
+			break;
+
+		expg = _tpg;
+		_tpg = list_prev_entry(expg, zltlst);
+	}
+
+writeback:
+	spin_unlock_irqrestore(&znd->zlt_lck, zflgs);
+
+	if (entries > 0) {
+		err = _pool_write(znd, wset, entries);
+		if (err < 0)
+			goto out;
+		if (entries == MAX_WSET)
+			err = -EBUSY;
+	}
+
+out:
+	if (!list_empty(&droplist))
+		lazy_pool_splice(znd, &droplist);
+
+	if (wset)
+		ZDM_FREE(znd, wset, sizeof(*wset) * MAX_WSET, KM_19);
+
+	return err;
+}
+
+/**
+ * _pool_handle_crc() - Wait for current map_pg's to be CRC verified.
+ * @znd: ZDM Instance
+ * @wset: Array of pages to wait on (check CRC's).
+ * @count: Number of entries.
+ *
+ *  NOTE: On entry all map_pg entries have elevated refcount from
+ *        md_handle_crcs().
+ */
+static int _pool_handle_crc(struct zdm *znd, struct map_pg **wset, int count)
+{
+	int iter;
+	struct map_pg *expg;
+	int err = 0;
+
+	/* write dirty table pages */
+	if (count <= 0)
+		goto out;
+
+	if (count > 1)
+		sort(wset, count, sizeof(*wset), wset_cmp_rd, NULL);
+
+	for (iter = 0; iter < count; iter++) {
+		expg = wset[iter];
+		if (expg) {
+			err = wait_for_map_pg(znd, expg, GFP_KERNEL);
+			deref_pg(expg);
+		}
+	}
+
+out:
+	return err;
+}
+
+/**
+ * _sync_dirty() - Write all *dirty* ZLT blocks to disk (journal->SYNC->home)
+ * @znd: ZDM instance
+ * @bit_type: MAP blocks then CRC blocks.
+ * @sync: If true write dirty blocks to disk
+ * @drop: Number of ZLT blocks to free.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int md_handle_crcs(struct zdm *znd)
+{
+	int err = 0;
+	int entries = 0;
+	unsigned long zflgs;
+	struct map_pg *expg = NULL;
+	struct map_pg *_tpg;
+	struct map_pg **wset = NULL;
+
+	wset = ZDM_CALLOC(znd, sizeof(*wset), MAX_WSET, KM_19, GFP_KERNEL);
+	if (!wset) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&znd->zlt_lck, zflgs);
+	if (list_empty(&znd->zltpool))
+		goto writeback;
+
+	expg = list_last_entry(&znd->zltpool, typeof(*expg), zltlst);
+	if (!expg || &expg->zltlst == (&znd->zltpool))
+		goto writeback;
+
+	_tpg = list_prev_entry(expg, zltlst);
+	while (&expg->zltlst != &znd->zltpool) {
+		ref_pg(expg);
+		if (test_bit(R_CRC_PENDING, &expg->flags)) {
+			if (entries < MAX_WSET) {
+				ref_pg(expg);
+				wset[entries] = expg;
+				entries++;
+			}
+		}
+		deref_pg(expg);
+		if (entries == MAX_WSET)
+			break;
+
+		expg = _tpg;
+		_tpg = list_prev_entry(expg, zltlst);
+	}
+
+writeback:
+	spin_unlock_irqrestore(&znd->zlt_lck, zflgs);
+
+	if (entries > 0) {
+		err = _pool_handle_crc(znd, wset, entries);
+		if (err < 0)
+			goto out;
+		if (entries == MAX_WSET)
+			err = -EBUSY;
+	}
+
+out:
+	if (wset)
+		ZDM_FREE(znd, wset, sizeof(*wset) * MAX_WSET, KM_19);
+
+	return err;
+}
+
+/**
+ * sync_dirty() - Write all *dirty* ZLT blocks to disk (journal->SYNC->home)
+ * @znd: ZDM instance
+ * @bit_type: MAP blocks then CRC blocks.
+ * @sync: Write dirty blocks
+ * @drop: IN: # of pages to free.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int sync_dirty(struct zdm *znd, int bit_type, int sync, int drop)
+{
+	int err;
+
+	MutexLock(&znd->pool_mtx);
+	do {
+		err = _sync_dirty(znd, bit_type, sync, drop);
+		drop = 0;
+	} while (err == -EBUSY);
+
+	if (err > 0)
+		err = 0;
+	mutex_unlock(&znd->pool_mtx);
+
+	return err;
+}
+
+/**
+ * sync_mapped_pages() - Migrate lookup tables and crc pages to disk
+ * @znd: ZDM instance
+ * @sync: If dirty blocks need to be written.
+ * @drop: Number of blocks to drop.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int sync_mapped_pages(struct zdm *znd, int sync, int drop)
+{
+	int err;
+	int remove = drop ? 1 : 0;
+
+	if (low_cache_mem(znd) && (!sync || !drop)) {
+		sync = 1;
+		if (drop < 2048)
+			drop = 2048;
+	}
+
+	err = sync_dirty(znd, IS_LUT, sync, drop);
+
+	/* on error return */
+	if (err < 0)
+		return err;
+
+	/* TBD: purge CRC's on ref-count? */
+	err = sync_dirty(znd, IS_CRC, sync, remove);
+
+	return err;
+}
+
+/**
+ * dm_s is a logical sector that maps 1:1 to the whole disk in 4k blocks
+ * Here the logical LBA and field are calculated for the lookup table
+ * where the physical LBA can be read from disk.
+ */
+static int map_addr_aligned(struct zdm *znd, u64 dm_s, struct map_addr *out)
+{
+	u64 block	= dm_s >> 10;
+
+	out->zone_id	= block >> 6;
+	out->lut_s	= block + znd->s_base;
+	out->lut_r	= block + znd->r_base;
+	out->pg_idx	= dm_s & 0x3FF;
+
+
+	if (block > znd->map_count)
+		Z_ERR(znd, "%s: *** %llx > %x",
+			__func__, block, znd->map_count);
+
+	return 0;
+}
+
+/* -------------------------------------------------------------------------- */
+/* -------------------------------------------------------------------------- */
+
+/**
+ * map_addr_calc()
+ * @znd: ZDM instance
+ * @origin: address to calc
+ * @out: address, zone, crc, lut addr
+ *
+ * dm_s is a logical sector that maps 1:1 to the whole disk in 4k blocks
+ * Here the logical LBA and field are calculated for the lookup table
+ * where the physical LBA can be read from disk.
+ */
+static int map_addr_calc(struct zdm *znd, u64 origin, struct map_addr *out)
+{
+	u64 offset = znd->md_start;
+
+	out->dm_s = origin;
+	return map_addr_aligned(znd, origin - offset, out);
+}
+
+/**
+ * crc_test_pg() - Test a pg against expected CRC
+ * @pg: Page of ZTL lookup table
+ * @gfp: allocation mask
+ * @mpi: Map information
+ */
+static int crc_test_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp,
+		       struct mpinfo *mpi)
+{
+	unsigned long mflgs;
+	int rcode = 0;
+	__le16 check;
+	__le16 expect = 0;
+
+	/*
+	 * Now check block crc
+	 */
+	check = crc_md_le16(pg->data.addr, Z_CRC_4K);
+	if (test_bit(IS_LUT, &pg->flags)) {
+		struct map_pg *crc_pg;
+		const int async = 0;
+
+		crc_pg = get_map_entry(znd, mpi->crc.lba, 1, async, 0, gfp);
+		if (crc_pg) {
+			ref_pg(crc_pg);
+			if (crc_pg->data.crc) {
+				spin_lock_irqsave(&crc_pg->md_lock, mflgs);
+				expect = crc_pg->data.crc[mpi->crc.pg_idx];
+				spin_unlock_irqrestore(&crc_pg->md_lock, mflgs);
+				crc_pg->age = jiffies_64
+					    + msecs_to_jiffies(crc_pg->hotness);
+			}
+			if (!pg->crc_pg) {
+				spin_lock_irqsave(&pg->md_lock, mflgs);
+				if (!pg->crc_pg) {
+					ref_pg(crc_pg);
+					pg->crc_pg = crc_pg;
+				}
+				spin_unlock_irqrestore(&pg->md_lock, mflgs);
+			}
+			deref_pg(crc_pg);
+		}
+		if (check != expect)
+			Z_ERR(znd, "Corrupt metadata (lut): %"
+				PRIx64" -> %" PRIx64 " -> %" PRIx64 " crc [%"
+				PRIx64 ".%u] : %04x != %04x",
+				pg->lba, pg->last_write, pg->lba48_in,
+				crc_pg->lba, mpi->crc.pg_idx,
+				le16_to_cpu(expect), le16_to_cpu(check));
+
+		put_map_entry(crc_pg);
+	} else {
+		expect = znd->md_crcs[mpi->crc.pg_idx];
+		if (check != expect)
+			Z_ERR(znd, "Corrupt metadata (CRC): %"
+				PRIx64" -> %" PRIx64 " crc [%u] : %04x != %04x",
+				pg->lba, pg->last_write,
+				mpi->crc.pg_idx,
+				le16_to_cpu(expect), le16_to_cpu(check));
+	}
+
+	if (check == expect) {
+		rcode = 1;
+	} else {
+		rcode = (pg->io_count < 3) ? -EBUSY : -EIO;
+		pg->io_count++;
+		pg->io_error = rcode;
+		Z_ERR(znd,
+		      "Corrupt metadata: %" PRIx64 " from %" PRIx64
+		      " [%04x != %04x (have)] flags: %lx",
+		      pg->lba, pg->lba48_in,
+		      le16_to_cpu(expect),
+		      le16_to_cpu(check),
+		      pg->flags);
+		dump_stack();
+	}
+
+	return rcode;
+}
+
+
+/**
+ * wait_for_map_pg() - Wait on bio of map_pg read ...
+ * @znd: ZDM instance
+ * @pg: Page to fill
+ * @gfp: Memory allocation rule
+ * @mpi: Backing page locations.
+ *
+ * Load a page of the sector lookup table that maps to pg->lba
+ * If pg->lba is not on disk return 0
+ *
+ * Return: 1 if page exists, 0 if unmodified, else -errno on error.
+ */
+static int wait_for_map_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp)
+{
+	int err = 0;
+
+	ref_pg(pg);
+
+	if (!pg->data.addr) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&pg->md_lock, flags);
+		if (!pg->data.addr && !test_bit(R_SCHED, &pg->flags))
+			err = -EBUSY;
+		spin_unlock_irqrestore(&pg->md_lock, flags);
+
+		if (err) {
+			Z_ERR(znd, "wait_for_map_pg %llx : %lx ... no page?",
+			      pg->lba, pg->flags);
+			dump_stack();
+			goto out;
+		}
+	}
+	if (test_bit(R_IN_FLIGHT, &pg->flags) ||
+	    test_bit(R_SCHED, &pg->flags) ||
+	    test_bit(IS_ALLOC, &pg->flags)) {
+		unsigned long to;
+
+		to = wait_for_completion_io_timeout(&pg->event,
+						    msecs_to_jiffies(15000));
+		Z_DBG(znd, "read %" PRIx64 " to: %lu", pg->lba, to);
+		if (to) {
+			err = pg->io_error;
+			clear_bit(IS_ALLOC, &pg->flags);
+		} else {
+			err = -EBUSY;
+		}
+		if (err)
+			goto out;
+	}
+	if (test_and_clear_bit(R_CRC_PENDING, &pg->flags)) {
+		struct mpinfo mpi;
+
+		to_table_entry(znd, pg->lba, 0, &mpi);
+		err = crc_test_pg(znd, pg, gfp, &mpi);
+		if (err)
+			goto out;
+
+		pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+	}
+out:
+	deref_pg(pg);
+
+	return err;
+}
+
+/**
+ * _pg_read_complete() - Handle map_pg read complete
+ * @pg: Page of ZTL lookup table
+ * @err: I/O error
+ */
+static void _pg_read_complete(struct map_pg *pg, int err)
+{
+	pg->io_error = err;
+	if (err) {
+		pg->io_count++;
+	} else {
+		pg->age = jiffies_64;
+		set_bit(IS_FLUSH, &pg->flags);
+		set_bit(R_CRC_PENDING, &pg->flags);
+	}
+	clear_bit(R_IN_FLIGHT, &pg->flags);
+	complete_all(&pg->event);
+}
+
+/**
+ * map_pg_bio_endio() - async page complete handler
+ * @bio: block I/O structure
+ */
+static void map_pg_bio_endio(struct bio *bio)
+{
+	struct map_pg *pg = bio->bi_private;
+
+	ref_pg(pg);
+	_pg_read_complete(pg, bio->bi_error);
+	deref_pg(pg);
+	bio_put(bio);
+}
+
+/**
+ * read_pg() - read a page of lookup table.
+ * @znd: ZDM instance
+ * @pg: Page to 'empty'
+ * @gfp: memory allocation mask
+ */
+static int read_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp)
+{
+	const int count = 1;
+	const int wq = 1;
+	int rc;
+
+	ref_pg(pg);
+	rc = read_block(znd, DM_IO_KMEM,
+			pg->data.addr, pg->lba48_in, count, wq);
+	if (rc) {
+		Z_ERR(znd, "%s: read_block: ERROR: %d", __func__, rc);
+		pg->io_error = rc;
+		pg->io_count++;
+		goto out;
+	}
+	_pg_read_complete(pg, rc);
+	pool_add(pg->znd, pg);
+	rc = wait_for_map_pg(znd, pg, gfp);
+	deref_pg(pg);
+
+out:
+	return rc;
+}
+
+/**
+ * enqueue_pg() - queue a page (sync or async)
+ * @znd: ZDM instance
+ * @pg: Page to 'empty'
+ * @gfp: memory allocation mask
+ */
+static int enqueue_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp)
+{
+	int rc = -ENOMEM;
+	struct bio *bio;
+#if ENABLE_SEC_METADATA
+	sector_t sector;
+#endif
+
+	if (znd->queue_depth == 0 || gfp != GFP_KERNEL) {
+		rc = read_pg(znd, pg, gfp);
+		goto out;
+	}
+
+	bio = bio_alloc_bioset(gfp, 1, znd->bio_set);
+	if (bio) {
+		int len;
+
+		bio->bi_private = pg;
+		bio->bi_end_io = map_pg_bio_endio;
+
+#if ENABLE_SEC_METADATA
+		sector = pg->lba48_in << Z_SHFT4K;
+		bio->bi_bdev = znd_get_backing_dev(znd, &sector);
+		bio->bi_iter.bi_sector = sector;
+#else
+		bio->bi_iter.bi_sector = pg->lba48_in << Z_SHFT4K;
+		bio->bi_bdev = znd->dev->bdev;
+#endif
+		bio_set_op_attrs(bio, REQ_OP_READ, 0);
+		bio->bi_iter.bi_size = 0;
+		len = bio_add_km(bio, pg->data.addr, 1);
+		if (len) {
+			submit_bio(bio);
+			pool_add(pg->znd, pg);
+			rc = 0;
+		}
+	}
+
+out:
+	return rc;
+}
+
+/**
+ * empty_pg() - clear a page ZLT
+ * @znd: ZDM instance
+ * @pg: Page to 'empty'
+ */
+static int empty_pg(struct zdm *znd, struct map_pg *pg)
+{
+	int empty_val = test_bit(IS_LUT, &pg->flags) ? 0xff : 0;
+
+	memset(pg->data.addr, empty_val, Z_C4K);
+	set_bit(R_CRC_PENDING, &pg->flags);
+	clear_bit(R_IN_FLIGHT, &pg->flags);
+
+	return pool_add(znd, pg);
+}
+
+
+/**
+ * cache_pg() - Load a page of LUT/CRC into memory from disk, or default values.
+ * @znd: ZDM instance
+ * @pg: Page to fill
+ * @gfp: Memory allocation rule
+ * @mpi: Backing page locations.
+ *
+ * Return: 1 if page loaded from disk, 0 if empty, else -errno on error.
+ */
+static int cache_pg(struct zdm *znd, struct map_pg *pg, gfp_t gfp,
+		    struct mpinfo *mpi)
+{
+	unsigned long flags;
+	u64 lba48 = pg->lba;
+	void *kmem;
+	int rc = 0;
+	bool do_get_page = false;
+
+	ref_pg(pg);
+	spin_lock_irqsave(&pg->md_lock, flags);
+	__smp_mb();
+	if (!pg->data.addr) {
+		if (pg->lba < znd->data_lba) {
+			u64 at = z_lookup_journal_cache_nlck(znd, pg->lba);
+			if (at) {
+				set_bit(R_SCHED, &pg->flags);
+				pg->lba48_in = lba48 = at;
+				do_get_page = true;
+				if (!test_bit(IN_WB_JOURNAL, &pg->flags)) {
+					Z_ERR(znd,
+					      "Not jrnl flagged: %llx -> %llx",
+					      pg->lba, at);
+					pg->lba48_in = lba48 = pg->lba;
+				}
+			} else if (test_bit(IN_WB_JOURNAL, &pg->flags)) {
+				Z_ERR(znd, "jrnl flagged: %llx not in cache",
+				      pg->lba);
+			}
+		}
+		if (!do_get_page && test_and_clear_bit(IS_ALLOC, &pg->flags)) {
+			const bool trim = true;
+			const bool jrnl = false;
+
+			set_bit(R_SCHED, &pg->flags);
+			lba48 = _current_mapping(znd, pg->lba, trim, jrnl, gfp);
+			pg->lba48_in = lba48;
+			do_get_page = true;
+		} else if (!test_bit(R_SCHED, &pg->flags)) {
+			Z_ERR(znd, "IS_ALLOC not set and page is empty");
+			Z_ERR(znd, "cache_pg %llx : %lx ... no page?",
+			      pg->lba, pg->flags);
+			dump_stack();
+		}
+	}
+	spin_unlock_irqrestore(&pg->md_lock, flags);
+
+	if (!do_get_page)
+		goto out;
+
+	kmem = ZDM_ALLOC(znd, Z_C4K, PG_27, gfp);
+	if (kmem) {
+		spin_lock_irqsave(&pg->md_lock, flags);
+		__smp_mb();
+		if (unlikely(pg->data.addr)) {
+			ZDM_FREE(znd, kmem, Z_C4K, PG_27);
+			spin_unlock_irqrestore(&pg->md_lock, flags);
+			goto out;
+		}
+		pg->data.addr = kmem;
+		__smp_mb();
+		pg->znd = znd;
+		atomic_inc(&znd->incore);
+		pg->age = jiffies_64;
+		set_bit(R_IN_FLIGHT, &pg->flags);
+		clear_bit(R_SCHED, &pg->flags);
+		pg->io_error = 0;
+		spin_unlock_irqrestore(&pg->md_lock, flags);
+
+		if (lba48)
+			rc = enqueue_pg(znd, pg, gfp);
+		else
+			rc = empty_pg(znd, pg);
+
+		if (rc < 0) {
+			Z_ERR(znd, "%s: addr %" PRIx64 " error: %d",
+			      __func__, pg->lba, rc);
+			complete_all(&pg->event);
+			init_completion(&pg->event);
+			set_bit(IS_ALLOC, &pg->flags);
+			ZDM_FREE(znd, pg->data.addr, Z_C4K, PG_27);
+			atomic_dec(&znd->incore);
+			goto out;
+		}
+		pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+	} else {
+		spin_lock_irqsave(&pg->md_lock, flags);
+		clear_bit(R_SCHED, &pg->flags);
+		complete_all(&pg->event);
+		init_completion(&pg->event);
+		set_bit(IS_ALLOC, &pg->flags);
+		spin_unlock_irqrestore(&pg->md_lock, flags);
+		Z_ERR(znd, "%s: Out of memory.", __func__);
+		rc = -ENOMEM;
+	}
+
+out:
+	deref_pg(pg);
+	return rc;
+}
+
+/**
+ * z_lookup_table() - resolve a sector mapping via ZLT mapping
+ * @znd: ZDM Instance
+ * @addr: Address to resolve (via FWD map).
+ * @gfp: Current allocation flags.
+ */
+static u64 z_lookup_table(struct zdm *znd, u64 addr, gfp_t gfp)
+{
+	struct map_addr maddr;
+	struct map_pg *pg;
+	u64 tlba = 0;
+	const int async = 0;
+	const int noio = 0;
+	const int ahead = znd->cache_reada;
+	int err;
+
+	map_addr_calc(znd, addr, &maddr);
+	pg = get_map_entry(znd, maddr.lut_s, ahead, async, noio, gfp);
+	if (pg) {
+		ref_pg(pg);
+		err = wait_for_map_pg(znd, pg, gfp);
+		if (err)
+			Z_ERR(znd, "%s: wait_for_map_pg -> %d", __func__, err);
+		if (pg->data.addr) {
+			unsigned long mflgs;
+			__le32 delta;
+
+			spin_lock_irqsave(&pg->md_lock, mflgs);
+			delta = pg->data.addr[maddr.pg_idx];
+			spin_unlock_irqrestore(&pg->md_lock, mflgs);
+			tlba = map_value(znd, delta);
+			pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+			clear_bit(IS_READA, &pg->flags);
+		}
+		deref_pg(pg);
+		put_map_entry(pg);
+	}
+	return tlba;
+}
+
+/**
+ * update_map_entry() - Migrate memcache to lookup table map entries.
+ * @znd: ZDM instance
+ * @mapped: memcache block.
+ * @maddr: map_addr
+ * @to_addr: LBA or sector #.
+ * @is_fwd: flag forward or reverse lookup table.
+ *
+ * when is_fwd is 0:
+ *  - maddr->dm_s is a sector -> lba.
+ *	 in this case the old lba is discarded and scheduled for cleanup
+ *	 by updating the reverse map lba tables noting that this location
+ *	 is now unused.
+ * when is_fwd is 0:
+ *  - maddr->dm_s is an lba, lba -> dm_s
+ *
+ * Return: non-zero on error.
+ */
+static int update_map_entry(struct zdm *znd, struct map_pg *pg,
+			    struct map_addr *maddr, u64 to_addr, int is_fwd)
+{
+	int err = -ENOMEM;
+
+	if (pg && pg->data.addr) {
+		u64 index = maddr->pg_idx;
+		unsigned long mflgs;
+		__le32 delta;
+		__le32 value;
+		int was_updated = 0;
+
+		ref_pg(pg);
+		spin_lock_irqsave(&pg->md_lock, mflgs);
+		delta = pg->data.addr[index];
+		err = map_encode(znd, to_addr, &value);
+		if (!err) {
+			/*
+			 * if the value is modified update the table and
+			 * place it on top of the active [zltlst] list
+			 * this will keep the chunk of lookup table in
+			 * memory.
+			 */
+			if (pg->data.addr[index] != value) {
+				pg->data.addr[index] = value;
+				pg->age = jiffies_64
+					    + msecs_to_jiffies(pg->hotness);
+				set_bit(IS_DIRTY, &pg->flags);
+				clear_bit(IS_FLUSH, &pg->flags);
+				clear_bit(IS_READA, &pg->flags);
+				was_updated = 1;
+			}
+		} else {
+			Z_ERR(znd, "*ERR* Mapping: %" PRIx64 " to %" PRIx64,
+			      to_addr, maddr->dm_s);
+		}
+		spin_unlock_irqrestore(&pg->md_lock, mflgs);
+
+		if (was_updated && is_fwd && (delta != MZTEV_UNUSED)) {
+			u64 old_phy = map_value(znd, delta);
+
+			err = unused_add(znd, old_phy, to_addr, 1, GFP_ATOMIC);
+		}
+		deref_pg(pg);
+	} else {
+		if (!pg)
+			Z_DBG(znd, "%s: no page?", __func__);
+		else if (!pg->data.addr)
+			Z_ERR(znd, "%s: %llx no data?", __func__, pg->lba);
+	}
+	return err;
+}
+
+/**
+ * __cached_to_tables() - Migrate map cache entries to ZLT
+ * @znd: ZDM instance
+ * @type: Which type (MAP) of cache entries to migrate.
+ * @zone: zone to force migration for partial memcache block
+ *
+ * Scan the memcache and move any full blocks to lookup tables
+ * If a (the) partial memcache block contains lbas that map to zone force
+ * early migration of the memcache block to ensure it is properly accounted
+ * for and migrated during and upcoming GC pass.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int __cached_to_tables(struct zdm *znd, u32 zone, gfp_t gfp)
+{
+	struct map_cache_page *wpg = NULL;
+	struct map_pg **wset = NULL;
+	unsigned long iflgs;
+	int err = -ENOMEM;
+	int once = 1;
+	int moved;
+
+	wpg = ZDM_ALLOC(znd, sizeof(*wpg), PG_09, gfp);
+	wset = ZDM_CALLOC(znd, sizeof(*wset), MAX_WSET, KM_19, gfp);
+	if (!wset || !wpg) {
+		Z_ERR(znd, "%s: ENOMEM @ %d", __func__, __LINE__);
+		goto out;
+	}
+
+	err = 0;
+	once = znd->unused->count;
+	while (once || znd->unused->count > 5) {
+		struct map_pool *mp;
+		int moving = ARRAY_SIZE(wpg->maps);
+
+		spin_lock_irqsave(&znd->unused_rwlck, iflgs);
+		moving = min(moving, znd->unused->count);
+		moving = min(moving, 30);
+		working_set(wpg->maps, znd->unused, moving, 1);
+		spin_unlock_irqrestore(&znd->unused_rwlck, iflgs);
+
+		err = zlt_move_unused(znd, wpg->maps, moving, wset, MAX_WSET);
+		if (err < 0) {
+			deref_all_pgs(znd, wset, MAX_WSET);
+			if (err == -ENOMEM)
+				err = -EBUSY;
+			goto out;
+		}
+
+		mp_grow(znd->unused, znd->_use, gfp);
+		spin_lock_irqsave(&znd->unused_rwlck, iflgs);
+		working_set(wpg->maps, znd->unused, moving, 0);
+		err = zlt_move_unused(znd, wpg->maps, moving, NULL, 0);
+		if (err < 0) {
+			if (err == -ENOMEM)
+				err = -EBUSY;
+			spin_unlock_irqrestore(&znd->unused_rwlck, iflgs);
+			deref_all_pgs(znd, wset, MAX_WSET);
+			goto out;
+		}
+		mp = mp_pick(znd->unused, znd->_use);
+		moved = znd->unused->count;
+		err = do_sort_merge(mp, znd->unused, wpg->maps, moving, 1);
+		if (unlikely(err)) {
+			Z_ERR(znd, "USortMerge failed: %d [%d]", err, __LINE__);
+			err = -EBUSY;
+		} else {
+			znd->unused = mp;
+			__smp_mb();
+		}
+		moved -= znd->unused->count;
+		spin_unlock_irqrestore(&znd->unused_rwlck, iflgs);
+		memset(wpg, 0, sizeof(*wpg));
+		deref_all_pgs(znd, wset, MAX_WSET);
+		once = 0;
+		Z_DBG(znd, "Moved %d unused extents.", moved);
+		if (moved == 0 && err != -EBUSY)
+			break;
+	}
+
+	once = znd->ingress->count;
+	while (once || znd->ingress->count > 80) {
+		struct map_pool *mp;
+		int moving = ARRAY_SIZE(wpg->maps);
+
+		spin_lock_irqsave(&znd->in_rwlck, iflgs);
+		if (moving > znd->ingress->count)
+			moving = znd->ingress->count;
+		if (moving > 60)
+			moving = 60;
+
+		if (low_cache_mem(znd))
+			moving = 2;
+		else if (znd->ingress->count < 2048)
+			if (moving > 15)
+				moving = 15;
+
+		working_set(wpg->maps, znd->ingress, moving, 1);
+		spin_unlock_irqrestore(&znd->in_rwlck, iflgs);
+		err = move_to_map_tables(znd, wpg->maps, moving, wset, MAX_WSET);
+		if (err < 0) {
+			if (err == -ENOMEM)
+				err = -EBUSY;
+			deref_all_pgs(znd, wset, MAX_WSET);
+			goto out;
+		}
+
+		mp_grow(znd->ingress, znd->in, gfp);
+		spin_lock_irqsave(&znd->in_rwlck, iflgs);
+		working_set(wpg->maps, znd->ingress, moving, 0);
+		err = move_to_map_tables(znd, wpg->maps, moving, NULL, 0);
+		if (err < 0) {
+			if (err == -ENOMEM)
+				err = -EBUSY;
+			spin_unlock_irqrestore(&znd->in_rwlck, iflgs);
+			deref_all_pgs(znd, wset, MAX_WSET);
+			goto out;
+		}
+		mp = mp_pick(znd->ingress, znd->in);
+		moved = znd->ingress->count;
+		err = do_sort_merge(mp, znd->ingress, wpg->maps, moving, 1);
+		if (unlikely(err)) {
+			Z_ERR(znd, "ISortMerge failed: %d [%d]", err, __LINE__);
+			err = -EBUSY;
+		} else {
+			znd->ingress = mp;
+			__smp_mb();
+		}
+		moved -= znd->ingress->count;
+		spin_unlock_irqrestore(&znd->in_rwlck, iflgs);
+		memset(wpg, 0, sizeof(*wpg));
+		deref_all_pgs(znd, wset, MAX_WSET);
+		once = 0;
+		Z_DBG(znd, "Moved %d ingress extents.", moved);
+		if (moved == 0 && err != -EBUSY)
+			break;
+	}
+
+out:
+	if (wset)
+		ZDM_FREE(znd, wset, sizeof(*wset) * MAX_WSET, KM_19);
+	if (wpg)
+		ZDM_FREE(znd, wpg, Z_C4K, PG_09);
+
+	return err;
+}
+
+/**
+ * _cached_to_tables() - Migrate memcache entries to lookup tables
+ * @znd: ZDM instance
+ * @zone: zone to force migration for partial memcache block
+ *
+ * Scan the memcache and move any full blocks to lookup tables
+ * If a (the) partial memcache block contains lbas that map to zone force
+ * early migration of the memcache block to ensure it is properly accounted
+ * for and migrated during and upcoming GC pass.
+ *
+ * Return: 0 on success or -errno value
+ */
+static int _cached_to_tables(struct zdm *znd, u32 zone, gfp_t gfp)
+{
+	int err = 0;
+
+	err = __cached_to_tables(znd, zone, gfp);
+	return err;
+}
+
+/**
+ * __do_gme_io() - Get map_pool lookup table page entry.
+ * @znd: ZDM instance
+ * @lba: address of lookup table to retrieve.
+ * @ahead: Number of blocks to read ahead.
+ * @async: Issue async request
+ * @gfp: Allocation mask
+ */
+static struct map_pg *__do_gme_io(struct zdm *znd, u64 lba,
+			          int ahead, int async, gfp_t gfp)
+{
+	struct map_pg *pg;
+	int retries = 5;
+
+	do
+		pg = do_gme_io(znd, lba, ahead, async, gfp);
+	while (!pg && --retries > 0);
+
+	return pg;
+}
+
+/**
+ * get_map_entry() - Get map_pool lookup table page entry.
+ * @znd: ZDM instance
+ * @lba: address of lookup table to retrieve.
+ * @ahead: Number of blocks to read ahead.
+ * @async: Issue async request
+ * @noio: Do not issue I/O to disk, only retrieve from cache
+ * @gfp: Allocation mask
+ */
+static struct map_pg *get_map_entry(struct zdm *znd, u64 lba,
+				    int ahead, int async, int noio, gfp_t gfp)
+{
+	struct map_pg *pg;
+
+	if (noio)
+		pg = gme_noio(znd, lba);
+	else
+		pg = __do_gme_io(znd, lba, ahead, async, gfp);
+
+	return pg;
+}
+
+
+/**
+ * move_to_map_tables() - Migrate map_pool entries to fwd/rev table entries.
+ * @znd: ZDM instance
+ * @mcache: memcache block.
+ *
+ * Return: non-zero on error.
+ */
+static int move_to_map_tables(struct zdm *znd, struct map_cache_entry *maps,
+			      int count, struct map_pg **pgs, int npgs)
+{
+	struct map_pg *smtbl = NULL;
+	struct map_pg *rmtbl = NULL;
+	struct map_addr maddr = { .dm_s = 0ul };
+	struct map_addr rev = { .dm_s = 0ul };
+	u64 lut_s = BAD_ADDR;
+	u64 lut_r = BAD_ADDR;
+	int err = 0;
+	int is_fwd = 1;
+	int idx;
+	int cpg = 0;
+	const int async = 0;
+	int noio = 1;
+	gfp_t gfp = GFP_ATOMIC;
+
+	if (pgs) {
+		noio = 0;
+		gfp = GFP_KERNEL;
+	}
+
+	for (idx = 0; idx < count; idx++) {
+		int e;
+		u32 flags;
+		u32 extent;
+		u64 addr = le64_to_lba48(maps[idx].tlba, &flags);
+		u64 blba = le64_to_lba48(maps[idx].bval, &extent);
+		u64 mapto;
+
+		if (maps[idx].tlba == 0 && maps[idx].bval == 0)
+			continue;
+
+		if (flags & MCE_NO_ENTRY)
+			continue;
+
+		for (e = 0; e < extent; e++) {
+			if (addr) {
+				map_addr_calc(znd, addr, &maddr);
+				if (lut_s != maddr.lut_s) {
+					if (smtbl)
+						deref_pg(smtbl);
+					put_map_entry(smtbl);
+					smtbl = get_map_entry(znd, maddr.lut_s,
+							      4, async, noio,
+							      gfp);
+					if (!smtbl) {
+						if (noio)
+							break;
+						err = -ENOMEM;
+						goto out;
+					}
+					if (pgs && cpg < npgs) {
+						pgs[cpg] = smtbl;
+						ref_pg(smtbl);
+						cpg++;
+					}
+					ref_pg(smtbl);
+					lut_s = smtbl->lba;
+				}
+				is_fwd = 1;
+				mapto = blba ? blba : BAD_ADDR;
+				if (noio)
+					err = update_map_entry(znd, smtbl,
+							       &maddr, mapto,
+							       is_fwd);
+				if (err < 0)
+					goto out;
+			}
+			if (blba) {
+				map_addr_calc(znd, blba, &rev);
+				if (lut_r != rev.lut_r) {
+					if (rmtbl)
+						deref_pg(rmtbl);
+					put_map_entry(rmtbl);
+					rmtbl = get_map_entry(znd, rev.lut_r, 4,
+							   async, noio, gfp);
+					if (!rmtbl) {
+						if (noio)
+							break;
+						err = -ENOMEM;
+						goto out;
+					}
+					if (pgs && cpg < npgs) {
+						pgs[cpg] = rmtbl;
+						ref_pg(rmtbl);
+						cpg++;
+					}
+					ref_pg(rmtbl);
+					lut_r = rmtbl->lba;
+				}
+				is_fwd = 0;
+				if (noio)
+					err = update_map_entry(znd, rmtbl, &rev,
+							       addr, is_fwd);
+				if (err == 1)
+					err = 0;
+				blba++;
+			}
+			if (addr)
+				addr++;
+			if (err < 0)
+				goto out;
+		}
+		if (noio && e == extent) {
+			u32 count;
+			u32 flgs;
+
+			addr = le64_to_lba48(maps[idx].tlba, &flgs);
+			blba = le64_to_lba48(maps[idx].bval, &count);
+			if (count == extent) {
+				flgs = MCE_NO_ENTRY;
+				count = 0;
+				maps[idx].bval = lba48_to_le64(count, blba);
+				maps[idx].tlba = lba48_to_le64(flgs, addr);
+			}
+		}
+	}
+out:
+	if (!err && !noio)
+		cpg = ref_crc_pgs(pgs, cpg, npgs);
+
+	if (smtbl)
+		deref_pg(smtbl);
+	if (rmtbl)
+		deref_pg(rmtbl);
+	put_map_entry(smtbl);
+	put_map_entry(rmtbl);
+	set_bit(DO_MEMPOOL, &znd->flags);
+
+	return err;
+}
+
+/**
+ * __move_unused() - Discard overwritten blocks ...
+ * blba: the LBA to update in the reverse map.
+ * plut_r: LBA of current *_pg.
+ * _pg: The active map_pg to update (or cache)
+ * pgs: array of ref'd pages to build.
+ * pcpg: current page
+ * npgs: size of pgs array.
+ */
+static int __move_unused(struct zdm *znd,
+			 u64 blba, u64 *plut_r, struct map_pg **_pg,
+			 struct map_pg **pgs, int *pcpg, int npgs)
+{
+	struct map_addr maddr;
+	struct map_pg *pg = *_pg;
+	u64 lut_r = *plut_r;
+	unsigned long mflgs;
+	int cpg = *pcpg;
+	int err = 0;
+	const int async = 0;
+	const int noio = pgs ? 0 : 1;
+	gfp_t gfp = GFP_KERNEL;
+
+	if (blba < znd->data_lba)
+		goto out;
+
+	map_addr_calc(znd, blba, &maddr);
+	if (lut_r != maddr.lut_r) {
+		put_map_entry(pg);
+		if (pg)
+			deref_pg(pg);
+
+		pg = get_map_entry(znd, maddr.lut_r, 4, async, noio, gfp);
+		if (!pg) {
+			err = noio ? -EBUSY : -ENOMEM;
+			goto out;
+		}
+		if (pgs && cpg < npgs) {
+			pgs[cpg] = pg;
+			ref_pg(pg);
+			cpg++;
+		}
+		ref_pg(pg);
+		lut_r = pg->lba;
+	}
+
+	/* on i/o pass .. only load and ref the map_pg's */
+	if (pgs)
+		goto out;
+
+	if (!pg) {
+		Z_ERR(znd, "No PG for %llx?", blba);
+		Z_ERR(znd, "  ... maddr.lut_r %llx [%llx]", maddr.lut_r, lut_r);
+		Z_ERR(znd, "  ... maddr.dm_s  %llx", maddr.dm_s);
+		Z_ERR(znd, "  ... maddr.pg_idx %x", maddr.pg_idx);
+
+		err = noio ? -EBUSY : -ENOMEM;
+		goto out;
+	}
+
+	ref_pg(pg);
+	spin_lock_irqsave(&pg->md_lock, mflgs);
+	if (!pg->data.addr) {
+		Z_ERR(znd, "PG w/o DATA !?!? %llx", pg->lba);
+		err = noio ? -EBUSY : -ENOMEM;
+		goto out_unlock;
+	}
+	if (maddr.pg_idx > 1024) {
+		Z_ERR(znd, "Invalid pg index? %u", maddr.pg_idx);
+		err = noio ? -EBUSY : -ENOMEM;
+		goto out_unlock;
+	}
+
+	/*
+	 * if the value is modified update the table and
+	 * place it on top of the active [zltlst] list
+	 */
+	if (pg->data.addr[maddr.pg_idx] != MZTEV_UNUSED) {
+		unsigned long wflgs;
+		u32 gzno  = maddr.zone_id >> GZ_BITS;
+		u32 gzoff = maddr.zone_id & GZ_MMSK;
+		struct meta_pg *wpg = &znd->wp[gzno];
+		u32 wp;
+		u32 zf;
+		u32 stream_id;
+
+		pg->data.addr[maddr.pg_idx] = MZTEV_UNUSED;
+		pg->age = jiffies_64 + msecs_to_jiffies(pg->hotness);
+		set_bit(IS_DIRTY, &pg->flags);
+		clear_bit(IS_READA, &pg->flags);
+		clear_bit(IS_FLUSH, &pg->flags);
+
+		spin_lock_irqsave(&wpg->wplck, wflgs);
+		wp = le32_to_cpu(wpg->wp_alloc[gzoff]);
+		zf = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_VALUE_MASK;
+		stream_id = le32_to_cpu(wpg->zf_est[gzoff]) & Z_WP_STREAM_MASK;
+		if (wp > 0 && zf < Z_BLKSZ) {
+			zf++;
+			wpg->zf_est[gzoff] = cpu_to_le32(zf | stream_id);
+			wpg->wp_alloc[gzoff] = cpu_to_le32(wp | Z_WP_RRECALC);
+			set_bit(IS_DIRTY, &wpg->flags);
+			clear_bit(IS_FLUSH, &wpg->flags);
+		}
+		spin_unlock_irqrestore(&wpg->wplck, wflgs);
+		if ((wp & Z_WP_VALUE_MASK) == Z_BLKSZ)
+			znd->discard_count++;
+	} else {
+		Z_DBG(znd, "lba: %" PRIx64 " already reported as free?", blba);
+	}
+out_unlock:
+	spin_unlock_irqrestore(&pg->md_lock, mflgs);
+	deref_pg(pg);
+
+out:
+	*plut_r = lut_r;
+	*pcpg  = cpg;
+	*_pg = pg;
+
+	return err;
+}
+
+/**
+ * zlt_move_unused() - Move usused map_pool entries to rev map table
+ * maps: Entries to be moved
+ * count: Number of entries
+ * pgs: array of ref'd pages to build.
+ * npgs: size of pgs array.
+ */
+static int zlt_move_unused(struct zdm *znd, struct map_cache_entry *maps,
+			   int count, struct map_pg **pgs, int npgs)
+{
+	struct map_pg *pg = NULL;
+	u64 lut_r = BAD_ADDR;
+	int err = 0;
+	int cpg = 0;
+	int idx;
+
+	/* the journal being move must remain stable so sorting
+	 * is disabled. If a sort is desired due to an unsorted
+	 * page the search devolves to a linear lookup.
+	 */
+	for (idx = 0; idx < count; idx++) {
+		int e;
+		u32 flags;
+		u32 extent;
+		u64 blba = le64_to_lba48(maps[idx].tlba, &flags);
+		u64 from = le64_to_lba48(maps[idx].bval, &extent);
+
+		if (maps[idx].tlba == 0 && maps[idx].bval == 0)
+			continue;
+
+		if (!blba) {
+			Z_ERR(znd, "%d - bogus rmap entry", idx);
+			continue;
+		}
+
+		for (e = 0; e < extent; e++) {
+			err = __move_unused(znd, blba + e, &lut_r, &pg,
+					    pgs, &cpg, npgs);
+			if (err < 0)
+				goto out;
+		}
+		if (!pgs) {
+			flags |= MCE_NO_ENTRY;
+			extent = 0;
+			maps[idx].tlba = lba48_to_le64(flags, blba);
+			maps[idx].bval = lba48_to_le64(extent, from);
+		}
+	}
+out:
+	cpg = ref_crc_pgs(pgs, cpg, npgs);
+
+	if (pg)
+		deref_pg(pg);
+	put_map_entry(pg);
+	set_bit(DO_MEMPOOL, &znd->flags);
+
+	return err;
+}
-- 
2.10.2

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel




[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux