Re: [RFC] Draft Linux kernel interfaces for ZBC drives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 31, 2014 at 11:08 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> I've been reading the draft ZBC specifications, especially 14-010r1[1],
> and I've created the following draft kernel interfaces, which I present
> as a strawman proposal for comments.
>
> [1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf
>
> As noted in the comments below, supporting variable length SMR zones
> does result in more complexity at the file system / userspace interface
> layer.  Life would certainly get simpler if these zones were fixed
> length.
>
>                                                         - Ted
>
>
> /*
>  * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
>  * will have 32,768 zones.   That means if we tried to use a contiguous
>  * array we would need to allocate 768k of contiguous, non-swappable
>  * kernel memory.  (Boo, hiss.)
>  *
>  * This large enough that it would be painful to hang an array off the
>  * block_device structure.  So we will define a function
>  * blkdev_query_zones() to selectively return information for some
>  * number of zones.
>  */
> struct zone_status {
>        sector_t z_start;
>        __u32    z_length;
>        __u32    z_write_ptr_offset;  /* offset */
>        __u32    z_checkpoint_offset; /* offset */
>        __u32    z_flags;             /* full, ro, offline, reset_requested */
> };
>
> #define Z_FLAGS_FULL            0x0001
> #define Z_FLAGS_OFFLINE         0x0002
> #define Z_FLAGS_RO              0x0004
> #define Z_FLAG_RESET_REQUESTED  0x0008
>
> #define Z_FLAG_TYPE_MASK        0x0F00
> #define Z_FLAG_TYPE_CONVENTIONAL 0x0000
> #define Z_FLAG_TYPE_SEQUENTIAL  0x0100
>
>
> /*
>  * Query the block_device bdev for information about the zones
>  * starting at start_sector that match the criteria specified by
>  * free_sectors_criteria.  Zone status information for at most
>  * max_zones will be placed into the memory array ret_zones.  The
>  * return value contains the number of zones actually returned.
>  *
>  * If free_sectors_criteria is positive, then return zones that have
>  * at least that many sectors available to be written.  If it is zero,
>  * then match all zones.  If free_sectors_criteria is negative, then
>  * return the zones that match the following criteria:
>  *
>  *      -1     Return all read-only zones
>  *      -2     Return all offline zones
>  *      -3     Return all zones where the write ptr != the checkpoint ptr
>  */
> extern int blkdev_query_zones(struct block_device *bdev,
>                               sector_t start_sector,
>                               int free_sectors_criteria,
>                               struct zone_status *ret_zones,
>                               int max_zones);
In this api, the caller would allocate the memory for ret_zones as
sizeof(struct zone_status) * max_zones, right? There can be a case
where return value is less than max_zones, in this case we would be
preallocating extra memory for (max_zones - ret val) that would not be
used (since they would not contain valid zone_status structs). As the
hdd ages, it can be prone to failures, instances of differences of the
two values can happen. Can we pass a double pointer to ret_zones, so
that the api allocates the memory and the caller can free it? Would
like to know your views on this. This thing will be invalid for the
single zone_status example that you gave.

>
> /*
>  * Reset the write pointer for a sequential write zone.
>  *
>  * Returns -EINVAL if the start_sector is not the beginning of a
>  * sequential write zone.
>  */
> extern int blkdev_reset_zone_ptr(struct block_dev *bdev,
>                                  sector_t start_sector);
>
>
> /*
>  * ----------------------------
>  */
>
> /*
>  * The zone_status structure could be a lot smaller if zones are a
>  * constant fixed size, then we could address zones using an 16 bit
>  * integer, instead of using a 64-bit starting lba number then this
>  * structure could half the size (12 bytes).
>  *
>  * We can also further shrink the structure by removing the
>  * z_checkpoint_offset element, since most of the time
>  * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
>  * only time they will be different is after a write is interrupted
>  * via an unexpected power removal
>  *
>  * With the smaller structure, we could fit all of the zones in an 8TB
>  * SMR drive in 256k, which maybe we could afford to vmalloc()
>  */
> struct simplified_zone_status {
>        __u32    z_write_ptr_offset;  /* offset */
>        __u32    z_flags;
> };
>
> /* add a new flag */
> #define Z_FLAG_POWER_FAIL_WRITE 0x0010 /* write_ptr != checkpoint ptr */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

- Regards,
     Rohan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux