On Fri, Jan 31, 2014 at 11:08 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > > I've been reading the draft ZBC specifications, especially 14-010r1[1], > and I've created the following draft kernel interfaces, which I present > as a strawman proposal for comments. > > [1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf > > As noted in the comments below, supporting variable length SMR zones > does result in more complexity at the file system / userspace interface > layer. Life would certainly get simpler if these zones were fixed > length. > > - Ted > > > /* > * Note: this structure is 24 bytes. Using 256 MB zones, an 8TB drive > * will have 32,768 zones. That means if we tried to use a contiguous > * array we would need to allocate 768k of contiguous, non-swappable > * kernel memory. (Boo, hiss.) > * > * This large enough that it would be painful to hang an array off the > * block_device structure. So we will define a function > * blkdev_query_zones() to selectively return information for some > * number of zones. > */ > struct zone_status { > sector_t z_start; > __u32 z_length; > __u32 z_write_ptr_offset; /* offset */ > __u32 z_checkpoint_offset; /* offset */ > __u32 z_flags; /* full, ro, offline, reset_requested */ > }; > > #define Z_FLAGS_FULL 0x0001 > #define Z_FLAGS_OFFLINE 0x0002 > #define Z_FLAGS_RO 0x0004 > #define Z_FLAG_RESET_REQUESTED 0x0008 > > #define Z_FLAG_TYPE_MASK 0x0F00 > #define Z_FLAG_TYPE_CONVENTIONAL 0x0000 > #define Z_FLAG_TYPE_SEQUENTIAL 0x0100 > > > /* > * Query the block_device bdev for information about the zones > * starting at start_sector that match the criteria specified by > * free_sectors_criteria. Zone status information for at most > * max_zones will be placed into the memory array ret_zones. The > * return value contains the number of zones actually returned. > * > * If free_sectors_criteria is positive, then return zones that have > * at least that many sectors available to be written. If it is zero, > * then match all zones. If free_sectors_criteria is negative, then > * return the zones that match the following criteria: > * > * -1 Return all read-only zones > * -2 Return all offline zones > * -3 Return all zones where the write ptr != the checkpoint ptr > */ > extern int blkdev_query_zones(struct block_device *bdev, > sector_t start_sector, > int free_sectors_criteria, > struct zone_status *ret_zones, > int max_zones); In this api, the caller would allocate the memory for ret_zones as sizeof(struct zone_status) * max_zones, right? There can be a case where return value is less than max_zones, in this case we would be preallocating extra memory for (max_zones - ret val) that would not be used (since they would not contain valid zone_status structs). As the hdd ages, it can be prone to failures, instances of differences of the two values can happen. Can we pass a double pointer to ret_zones, so that the api allocates the memory and the caller can free it? Would like to know your views on this. This thing will be invalid for the single zone_status example that you gave. > > /* > * Reset the write pointer for a sequential write zone. > * > * Returns -EINVAL if the start_sector is not the beginning of a > * sequential write zone. > */ > extern int blkdev_reset_zone_ptr(struct block_dev *bdev, > sector_t start_sector); > > > /* > * ---------------------------- > */ > > /* > * The zone_status structure could be a lot smaller if zones are a > * constant fixed size, then we could address zones using an 16 bit > * integer, instead of using a 64-bit starting lba number then this > * structure could half the size (12 bytes). > * > * We can also further shrink the structure by removing the > * z_checkpoint_offset element, since most of the time > * z_write_ptr_offset and z_checkpoint_offset will be the same. The > * only time they will be different is after a write is interrupted > * via an unexpected power removal > * > * With the smaller structure, we could fit all of the zones in an 8TB > * SMR drive in 256k, which maybe we could afford to vmalloc() > */ > struct simplified_zone_status { > __u32 z_write_ptr_offset; /* offset */ > __u32 z_flags; > }; > > /* add a new flag */ > #define Z_FLAG_POWER_FAIL_WRITE 0x0010 /* write_ptr != checkpoint ptr */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html - Regards, Rohan -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html