Re: [RFC] Draft Linux kernel interfaces for ZBC drives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 21, 2014 at 03:32:52PM +0530, Rohan Puri wrote:
> > extern int blkdev_query_zones(struct block_device *bdev,
> >                               sector_t start_sector,
> >                               int free_sectors_criteria,
> >                               struct zone_status *ret_zones,
> >                               int max_zones);
>
> In this api, the caller would allocate the memory for ret_zones as
> sizeof(struct zone_status) * max_zones, right? There can be a case
> where return value is less than max_zones, in this case we would be
> preallocating extra memory for (max_zones - ret val) that would not be
> used (since they would not contain valid zone_status structs). As the
> hdd ages, it can be prone to failures, instances of differences of the
> two values can happen. Can we pass a double pointer to ret_zones, so
> that the api allocates the memory and the caller can free it? Would
> like to know your views on this. This thing will be invalid for the
> single zone_status example that you gave.

I think you are making the assumption here that max_zones will
normally be the maximum number of zone available to the disk.  In
practice, this will never be true.  Consider that a 8TB SMR drive with
256 MB zones will have 32,768 zones.  The kernel will *not* want to
allocate 768k of non-swappable kernel memory on a regular basis.
(There is no guarantee there will be that number of contiguous pages
available, and if you use vmalloc() instead, it's slower since it
involves page table operations.)  Also, when will the kernel ever want
to see all of the zones all at once, anyway?

So it's likely that the caller will always be allocating, a relatively
small number of zones (I suspect it will always be less than 128), and
if the caller needs more zones, it will simply call
blkdev_qeury_zones() with a larger start_sector value and get the next
128 zones.

So your concern about preallocating extra memory for zones that would
not be used is I don't belive a major issue.


My anticipation is that kernel will be storing the information
returned blkdev_query_zones() in a much more compact fashion (since we
don't need to store the write pointer if the zone is completely full,
or completely empty, which will very often be the case, I suspect),
and there will be a different interface that will be used by block
device drivers to send this information to the block device layer
library function which will be maintaining this information in a
compact form.

I know that I still need to spec out some functions to make life
easier for the block device drivers that will be interfacing into ZBC
maintenance layer.   They will probably look something like this:

extern int blkdev_set_zone_info(struct block_device *bdev,
       	   			struct zone_status *zone_info);

blkdev_set_zone_info() would get called once per zone when the block
device is initially set up.  My assumption is that the block device
layer will query the drive initially, and grab all of this
information, and keep it in the compressed form.  (Since querying this
data each time the OS needs it will likely be too expensive; even if
the ZBC commands don't have the same insanity as the non-queable TRIM
command, the fact that we need to go out to the disk means that we
will need to send a disk command and wait for an command completion
interrupt, which would be sad.)

I suspect we will also need commands such as these for the convenience
of the block device driver:

extern int blkdev_update_write_ptr(struct block_device *bdev,
       	   			   sector_t start_sector,
				   u32 write_ptr);

extern int blkdev_update_zone_info(struct block_device *bdev,
       	   			   struct zone_status *zone_info);

And we will probably want to define that in blockdev_query_zones(), if
start_sector is not located at the beginning of a zone, that the first
zone returned will be zone containing the specified sector.  (We'll
need this in the event that the T10 committee allows for variable
sized zones, instead of the much simpler fixed-size zone design, since
given a sector number, the block driver or the file system above the
ZBC OS management layer would have no way of mapping a sector number
to a specific zone.)

So I suspect as start implementing device mapper SMR simulators and
actual SAS/SATA block device drivers which will interface with the ZBC
prototype drives, there may be other functions we will need to
implement in order to make life easier both for these systems.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux