On Fri, Feb 21, 2014 at 9:19 PM, Theodore Ts'o <tytso@xxxxxxx> wrote: > On Fri, Feb 21, 2014 at 03:32:52PM +0530, Rohan Puri wrote: >> > extern int blkdev_query_zones(struct block_device *bdev, >> > sector_t start_sector, >> > int free_sectors_criteria, >> > struct zone_status *ret_zones, >> > int max_zones); >> >> In this api, the caller would allocate the memory for ret_zones as >> sizeof(struct zone_status) * max_zones, right? There can be a case >> where return value is less than max_zones, in this case we would be >> preallocating extra memory for (max_zones - ret val) that would not be >> used (since they would not contain valid zone_status structs). As the >> hdd ages, it can be prone to failures, instances of differences of the >> two values can happen. Can we pass a double pointer to ret_zones, so >> that the api allocates the memory and the caller can free it? Would >> like to know your views on this. This thing will be invalid for the >> single zone_status example that you gave. > > I think you are making the assumption here that max_zones will > normally be the maximum number of zone available to the disk.In No, not the maximum anything greater than 1 and less than maximum number of available zone. Consider a out of 32,768 zones, kernel wants to query for 1000 zones, out of this 1000 zones information requested, there can be a case that only 700-800 zones information could be obtained & for remaining 200-300 zones information couldn't be obtained due to some error condition. Now, we would be preallocated more (200-300) * 24 bytes of information. This will only happen in case of error path and i am not quite sure about the probability of it. What are your views on this? > practice, this will never be true. Consider that a 8TB SMR drive with > 256 MB zones will have 32,768 zones. The kernel will *not* want to > allocate 768k of non-swappable kernel memory on a regular basis. > (There is no guarantee there will be that number of contiguous pages > available, and if you use vmalloc() instead, it's slower since it > involves page table operations.) Also, when will the kernel ever want > to see all of the zones all at once, anyway? > any filesystem that would be SMR-aware can need this, right? like its block allocator, to optimise for fragmentation n stuff? > So it's likely that the caller will always be allocating, a relatively > small number of zones (I suspect it will always be less than 128), and agree, but this no has to be optimal to reduce the no of disk reads. > if the caller needs more zones, it will simply call > blkdev_qeury_zones() with a larger start_sector value and get the next > 128 zones. > > So your concern about preallocating extra memory for zones that would > not be used is I don't belive a major issue. > Yes, only in disk read errors & requests of more than 1 zones information this could happen. > > My anticipation is that kernel will be storing the information > returned blkdev_query_zones() in a much more compact fashion (since we > don't need to store the write pointer if the zone is completely full, > or completely empty, which will very often be the case, I suspect), > and there will be a different interface that will be used by block > device drivers to send this information to the block device layer > library function which will be maintaining this information in a > compact form. > > I know that I still need to spec out some functions to make life > easier for the block device drivers that will be interfacing into ZBC > maintenance layer. They will probably look something like this: > > extern int blkdev_set_zone_info(struct block_device *bdev, > struct zone_status *zone_info); > > blkdev_set_zone_info() would get called once per zone when the block > device is initially set up. My assumption is that the block device would this happen every time on os boot up? if so then will it not increase the os boot time? > layer will query the drive initially, and grab all of this > information, and keep it in the compressed form. (Since querying this > data each time the OS needs it will likely be too expensive; even if > the ZBC commands don't have the same insanity as the non-queable TRIM > command, the fact that we need to go out to the disk means that we > will need to send a disk command and wait for an command completion > interrupt, which would be sad.) > Agree. > I suspect we will also need commands such as these for the convenience > of the block device driver: > > extern int blkdev_update_write_ptr(struct block_device *bdev, > sector_t start_sector, > u32 write_ptr); > > extern int blkdev_update_zone_info(struct block_device *bdev, > struct zone_status *zone_info); > Will this lead to the update on the disk or in-memory, like write through or write back? > And we will probably want to define that in blockdev_query_zones(), if > start_sector is not located at the beginning of a zone, that the first > zone returned will be zone containing the specified sector. (We'll > need this in the event that the T10 committee allows for variable > sized zones, instead of the much simpler fixed-size zone design, since > given a sector number, the block driver or the file system above the > ZBC OS management layer would have no way of mapping a sector number > to a specific zone.) > > So I suspect as start implementing device mapper SMR simulators and > actual SAS/SATA block device drivers which will interface with the ZBC > prototype drives, there may be other functions we will need to > implement in order to make life easier both for these systems. > I was interested in project core-04, smr simulator. I read a project report related to it, research conducted at ucsc link : - http://www.ssrc.ucsc.edu/Papers/ssrctr-12-05.pdf Also, would like to know your inputs to for core-04. > Cheers, > > - Ted - Regards, Rohan -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html