On 2020/09/29 3:58, Kanchan Joshi wrote: [...] > ZoneFS is better when it is about dealing at single-zone granularity, > and direct-block seems better when it is about grouping zones (in > various ways including striping). The latter case (i.e. grouping > zones) requires more involved mapping, and I agree that it can be left > to application (for both ZoneFS and raw-block backends). > But when an application tries that on ZoneFS, apart from mapping there > would be additional cost of indirection/fd-management (due to > file-on-files). There is no indirection in zonefs. fd-to-struct file/inode conversion is very fast and happens for every system call anyway, regardless of what the fd represents. So I really do not understand what your worry is here. If you are worried about overhead/performance, then please show numbers. If something is wrong, we can work on fixing it. > And if new features (zone-append for now) are available only on > ZoneFS, it forces application to use something that maynot be most > optimal for its need. "may" is not enough to convince me... > Coming to the original problem of plumbing append - I think divergence > started because RWF_APPEND did not have any meaning for block device. > Did I miss any other reason? Correct. > How about write-anywhere semantics (RWF_RELAXED_WRITE or > RWF_ANONYMOUS_WRITE flag) on block-dev. "write-anywhere" ? What do you mean ? That is not possible on zoned devices, even with zone append, since you at least need to guarantee that zones have enough unwritten space to accept an append command. > Zone-append works a lot like write-anywhere on block-dev (or on any > other file that combines multiple-zones, in non-sequential fashion). That is an over-simplification that is not helpful at all. Zone append is not "write anywhere" at all. And "write anywhere" is not a concept that exist on regular block devices anyway. Writes only go to the offset that the user decided, through lseek(), pwrite() or aio->aio_offset. It is not like the block layer decides where the writes land. The same constraint applies to zone append: the user decide the target zone. That is not "anywhere". Please be precise with wording and implied/desired semantic. Narrow down the scope of your concept names for clarity. And talking about "file that combines multiple-zones" would mean that we are now back in FS land, not raw block device file accesses anymore. So which one are we talking about ? It looks like you are confusing what the application does and how it uses whatever usable interface to the device with what that interface actually is. It is very confusing. >>> Also it seems difficult (compared to block dev) to fit simple-copy TP >>> in ZoneFS. The new >>> command needs: one NVMe drive, list of source LBAs and one destination >>> LBA. In ZoneFS, we would deal with N+1 file-descriptors (N source zone >>> file, and one destination zone file) for that. While with block >>> interface, we do not need more than one file-descriptor representing >>> the entire device. With more zone-files, we face open/close overhead too. >> >> Are you expecting simple-copy to allow requests that are not zone aligned ? I do >> not think that will ever happen. Otherwise, the gotcha cases for it would be far >> too numerous. Simple-copy is essentially an optimized regular write command. >> Similarly to that command, it will not allow copies over zone boundaries and >> will need the destination LBA to be aligned to the destination zone WP. I have >> not checked the TP though and given the NVMe NDA, I will stop the discussion here. > > TP is ratified, if that is the problem you are referring to. Ah. Yes. Got confused with ZRWA. Simple-copy is a different story anyway. Let's not mix it into zone append user interface please. > >> filesend() could be used as the interface for simple-copy. Implementing that in >> zonefs would not be that hard. What is your plan for simple-copy interface for >> raw block device ? An ioctl ? filesend() too ? As as with any other user level >> API, we should not be restricted to a particular device type if we can avoid it, >> so in-kernel emulation of the feature is needed for devices that do not have >> simple-copy or scsi extended copy. filesend() seems to me like the best choice >> since all of that is already implemented there. > > At this moment, ioctl as sync and io-uring for async. sendfile() and > copy_file_range() takes two fds....with that we can represent copy > from one source zone to another zone. > But it does not fit to represent larger copy (from N source zones to > one destination zone). nvme passthrough ? If that does not fit your use case, then think of an interface, its definition/semantic and propose it. But again, use a different thread. This is mixing up zone-append and simple copy, which I do not think are directly related. > Not sure if I am clear, perhaps sending RFC would be better for > discussion on simple-copy. Separate this discussion from zone append please. Mixing up 2 problems in one thread is not helpful to make progress. -- Damien Le Moal Western Digital Research