On 12/14/23 01:49, Jaegeuk Kim wrote: > On 12/13, Damien Le Moal wrote: >> On 12/13/23 04:03, Jaegeuk Kim wrote: >>> On 12/12, Christoph Hellwig wrote: >>>> On Tue, Dec 12, 2023 at 10:19:31AM -0800, Bart Van Assche wrote: >>>>> "Fundamentally broken model" is your personal opinion. I don't know anyone >>>>> else than you who considers zoned writes as a broken model. >>>> >>>> No Bart, it is not. Talk to Damien, talk to Martin, to Jens. Or just >>>> look at all the patches you're sending to the list that play a never >>>> ending hac-a-mole trying to bandaid over reordering that should be >>>> perfectly fine. You're playing a long term losing game by trying to >>>> prevent reordering that you can't win. >>> >>> As one of users of zoned devices, I disagree this is a broken model, but even >>> better than the zone append model. When considering the filesystem performance, >>> it is essential to place the data per file to get better bandwidth. And for >>> NAND-based storage, filesystem is the right place to deal with the more efficient >>> garbage collecion based on the known data locations. That's why all the flash >>> storage vendors adopted it in the JEDEC. Agreed that zone append is nice, but >>> IMO, it's not practical for production. >> >> The work on btrfs is a counter argument to this statement. The initial zone >> support based on regular writes was going nowhere as trying to maintain ordering >> was too complex and/or too invasive. Using zone append for the data path solved >> and simplified many things. > > We're in supporting zoned writes, and we don't see huge problem of reordering > issues like you mention. I do agree there're pros and cons between the two, but > I believe using which one depends on user behaviors. If there's a user, why it > should be blocked? IOWs, why not just trying to support both? We do support both... But: 1) regular writes to zones is a user (= application) facing API. An application using a block device directly without an FS can directly drive the issuing of sequential writes to a zone. If there is an FS between the application and the device, the FS decides what to do (regular writes or zone append, and to which zone) 2) Zone append cannot be directly issued by applications to block devices. I am working on restoring zone append writes in zonefs as an alternative to this limitation. Now, in the context of IO priorities, issuing sequential writes to the same zone with different priorities really is a silly thing to do. Even if done in the proper order, that would essentially mean that whoever does that (FS or application) is creating priority inversion issues for himself and thus negating any benefit one can achieve with IO priorities (that is, most of the time, lowering tail latency for a class of IOs). As I mentioned before, for applications that use the zoned block device directly, I think we should just leave things as is, that is, let the writes fail if they are reordered due to a nonsensical IO priority setup. That is a nice way to warn the user that he/she is doing something silly. For the FS case, it is a little more difficult given that the user may have a sensible IO priority setup, e.g. assigning different IO priorities (cgroups, ioprio_set or ionice) to different processes accessing different files. For that case, if the FS decides to issue writes to these files to the same zone, then the problem occur. But back to the previous point: this is a silly thing to do when writes have to be sequential. That is priority inversion right there. The difficulty for an FS is, I think, that the FS cannot easily know the IO priority until the BIO for the write is issued... So that is the problem that needs fixing. Bart's proposed fix will, I think, address your issue. However, it will also hide IO priority setup problems to users accessing the block device directly. That I do not like. As I stated above, I think it is better to let writes fail in that case to signal the priority inversion. There are *a lot* of IO priority SMR HDD users out there. Literally millions of drives running with that, and not just for read operations. So please understand my concerns. A better solution may be to introduce a BIO flags that says "ignore IO priorities". f2fs can use that to avoid reordering writes to the same zone due to different IO priorities (again, *that* is the issue to fix in the first place I think, because that is simply silly to do, even with a regular HDD or SSD since that will break sequential write streams and thus impact performace, increase device-level GC/WAF etc). -- Damien Le Moal Western Digital Research