On Sat, Mar 04, 2023 at 05:54:38PM +0000, Matthew Wilcox wrote: > On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote: > > On 3/4/23 17:47, Matthew Wilcox wrote: > > > On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote: > > > > We could implement a (virtual) zoned device, and expose each zone as a > > > > block. That gives us the required large block characteristics, and with > > > > a bit of luck we might be able to dial up to really large block sizes > > > > like the 256M sizes on current SMR drives. > > > > ublk might be a good starting point. > > > > > > Ummmm. Is supporting 256MB block sizes really a desired goal? I suggest > > > that is far past the knee of the curve; if we can only write 256MB chunks > > > as a single entity, we're looking more at a filesystem redesign than we > > > are at making filesystems and the MM support 256MB size blocks. > > > > > Naa, not really. It _would_ be cool as we could get rid of all the cludges > > which have nowadays re sequential writes. > > And, remember, 256M is just a number someone thought to be a good > > compromise. If we end up with a lower number (16M?) we might be able > > to convince the powers that be to change their zone size. > > Heck, with 16M block size there wouldn't be a _need_ for zones in > > the first place. > > > > But yeah, 256M is excessive. Initially I would shoot for something > > like 2M. > > I think we're talking about different things (probably different storage > vendors want different things, or even different people at the same > storage vendor want different things). > > Luis and I are talking about larger LBA sizes. That is, the minimum > read/write size from the block device is 16kB or 64kB or whatever. > In this scenario, the minimum amount of space occupied by a file goes > up from 512 bytes or 4kB to 64kB. That's doable, even if somewhat > suboptimal. Yes. > Your concern seems to be more around shingled devices (or their equivalent > in SSD terms) where there are large zones which are append-only, but > you can still random-read 512 byte LBAs. I think there are different > solutions to these problems, and people are working on both of these > problems. > > But if storage vendors are really pushing for 256MB LBAs, then that's > going to need a third kind of solution, and I'm not aware of anyone > working on that. Hannes had replied to my suggestion about a way to *virtualize* *optimally* a real storage controller with larger LBA, in that thread I was hinting to avoid using on the hypervisor cache=passthrough and instead use something like cache=writeback or even cache=unsafe for experimentation for virtio-blk-pci. For a more elaborate description of these see [0] but the skinny is cache=writeback uses the host stroage controller while the other rely on the host page cache. The overhead of latencies incurred by anything to replicate larger LBAs should be mitigated, so I don't think using a zone storage zone for it would be good. I was asking whether or not experimenting with a different host page cache PAGE_SIZE might help replicate things more a bit realistically, even if if was suboptimal for the host for the reasons previously noted as stupid. If sticking to PAGE_SIZE on the host another idea may be to use tmpfs + huge pages so to at least mitigate TLB lookups. [0] https://github.com/linux-kdevops/kdevops/commit/94844c4684a51997cb327d2fb0ce491fe4429dfc Luis