On Mon, Jun 20, 2022 at 2:50 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: > > On Jun 17, 2022, at 5:56 PM, Santosh S <santosh.letterz@xxxxxxxxx> wrote: > > > > On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@xxxxxxx> wrote: > >> > >> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote: > >>> Dear ext4 developers, > >>> > >>> This is my test - preallocate a large file (2G) and then do sequential > >>> 4K direct-io writes to that file, with fdatasync after every write. > >>> I am preallocating using fallocate mode 0. I noticed that if the 2G > >>> file is pre-written rather than fallocate'd I get more than twice the > >>> throughput. I could reproduce this with fio. The storage is nvme. > >>> Kernel version is 5.3.18 on Suse. > >>> > >>> Am I doing something wrong or is this difference expected? Any > >>> suggestion to get a better throughput without actually pre-writing the > >>> file. > >> > >> This is, alas, expected. The reason for this is because when you use > >> fallocate, the extent is marked as uninitialized, so that when you > >> read from the those newly allocated blocks, you don't see previously > >> written data belonging to deleted files. These files could contain > >> someone else's e-mail, or medical information, etc. So if we didn't > >> do this, it would be a walking, talking HIPPA or PCI violation. > >> > >> So when you write to an fallocated region, and then call fdatasync(2), > >> we need to update the metadata blocks to clear the uninitialized bit > >> so that when you read from the file after a crash, you actually get > >> the data that was written. So the fdatasync(2) operation is quite the > >> heavyweight operation, since it requries journal commit because of the > >> required metadata update. When you do an overwrite, there is no need > >> to force a metadata update and journal update, which is why write(2) > >> plus fdatasync(2) is much lighter weight when you do an overwrite. > >> > >> What enterprise databases (e.g., Oracle Enterprise Database and IBM's > >> Informix DB) tend to do is to use fallocate a chunk of space (say, > >> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some > >> file system's block allocators to be more likely to allocate a > >> contiguous block range, and then immediate write zero's on that 16 or > >> 32MB, plus a fdatasync(2). This fdatasync(2) would update the extent > >> tree once to make that 16MB or 32MB to be marked initialized to the > >> database's tablespace file, so you only pay the metadata update once, > >> instead of every few dozen kilobytes as you write each database commit > >> into the tablespace file. > >> > >> There is also an old, out of tree patch which enables an fallocate > >> mode called "no hide stale", which marks the extent tree blcoks which > >> are allocated using fallocate(2) as initialized. This substantially > >> speeds things up, but it is potentially a walking, talking, HIPPA or > >> PCI violation in that revealing previously written data is considered > >> a horrible security violation by most file system developers. > >> > >> If you know, say, that a cluster file system is the only user of the > >> file system, and all data is written encrypted at rest using a > >> per-user key, such that exposing stale data is not a security > >> disaster, the "no hide stale" flag could be "safe" in that highly > >> specialized user case. > >> > >> But that assumes that file system authors can trust application > >> writers not to do something stupid and insecure, and historically, > >> file system authors (possibly with good reason, given bitter past > >> experience) don't trust application writesr to do something which is > >> very easy, and gooses performance, even if it has terrible side > >> effects on either data robustness or data security. > >> > >> Effectively, the no hide stale flag could be considered an "Attractive > >> Nuisance"[1] and so support for this feature has never been accepted > >> into the mainline kernel, and never to any distro kernels, since the > >> distribution companies don't want to be held liable for making an > >> "acctive nuisance" that might enable application authors from shooting > >> themselves in the foot. > >> > >> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine > >> > >> In any case, the technique of fallocatE(2) plus zero-fill-write plus > >> fdatasync(2) isn't *that* slow, and is only needed when you are first > >> extending the tablespace file. In the steady state, most database > >> applications tend to be overwriting space, so this isn't an issue. > >> > >> In any case, if you need to get that last 5% or so of performance --- > >> say, if you are are an enterprise database company interested in > >> taking a full page advertisement on the back cover of Business Week > >> Magazine touting how your enterprise database benchmarks are better > >> than the competition --- the simple solution is to use a raw block > >> device. Of course, most end users want the convenience of the file > >> system, but that's not the point if you are engaging in > >> benchmarketing. :-) > >> > >> Cheers, > >> > >> - Ted > > > > Thank you for a comprehensive answer :-) > > > > I have one more question - when I gradually increase the i/o transfer > > size the performance degradation begins to lessen and at 32K it is > > similar to the "overwriting the file" case. I assume this is because > > the metadata update is now spread over 32K of data rather than 4K. > > When splitting unwritten extents, the ext4 code will write out zero > blocks up to 32KB by default (/sys/fs/ext4/*/extent_max_zeroout_kb) > to avoid having millions of very small extents in a file (e.g. in > case of a pathological alternating 4KB write pattern). If your test > is writing >= 32KB blocks then this no longer needs to be done. If > writing smaller blocks then it makes sense that the speed is 1/2 the > raw speed because the file blocks are all being written twice (first > with zeroes, then with actual data on a later write). > > 32KB (or 64KB) is a reasonable minimum size because any disk write > will take the same time to write a single block or a whole sector, > so doing writes in smaller units is not very efficient. Depending > on the underlying storage (e.g. RAID-6) it might be more efficient > to set extent_max_zeroout_kb=1024 or similar. > > > However, my understanding is that, in my case, an extent should > > represent the max 128MiB of data and so the clearing of the > > uninitialized bit for an extent should happen once every 128MiB, so > > then why is a higher transfer size making a difference? > > You are misunderstanding how uninitialized extents are cleared. The > uninitialized extent is split into two/three parts, where only the > extent that has data written to it (min 32KB) is set to "initialized" > and the remaining one/two extents are left uninitialized. Otherwise, > each write to an uninitialized extent would need up to 128MB of zeroes > written to disk each time, which would be slow/high latency. > > Cheers, Andreas > > Thank you and sorry for the delay in responding. What kind of write will stop an uninitialized extent from splitting? For example, I want to create a file, fallocate 512MB, and zero-fill it. But I want the file system to only create 4 extents so they all reside in the inode itself, and each extent represents the entire 128MB (so no splitting). Even if I do large sized writes, my understanding is that ultimately the kernel / hardware restrictions will split the the i/o into smaller chunks thus causing the extent to split. For example, this is what I see on my test system # cat /sys/block/nvme1n1/queue/max_hw_sectors_kb 128 # cat /sys/block/nvme1n1/queue/max_sectors_kb 128 Santosh