On Fri, Jun 17, 2022 at 7:56 PM Santosh S <santosh.letterz@xxxxxxxxx> wrote: > > On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@xxxxxxx> wrote: > > > > On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote: > > > Dear ext4 developers, > > > > > > This is my test - preallocate a large file (2G) and then do sequential > > > 4K direct-io writes to that file, with fdatasync after every write. > > > I am preallocating using fallocate mode 0. I noticed that if the 2G > > > file is pre-written rather than fallocate'd I get more than twice the > > > throughput. I could reproduce this with fio. The storage is nvme. > > > Kernel version is 5.3.18 on Suse. > > > > > > Am I doing something wrong or is this difference expected? Any > > > suggestion to get a better throughput without actually pre-writing the > > > file. > > > > This is, alas, expected. The reason for this is because when you use > > fallocate, the extent is marked as uninitialized, so that when you > > read from the those newly allocated blocks, you don't see previously > > written data belonging to deleted files. These files could contain > > someone else's e-mail, or medical information, etc. So if we didn't > > do this, it would be a walking, talking HIPPA or PCI violation. > > > > So when you write to an fallocated region, and then call fdatasync(2), > > we need to update the metadata blocks to clear the uninitialized bit > > so that when you read from the file after a crash, you actually get > > the data that was written. So the fdatasync(2) operation is quite the > > heavyweight operation, since it requries journal commit because of the > > required metadata update. When you do an overwrite, there is no need > > to force a metadata update and journal update, which is why write(2) > > plus fdatasync(2) is much lighter weight when you do an overwrite. > > > > What enterprise databases (e.g., Oracle Enterprise Database and IBM's > > Informix DB) tend to do is to use fallocate a chunk of space (say, > > 16MB or 32MB), because for Legacy Unix OS's, this tends enable some > > file system's block allocators to be more likely to allocate a > > contiguous block range, and then immediate write zero's on that 16 or > > 32MB, plus a fdatasync(2). This fdatasync(2) would update the extent > > tree once to make that 16MB or 32MB to be marked initialized to the > > database's tablespace file, so you only pay the metadata update once, > > instead of every few dozen kilobytes as you write each database commit > > into the tablespace file. > > > > There is also an old, out of tree patch which enables an fallocate > > mode called "no hide stale", which marks the extent tree blcoks which > > are allocated using fallocate(2) as initialized. This substantially > > speeds things up, but it is potentially a walking, talking, HIPPA or > > PCI violation in that revealing previously written data is considered > > a horrible security violation by most file system developers. > > > > If you know, say, that a cluster file system is the only user of the > > file system, and all data is written encrypted at rest using a > > per-user key, such that exposing stale data is not a security > > disaster, the "no hide stale" flag could be "safe" in that highly > > specialized user case. > > > > But that assumes that file system authors can trust application > > writers not to do something stupid and insecure, and historically, > > file system authors (possibly with good reason, given bitter past > > experience) don't trust application writesr to do something which is > > very easy, and gooses performance, even if it has terrible side > > effects on either data robustness or data security. > > > > Effectively, the no hide stale flag could be considered an "Attractive > > Nuisance"[1] and so support for this feature has never been accepted > > into the mainline kernel, and never to any distro kernels, since the > > distribution companies don't want to be held liable for making an > > "acctive nuisance" that might enable application authors from shooting > > themselves in the foot. > > > > [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine > > > > In any case, the technique of fallocatE(2) plus zero-fill-write plus > > fdatasync(2) isn't *that* slow, and is only needed when you are first > > extending the tablespace file. In the steady state, most database > > applications tend to be overwriting space, so this isn't an issue. > > > > In any case, if you need to get that last 5% or so of performance --- > > say, if you are are an enterprise database company interested in > > taking a full page advertisement on the back cover of Business Week > > Magazine touting how your enterprise database benchmarks are better > > than the competition --- the simple solution is to use a raw block > > device. Of course, most end users want the convenience of the file > > system, but that's not the point if you are engaging in > > benchmarketing. :-) > > > > Cheers, > > > > - Ted > > Thank you for a comprehensive answer :-) > > I have one more question - when I gradually increase the i/o transfer > size the performance degradation begins to lessen and at 32K it is > similar to the "overwriting the file" case. I assume this is because > the metadata update is now spread over 32K of data rather than 4K. > However, my understanding is that, in my case, an extent should > represent the max 128MiB of data and so the clearing of the > uninitialized bit for an extent should happen once every 128MiB, so > then why is a higher transfer size making a difference? > I think I understand. The metadata update cannot just be clearing the uninitialized bit, but also updating the high water mark telling the length of the initialized part of the extent. > Santosh