Re: Overwrite faster than fallocate

Santosh S <santosh.letterz@xxxxxxxxx> · Thu, 23 Jun 2022 14:28:47 -0400

On Mon, Jun 20, 2022 at 2:50 PM Andreas Dilger <adilger@xxxxxxxxx> wrote:
>
> On Jun 17, 2022, at 5:56 PM, Santosh S <santosh.letterz@xxxxxxxxx> wrote:
> >
> > On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@xxxxxxx> wrote:
> >>
> >> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
> >>> Dear ext4 developers,
> >>>
> >>> This is my test - preallocate a large file (2G) and then do sequential
> >>> 4K direct-io writes to that file, with fdatasync after every write.
> >>> I am preallocating using fallocate mode 0. I noticed that if the 2G
> >>> file is pre-written rather than fallocate'd I get more than twice the
> >>> throughput. I could reproduce this with fio. The storage is nvme.
> >>> Kernel version is 5.3.18 on Suse.
> >>>
> >>> Am I doing something wrong or is this difference expected? Any
> >>> suggestion to get a better throughput without actually pre-writing the
> >>> file.
> >>
> >> This is, alas, expected.  The reason for this is because when you use
> >> fallocate, the extent is marked as uninitialized, so that when you
> >> read from the those newly allocated blocks, you don't see previously
> >> written data belonging to deleted files.  These files could contain
> >> someone else's e-mail, or medical information, etc.  So if we didn't
> >> do this, it would be a walking, talking HIPPA or PCI violation.
> >>
> >> So when you write to an fallocated region, and then call fdatasync(2),
> >> we need to update the metadata blocks to clear the uninitialized bit
> >> so that when you read from the file after a crash, you actually get
> >> the data that was written.  So the fdatasync(2) operation is quite the
> >> heavyweight operation, since it requries journal commit because of the
> >> required metadata update.  When you do an overwrite, there is no need
> >> to force a metadata update and journal update, which is why write(2)
> >> plus fdatasync(2) is much lighter weight when you do an overwrite.
> >>
> >> What enterprise databases (e.g., Oracle Enterprise Database and IBM's
> >> Informix DB) tend to do is to use fallocate a chunk of space (say,
> >> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
> >> file system's block allocators to be more likely to allocate a
> >> contiguous block range, and then immediate write zero's on that 16 or
> >> 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
> >> tree once to make that 16MB or 32MB to be marked initialized to the
> >> database's tablespace file, so you only pay the metadata update once,
> >> instead of every few dozen kilobytes as you write each database commit
> >> into the tablespace file.
> >>
> >> There is also an old, out of tree patch which enables an fallocate
> >> mode called "no hide stale", which marks the extent tree blcoks which
> >> are allocated using fallocate(2) as initialized.  This substantially
> >> speeds things up, but it is potentially a walking, talking, HIPPA or
> >> PCI violation in that revealing previously written data is considered
> >> a horrible security violation by most file system developers.
> >>
> >> If you know, say, that a cluster file system is the only user of the
> >> file system, and all data is written encrypted at rest using a
> >> per-user key, such that exposing stale data is not a security
> >> disaster, the "no hide stale" flag could be "safe" in that highly
> >> specialized user case.
> >>
> >> But that assumes that file system authors can trust application
> >> writers not to do something stupid and insecure, and historically,
> >> file system authors (possibly with good reason, given bitter past
> >> experience) don't trust application writesr to do something which is
> >> very easy, and gooses performance, even if it has terrible side
> >> effects on either data robustness or data security.
> >>
> >> Effectively, the no hide stale flag could be considered an "Attractive
> >> Nuisance"[1] and so support for this feature has never been accepted
> >> into the mainline kernel, and never to any distro kernels, since the
> >> distribution companies don't want to be held liable for making an
> >> "acctive nuisance" that might enable application authors from shooting
> >> themselves in the foot.
> >>
> >> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
> >>
> >> In any case, the technique of fallocatE(2) plus zero-fill-write plus
> >> fdatasync(2) isn't *that* slow, and is only needed when you are first
> >> extending the tablespace file.  In the steady state, most database
> >> applications tend to be overwriting space, so this isn't an issue.
> >>
> >> In any case, if you need to get that last 5% or so of performance ---
> >> say, if you are are an enterprise database company interested in
> >> taking a full page advertisement on the back cover of Business Week
> >> Magazine touting how your enterprise database benchmarks are better
> >> than the competition --- the simple solution is to use a raw block
> >> device.  Of course, most end users want the convenience of the file
> >> system, but that's not the point if you are engaging in
> >> benchmarketing.   :-)
> >>
> >> Cheers,
> >>
> >>                                                - Ted
> >
> > Thank you for a comprehensive answer :-)
> >
> > I have one more question - when I gradually increase the i/o transfer
> > size the performance degradation begins to lessen and at 32K it is
> > similar to the "overwriting the file" case. I assume this is because
> > the metadata update is now spread over 32K of data rather than 4K.
>
> When splitting unwritten extents, the ext4 code will write out zero
> blocks up to 32KB by default (/sys/fs/ext4/*/extent_max_zeroout_kb)
> to avoid having millions of very small extents in a file (e.g. in
> case of a pathological alternating 4KB write pattern).  If your test
> is writing >= 32KB blocks then this no longer needs to be done.  If
> writing smaller blocks then it makes sense that the speed is 1/2 the
> raw speed because the file blocks are all being written twice (first
> with zeroes, then with actual data on a later write).
>
> 32KB (or 64KB) is a reasonable minimum size because any disk write
> will take the same time to write a single block or a whole sector,
> so doing writes in smaller units is not very efficient.  Depending
> on the underlying storage (e.g. RAID-6) it might be more efficient
> to set extent_max_zeroout_kb=1024 or similar.
>
> > However, my understanding is that, in my case, an extent should
> > represent the max 128MiB of data and so the clearing of the
> > uninitialized bit for an extent should happen once every 128MiB, so
> > then why is a higher transfer size making a difference?
>
> You are misunderstanding how uninitialized extents are cleared.  The
> uninitialized extent is split into two/three parts, where only the
> extent that has data written to it (min 32KB) is set to "initialized"
> and the remaining one/two extents are left uninitialized.  Otherwise,
> each write to an uninitialized extent would need up to 128MB of zeroes
> written to disk each time, which would be slow/high latency.
>
> Cheers, Andreas
>
>
Thank you and sorry for the delay in responding.

What kind of write will stop an uninitialized extent from splitting?
For example, I want to create a file, fallocate 512MB, and zero-fill
it. But I want the file system to only create 4 extents so they all
reside in the inode itself, and each extent represents the entire
128MB (so no splitting).
Even if I do large sized writes, my understanding is that ultimately
the kernel / hardware restrictions will split the the i/o into smaller
chunks thus causing the extent to split. For example, this is what I
see on my test system

# cat /sys/block/nvme1n1/queue/max_hw_sectors_kb
128
# cat /sys/block/nvme1n1/queue/max_sectors_kb
128

Santosh