Re: User experience issue on btrfs

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 28 Jun 2020 15:40:11 -0600

On Sun, Jun 28, 2020 at 9:04 AM <alexandrebfarias@xxxxxxxxx> wrote:

> I'm willing to perform further testing. There shouldn't be anything very special about my workload. I was working mostly with NodeJS 12 and React Native. VS Code (I should mention I make use of TabNine, which can be a huge drain on system resources). So, in a typical work session I'd have android emulator open, PostgreSQL, some chrome tabs, VS Code, probably Emacs, plus the React Native metro server and an Express.js backend.

Databases and VM images are things btrfs is bad at out of the box.
Most of this has to do with fsync dependency of other file systems.
Btrfs is equipped to deal with an fsync heavy world out of the box,
using treelog enabled by default. But can still be slow for some
workloads.

What I think is going on in your case:

Btrfs by default is copy-on-write for everything. If the workload
involves heavy writes on a small volume [1], the SSD has no hinting
about deallocated blocks. XFS will overwrite, that is the hint the SSD
firmware needs to erase those blocks to prepare them for fast writes.
While you can turn copy-on-write off for data, it's always
copy-on-write for metadata. And this is a metadata heavy workload
you're describing. I don't think your SSD is bad, I think it's just
(a) small for the workload and (b) not getting any hints about what's
freed up for it to prepare for future writes. The SSD is trying to
erase blocks right at the moment of allocation - super slow for any
SSD to do that.

Also the workload implies a lot of fsyncs. Other file systems need
them, Btrfs really doesn't. But it's an fsync dominate world, so it
has to fit in. And while it has some optimizations for it, it can
still be slower than XFS.

How to address this?

Stick with what's working. Use XFS. This is also consistent with
Facebook's workloads still on XFS. But if you really wanna give btrfs
a shot at your workload. There are three possible optimizations:

1. Mount option space_cache=v2 (this will be the default soon), discard=async.

This might fix most of the problem. If I'm correct that the SSD is
just inundated, this will give it the hint it needs to prepare blocks
for fast writes. But not so aggressively that the hints themselves
slow things down. (It's a fine line between getting no hints, and a
fire hose of them. discard=async is in between.)

2. Mount option flushoncommit (you'll get benign, but annoying,
WARNONS in dmesg)
   And use fsync = off in postgresql.conf (really everywhere you can)

Note: if you get a crash you'll lose the last ~30s of commits, but the
database and the file system are expected to be consistent. The commit
interval is configurable, defaults to 30s. I suggest leaving it there
for testing. It is mainly a risk vs performance assessment, as to why
you'd change it.

3. VM images have two schools of thought, depending on your risk tolerance.

        A. nodatacow:  (chattr +C). Use with cache=writeback.
flushoncommit isn't necessary.

        B. datacow: Use with compression (mount option or chattr +c).
Use with cache=unsafe. flushoncommit highly recommended.

Note 1: chattr +C/+c needs to be set at the time of the file's
creation, it won't work after the fact. Set it on the containing
directory before copying one over.

Note 2: Invariably you will prefer the performance of B. But obviously
it's not going to be a default configuration, because unsafe basically
drops fsyncs, and flushoncommit does WARNON spamming.

And yeah, how would anyone know all of this? And is it an opportunity
for docs (probably) or desktop integration? Detect this workload or
ask the user? I'm not sure.

[1] From your email, the kickstart shows
> part btrfs.475 --fstype="btrfs" --ondisk=sda --size=93368

93G is likely making things worse for your SSD. It's small for this
workload. Chances are if it were bigger, it'd cope better by
effectively being over provisioned, and it'd more easily get erase
blocks ready. But discard=async will mitigate this.

--
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx