Re: User experience issue on btrfs

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sat, 4 Jul 2020 18:03:10 -0600

On Sat, Jul 4, 2020 at 3:29 PM Scott Schmit <i.grok@xxxxxxxxxxx> wrote:
>
> On Fri, Jul 03, 2020 at 10:37:43AM -0600, Chris Murphy wrote:
> > On Thu, Jul 2, 2020 at 10:29 PM Scott Schmit <i.grok@xxxxxxxxxxx> wrote:
> > >
> > > On Sun, Jun 28, 2020 at 03:40:11PM -0600, Chris Murphy wrote:
> > > > Databases and VM images are things btrfs is bad at out of the box.
> > > > Most of this has to do with fsync dependency of other file systems.
> > > > Btrfs is equipped to deal with an fsync heavy world out of the box,
> > > > using treelog enabled by default. But can still be slow for some
> > > > workloads.
> > >
> > > Does this also impact mariadb databases?  I've noticed that since
> > > reinstalling my machine with mediawiki installed, the performance of the
> > > wiki has dropped noticeably when the cache is cold (just loading the
> > > pages, not editing them).
> >
> > Good question. A complete answer leads to a lot more questions.
> >
> > Mariadb has a couple older docs on this: one suggests using 'noatime'
> > mount option on all file systems [1] as an optimization, and
> > additionally for Btrfs to use 'nodatacow' [2]. It can be set per
> > directory or per file using 'chattr +C' before files are created - it
> > won't work after the fact. 'chattr +C' will make files behave like
> > it's on any other filesystem: all writes are overwrites instead of
> > copy-on-write, no checksums, no compression.
> >
> > Is this stale information? Is there something unrelated going on in
> > your case? Should databases setup these optimizations on behalf of
> > users? Does storage type make a difference? I'm just going to set
> > those aside for now.
>
> FWIW, neither /var/lib/mysql nor any of the files under it were set up
> with +C.

That's expected. There is precedent to optimize automatically, e.g.
systemd-journald sets chattr +C on /var/log/journal when it detects
its Btrfs.

Rabbit hole sidebar: It's an open question if this is really needed on
SSD. The latency hit on HDD makes this optimization more useful. Also,
when rotating the journals, systemd submits the journal for
defragmentation by Btrfs. So we get some extra writes on SSDs because
of this, and since it's nodatacow, it can't be compressed. So lately I
'touch /etc/tmpfiles.d/journal-nocow.conf' to prevent journald from
setting /var/log/journal to nodatacow. The journals are sometimes
compressed as much as 10:1 using the *lowest* zstd compression level.

Is there much optimization possible here? They aren't of significant
size or number. I don't think it really matters. But I think it's
useful to look into these issues for databases because COW is
relevant. Btrfs has it by default. But it also happens with reflink
copies on XFS. And following snapshots on dm-thin.

> I'm not using noatime, but I am using relatime.  This isn't a terribly
> large wiki -- just a personal setup (about 19M if I'm measuring the
> right files).  It's also more of a server use case than a workstation
> one.  That said, I'd imagine it's on the order of what a developer might
> set up for testing.

Yeah I should have asked. I kinda doubt a database of this size would
improve performance in a meaningful way between cow and nowcow. Your
could try letting it age for say, a month, and then go back to datacow
and let it age a month - and end the end of each month compare with
'filefrag' command. COW itself isn't the cause of overhead, even SSDs
are using COW internally. But there is a fragmentation factor and the
extents have a cumulative tracking cost, cpu and memory wise. But even
extreme fragmentation of 19M is  - I don't know for sure without
testing it but I wouldn't be surprised if it had no or very low cost.

>
> > To give the nodatacow suggestion a try:
> > ## shutdown the database
> > # mkdir /var/lib/mysql2
> > # chattr +C /var/lib/mysql2
> > # cp /var/lib/mysql/* /var/lib/mysql2/
> > # rm /var/lib/mysql/
> > # mv /var/lib/mysql2/ /var/lib/mysql/
> > ## resume operation
>
> Doing the manipulations to make it nocow doesn't appear to have made a
> significant difference: I still see a delay between the raw page (sans
> CSS) loading and the CSS loading to make it look right.  I thought it
> had lessened when I tried it last night, but when I tried again today,
> it was back just as long.  When I was running on LVM+ext4, I remember no
> delay.  Maybe the database has nothing to do with it?

Maybe. But then where is it coming from? Another rabbit hole is
performance troubleshooting. bcc-tools has file system tracing tools
for this purpose, but I haven't dug into any of that.

> Incidently... how does one handle chattr +C as part of tar backups and
> the like?

My expectation is that as it's a local optimization, it gets set when
it's copied (created) locally by inheriting +C from a directory. I'm
not sure if there's a way to store/restore file attributes with tar.
So what or who should set it? The distribution could do it at install
time, use an anaconda post-install script to set it on target
directories, or bake it into the rpm file. What about directories that
don't yet exist? Is it an application responsibility? These are the
questions. I kinda like the systemd approach. If the recommendation
changes, a future update can cause it to be unset. This has its own
drawback, as you can see from my earlier command above that I inhibit
the setting of +C on journals now.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx