Re: PROBLEM: CephFS write performance drops by 90%

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 23 Dec 2022 14:22:39 +0100

On Fri, Dec 23, 2022 at 9:06 AM Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote:
>
> Dear Ilya,
>
>
> Am 22.12.22 um 16:25 schrieb Ilya Dryomov:
> > On Thu, Dec 22, 2022 at 3:41 PM Roose, Marco <marco.roose@xxxxxxxxxxxxx> wrote:
>
> >> thanks for providing the revert. Using that commit all is fine:
> >>
> >> ~# uname -a
> >> Linux S1020-CephTest 6.1.0+ #1 SMP PREEMPT_DYNAMIC Thu Dec 22 14:30:22 CET
> >> 2022 x86_64 x86_64 x86_64 GNU/Linux
> >>
> >> ~# rsync -ah --progress /root/test-file_1000MB /mnt/ceph/test-file_1000MB
> >> sending incremental file list
> >> test-file_1000MB
> >>            1.00G 100%   90.53MB/s    0:00:10 (xfr#1, to-chk=0/1)
> >>
> >> I attach some ceph reports taking before, during and after an rsync on a bad
> >> kernel (5.6.0) for debugging.
> >
> > I see two CephFS data pools and one of them is nearfull:
> >
> >      "pool": 10,
> >      "pool_name": "cephfs_data",
> >      "create_time": "2020-11-22T08:19:53.701636+0100",
> >      "flags": 1,
> >      "flags_names": "hashpspool",
> >
> >      "pool": 11,
> >      "pool_name": "cephfs_data_ec",
> >      "create_time": "2020-11-22T08:22:01.779715+0100",
> >      "flags": 2053,
> >      "flags_names": "hashpspool,ec_overwrites,nearfull",
> >
> > How is this CephFS filesystem is configured?  If you end up writing to
> > cephfs_data_ec pool there, the slowness is expected.  nearfull makes
> > the client revert to synchronous writes so that it can properly return
> > ENOSPC error when nearfull develops into full.  That is the whole point
> > of the commit that you landed upon when bisecting so of course
> > reverting it helps:
> >
> > -   if (ceph_osdmap_flag(&fsc->client->osdc, CEPH_OSDMAP_NEARFULL))
> > +   if ((map_flags & CEPH_OSDMAP_NEARFULL) ||
> > +       (pool_flags & CEPH_POOL_FLAG_NEARFULL))
> >              iocb->ki_flags |= IOCB_DSYNC;
>
> Well, that effect is not documented in the commit message, and for the
> user it’s a regression, that the existing (for the user working)
> configuration performs worse after updating the Linux kernel. That
> violates Linux’ no-regression policy, and at least needs to be better
> documented and explained.

Hi Paul,

This isn't a regression -- CephFS has always behaved this way.  In
fact, these states (nearfull and full) used to be global meaning that
filling up some random pool, completely unrelated to CephFS, still
lead to synchronous behavior!

This was fixed in the Mimic release.  These states became per-pool
and the global CEPH_OSDMAP_NEARFULL and CEPH_OSDMAP_FULL flags were
deprecated.  The referenced commit just caught the kernel client up
with that OSD-side change -- which is a definite improvement.

Unfortunately this catch up was almost two years late (Mimic went out
in 2018).  The users shouldn't have noticed: our expectation was that
the global flags would continue to be set for older clients to ensure
that such clients could revert to synchronous writes as before.
However, as noted in the commit message, the deprecation change turned
out to be backwards incompatible by mistake and the net effect was that
the global flags just stopped being set.  As a result, for a Mimic (or
later) cluster + any kernel client combination, synchronous behavior
just vanished and _this_ was a regression.  After a while Yanhu noticed
and reported it.

So the commit in question actually fixes a regression, not introduces
one.  You just happened to ran into a case of a nearfull pool with
a newer cluster and an older kernel client.  Had global -> per-pool
flags change in Mimic been backwards compatible as intended, you would
have encountered a performance drop immediately after cephfs_data_ec
pool had exceeded nearfull watermark.

The reason for the synchronous behavior is that, thanks to an advanced
caps system [1], CephFS clients can buffer pretty large amounts of data
as well as carry out many metadata operations locally.  If the pool is
nearing capacity, determining whether there is enough space left for
all that data is tricky.  Switching to performing writes synchronously
allows the client to generate ENOSPC error in a timely manner.

[1] https://www.youtube.com/watch?v=VgNI5RQJGp0

Hope this explanation helps!

                Ilya