On Fri, Dec 23, 2022 at 9:06 AM Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote: > > Dear Ilya, > > > Am 22.12.22 um 16:25 schrieb Ilya Dryomov: > > On Thu, Dec 22, 2022 at 3:41 PM Roose, Marco <marco.roose@xxxxxxxxxxxxx> wrote: > > >> thanks for providing the revert. Using that commit all is fine: > >> > >> ~# uname -a > >> Linux S1020-CephTest 6.1.0+ #1 SMP PREEMPT_DYNAMIC Thu Dec 22 14:30:22 CET > >> 2022 x86_64 x86_64 x86_64 GNU/Linux > >> > >> ~# rsync -ah --progress /root/test-file_1000MB /mnt/ceph/test-file_1000MB > >> sending incremental file list > >> test-file_1000MB > >> 1.00G 100% 90.53MB/s 0:00:10 (xfr#1, to-chk=0/1) > >> > >> I attach some ceph reports taking before, during and after an rsync on a bad > >> kernel (5.6.0) for debugging. > > > > I see two CephFS data pools and one of them is nearfull: > > > > "pool": 10, > > "pool_name": "cephfs_data", > > "create_time": "2020-11-22T08:19:53.701636+0100", > > "flags": 1, > > "flags_names": "hashpspool", > > > > "pool": 11, > > "pool_name": "cephfs_data_ec", > > "create_time": "2020-11-22T08:22:01.779715+0100", > > "flags": 2053, > > "flags_names": "hashpspool,ec_overwrites,nearfull", > > > > How is this CephFS filesystem is configured? If you end up writing to > > cephfs_data_ec pool there, the slowness is expected. nearfull makes > > the client revert to synchronous writes so that it can properly return > > ENOSPC error when nearfull develops into full. That is the whole point > > of the commit that you landed upon when bisecting so of course > > reverting it helps: > > > > - if (ceph_osdmap_flag(&fsc->client->osdc, CEPH_OSDMAP_NEARFULL)) > > + if ((map_flags & CEPH_OSDMAP_NEARFULL) || > > + (pool_flags & CEPH_POOL_FLAG_NEARFULL)) > > iocb->ki_flags |= IOCB_DSYNC; > > Well, that effect is not documented in the commit message, and for the > user it’s a regression, that the existing (for the user working) > configuration performs worse after updating the Linux kernel. That > violates Linux’ no-regression policy, and at least needs to be better > documented and explained. Hi Paul, This isn't a regression -- CephFS has always behaved this way. In fact, these states (nearfull and full) used to be global meaning that filling up some random pool, completely unrelated to CephFS, still lead to synchronous behavior! This was fixed in the Mimic release. These states became per-pool and the global CEPH_OSDMAP_NEARFULL and CEPH_OSDMAP_FULL flags were deprecated. The referenced commit just caught the kernel client up with that OSD-side change -- which is a definite improvement. Unfortunately this catch up was almost two years late (Mimic went out in 2018). The users shouldn't have noticed: our expectation was that the global flags would continue to be set for older clients to ensure that such clients could revert to synchronous writes as before. However, as noted in the commit message, the deprecation change turned out to be backwards incompatible by mistake and the net effect was that the global flags just stopped being set. As a result, for a Mimic (or later) cluster + any kernel client combination, synchronous behavior just vanished and _this_ was a regression. After a while Yanhu noticed and reported it. So the commit in question actually fixes a regression, not introduces one. You just happened to ran into a case of a nearfull pool with a newer cluster and an older kernel client. Had global -> per-pool flags change in Mimic been backwards compatible as intended, you would have encountered a performance drop immediately after cephfs_data_ec pool had exceeded nearfull watermark. The reason for the synchronous behavior is that, thanks to an advanced caps system [1], CephFS clients can buffer pretty large amounts of data as well as carry out many metadata operations locally. If the pool is nearing capacity, determining whether there is enough space left for all that data is tricky. Switching to performing writes synchronously allows the client to generate ENOSPC error in a timely manner. [1] https://www.youtube.com/watch?v=VgNI5RQJGp0 Hope this explanation helps! Ilya