On Fri, 2019-08-16 at 14:12 +0200, Jonas Jelten wrote: > Hi! > > I've missed your previous post, but we do have inline_data enabled on our cluster. > We've not yet benchmarked, but the filesystem has a wide variety of file sizes, and it sounded like a good idea to speed > up performance. We mount it with the kernel client only, and I've had the subjective impression that latency was better > once we enabled the feature. Now that you say the kernel client has no write support for it, my impression is probably > wrong. > > I think inline_data is a nice and easy way to improve performance when the CephFS metadata are on SSDs but the bulk data > is on HDDs. So I'd vote against removal and would instead vouch for improvements of this feature :) > > If storage on the MDS is a problem, files could be stored on a different (e.g. SSD) pool instead, and the file size > limit and pool selection could be configured via xattrs. And there was some idea to store small objects not in the OSD > block, but only in the OSD's DB (which is more complicated to use than separate SSD-pool and HDD-pool, but when block.db > is on an SSD the speed would be better). Maybe this could all be combined to have better small-file performance in CephFS! > The main problem is developer time and the maintenance burden this feature represents. This is very much a non-trivial thing to implement. Consider that the read() and write() codepaths in the kernel already have 3 main branches each: buffered I/O (when Fcb caps are held) synchronous I/O (when Fcb caps are not held) O_DIRECT I/O We could probably consolidate the O_DIRECT and sync I/O code somewhat, but buffered is handled entirely differently. Once we mix in inline_data support, we have to add a completely new branch for each of those cases, effectively doubling the complexity. We'd also need to add similar handing for mmap'ed I/O and for things like copy_file_range. But, even before that...I have some real concerns about the existing handling, even with a single client. While I haven't attempted to roll a testcase for it, I think we can probably hit races where multiple tasks handling write page faults can compete to uninline the data, potentially clobbering the others' writes. Again, this is non-trivial to fix. In summary I don't see a real future for this feature unless someone wants to step up to own it and commit to fixing up these problems. > On 16/08/2019 13.15, Jeff Layton wrote: > > A couple of weeks ago, I sent a request to the mailing list asking > > whether anyone was using the inline_data support in cephfs: > > > > https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data > > > > I got exactly zero responses, so I'm going to formally propose that we > > move to start deprecating this feature for Octopus. > > > > Why deprecate this feature? > > =========================== > > While the userland clients have support for both reading and writing, > > the kernel only has support for reading, and aggressively uninlines > > everything as soon as it needs to do any writing. That uninlining has > > some rather nasty potential race conditions too that could cause data > > corruption. > > > > We could work to fix this, and maybe add write support for the kernel, > > but it adds a lot of complexity to the read and write codepaths in the > > clients, which are already pretty complex. Given that there isn't a lot > > of interest in this feature, I think we ought to just pull the plug on > > it. > > > > How should we do this? > > ====================== > > We should start by disabling this feature in master for Octopus. > > > > In particular, we should stop allowing users to call "fs set inline_data > > true" on filesystems where it's disabled, and maybe throw a loud warning > > about the feature being deprecated if the mds is started on a filesystem > > that has it enabled. > > > > We could also consider creating a utility to crawl an existing > > filesystem and uninline anything there, if there was need for it. > > > > Then, in a few release cycles, once we're past the point where someone > > can upgrade directly from Nautilus (release Q or R?) we'd rip out > > support for this feature entirely. > > > > Thoughts, comments, questions welcome. > > -- Jeff Layton <jlayton@xxxxxxxxxx> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx