Re: deprecating inline_data support for CephFS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2019-08-16 at 14:12 +0200, Jonas Jelten wrote:
> Hi!
> 
> I've missed your previous post, but we do have inline_data enabled on our cluster.
> We've not yet benchmarked, but the filesystem has a wide variety of file sizes, and it sounded like a good idea to speed
> up performance. We mount it with the kernel client only, and I've had the subjective impression that latency was better
> once we enabled the feature. Now that you say the kernel client has no write support for it, my impression is probably
> wrong.
>
> I think inline_data is a nice and easy way to improve performance when the CephFS metadata are on SSDs but the bulk data
> is on HDDs. So I'd vote against removal and would instead vouch for improvements of this feature :)
> 
> If storage on the MDS is a problem, files could be stored on a different (e.g. SSD) pool instead, and the file size
> limit and pool selection could be configured via xattrs. And there was some idea to store small objects not in the OSD
> block, but only in the OSD's DB (which is more complicated to use than separate SSD-pool and HDD-pool, but when block.db
> is on an SSD the speed would be better). Maybe this could all be combined to have better small-file performance in CephFS!
> 

The main problem is developer time and the maintenance burden this
feature represents. This is very much a non-trivial thing to implement.
Consider that the read() and write() codepaths in the kernel already
have 3 main branches each:

buffered I/O (when Fcb caps are held)
synchronous I/O (when Fcb caps are not held)
O_DIRECT I/O

We could probably consolidate the O_DIRECT and sync I/O code somewhat,
but buffered is handled entirely differently. Once we mix in inline_data
support, we have to add a completely new branch for each of those cases,
effectively doubling the complexity.

We'd also need to add similar handing for mmap'ed I/O and for things
like copy_file_range.

But, even before that...I have some real concerns about the existing
handling, even with a single client.

While I haven't attempted to roll a testcase for it, I think we can
probably hit races where multiple tasks handling write page faults can
compete to uninline the data, potentially clobbering the others' writes.
Again, this is non-trivial to fix.

In summary I don't see a real future for this feature unless someone
wants to step up to own it and commit to fixing up these problems.


> On 16/08/2019 13.15, Jeff Layton wrote:
> > A couple of weeks ago, I sent a request to the mailing list asking
> > whether anyone was using the inline_data support in cephfs:
> > 
> >     https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data
> > 
> > I got exactly zero responses, so I'm going to formally propose that we
> > move to start deprecating this feature for Octopus.
> > 
> > Why deprecate this feature?
> > ===========================
> > While the userland clients have support for both reading and writing,
> > the kernel only has support for reading, and aggressively uninlines
> > everything as soon as it needs to do any writing. That uninlining has
> > some rather nasty potential race conditions too that could cause data
> > corruption.
> > 
> > We could work to fix this, and maybe add write support for the kernel,
> > but it adds a lot of complexity to the read and write codepaths in the
> > clients, which are already pretty complex. Given that there isn't a lot
> > of interest in this feature, I think we ought to just pull the plug on
> > it.
> > 
> > How should we do this?
> > ======================
> > We should start by disabling this feature in master for Octopus. 
> > 
> > In particular, we should stop allowing users to call "fs set inline_data
> > true" on filesystems where it's disabled, and maybe throw a loud warning
> > about the feature being deprecated if the mds is started on a filesystem
> > that has it enabled.
> > 
> > We could also consider creating a utility to crawl an existing
> > filesystem and uninline anything there, if there was need for it.
> > 
> > Then, in a few release cycles, once we're past the point where someone
> > can upgrade directly from Nautilus (release Q or R?) we'd rip out
> > support for this feature entirely.
> > 
> > Thoughts, comments, questions welcome.
> > 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux