Re: mount cephfs on ceph servers

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 12 Mar 2019 22:20:55 +0100

On Tue, Mar 12, 2019 at 8:56 PM David C <dcsysengineer@xxxxxxxxx> wrote:
>
> Out of curiosity, are you guys re-exporting the fs to clients over something like nfs or running applications directly on the OSD nodes?

Kernel NFS + kernel CephFS can fall apart and deadlock itself in
exciting ways...

nfs-ganesha is so much better.

Paul

>
> On Tue, 12 Mar 2019, 18:28 Paul Emmerich, <paul.emmerich@xxxxxxxx> wrote:
>>
>> Mounting kernel CephFS on an OSD node works fine with recent kernels
>> (4.14+) and enough RAM in the servers.
>>
>> We did encounter problems with older kernels though
>>
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Tue, Mar 12, 2019 at 10:07 AM Hector Martin <hector@xxxxxxxxxxxxxx> wrote:
>> >
>> > It's worth noting that most containerized deployments can effectively
>> > limit RAM for containers (cgroups), and the kernel has limits on how
>> > many dirty pages it can keep around.
>> >
>> > In particular, /proc/sys/vm/dirty_ratio (default: 20) means at most 20%
>> > of your total RAM can be dirty FS pages. If you set up your containers
>> > such that the cumulative memory usage is capped below, say, 70% of RAM,
>> > then this might effectively guarantee that you will never hit this issue.
>> >
>> > On 08/03/2019 02:17, Tony Lill wrote:
>> > > AFAIR the issue is that under memory pressure, the kernel will ask
>> > > cephfs to flush pages, but that this in turn causes the osd (mds?) to
>> > > require more memory to complete the flush (for network buffers, etc). As
>> > > long as cephfs and the OSDs are feeding from the same kernel mempool,
>> > > you are susceptible. Containers don't protect you, but a full VM, like
>> > > xen or kvm? would.
>> > >
>> > > So if you don't hit the low memory situation, you will not see the
>> > > deadlock, and you can run like this for years without a problem. I have.
>> > > But you are most likely to run out of memory during recovery, so this
>> > > could compound your problems.
>> > >
>> > > On 3/7/19 3:56 AM, Marc Roos wrote:
>> > >>
>> > >>
>> > >> Container =  same kernel, problem is with processes using the same
>> > >> kernel.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -----Original Message-----
>> > >> From: Daniele Riccucci [mailto:devster@xxxxxxxxxx]
>> > >> Sent: 07 March 2019 00:18
>> > >> To: ceph-users@xxxxxxxxxxxxxx
>> > >> Subject: Re:  mount cephfs on ceph servers
>> > >>
>> > >> Hello,
>> > >> is the deadlock risk still an issue in containerized deployments? For
>> > >> example with OSD daemons in containers and mounting the filesystem on
>> > >> the host machine?
>> > >> Thank you.
>> > >>
>> > >> Daniele
>> > >>
>> > >> On 06/03/19 16:40, Jake Grimmett wrote:
>> > >>> Just to add "+1" on this datapoint, based on one month usage on Mimic
>> > >>> 13.2.4 essentially "it works great for us"
>> > >>>
>> > >>> Prior to this, we had issues with the kernel driver on 12.2.2. This
>> > >>> could have been due to limited RAM on the osd nodes (128GB / 45 OSD),
>> > >>> and an older kernel.
>> > >>>
>> > >>> Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
>> > >>> allowed us to reliably use the kernel driver.
>> > >>>
>> > >>> We keep 30 snapshots ( one per day), have one active metadata server,
>> > >>> and change several TB daily - it's much, *much* faster than with fuse.
>> > >>>
>> > >>> Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.
>> > >>>
>> > >>> ta ta
>> > >>>
>> > >>> Jake
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On 3/6/19 11:10 AM, Hector Martin wrote:
>> > >>>> On 06/03/2019 12:07, Zhenshi Zhou wrote:
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> I'm gonna mount cephfs from my ceph servers for some reason,
>> > >>>>> including monitors, metadata servers and osd servers. I know it's
>> > >>>>> not a best practice. But what is the exact potential danger if I
>> > >>>>> mount cephfs from its own server?
>> > >>>>
>> > >>>> As a datapoint, I have been doing this on two machines (single-host
>> > >>>> Ceph
>> > >>>> clusters) for months with no ill effects. The FUSE client performs a
>> > >>>> lot worse than the kernel client, so I switched to the latter, and
>> > >>>> it's been working well with no deadlocks.
>> > >>>>
>> > >>> _______________________________________________
>> > >>> ceph-users mailing list
>> > >>> ceph-users@xxxxxxxxxxxxxx
>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>
>> > >> _______________________________________________
>> > >> ceph-users mailing list
>> > >> ceph-users@xxxxxxxxxxxxxx
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> ceph-users mailing list
>> > >> ceph-users@xxxxxxxxxxxxxx
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>
>> > >
>> > >
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> >
>> > --
>> > Hector Martin (hector@xxxxxxxxxxxxxx)
>> > Public Key: https://mrcn.st/pub
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com