Re: Ceph-Fuse and mount namespaces

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Wed, 28 Feb 2018 15:25:50 +0100

Am 28.02.2018 um 15:18 schrieb David Turner:
> If you run your container in privileged mode you can mount ceph-fuse inside of the VMs instead of from the shared resource on the host. I used a configuration like this to test multi-tenency speed tests of CephFS using ceph-fuse. The more mount points I used 1 per container, the more bandwidth I was able to utilize into ceph. It was about 15 clients before any of the clients were slower than a single client. Which is to say that my bandwidth usage went up linearly for the first 15 containers. After that it started snowing down peer container, but I eventually got to the point around 70 containers before I maxed the nic on the container host.
> 
> Anyway. All of that is meant to say that I would recommend against 1 fuse mount point on the host for containers and add the logic to mount the FS inside of the containers. Possibly check out the kennel mount options instead to prevent the need for privileged containers.

Thanks for sharing these results!
In our case, this will likely not make any difference, since either the NICs of the OSD-hosts are limiting the system (large files) or the SSDs for the cephfs-metadata. 

The main issues I see with performing the ceph-fuse mount inside the container would be:
- Memory increase. I saw clients eating up to 400 MB, we run 28 jobs per host, so that's 14 GB. 
  Also, in case users process a common set of files, will that not mean things will be cached separately (and number of total caps held will increase)? 
- We would multiply our number of clients by a factor of 28, so we would run with 1100 clients. This would probably also increase the load on the MDS,
  since more clients come asking for caps and stay connected. 

So I'm unsure it's a good idea in our case (when the network already goes in saturation if the single host-clients are running), but it's certainly worth looking at. 

Cheers and thanks,
	Oliver

> 
> On Tue, Feb 27, 2018, 6:27 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
> 
>     Dear Cephalopodians,
> 
>     continuing a bit on the point raised in the other thread ( "CephFS very unstable with many small files" )
>     concerning the potentially unexpected behaviour of the ceph-fuse client with regard to mount namespaces I did a first small experiment.
> 
>     First off: I did not see any bad behaviour which can be traced back to this directly, but maybe it is still worthwhile
>     to share the information.
> 
>     Here's what I did.
> 
>     1) Initially, cephfs is mounted fine:
>     [root@wn001 ~]# ps faux | grep ceph
>     root        1908 31.4  0.1 1485376 201392 ?      Sl   Feb25 983:26 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw
> 
>     2) Now, I fire off a container as normal user:
>     $ singularity exec -B /cvmfs -B /cephfs /cvmfs/some_container_repository/singularity/SL6/default/1519725973/ bash
>     Welcome inside the SL6 container.
>     Singularity> ls /cephfs
>     benchmark  dd_test_rd.sh  dd_test.sh  grid  kern  port  user
>     Singularity> cd /cephfs
> 
>     All is fine and as expected. Singularity is one of many container runtimes, you may also use charliecloud (more lightweight,
>     and good to learn from the code how things work) or runc (the reference implementation of OCI).
>     The following may also work with a clever arrangement of "unshare" calls (see e.g. https://sft.its.cern.ch/jira/projects/CVM/issues/CVM-1478 ).
> 
>     3) Now the experiment starts. On the host:
>     [root@wn001 ~]# umount /cephfs/
>     [root@wn001 ~]# ps faux | grep ceph
>     root        1908 31.4  0.1 1485376 201392 ?      Sl   Feb25 983:26 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw
>     [root@wn001 ~]# ls /cephfs/
>     [root@wn001 ~]#
> 
>     => The CephFS is unmounted, the fuse helper is kept running!
>     The reason: It is still in use within the mount namespace in the container.
>     But there is no filehandle visible in the host namespace, which is why the umount succeeds and returns.
> 
>     4) Now, in the container:
>     Singularity> ls
>     benchmark  dd_test_rd.sh  dd_test.sh  grid  kern  port  user
> 
>     I can also write and read just fine.
> 
>     5) Now the ugly part begins. On the host:
>     [root@wn001 ~]# mount /cephfs
>     2018-02-28 00:07:43.431425 7efddc61e040 -1 asok(0x5571340ae1c0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.cephfs_baf.asok': (17) File exists
>     2018-02-28 00:07:43.434597 7efddc61e040 -1 init, newargv = 0x5571340abb20 newargc=11
>     ceph-fuse[98703]: starting ceph client
>     ceph-fuse[98703]: starting fuse
>     [root@wn001 ~]# ps faux | grep ceph
>     root        1908 31.4  0.1 1485376 201392 ?      Sl   Feb25 983:26 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw
>     root       98703  1.0  0.0 400268  9456 pts/2    Sl   00:07   0:00 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw
> 
>     As you can see:
>     - Name collision for admin socket, since the helper is already running.
>     - A second helper for the same mountpoint was fired up!
>     - Of course, now cephfs is accessible on the host again.
>     - On a side-note, once I exit the container (and hence close the mount namespace), the "old" helper is finally freed.
> 
>     Hence, I am unsure what exactly happens during the internal "remount" when the cephfs_fuse helper remounts the FS to make the kernel drop all internal caches.
> 
>     Since my kernel anf FUSE experience is very limited, let me recollect what other Fuse-FSes do:
>     - sshfs does the same, i.e. one helper in host and one helper in container namespace. But it does not have problems with e.g. the admin socket.
>     - CVMFS ( http://cvmfs.readthedocs.io/en/stable/ ) errors out in step (5), i.e. the admin can not remount anymore on the host.
>       This is nasty, especially when combined with autofs and containers are placed on CVMFS, which is why I opened https://sft.its.cern.ch/jira/projects/CVM/issues/CVM-1478 with them.
>       They need to enforce a single helper only to prevent corruption (even though it's a network FS, they have heavy local on-disk caching).
>     - ntfs-3g has the only correct behaviour IMHO.
>       I don't know how they pull it off, but when you are in (5) and issue "mount" on the host, no new fuse-helper is started - but the existing fuse helper takes care of both the mount in the host namespace
>       and the mount in the container namespace.
>       They also need to do this to prevent corruption, since it's not a network FS.
> 
>     I'm unsure if this is really a problem, and I did not yet clearly see if this actually breaks anything with Ceph. For sure I hope the information
>     is worthwhile and may trigger some ideas.
> 
>     Cheers,
>             Oliver
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com