Dear Cephalopodians, continuing a bit on the point raised in the other thread ( "CephFS very unstable with many small files" ) concerning the potentially unexpected behaviour of the ceph-fuse client with regard to mount namespaces I did a first small experiment. First off: I did not see any bad behaviour which can be traced back to this directly, but maybe it is still worthwhile to share the information. Here's what I did. 1) Initially, cephfs is mounted fine: [root@wn001 ~]# ps faux | grep ceph root 1908 31.4 0.1 1485376 201392 ? Sl Feb25 983:26 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw 2) Now, I fire off a container as normal user: $ singularity exec -B /cvmfs -B /cephfs /cvmfs/some_container_repository/singularity/SL6/default/1519725973/ bash Welcome inside the SL6 container. Singularity> ls /cephfs benchmark dd_test_rd.sh dd_test.sh grid kern port user Singularity> cd /cephfs All is fine and as expected. Singularity is one of many container runtimes, you may also use charliecloud (more lightweight, and good to learn from the code how things work) or runc (the reference implementation of OCI). The following may also work with a clever arrangement of "unshare" calls (see e.g. https://sft.its.cern.ch/jira/projects/CVM/issues/CVM-1478 ). 3) Now the experiment starts. On the host: [root@wn001 ~]# umount /cephfs/ [root@wn001 ~]# ps faux | grep ceph root 1908 31.4 0.1 1485376 201392 ? Sl Feb25 983:26 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw [root@wn001 ~]# ls /cephfs/ [root@wn001 ~]# => The CephFS is unmounted, the fuse helper is kept running! The reason: It is still in use within the mount namespace in the container. But there is no filehandle visible in the host namespace, which is why the umount succeeds and returns. 4) Now, in the container: Singularity> ls benchmark dd_test_rd.sh dd_test.sh grid kern port user I can also write and read just fine. 5) Now the ugly part begins. On the host: [root@wn001 ~]# mount /cephfs 2018-02-28 00:07:43.431425 7efddc61e040 -1 asok(0x5571340ae1c0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.cephfs_baf.asok': (17) File exists 2018-02-28 00:07:43.434597 7efddc61e040 -1 init, newargv = 0x5571340abb20 newargc=11 ceph-fuse[98703]: starting ceph client ceph-fuse[98703]: starting fuse [root@wn001 ~]# ps faux | grep ceph root 1908 31.4 0.1 1485376 201392 ? Sl Feb25 983:26 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw root 98703 1.0 0.0 400268 9456 pts/2 Sl 00:07 0:00 ceph-fuse --id=cephfs_baf --client_mountpoint=/ /cephfs -o rw As you can see: - Name collision for admin socket, since the helper is already running. - A second helper for the same mountpoint was fired up! - Of course, now cephfs is accessible on the host again. - On a side-note, once I exit the container (and hence close the mount namespace), the "old" helper is finally freed. Hence, I am unsure what exactly happens during the internal "remount" when the cephfs_fuse helper remounts the FS to make the kernel drop all internal caches. Since my kernel anf FUSE experience is very limited, let me recollect what other Fuse-FSes do: - sshfs does the same, i.e. one helper in host and one helper in container namespace. But it does not have problems with e.g. the admin socket. - CVMFS ( http://cvmfs.readthedocs.io/en/stable/ ) errors out in step (5), i.e. the admin can not remount anymore on the host. This is nasty, especially when combined with autofs and containers are placed on CVMFS, which is why I opened https://sft.its.cern.ch/jira/projects/CVM/issues/CVM-1478 with them. They need to enforce a single helper only to prevent corruption (even though it's a network FS, they have heavy local on-disk caching). - ntfs-3g has the only correct behaviour IMHO. I don't know how they pull it off, but when you are in (5) and issue "mount" on the host, no new fuse-helper is started - but the existing fuse helper takes care of both the mount in the host namespace and the mount in the container namespace. They also need to do this to prevent corruption, since it's not a network FS. I'm unsure if this is really a problem, and I did not yet clearly see if this actually breaks anything with Ceph. For sure I hope the information is worthwhile and may trigger some ideas. Cheers, Oliver
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com