This is the latest default kernel with CentOS7. We also tried a newer kernel (from elrepo), a 4.4 that has the same problem, so I don't think that is it. Thank you for the suggestion though. We upgraded our cluster to the 10.2.2 release today, and it didn't resolve all of the issues. It's possible that a related issue is actually permissions. Something may not be right with our config (or a bug) here. While testing we noticed that there may actually be two issues here. I am unsure, as we noticed that the most consistent way to reproduce our issue is to use vim or sed -i which does in place renames: [root@ftp01 cron]# ls -la total 3 drwx------ 1 root root 2044 Jun 16 15:50 . drwxr-xr-x. 10 root root 104 May 19 09:34 .. -rw-r--r-- 1 root root 300 Jun 16 15:50 file -rw------- 1 root root 2044 Jun 16 13:47 root [root@ftp01 cron]# sed -i 's/^/#/' file sed: cannot rename ./sedfB2CkO: Permission denied Strangely, adding or deleting files works fine, it's only renaming that fails. And strangely I was able to successfully edit the file on ftp02: [root@ftp02 cron]# sed -i 's/^/#/' file [root@ftp02 cron]# ls -la total 3 drwx------ 1 root root 2044 Jun 16 15:49 . drwxr-xr-x. 10 root root 104 May 19 09:34 .. -rw-r--r-- 1 root root 313 Jun 16 15:49 file -rw------- 1 root root 2044 Jun 16 13:47 root Then it worked on ftp01 this time: [root@ftp01 cron]# ls -la total 3 drwx------ 1 root root 2357 Jun 16 15:49 . drwxr-xr-x. 10 root root 104 May 19 09:34 .. -rw-r--r-- 1 root root 313 Jun 16 15:49 file -rw------- 1 root root 2044 Jun 16 13:47 root Then, I vim'd it successfully on ftp01... Then ran the sed again: [root@ftp01 cron]# sed -i 's/^/#/' file sed: cannot rename ./sedfB2CkO: Permission denied [root@ftp01 cron]# ls -la total 3 drwx------ 1 root root 2044 Jun 16 15:51 . drwxr-xr-x. 10 root root 104 May 19 09:34 .. -rw-r--r-- 1 root root 300 Jun 16 15:50 file -rw------- 1 root root 2044 Jun 16 13:47 root And now we have the zero file problem again: [root@ftp02 cron]# ls -la total 2 drwx------ 1 root root 2044 Jun 16 15:51 . drwxr-xr-x. 10 root root 104 May 19 09:34 .. -rw-r--r-- 1 root root 0 Jun 16 15:50 file -rw------- 1 root root 2044 Jun 16 13:47 root Anyway, I wonder how much of this issue is related to that cannot rename issue above. Here are our security settings: client.ftp01 key: <redacted> caps: [mds] allow r, allow rw path=/ftp caps: [mon] allow r caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data client.ftp02 key: <redacted> caps: [mds] allow r, allow rw path=/ftp caps: [mon] allow r caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data /ftp is the directory on cephfs under which cron lives; the full path is /ftp/cron . I hope this helps and thank you for your time! Jason On 6/15/16, 4:43 PM, "John Spray" <jspray@xxxxxxxxxx> wrote: >On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress <jgress@xxxxxxxxxxxxx> >wrote: >> While trying to use CephFS as a clustered filesystem, we stumbled upon a >> reproducible bug that is unfortunately pretty serious, as it leads to >>data >> loss. Here is the situation: >> >> We have two systems, named ftp01 and ftp02. They are both running >>CentOS >> 7.2, with this kernel release and ceph packages: >> >> kernel-3.10.0-327.18.2.el7.x86_64 > >That is an old-ish kernel to be using with cephfs. It may well be the >source of your issues. > >> [root@ftp01 cron]# rpm -qa | grep ceph >> ceph-base-10.2.1-0.el7.x86_64 >> ceph-deploy-1.5.33-0.noarch >> ceph-mon-10.2.1-0.el7.x86_64 >> libcephfs1-10.2.1-0.el7.x86_64 >> ceph-selinux-10.2.1-0.el7.x86_64 >> ceph-mds-10.2.1-0.el7.x86_64 >> ceph-common-10.2.1-0.el7.x86_64 >> ceph-10.2.1-0.el7.x86_64 >> python-cephfs-10.2.1-0.el7.x86_64 >> ceph-osd-10.2.1-0.el7.x86_64 >> >> Mounted like so: >> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph >> _netdev,relatime,name=ftp01,secretfile=/etc/ceph/ftp01.secret 0 0 >> And: >> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph >> _netdev,relatime,name=ftp02,secretfile=/etc/ceph/ftp02.secret 0 0 >> >> This filesystem has 234GB worth of data on it, and I created another >> subdirectory and mounted it, NFS style. >> >> Here were the steps to reproduce: >> >> First, I created a file (I was mounting /var/spool/cron on two systems) >>on >> ftp01: >> (crond is not running right now on either system to keep the variables >>down) >> >> [root@ftp01 cron]# cp /tmp/root . >> >> Shows up on both fine: >> [root@ftp01 cron]# ls -la >> total 2 >> drwx------ 1 root root 0 Jun 15 15:50 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 2043 Jun 15 15:50 root >> [root@ftp01 cron]# md5sum root >> 0636c8deaeadfea7b9ddaa29652b43ae root >> >> [root@ftp02 cron]# ls -la >> total 2 >> drwx------ 1 root root 2043 Jun 15 15:50 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 2043 Jun 15 15:50 root >> [root@ftp02 cron]# md5sum root >> 0636c8deaeadfea7b9ddaa29652b43ae root >> >> Now, I vim the file on one of them: >> [root@ftp01 cron]# vim root >> [root@ftp01 cron]# ls -la >> total 2 >> drwx------ 1 root root 0 Jun 15 15:51 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 2044 Jun 15 15:50 root >> [root@ftp01 cron]# md5sum root >> 7a0c346bbd2b61c5fe990bb277c00917 root >> >> [root@ftp02 cron]# md5sum root >> 7a0c346bbd2b61c5fe990bb277c00917 root >> >> So far so good, right? Then, a few seconds later: >> >> [root@ftp02 cron]# ls -la >> total 0 >> drwx------ 1 root root 0 Jun 15 15:51 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 0 Jun 15 15:50 root >> [root@ftp02 cron]# cat root >> [root@ftp02 cron]# md5sum root >> d41d8cd98f00b204e9800998ecf8427e root >> >> And on ftp01: >> >> [root@ftp01 cron]# ls -la >> total 2 >> drwx------ 1 root root 0 Jun 15 15:51 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 2044 Jun 15 15:50 root >> [root@ftp01 cron]# md5sum root >> 7a0c346bbd2b61c5fe990bb277c00917 root >> >> I later create a 'root2' on ftp02 and cause a similar issue. The end >> results are two non-matching files: >> >> [root@ftp01 cron]# ls -la >> total 2 >> drwx------ 1 root root 0 Jun 15 15:53 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 2044 Jun 15 15:50 root >> -rw-r--r-- 1 root root 0 Jun 15 15:53 root2 >> >> [root@ftp02 cron]# ls -la >> total 2 >> drwx------ 1 root root 0 Jun 15 15:53 . >> drwxr-xr-x. 10 root root 104 May 19 09:34 .. >> -rw------- 1 root root 0 Jun 15 15:50 root >> -rw-r--r-- 1 root root 1503 Jun 15 15:53 root2 >> >> We were able to reproduce this on two other systems with the same cephfs >> filesystem. I have also seen cases where the file would just blank out >>on >> both as well. >> >> We could not reproduce it with our dev/test cluster running the >>development >> ceph version: >> >> ceph-10.2.2-1.g502540f.el7.x86_64 > >Strange. In that cluster, was the same 3.x kernel in use? There >aren't a whole lot of changes on the server side in v10.2.2 that I >could imagine affecting this case. > >The best thing to do right now is to try using ceph-fuse in your >production environment, to check that it is not exhibiting the same >behaviour as the old kernel client. Once you confirm that, I would >recommend upgrading your kernel to the most recent 4.x that you are >comfortable with, and confirm that that also does not exhibit the bad >behaviour. > >John > >> Is this a known bug with the current production Jewel release? If so, >>will >> it be patched in the next release? >> >> Thank you very much, >> >> Jason Gress >> >> "This message and any attachments may contain confidential information. >>If >> you >> have received this message in error, any use or distribution is >>prohibited. >> Please notify us by reply e-mail if you have mistakenly received this >> message, >> and immediately and permanently delete it and any attachments. Thank >>you." >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> "This message and any attachments may contain confidential information. If you have received this message in error, any use or distribution is prohibited. Please notify us by reply e-mail if you have mistakenly received this message, and immediately and permanently delete it and any attachments. Thank you." _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com