Re: CephFS Bug found with CentOS 7.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is the latest default kernel with CentOS7.  We also tried a newer
kernel (from elrepo), a 4.4 that has the same problem, so I don't think
that is it.  Thank you for the suggestion though.

We upgraded our cluster to the 10.2.2 release today, and it didn't resolve
all of the issues.  It's possible that a related issue is actually
permissions.  Something may not be right with our config (or a bug) here.

While testing we noticed that there may actually be two issues here.  I am
unsure, as we noticed that the most consistent way to reproduce our issue
is to use vim or sed -i which does in place renames:

[root@ftp01 cron]# ls -la
total 3
drwx------   1 root root 2044 Jun 16 15:50 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  300 Jun 16 15:50 file
-rw-------   1 root root 2044 Jun 16 13:47 root
[root@ftp01 cron]# sed -i 's/^/#/' file
sed: cannot rename ./sedfB2CkO: Permission denied


Strangely, adding or deleting files works fine, it's only renaming that
fails.  And strangely I was able to successfully edit the file on ftp02:

[root@ftp02 cron]# sed -i 's/^/#/' file
[root@ftp02 cron]# ls -la
total 3
drwx------   1 root root 2044 Jun 16 15:49 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  313 Jun 16 15:49 file
-rw-------   1 root root 2044 Jun 16 13:47 root


Then it worked on ftp01 this time:
[root@ftp01 cron]# ls -la
total 3
drwx------   1 root root 2357 Jun 16 15:49 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  313 Jun 16 15:49 file
-rw-------   1 root root 2044 Jun 16 13:47 root


Then, I vim'd it successfully on ftp01... Then ran the sed again:

[root@ftp01 cron]# sed -i 's/^/#/' file
sed: cannot rename ./sedfB2CkO: Permission denied
[root@ftp01 cron]# ls -la
total 3
drwx------   1 root root 2044 Jun 16 15:51 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  300 Jun 16 15:50 file
-rw-------   1 root root 2044 Jun 16 13:47 root


And now we have the zero file problem again:

[root@ftp02 cron]# ls -la
total 2
drwx------   1 root root 2044 Jun 16 15:51 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root    0 Jun 16 15:50 file
-rw-------   1 root root 2044 Jun 16 13:47 root


Anyway, I wonder how much of this issue is related to that cannot rename
issue above.  Here are our security settings:

client.ftp01
	key: <redacted>
	caps: [mds] allow r, allow rw path=/ftp
	caps: [mon] allow r
	caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
client.ftp02
	key: <redacted>
	caps: [mds] allow r, allow rw path=/ftp
	caps: [mon] allow r
	caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data


/ftp is the directory on cephfs under which cron lives; the full path is
/ftp/cron .

I hope this helps and thank you for your time!

Jason

On 6/15/16, 4:43 PM, "John Spray" <jspray@xxxxxxxxxx> wrote:

>On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress <jgress@xxxxxxxxxxxxx>
>wrote:
>> While trying to use CephFS as a clustered filesystem, we stumbled upon a
>> reproducible bug that is unfortunately pretty serious, as it leads to
>>data
>> loss.  Here is the situation:
>>
>> We have two systems, named ftp01 and ftp02.  They are both running
>>CentOS
>> 7.2, with this kernel release and ceph packages:
>>
>> kernel-3.10.0-327.18.2.el7.x86_64
>
>That is an old-ish kernel to be using with cephfs.  It may well be the
>source of your issues.
>
>> [root@ftp01 cron]# rpm -qa | grep ceph
>> ceph-base-10.2.1-0.el7.x86_64
>> ceph-deploy-1.5.33-0.noarch
>> ceph-mon-10.2.1-0.el7.x86_64
>> libcephfs1-10.2.1-0.el7.x86_64
>> ceph-selinux-10.2.1-0.el7.x86_64
>> ceph-mds-10.2.1-0.el7.x86_64
>> ceph-common-10.2.1-0.el7.x86_64
>> ceph-10.2.1-0.el7.x86_64
>> python-cephfs-10.2.1-0.el7.x86_64
>> ceph-osd-10.2.1-0.el7.x86_64
>>
>> Mounted like so:
>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>> _netdev,relatime,name=ftp01,secretfile=/etc/ceph/ftp01.secret 0 0
>> And:
>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>> _netdev,relatime,name=ftp02,secretfile=/etc/ceph/ftp02.secret 0 0
>>
>> This filesystem has 234GB worth of data on it, and I created another
>> subdirectory and mounted it, NFS style.
>>
>> Here were the steps to reproduce:
>>
>> First, I created a file (I was mounting /var/spool/cron on two systems)
>>on
>> ftp01:
>> (crond is not running right now on either system to keep the variables
>>down)
>>
>> [root@ftp01 cron]# cp /tmp/root .
>>
>> Shows up on both fine:
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------   1 root root    0 Jun 15 15:50 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-------   1 root root 2043 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 0636c8deaeadfea7b9ddaa29652b43ae  root
>>
>> [root@ftp02 cron]# ls -la
>> total 2
>> drwx------   1 root root 2043 Jun 15 15:50 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-------   1 root root 2043 Jun 15 15:50 root
>> [root@ftp02 cron]# md5sum root
>> 0636c8deaeadfea7b9ddaa29652b43ae  root
>>
>> Now, I vim the file on one of them:
>> [root@ftp01 cron]# vim root
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------   1 root root    0 Jun 15 15:51 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-------   1 root root 2044 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 7a0c346bbd2b61c5fe990bb277c00917  root
>>
>> [root@ftp02 cron]# md5sum root
>> 7a0c346bbd2b61c5fe990bb277c00917  root
>>
>> So far so good, right?  Then, a few seconds later:
>>
>> [root@ftp02 cron]# ls -la
>> total 0
>> drwx------   1 root root   0 Jun 15 15:51 .
>> drwxr-xr-x. 10 root root 104 May 19 09:34 ..
>> -rw-------   1 root root   0 Jun 15 15:50 root
>> [root@ftp02 cron]# cat root
>> [root@ftp02 cron]# md5sum root
>> d41d8cd98f00b204e9800998ecf8427e  root
>>
>> And on ftp01:
>>
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------   1 root root    0 Jun 15 15:51 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-------   1 root root 2044 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 7a0c346bbd2b61c5fe990bb277c00917  root
>>
>> I later create a 'root2' on ftp02 and cause a similar issue.  The end
>> results are two non-matching files:
>>
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx------   1 root root    0 Jun 15 15:53 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-------   1 root root 2044 Jun 15 15:50 root
>> -rw-r--r--   1 root root    0 Jun 15 15:53 root2
>>
>> [root@ftp02 cron]# ls -la
>> total 2
>> drwx------   1 root root    0 Jun 15 15:53 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-------   1 root root    0 Jun 15 15:50 root
>> -rw-r--r--   1 root root 1503 Jun 15 15:53 root2
>>
>> We were able to reproduce this on two other systems with the same cephfs
>> filesystem.  I have also seen cases where the file would just blank out
>>on
>> both as well.
>>
>> We could not reproduce it with our dev/test cluster running the
>>development
>> ceph version:
>>
>> ceph-10.2.2-1.g502540f.el7.x86_64
>
>Strange.  In that cluster, was the same 3.x kernel in use?  There
>aren't a whole lot of changes on the server side in v10.2.2 that I
>could imagine affecting this case.
>
>The best thing to do right now is to try using ceph-fuse in your
>production environment, to check that it is not exhibiting the same
>behaviour as the old kernel client.  Once you confirm that, I would
>recommend upgrading your kernel to the most recent 4.x that you are
>comfortable with, and confirm that that also does not exhibit the bad
>behaviour.
>
>John
>
>> Is this a known bug with the current production Jewel release?  If so,
>>will
>> it be patched in the next release?
>>
>> Thank you very much,
>>
>> Jason Gress
>>
>> "This message and any attachments may contain confidential information.
>>If
>> you
>> have received this  message in error, any use or distribution is
>>prohibited.
>> Please notify us by reply e-mail if you have mistakenly received this
>> message,
>> and immediately and permanently delete it and any attachments. Thank
>>you."
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>




"This message and any attachments may contain confidential information. If you
have received this  message in error, any use or distribution is prohibited. 
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux