On Tuesday 31 January 2012 wrote Gregory Farnum: > On Tue, Jan 31, 2012 at 4:00 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote: > > Hi again! > > > > We are running Ceph 0.41 and kernel 3.2.2 with current for-linus code > > (commit 3d882ce47de80e0294a536bec771b5651885b4d3) now. > > > > After some heavy workloads we see quite a few directories that cannot be > > deleted, although ls and find show that they are empty. rmdir says they > > are not empty. > > > > Additionally, ceph reports various weird size values for some, but not > > all of them: > > ls -la .tmp/tiny61/.mozilla/firefox/default.yat/ > > insgesamt 0 > > drwxr-xr-x 1 tiny61 users 18446744073705748665 25. Jan 10:02 . > > drwxr-xr-x 1 tiny61 users 18446744073705748665 25. Jan 10:02 .. > > > > Is this a known or a new bug? Can it be related to .snap pseudo dirs? The > > problem appeared without ever using snapshots, though. > > I believe this is new. Based on the odd sizes (that's a 64-bit -1 > interpreted as unsigned, fyi), my guess is that the "recursive > accounting" statistics are off and that's leading the MDS to believe > the directory is not empty even though it is. It's unlikely to be > directly related to snapshots, though it's not impossible. > > Have you seen this on more than one MDS? If it's reproducible we could > more easily figure out the cause; otherwise the best we can do is to > maybe fix up the specific instance of it. I had to recreate ceph fs several times today because of kernel problems. Now I have only one dir that is wrong: ls -la .tmp/tiny14/.config/pcmanfm/LXDE/ insgesamt 0 drwxr-xr-x 1 32252 users 393 1. Feb 15:19 . drwxr-xr-x 1 32252 users 0 1. Feb 17:21 .. This is probably caused by another reboot I had to do, although I think ceph should have recovered here. Might also be caused by this setting that I tried for a while, it is off now: mds standby replay = true With this setting, if the active mds gets killed, no mds is able to become active, so everything hangs. Had to reboot again. Found that in mds log, the reported wrong size matches the dir total: 2012-02-01 17:21:51.306561 4f830b70 mds.0.cache.dir(1000000b055) _fetched badness: got (but i already had) [inode 100000066c9 [2,head] /tiny14/.config/pcmanfm/ LXDE.conf auth v4 s=393 n(v0 b393 1=1+0) (iversion lock) cr={4711=0-4194304@1} caps={5313=pAsLsXsFscr/-@1} | caps 0x1d13c600] mode 33188 mtime 2012-01-24 15:55:59.0000002012-02-01 17:21:51.306646 4f830b70 log [ERR] : loaded dup inode 100000066c9 [2,head] v7 at /tiny14/.config/pcmanfm/LXDE/pcmanfm.conf, but inode 100000066c9.head v4 already exists at /tiny14/.config/pcmanfm/LXDE.conf 2012-02-01 17:21:51.349424 4f830b70 mds.0.cache.dir(100000066ae) mismatch between head items and fnode.fragstat! printing dentries 2012-02-01 17:21:51.349457 4f830b70 mds.0.cache.dir(100000066ae) get_num_head_items() = 2; fnode.fragstat.nfiles=0 fnode.fragstat.nsubdirs=1 2012-02-01 17:21:51.349493 4f830b70 mds.0.cache.dir(100000066ae) [dentry #1/tiny14/.config/pcmanfm/LXDE [2,head] auth (dversion lock) pv=0 v=16 inode=0x1cff3828 | inodepin 0x1b9f4de0] 2012-02-01 17:21:51.349521 4f830b70 mds.0.cache.dir(100000066ae) [dentry #1/tiny14/.config/pcmanfm/LXDE.conf [2,head] auth (dn xlock x=1 by 0x1ab21200) (dversion lock w=1 last_client=5313) pv=17 v=16 ap=2+2 inode=0x1d13c600 | request lock inodepin authpin 0x1b90f064] 2012-02-01 17:21:51.349552 4f830b70 mds.0.cache.dir(100000066ae) mismatch between child accounted_rstats and my rstats! 2012-02-01 17:21:51.349573 4f830b70 mds.0.cache.dir(100000066ae) total of child dentrys: n(v0 rc2012-02-01 15:19:55.517733 b786 3=2+1) 2012-02-01 17:21:51.349591 4f830b70 mds.0.cache.dir(100000066ae) my rstats: n(v3 rc2012-02-01 15:19:55.517733 b393 2=1+1) 2012-02-01 17:21:51.349616 4f830b70 mds.0.cache.dir(100000066ae) [dentry #1/tiny14/.config/pcmanfm/LXDE [2,head] auth (dversion lock) pv=0 v=16 inode=0x1cff3828 | inodepin 0x1b9f4de0] n(v0 rc2012-02-01 15:19:55.517733 b393 2=1+1) 2012-02-01 17:21:51.349643 4f830b70 mds.0.cache.dir(100000066ae) [dentry #1/tiny14/.config/pcmanfm/LXDE.conf [2,head] auth (dn xlock x=1 by 0x1ab21200) (dversion lock w=1 last_client=5313) pv=17 v=16 ap=2+2 inode=0x1d13c600 | request lock inodepin authpin 0x1b90f064] n(v0 b393 1=1+0) Then killed the active mds, another takes over and suddenly the missing file appears: .tmp/tiny14/.config/pcmanfm/LXDE/insgesamt 1 drwxr-xr-x 1 32252 users 393 1. Feb 15:19 . drwxr-xr-x 1 32252 users 0 1. Feb 17:21 .. -rw-r--r-- 1 root root 393 24. Jan 15:55 pcmanfm.conf Restarted the original mds, it does not appear in "ceph mds dump", although it is running at 100% cpu. Same happened with other mds processes after killing and starting, now I have only one left that is working correctly. Will leave the cluster in this state now and have another look tomorrow - maybe the spinning mds processes recover by some miracle. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1 Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html