Re: Cannot delete some empty dirs and weird sizes

Amon Ott <a.ott@xxxxxxxxxxxx> · Fri, 3 Feb 2012 09:24:09 +0100

On Thursday 02 February 2012 wrote Gregory Farnum:
> On Wed, Feb 1, 2012 at 9:02 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote:
> > ceph should have recovered here. Might also be caused by this setting
> > that I tried for a while, it is off now:
> > mds standby replay = true
> > With this setting, if the active mds gets killed, no mds is able to
> > become active, so everything hangs. Had to reboot again.
>
> Hrm. That setting simply tells the non-active MDSes that they should
> follow the journal of the active MDS(es). They should still go active
> if the MDS they're following fails — although it does slightly
> increase the chances of them running into the same bugs in code and
> dying at the same time.

Well, in this test the following MDS did not become active, it kept stuck in 
replay mode. As waking up without that following is also fast enough for us, 
I quickly turned it off again. Just wanted to mention that there might be 
another bug lurking.

> > Then killed the active mds, another takes over and suddenly the missing
> > file appears:
> > .tmp/tiny14/.config/pcmanfm/LXDE/insgesamt 1
> > drwxr-xr-x 1 32252 users 393  1. Feb 15:19 .
> > drwxr-xr-x 1 32252 users   0  1. Feb 17:21 ..
> > -rw-r--r-- 1 root  root  393 24. Jan 15:55 pcmanfm.conf
>
> So were you able to delete that file once it reappeared?

It got deleted correctly by nightly cleanup cron job.

> > Restarted the original mds, it does not appear in "ceph mds dump",
> > although it is running at 100% cpu. Same happened with other mds
> > processes after killing and starting, now I have only one left that is
> > working correctly.
>
> Remind me, you do only have one active MDS, correct?
> Did you look at the logs and see what the MDS was doing with 100% cpu?

Four MDS defined, but using
max mds = 1

> > Will leave the cluster in this state now and have another look tomorrow -
> > maybe the spinning mds processes recover by some miracle.

After yet another complete reboot it seems to be stable again now. So far I 
can say that Ceph FS runs pretty stable with the following conditions:
- 4 nodes with each 8 to 12 CPU cores, 12 GB of RAM (CPUs and RAM mostly 
unused by Ceph)
- Ceph 0.41
- 3 mon, 4 mds (max mds 1), 4 osd, all 4 nodes are both server and client
- Kernel 3.1.10 with some hand integrated patches for Ceph fixes (3.2.2 was 
unstable, need to check)
- OSD storage on normal hard drive, ext4 without journal (btrfs crashes about 
once per day)
- OSD journals on separate SSD
- MDS does not get restarted without reboot (yes, only reboot helps)

>From our experience, it could be a bit faster when writing into many different 
files, which is our major work load. However, the last few versions already 
brought significant improvements on stability and speed. The main problem we 
see is the MDS takeover and recovery, which has given a lot of trouble so 
far.

Thanks once more for your good work!

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html