Re: some questions about ceph deployment

cang lin <fwdflywl@xxxxxxxxx> · Wed, 22 Sep 2010 19:09:04 +0800

Thanks for your replay,Sege.I think ceph is a very good distributed
filesystem and want to test it in product environment.Your reply is
very important to us.

2010/9/18 Sage WeilÂ<sage@xxxxxxxxxxxx>
>
> Sorry, I just realized this one slipped through the cracks!
>
> On Sat, 4 Sep 2010, FWDF wrote:
>
> > We use 3 servers to build a test system of ceph, configured as below:
> >
> > Host Â Â Â Â Â Â Â Â Â Â Â Â ÂIP
> > client01 Â Â Â Â Â Â192.168.1.10
> > ceph01 Â Â Â Â Â Â Â192.168.2.50
> > ceph02 Â Â Â Â Â Â Â192.168.2.51
> >
> > The OS is unbuntu 10.04 LTS and the version of ceph is v0.21.1
> >
> > ceph.conf:
> > [global]
> > Â Â Â Â auth supported = cephx
> > Â Â Â Â pid file = /var/run/ceph/$name.pid
> > Â Â Â Â debug ms = 0
> > Â Â Â Â keyring = /etc/ceph/keyring.bin
> > [mon]
> > Â Â Â Â mon data = /mnt/ceph/data/mon$id
> > Â Â Â Â debug ms = 1
> > [mon0]
> > Â Â Â Â host = ceph01
> > Â Â Â Â mon addr =Â192.168.2.50:6789
> > Â[mds]
> > Â Â Â Â keyring = /etc/ceph/keyring.$name
> > Â Â Â Â debug ms = 1
> > [mds.ceph01]
> > Â Â Â Â host = ceph01
> > [mds.ceph02]
> > Â Â Â Â host = ceph02
> > [osd]
> > Â Â Â Â sudo = true
> > Â Â Â Â osd data = /mnt/ceph/osd$id/data
> > Â Â Â Â keyring = /etc/ceph/keyring.$name
> > Â Â Â Â osd journal = /mnt/ceph/osd$id/data/journal
> > Â Â Â Â osd journal size = 100
> > [osd0]
> > Â Â Â Â host = ceph01
> > [osd1]
> > Â Â Â Â host = ceph01
> > [osd2]
> > Â Â Â Â host = ceph01
> > [osd3]
> > Â Â Â Â host = ceph01
> > [osd10]
> > Â Â Â Â host = ceph02
> >
> >There are 4 HDDs in the ceph01 and every HDD has a OSD named as osd0, osd1, osd2,osd3; there is 1 HDD in the ceph02 named as osd10. All these HDDs are made as btrfs and mounted on the mount point as listed below:
> >
> > ceph01
> > Â Â Â Â /dev/sdc1 Â Â Â Â /mnt/ceph/osd0/data Â Â Â Â Â Â Â btrfs
> > Â Â Â Â /dev/sdd1 Â Â Â Â /mnt/ceph/osd1/data Â Â Â Â Â Â Â btrfs
> > Â Â Â Â /dev/sde1 Â Â Â Â /mnt/ceph/osd2/data Â Â Â Â Â Â Â btrfs
> > Â Â Â Â /dev/sdf1 Â Â Â Â Â/mnt/ceph/osd3/data Â Â Â Â Â Â Â btrfs
> >
> > ceph02
> > Â Â Â Â /dev/sdb1 Â Â Â Â /mnt.ceph/osd10/data Â Â Â Â Â Â btrfs
> >
> > Make ceph FileSystem:
> > root@ceph01:~# Âmkcephfs Â-c /etc/cepf/ceph.conf -a -k /etc/ceph/keyring.bin
> >
> > Startup ceph:
> > root@ceph01:~# Â/etc/init.d/ceph -a Âstart
> >
> > Â Â Â Â ÂThen
> > root@ceph01:~# Âceph -w
> > 10.09.01_17:56:19.337895 Â mds e17: 1/1/1 up {0=up:active}, 1 up:standby
> > 10.09.01_17:56:19.347184 Â osd e27: 5 osds: 5 up, 5 in
> > 10.09.01_17:56:19.349447 Â Â log ...
> > 10.09.01_17:56:19.373773 Â mon e1: 1 mons atÂ192.168.2.50:6789/0
> >
> > The ceph file system is mounted to client01(192.168.1.10), ceph01(192.168.2.50), ceph02(192.168.2.51) at /data/ceph. It works fine at the beginning, I can use ls and the write and read of file is ok. After some files are wrote , I find I can't use ls -l /data/ceph until I umount ceph from ceph02, but one day later the same problem occurred again, then I umount ceph from ceph01 the system and everything is ok.
> >
> > Q1:
> > Â Â Â Â ÂCan the ceph filesystem be mounted to a member of ceph cluster?
>
> Technically, yes, but you should be very careful doing so. ÂThe problem is
> that when the kernel is low on memory it will force the client to write
> out dirty data so that it can reclaim those pages. ÂIf the writeout
> depends on then waking up some user process (cosd daemon), doing a bunch
> of random work, and writing the data to disk (dirtying yet more memory),
> you can deadlock the system.

We not only mount ceph onto a client in the same subnet but also mount
it onto remote client throughÂinternet.inÂthe first week everything
worked fine,it is about 100G write operation and 10 times read
operation per day.The file was almost read only and the size is form a
dozens Âof MB to a few GB,not a very heavy load.but in the second week
the client in the same subnet with ceph cluster canât be accessed and
ceph canât be unmounted from it,the remote client can still access and
unmount ceph.

Use 'cephÂâs'Âand 'ceph osd dump -0' on ceph01 can find out that the 3
of 4 osd were down(osd0,osd02,osd04). Use 'dfÂâh' command can find out
/dev/sde1(for osd0), /dev/sdd1(for osd2), /dev/sdc1(for osd4) still in
their mount point.

Use following command to restart osd:

# /etc/init.d/ceph start osd0

[/etc/ceph/fetch_config /tmp/fetched.ceph.conf.4967]
=== osd.0 ===
Starting Ceph osd0 on ceph01...
Â** WARNING: Ceph is still under heavy development, and is only suitable for **
Â**ÂÂÂÂÂÂÂÂÂÂtesting and review.ÂÂDo not trust it with important data.ÂÂÂÂÂÂÂ**
starting osd0 atÂ0.0.0.0:6800/4864Âosd_data /mnt/ceph/osd0/data
/mnt/ceph/osd0/data/journal
â

3 osd started and ran normally,but the local ceph client was down.Dose
it have anything to do with the osd restart?The local client can
remount ceph after reboot and work normally. The remote client can
remount ceph and work normally too,but a few days later it canât
access or unmount ceph.

#umount /mnt/ceph
umount: /mnt/ceph: device is busy.
ÂÂÂÂÂÂÂÂ(In some cases useful info about processes that use
ÂÂÂÂÂÂÂÂÂthe device is found by lsof(8) or fuser(1))

There was no response for lsof or fuser command.the only thing could
do is kill the process and reboot the system.We use ceph v0.21.2 for
the cluster and client and use Ubuntu 10.04 LTS(server),kernel version
is 2.6.32-21-generic-paeã
What confuse me is why the client canât access ceph?Even if the osd
was down shouldnât affect the client.what is the reason for the client
canât access or unmount ceph?
>
> > Â Â Â Â ÂWhen I follow the instruction ofÂhttp://ceph.newdream.net/wiki/Monitor_cluster_expansionÂto expand a monitor to ceph02, the following error occurred:
> >
> > root@ceph02:~# Â/etc/init.d/ceph start mon1
> > [/etc/ceph/fetch_config/tmp/fetched.ceph.conf.14210] ceph.conf 100% Â2565 Â2.5KB/s Â00:00
> > === mon.1 ===
> > Starting Ceph mon1 on ceph02...
> > Â** WARNING: Ceph is still under heavy development, and is only suitable for **
> > Â** testing and review. ÂDo not trust it with important data. Â**
> > terminate called after throwing an instance of 'std::logic_error'
> > Â what(): Âbasic_string::_S_construct NULL not valid
> > Aborted (core dumped)
> > failed: ' /usr/bin/cmon -i 1 -c /tmp/fetched.ceph.conf.14210 '
>
> I haven't seen that crash, but it looks like a std::string constructor is
> being passed a NULL pointer. ÂDo you have a core dump (to get a
> backtrace)? ÂWhich version are you running (`cmon -v`)?

The cmon version is v0.21.1 when the crash happen and been updated to v0.21.2.
The following backtrace is from v0.21.2:

# gdb cmon core
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.ÂÂType "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/cmon...(no debugging symbols found)...done.

warning: exec file is newer than core file.
[New Thread 17644]
warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/tls/i686/cmov/libpthread.so.0...(no
debugging symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libpthread.so.0
Reading symbols from /lib/i686/cmov/libcrypto.so.0.9.8...(no debugging
symbols found)...done.
Loaded symbols for /lib/i686/cmov/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/tls/i686/cmov/libm.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/tls/i686/cmov/libdl.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libdl.so.2
Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libz.so.1
Core was generated by `/usr/bin/cmon -i 1 -c /tmp/fetched.ceph.conf.17598'.
Program terminated with signal 6, Aborted.
#0ÂÂ0x001be422 in __kernel_vsyscall ()

(gdb)bt
#0ÂÂ0x001be422 in __kernel_vsyscall ()
#1ÂÂ0x00c2d651 in raise () from /lib/tls/i686/cmov/libc.so.6
#2ÂÂ0x00c30a82 in abort () from /lib/tls/i686/cmov/libc.so.6
#3ÂÂ0x0050a52f in __gnu_cxx::__verbose_terminate_handler() () from
/usr/lib/libstdc++.so.6
#4ÂÂ0x00508465 in ?? () from /usr/lib/libstdc++.so.6
#5ÂÂ0x005084a2 in std::terminate() () from /usr/lib/libstdc++.so.6
#6ÂÂ0x005085e1 in __cxa_throw () from /usr/lib/libstdc++.so.6
#7ÂÂ0x0049f57f in std::__throw_logic_error(char const*) () from
/usr/lib/libstdc++.so.6
#8ÂÂ0x004e3b82 in ?? () from /usr/lib/libstdc++.so.6
#9ÂÂ0x004e3da2 in std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string(char const*, unsigned int,
std::allocator<char> const&) () from /usr/lib/libstdc++.so.6
#10 0x08088744 in main ()
(gdb)
>
> > Q2:
> > How to expand a monitor to a running ceph system?
>
> The process in that wiki article can expand the monitor cluster while it
> is online. ÂNote that the monitor identication changed slightly between
> v0.21 and the current unstable branch (will be v0.22), and the
> instructions still need to be updated for that.
>
> > Q3
> > Â ÂIs it possible to add mds when the ceph system is running? how?
>
> Yes. ÂAdd the new mds to ceph.conf, start the daemon. ÂYou should see it
> as up:standby in the 'ceph -s' or 'ceph mds dump -o -' output. ÂThen
>
> Âceph mds setmaxmds 2
>
> change the size of the 'active' cluster to 2.
>
> Please keep in mind the clustered MDS still has some bugs; we expect v0.22
> to be stable.

Thanks,I will wait for v0.22 and try to add mds then,but I want to is
my config for mds is right.

I set 2 mds in ceph.conf

[mds]
        keyring = /etc/ceph/keyring.$name
        debug ms = 1
[mds.ceph01]
        host = ceph01
[mds.ceph02]
ÂÂ Â Âhost = ceph02

The result for 'cephÂâs':

  10.09.01_17:56:19.337895ÂÂÂmds e17: 1/1/1 up {0=up:active}, 1 up:standby

But nowÂthe result for 'cephÂâs' is:

  10.09.19_17:01:50.398809ÂÂÂmds e27: 1/1/1 up {0=up:active}

The result for '

  ceph mds dump -o â' is:
  10.09.19_17:05:10.263142 mon <- [mds,dump]
  10.09.19_17:05:10.264095 mon0 -> 'dumped mdsmap epoch 27' (0)
  epoch 27
  client_epoch 0
  created 10.08.26_03:27:01.753124
  modified 10.09.11_00:42:41.691011
  tableserver 0
  root 0
  session_timeout 60
  session_autoclose 300
  compatÂÂcompat={},rocompat={},incompat={1=base v0.20}
  max_mds 1
  inÂÂÂÂÂÂ0
  upÂÂÂÂÂÂ{0=4298}
  failed
  stopped
  4298:ÂÂÂ192.168.2.51:6800/3780Â'ceph02' mds0.6 up:active seq 260551
  10.09.19_17:05:10.264231 wrote 321 byte payload to -

I donât quite understand what the infomation means?Does it mean 1 mds
was down?how should I handle it?
>
> >
> > I fdisked a HDD into two partition, one for journal, other one for data like this:
> > /dev/sdc1ÃÃ180GBÃÃas data
> > dev/sdc2ÃÃ10GBÃÃas journal
> >
> > /dev/sdc1 made as btrfs, mount to /mnt/osd0/data
> > /dev/sdc2 made as btrfs, mount to /mnt/osd0/journal
> >
> > ceph.conf:
> >
> > [osd]
> > Â Â Â Â osd data = /mnt/ceph/osd$id/data
> > Â Â Â Â osd journal = /mnt/ceph/osd$id/journal
> > Â Â Â Â ; osd journal size = 100
> >
> > When I use mkcephfs command, I can't build osd until I edited ceph.conf like this:
> >
> > [osd]
> > Â Â Â Â osd data = /mnt/ceph/osd$id/data
> > Â Â Â Â osd journal = /mnt/ceph/osd$id/data/journal
> > Â Â Â Â osd journal size = 100
>
> If the journal is a file, the system won't create it for you unless you
> specify a size. ÂIf it already exists (e.g., you created it via 'dd', or
> it's a block device) the journal size isn't needed.
>
> > Q4.
> > How to set the journal path to a device or patition?
>
> Â Â Â Âosd journal = /dev/sdc1 Â; or whatever

How to know which journal is for certain osd?
Can the following config does that?

[osd]
ÂÂÂÂÂÂÂÂsudo = true
ÂÂÂÂÂÂÂÂosd data = /mnt/ceph/osd$id/data
[osd0]
ÂÂÂÂÂÂÂÂhost = ceph01
ÂÂ Â Â Âosd journal = /dev/sdc1

If I make a partition for journal in a 500GB hdd,what is the proper
size for the partition?

thanks.
Lin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html