On Wed, 22 Sep 2010, cang lin wrote: > We not only mount ceph onto a client in the same subnet but also mount it > onto remote client through internet.in the first week everything worked > fine,it is about 100G write operation and 10 times read operation per > day.The file was almost read only and the size is form a dozens of MB to a > few GB,not a very heavy load.but in the second week the client in the same > subnet with ceph cluster canÿÿt be accessed and ceph canÿÿt be unmounted from > it,the remote client can still access and unmount ceph. > > Use 'ceph ÿÿs' and 'ceph osd dump -0' on ceph01 can find out that the 3 of 4 > osd were down(osd0,osd02,osd04). Use 'df ÿÿh' command can find out > /dev/sde1(for osd0), /dev/sdd1(for osd2), /dev/sdc1(for osd4) still in their > mount point. > > Use following command to restart osd: > > # /etc/init.d/ceph start osd0 > > [/etc/ceph/fetch_config /tmp/fetched.ceph.conf.4967] > > === osd.0 === > > Starting Ceph osd0 on ceph01... > > ** WARNING: Ceph is still under heavy development, and is only suitable for > ** > > ** testing and review. Do not trust it with important data. > ** > > starting osd0 at 0.0.0.0:6800/4864 osd_data /mnt/ceph/osd0/data > /mnt/ceph/osd0/data/journal > > ÿÿ > > 3 osd started and ran normally,but the local ceph client was down.Dose it > have anything to do with the osd restart?The local client can remount ceph > after reboot and work normally. The remote client can remount ceph and work > normally too,but a few days later it canÿÿt access or unmount ceph. > > > > #umount /mnt/ceph > > umount: /mnt/ceph: device is busy. > > (In some cases useful info about processes that use > > the device is found by lsof(8) or fuser(1)) > > > There was no response for lsof or fuser command.the only thing could do is > kill the process and reboot the system.We use ceph v0.21.2 for the cluster > and client and use Ubuntu 10.04 LTS(server),kernel version is > 2.6.32-21-generic-paeÿÿ > > What confuse me is why the client canÿÿt access ceph?Even if the osd was > down shouldnÿÿt affect the client.what is the reason for the client canÿÿt > access or unmount ceph? It could be a number of things. The output from cat /sys/kernel/debug/ceph/*/mdsc cat /sys/kernel/debug/ceph/*/osdc will tell you if it's waiting for a server request to respond. Also, if you know the hung pid, you can cat /proc/$pid/stack and see where it is blocked. Also, dmesg | tail may have some relevant console messages. > > When I follow the instruction of > > http://ceph.newdream.net/wiki/Monitor_cluster_expansion to expand a > > monitor to ceph02, the following error occurred: > > > > > > root@ceph02:~# /etc/init.d/ceph start mon1 > > > [/etc/ceph/fetch_config/tmp/fetched.ceph.conf.14210] ceph.conf 100% 2565 > > 2.5KB/s 00:00 > > > === mon.1 === > > > Starting Ceph mon1 on ceph02... > > > ** WARNING: Ceph is still under heavy development, and is only suitable > > for ** > > > ** testing and review. Do not trust it with important data. ** > > > terminate called after throwing an instance of 'std::logic_error' > > > what(): basic_string::_S_construct NULL not valid > > > Aborted (core dumped) > > > failed: ' /usr/bin/cmon -i 1 -c /tmp/fetched.ceph.conf.14210 ' > > > > I haven't seen that crash, but it looks like a std::string constructor is > > being passed a NULL pointer. Do you have a core dump (to get a > > backtrace)? Which version are you running (`cmon -v`)? > > > > The cmon version is v0.21.1 when the crash happen and been updated to > v0.21.2. > > The following backtrace is from v0.21.2: Thanks, we'll see if we can reproduce and fix this one! > [...] > Thanks,I will wait for v0.22 and try to add mds then,but I want to is my > config for mds is right. > > > > I set 2 mds in ceph.conf > > [mds] > > keyring = /etc/ceph/keyring.$name > > debug ms = 1 > > [mds.ceph01] > > host = ceph01 > > [mds.ceph02] > > host = ceph02 Looks right. > The result for 'ceph ÿÿs': > > 10.09.01_17:56:19.337895 mds e17: 1/1/1 up {0=up:active}, 1 up:standby > > But now the result for 'ceph ÿÿs' is: > > 10.09.19_17:01:50.398809 mds e27: 1/1/1 up {0=up:active} It looks like the second 'standby' cmds went away. Is the daemon still running? > > > Q4. > > > How to set the journal path to a device or patition? > > > > osd journal = /dev/sdc1 ; or whatever > > > > How to know which journal is for certain osd? > > Can the following config does that? > > > > [osd] > > sudo = true > > osd data = /mnt/ceph/osd$id/data > > [osd0] > > host = ceph01 > > osd journal = /dev/sdc1 > > > If I make a partition for journal in a 500GB hdd,what is the proper size for > the partition? 1 GB should be sufficient. sage