hello, every one: These days a problem of "ceph" has troubled me for a long time. I build a cluster with 3 hosts and each host has three osds in it. And after that I used the command "rados bench 360 -p data -b 4194304 -t 300 write --no-cleanup" to test the write performance of the cluster. When the cluster is near full, there couldn't write any data to it. Unfortunately, there was a host hung up, then a lots of PG was going to migrate to other OSDs. After a while, a lots of OSD was marked down and out, my cluster couldn't work any more. The following is the output of "ceph -s": cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons down, quorum 0,2 2,1 monmap e1: 3 mons at {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election epoch 40, quorum 0,2 2,1 osdmap e173: 9 osds: 2 up, 2 in flags full pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects 37541 MB used, 3398 MB / 40940 MB avail 945/29649 objects degraded (3.187%) 34 stale+active+degraded+remapped 176 stale+incomplete 320 stale+down+peering 53 active+degraded+remapped 408 incomplete 1 active+recovering+degraded 673 down+peering 1 stale+active+degraded 15 remapped+peering 3 stale+active+recovering+degraded+remapped 3 active+degraded 33 remapped+incomplete 8 active+recovering+degraded+remapped The following is the output of "ceph osd tree": # id weight type name up/down reweight -1 9 root default -3 9 rack unknownrack -2 3 host 10.0.0.97 0 1 osd.0 down 0 1 1 osd.1 down 0 2 1 osd.2 down 0 -4 3 host 10.0.0.98 3 1 osd.3 down 0 4 1 osd.4 down 0 5 1 osd.5 down 0 -5 3 host 10.0.0.70 6 1 osd.6 up 1 7 1 osd.7 up 1 8 1 osd.8 down 0 The following is part of output os osd.0.log -3> 2014-11-14 17:33:02.166022 7fd9dd1ab700 0 filestore(/data/osd/osd.0) error (28) No space left on device not handled on operation 10 (15804.0.13, or op 13, counting from 0) -2> 2014-11-14 17:33:02.216768 7fd9dd1ab700 0 filestore(/data/osd/osd.0) ENOSPC handling not implemented -1> 2014-11-14 17:33:02.216783 7fd9dd1ab700 0 filestore(/data/osd/osd.0) transaction dump: ... ... 0> 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7fd9dd1ab700 time 2014-11-14 17:33:02.251570 os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error") ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x17f8675] 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x4855) [0x1534c21] 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x101) [0x152d67d] 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x57b) [0x152bdc3] 5: (FileStore::OpWQ::_process(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2f) [0x1553c6f] 6: (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*, ThreadPool::TPHandle&)+0x37) [0x15625e7] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de] 8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d] 9: (Thread::_entry_func(void*)+0x23) [0x1998117] 10: (()+0x79d1) [0x7fd9e92bf9d1] 11: (clone()+0x6d) [0x7fd9e78ca9dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. It seens the error code was ENOSPC(No space left), why the osd program exited with "assert" at this time? If there was no space left, why the cluster should choose to migrate? Only osd.6 and osd.7 was alive. I tried to restarted other OSDs, but after a while, there osds crashed again. And now I can't read the data any more. Is it a bug? Anyone can help me? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com