Hmm, the problem is I had not modified any config, all the config is default. as you said, all the IO should be stopped by the configs "mon_osd_full_ration" or "osd_failsafe_full_ration". In my test, when the osd near full, the IO from "rest bench" stopped, but the backfill IO did not stop. Each osd had 20G space, I think the space is big enough. 2014-11-19 3:18 GMT+08:00 Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx>: > You shouldn't let the cluster get so full that losing a few OSDs will make > you go toofull. Letting the cluster get to 100% full is such a bad idea > that you should make sure it doesn't happen. > > > Ceph is supposed to stop moving data to an OSD once that OSD hits > osd_backfill_full_ratio, which defaults to 0.85. Any disk at 86% full will > stop backfilling. > > I have verified this works when the disks fill up while the cluster is > healthy, but I haven't failed a disk once I'm in the toofull state. Even > so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default > 0.97) should stop all IO until a human gets involved. > > The only gotcha I can find is that the values are percentages, and the test > is a "greater than" done with two significant digits. ie, if the > osd_backfill_full_ratio is 0.85, it will continue backfilling until the disk > is 86% full. So values are 0.99 and 1.00 will cause problems. > > > On Mon, Nov 17, 2014 at 6:50 PM, han vincent <hangzws@xxxxxxxxx> wrote: >> >> hi, craig: >> >> Your solution did work very well. But if the data is very >> important, when remove directory of PG from OSDs, a small mistake will >> result in loss of data. And if cluster is very large, do not you think >> delete the data on the disk from 100% to 95% is a tedious and >> error-prone thing, for so many OSDs, large disks, and so on. >> >> so my key question is: if there is no space in the cluster while >> some OSDs crashed, why the cluster should choose to migrate? And in >> the migrating, other >> OSDs will crashed one by one until the cluster could not work. >> >> 2014-11-18 5:28 GMT+08:00 Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx>: >> > At this point, it's probably best to delete the pool. I'm assuming the >> > pool >> > only contains benchmark data, and nothing important. >> > >> > Assuming you can delete the pool: >> > First, figure out the ID of the data pool. You can get that from ceph >> > osd >> > dump | grep '^pool' >> > >> > Once you have the number, delete the data pool: rados rmpool data data >> > --yes-i-really-really-mean-it >> > >> > That will only free up space on OSDs that are up. You'll need to >> > manually >> > some PGs on the OSDs that are 100% full. Go to >> > /var/lib/ceph/osd/ceph-<OSDID>/current, and delete a few directories >> > that >> > start with your data pool ID. You don't need to delete all of them. >> > Once >> > the disk is below 95% full, you should be able to start that OSD. Once >> > it's >> > up, it will finish deleting the pool. >> > >> > If you can't delete the pool, it is possible, but it's more work, and >> > you >> > still run the risk of losing data if you make a mistake. You need to >> > disable backfilling, then delete some PGs on each OSD that's full. Try >> > to >> > only delete one copy of each PG. If you delete every copy of a PG on >> > all >> > OSDs, then you lost the data that was in that PG. As before, once you >> > delete enough that the disk is less than 95% full, you can start the >> > OSD. >> > Once you start it, start deleting your benchmark data out of the data >> > pool. >> > Once that's done, you can re-enable backfilling. You may need to scrub >> > or >> > deep-scrub the OSDs you deleted data from to get everything back to >> > normal. >> > >> > >> > So how did you get the disks 100% full anyway? Ceph normally won't let >> > you >> > do that. Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, >> > or >> > osd_failsafe_full_ratio? >> > >> > >> > On Mon, Nov 17, 2014 at 7:00 AM, han vincent <hangzws@xxxxxxxxx> wrote: >> >> >> >> hello, every one: >> >> >> >> These days a problem of "ceph" has troubled me for a long time. >> >> >> >> I build a cluster with 3 hosts and each host has three osds in it. >> >> And after that >> >> I used the command "rados bench 360 -p data -b 4194304 -t 300 write >> >> --no-cleanup" >> >> to test the write performance of the cluster. >> >> >> >> When the cluster is near full, there couldn't write any data to >> >> it. Unfortunately, >> >> there was a host hung up, then a lots of PG was going to migrate to >> >> other >> >> OSDs. >> >> After a while, a lots of OSD was marked down and out, my cluster >> >> couldn't >> >> work >> >> any more. >> >> >> >> The following is the output of "ceph -s": >> >> >> >> cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 >> >> health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs >> >> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 >> >> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; >> >> recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons >> >> down, quorum 0,2 2,1 >> >> monmap e1: 3 mons at >> >> {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election >> >> epoch 40, quorum 0,2 2,1 >> >> osdmap e173: 9 osds: 2 up, 2 in >> >> flags full >> >> pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects >> >> 37541 MB used, 3398 MB / 40940 MB avail >> >> 945/29649 objects degraded (3.187%) >> >> 34 stale+active+degraded+remapped >> >> 176 stale+incomplete >> >> 320 stale+down+peering >> >> 53 active+degraded+remapped >> >> 408 incomplete >> >> 1 active+recovering+degraded >> >> 673 down+peering >> >> 1 stale+active+degraded >> >> 15 remapped+peering >> >> 3 stale+active+recovering+degraded+remapped >> >> 3 active+degraded >> >> 33 remapped+incomplete >> >> 8 active+recovering+degraded+remapped >> >> >> >> The following is the output of "ceph osd tree": >> >> # id weight type name up/down reweight >> >> -1 9 root default >> >> -3 9 rack unknownrack >> >> -2 3 host 10.0.0.97 >> >> 0 1 osd.0 down 0 >> >> 1 1 osd.1 down 0 >> >> 2 1 osd.2 down 0 >> >> -4 3 host 10.0.0.98 >> >> 3 1 osd.3 down 0 >> >> 4 1 osd.4 down 0 >> >> 5 1 osd.5 down 0 >> >> -5 3 host 10.0.0.70 >> >> 6 1 osd.6 up 1 >> >> 7 1 osd.7 up 1 >> >> 8 1 osd.8 down 0 >> >> >> >> The following is part of output os osd.0.log >> >> >> >> -3> 2014-11-14 17:33:02.166022 7fd9dd1ab700 0 >> >> filestore(/data/osd/osd.0) error (28) No space left on device not >> >> handled on operation 10 (15804.0.13, or op 13, counting from 0) >> >> -2> 2014-11-14 17:33:02.216768 7fd9dd1ab700 0 >> >> filestore(/data/osd/osd.0) ENOSPC handling not implemented >> >> -1> 2014-11-14 17:33:02.216783 7fd9dd1ab700 0 >> >> filestore(/data/osd/osd.0) transaction dump: >> >> ... >> >> ... >> >> 0> 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In >> >> function 'unsigned int >> >> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, >> >> ThreadPool::TPHandle*)' thread 7fd9dd1ab700 time >> >> 2014-11-14 17:33:02.251570 >> >> os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error") >> >> >> >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >> >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> >> const*)+0x85) [0x17f8675] >> >> 2: (FileStore::_do_transaction(ObjectStore::Transaction&, >> >> unsigned long, int, ThreadPool::TPHandle*)+0x4855) [0x1534c21] >> >> 3: >> >> (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, >> >> std::allocator<ObjectStore::Transaction*> >&, unsigned long, >> >> ThreadPool::TPHandle*)+0x101) [0x152d67d] >> >> 4: (FileStore::_do_op(FileStore::OpSequencer*, >> >> ThreadPool::TPHandle&)+0x57b) [0x152bdc3] >> >> 5: (FileStore::OpWQ::_process(FileStore::OpSequencer*, >> >> ThreadPool::TPHandle&)+0x2f) [0x1553c6f] >> >> 6: >> >> (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*, >> >> ThreadPool::TPHandle&)+0x37) [0x15625e7] >> >> 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de] >> >> 8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d] >> >> 9: (Thread::_entry_func(void*)+0x23) [0x1998117] >> >> 10: (()+0x79d1) [0x7fd9e92bf9d1] >> >> 11: (clone()+0x6d) [0x7fd9e78ca9dd] >> >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> >> needed to interpret this. >> >> >> >> It seens the error code was ENOSPC(No space left), why the osd >> >> program exited with "assert" at >> >> this time? If there was no space left, why the cluster should choose >> >> to migrate? Only osd.6 >> >> and osd.7 was alive. I tried to restarted other OSDs, but after a >> >> while, there osds crashed again. >> >> And now I can't read the data any more. >> >> Is it a bug? Anyone can help me? >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users@xxxxxxxxxxxxxx >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com