Re: osd crashed while there was no space

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Tue, 18 Nov 2014 11:18:34 -0800

You shouldn't let the cluster get so full that losing a few OSDs will make you go toofull.  Letting the cluster get to 100% full is such a bad idea that you should make sure it doesn't happen.

Ceph is supposed to stop moving data to an OSD once that OSD hits osd_backfill_full_ratio, which defaults to 0.85.  Any disk at 86% full will stop backfilling.

I have verified this works when the disks fill up while the cluster is healthy, but I haven't failed a disk once I'm in the toofull state.  Even so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default 0.97) should stop all IO until a human gets involved. 

The only gotcha I can find is that the values are percentages, and the test is a "greater than" done with two significant digits.  ie, if the osd_backfill_full_ratio is 0.85, it will continue backfilling until the disk is 86% full.  So values are 0.99 and 1.00 will cause problems.

On Mon, Nov 17, 2014 at 6:50 PM, han vincent <hangzws@xxxxxxxxx> wrote:
hi, craig:

    Your solution did work very well. But if the data is very

important, when remove directory of PG from OSDs, a small mistake will

result in loss of data. And if cluster is very large, do not you think

delete the data on the disk from 100% to 95% is a tedious and

error-prone thing, for so many OSDs, large disks, and so on.

     so my key question is: if there is no space in the cluster while

some OSDs crashed,  why the cluster should choose to migrate? And in

the migrating, other

OSDs will crashed one by one until the cluster could not work.

2014-11-18 5:28 GMT+08:00 Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx>:

> At this point, it's probably best to delete the pool.  I'm assuming the pool

> only contains benchmark data, and nothing important.

>

> Assuming you can delete the pool:

> First, figure out the ID of the data pool.  You can get that from ceph osd

> dump | grep '^pool'

>

> Once you have the number, delete the data pool: rados rmpool data data

> --yes-i-really-really-mean-it

>

> That will only free up space on OSDs that are up.  You'll need to manually

> some PGs on the OSDs that are 100% full.  Go to

> /var/lib/ceph/osd/ceph-<OSDID>/current, and delete a few directories that

> start with your data pool ID.  You don't need to delete all of them.  Once

> the disk is below 95% full, you should be able to start that OSD.  Once it's

> up, it will finish deleting the pool.

>

> If you can't delete the pool, it is possible, but it's more work, and you

> still run the risk of losing data if you make a mistake.  You need to

> disable backfilling, then delete some PGs on each OSD that's full. Try to

> only delete one copy of each PG.  If you delete every copy of a PG on all

> OSDs, then you lost the data that was in that PG.  As before, once you

> delete enough that the disk is less than 95% full, you can start the OSD.

> Once you start it, start deleting your benchmark data out of the data pool.

> Once that's done, you can re-enable backfilling.  You may need to scrub or

> deep-scrub the OSDs you deleted data from to get everything back to normal.

>

>

> So how did you get the disks 100% full anyway?  Ceph normally won't let you

> do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or

> osd_failsafe_full_ratio?

>

>

> On Mon, Nov 17, 2014 at 7:00 AM, han vincent <hangzws@xxxxxxxxx> wrote:

>>

>> hello, every one:

>>

>>     These days a problem of "ceph" has troubled me for a long time.

>>

>>     I build a cluster with 3 hosts and each host has three osds in it.

>> And after that

>> I used the command "rados bench 360 -p data -b 4194304 -t 300 write

>> --no-cleanup"

>> to test the write performance of the cluster.

>>

>>     When the cluster is near full, there couldn't write any data to

>> it. Unfortunately,

>> there was a host hung up, then a lots of PG was going to migrate to other

>> OSDs.

>> After a while, a lots of OSD was marked down and out, my cluster couldn't

>> work

>> any more.

>>

>>     The following is the output of "ceph -s":

>>

>>     cluster 002c3742-ab04-470f-8a7a-ad0658b547d6

>>     health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs

>> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625

>> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;

>> recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons

>> down, quorum 0,2 2,1

>>      monmap e1: 3 mons at

>> {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election

>> epoch 40, quorum 0,2 2,1

>>      osdmap e173: 9 osds: 2 up, 2 in

>>             flags full

>>       pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects

>>             37541 MB used, 3398 MB / 40940 MB avail

>>             945/29649 objects degraded (3.187%)

>>                   34 stale+active+degraded+remapped

>>                  176 stale+incomplete

>>                  320 stale+down+peering

>>                   53 active+degraded+remapped

>>                  408 incomplete

>>                    1 active+recovering+degraded

>>                  673 down+peering

>>                    1 stale+active+degraded

>>                   15 remapped+peering

>>                    3 stale+active+recovering+degraded+remapped

>>                    3 active+degraded

>>                   33 remapped+incomplete

>>                    8 active+recovering+degraded+remapped

>>

>>     The following is the output of "ceph osd tree":

>>     # id    weight  type name       up/down reweight

>>     -1      9       root default

>>     -3      9               rack unknownrack

>>     -2      3                       host 10.0.0.97

>>      0       1                               osd.0   down    0

>>      1       1                               osd.1   down    0

>>      2       1                               osd.2   down    0

>>      -4      3                       host 10.0.0.98

>>      3       1                               osd.3   down    0

>>      4       1                               osd.4   down    0

>>      5       1                               osd.5   down    0

>>      -5      3                       host 10.0.0.70

>>      6       1                               osd.6   up      1

>>      7       1                               osd.7   up      1

>>      8       1                               osd.8   down    0

>>

>> The following is part of output os osd.0.log

>>

>>     -3> 2014-11-14 17:33:02.166022 7fd9dd1ab700  0

>> filestore(/data/osd/osd.0)  error (28) No space left on device not

>> handled on operation 10 (15804.0.13, or op 13, counting from 0)

>>     -2> 2014-11-14 17:33:02.216768 7fd9dd1ab700  0

>> filestore(/data/osd/osd.0) ENOSPC handling not implemented

>>     -1> 2014-11-14 17:33:02.216783 7fd9dd1ab700  0

>> filestore(/data/osd/osd.0)  transaction dump:

>>     ...

>>     ...

>>     0> 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In

>> function 'unsigned int

>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,

>> ThreadPool::TPHandle*)' thread 7fd9dd1ab700             time

>> 2014-11-14 17:33:02.251570

>>       os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error")

>>

>>       ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)

>>      1: (ceph::__ceph_assert_fail(char const*, char const*, int, char

>> const*)+0x85) [0x17f8675]

>>      2: (FileStore::_do_transaction(ObjectStore::Transaction&,

>> unsigned long, int, ThreadPool::TPHandle*)+0x4855)         [0x1534c21]

>>      3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,

>> std::allocator<ObjectStore::Transaction*> >&,      unsigned long,

>> ThreadPool::TPHandle*)+0x101) [0x152d67d]

>>      4: (FileStore::_do_op(FileStore::OpSequencer*,

>> ThreadPool::TPHandle&)+0x57b) [0x152bdc3]

>>      5: (FileStore::OpWQ::_process(FileStore::OpSequencer*,

>> ThreadPool::TPHandle&)+0x2f) [0x1553c6f]

>>      6:

>> (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*,

>> ThreadPool::TPHandle&)+0x37)      [0x15625e7]

>>      7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de]

>>      8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d]

>>      9: (Thread::_entry_func(void*)+0x23) [0x1998117]

>>     10: (()+0x79d1) [0x7fd9e92bf9d1]

>>     11: (clone()+0x6d) [0x7fd9e78ca9dd]

>>     NOTE: a copy of the executable, or `objdump -rdS <executable>` is

>> needed to interpret this.

>>

>>     It seens the error code was ENOSPC(No space left), why the osd

>> program exited with "assert" at

>> this time? If there was no space left, why the cluster should choose

>> to migrate? Only osd.6

>> and osd.7 was alive. I tried to restarted other OSDs, but after a

>> while, there osds crashed again.

>> And now I can't read the data any more.

>>     Is it a bug? Anyone can help me?

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com