Re: osd crashed while there was no space

han vincent <hangzws@xxxxxxxxx> · Wed, 19 Nov 2014 10:19:14 +0800



    Hmm, the problem is I had not modified any config, all the config
is default.
    as you said, all the IO should be stopped by the configs
"mon_osd_full_ration" or "osd_failsafe_full_ration". In my test, when
the osd near full, the IO from "rest bench" stopped, but the backfill
IO did not stop.  Each osd had 20G space, I think the space is big
enough.

2014-11-19 3:18 GMT+08:00 Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx>:
> You shouldn't let the cluster get so full that losing a few OSDs will make
> you go toofull.  Letting the cluster get to 100% full is such a bad idea
> that you should make sure it doesn't happen.
>
>
> Ceph is supposed to stop moving data to an OSD once that OSD hits
> osd_backfill_full_ratio, which defaults to 0.85.  Any disk at 86% full will
> stop backfilling.
>
> I have verified this works when the disks fill up while the cluster is
> healthy, but I haven't failed a disk once I'm in the toofull state.  Even
> so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default
> 0.97) should stop all IO until a human gets involved.
>
> The only gotcha I can find is that the values are percentages, and the test
> is a "greater than" done with two significant digits.  ie, if the
> osd_backfill_full_ratio is 0.85, it will continue backfilling until the disk
> is 86% full.  So values are 0.99 and 1.00 will cause problems.
>
>
> On Mon, Nov 17, 2014 at 6:50 PM, han vincent <hangzws@xxxxxxxxx> wrote:
>>
>> hi, craig:
>>
>>     Your solution did work very well. But if the data is very
>> important, when remove directory of PG from OSDs, a small mistake will
>> result in loss of data. And if cluster is very large, do not you think
>> delete the data on the disk from 100% to 95% is a tedious and
>> error-prone thing, for so many OSDs, large disks, and so on.
>>
>>      so my key question is: if there is no space in the cluster while
>> some OSDs crashed,  why the cluster should choose to migrate? And in
>> the migrating, other
>> OSDs will crashed one by one until the cluster could not work.
>>
>> 2014-11-18 5:28 GMT+08:00 Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx>:
>> > At this point, it's probably best to delete the pool.  I'm assuming the
>> > pool
>> > only contains benchmark data, and nothing important.
>> >
>> > Assuming you can delete the pool:
>> > First, figure out the ID of the data pool.  You can get that from ceph
>> > osd
>> > dump | grep '^pool'
>> >
>> > Once you have the number, delete the data pool: rados rmpool data data
>> > --yes-i-really-really-mean-it
>> >
>> > That will only free up space on OSDs that are up.  You'll need to
>> > manually
>> > some PGs on the OSDs that are 100% full.  Go to
>> > /var/lib/ceph/osd/ceph-<OSDID>/current, and delete a few directories
>> > that
>> > start with your data pool ID.  You don't need to delete all of them.
>> > Once
>> > the disk is below 95% full, you should be able to start that OSD.  Once
>> > it's
>> > up, it will finish deleting the pool.
>> >
>> > If you can't delete the pool, it is possible, but it's more work, and
>> > you
>> > still run the risk of losing data if you make a mistake.  You need to
>> > disable backfilling, then delete some PGs on each OSD that's full. Try
>> > to
>> > only delete one copy of each PG.  If you delete every copy of a PG on
>> > all
>> > OSDs, then you lost the data that was in that PG.  As before, once you
>> > delete enough that the disk is less than 95% full, you can start the
>> > OSD.
>> > Once you start it, start deleting your benchmark data out of the data
>> > pool.
>> > Once that's done, you can re-enable backfilling.  You may need to scrub
>> > or
>> > deep-scrub the OSDs you deleted data from to get everything back to
>> > normal.
>> >
>> >
>> > So how did you get the disks 100% full anyway?  Ceph normally won't let
>> > you
>> > do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio,
>> > or
>> > osd_failsafe_full_ratio?
>> >
>> >
>> > On Mon, Nov 17, 2014 at 7:00 AM, han vincent <hangzws@xxxxxxxxx> wrote:
>> >>
>> >> hello, every one:
>> >>
>> >>     These days a problem of "ceph" has troubled me for a long time.
>> >>
>> >>     I build a cluster with 3 hosts and each host has three osds in it.
>> >> And after that
>> >> I used the command "rados bench 360 -p data -b 4194304 -t 300 write
>> >> --no-cleanup"
>> >> to test the write performance of the cluster.
>> >>
>> >>     When the cluster is near full, there couldn't write any data to
>> >> it. Unfortunately,
>> >> there was a host hung up, then a lots of PG was going to migrate to
>> >> other
>> >> OSDs.
>> >> After a while, a lots of OSD was marked down and out, my cluster
>> >> couldn't
>> >> work
>> >> any more.
>> >>
>> >>     The following is the output of "ceph -s":
>> >>
>> >>     cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
>> >>     health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
>> >> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
>> >> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
>> >> recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
>> >> down, quorum 0,2 2,1
>> >>      monmap e1: 3 mons at
>> >> {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
>> >> epoch 40, quorum 0,2 2,1
>> >>      osdmap e173: 9 osds: 2 up, 2 in
>> >>             flags full
>> >>       pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
>> >>             37541 MB used, 3398 MB / 40940 MB avail
>> >>             945/29649 objects degraded (3.187%)
>> >>                   34 stale+active+degraded+remapped
>> >>                  176 stale+incomplete
>> >>                  320 stale+down+peering
>> >>                   53 active+degraded+remapped
>> >>                  408 incomplete
>> >>                    1 active+recovering+degraded
>> >>                  673 down+peering
>> >>                    1 stale+active+degraded
>> >>                   15 remapped+peering
>> >>                    3 stale+active+recovering+degraded+remapped
>> >>                    3 active+degraded
>> >>                   33 remapped+incomplete
>> >>                    8 active+recovering+degraded+remapped
>> >>
>> >>     The following is the output of "ceph osd tree":
>> >>     # id    weight  type name       up/down reweight
>> >>     -1      9       root default
>> >>     -3      9               rack unknownrack
>> >>     -2      3                       host 10.0.0.97
>> >>      0       1                               osd.0   down    0
>> >>      1       1                               osd.1   down    0
>> >>      2       1                               osd.2   down    0
>> >>      -4      3                       host 10.0.0.98
>> >>      3       1                               osd.3   down    0
>> >>      4       1                               osd.4   down    0
>> >>      5       1                               osd.5   down    0
>> >>      -5      3                       host 10.0.0.70
>> >>      6       1                               osd.6   up      1
>> >>      7       1                               osd.7   up      1
>> >>      8       1                               osd.8   down    0
>> >>
>> >> The following is part of output os osd.0.log
>> >>
>> >>     -3> 2014-11-14 17:33:02.166022 7fd9dd1ab700  0
>> >> filestore(/data/osd/osd.0)  error (28) No space left on device not
>> >> handled on operation 10 (15804.0.13, or op 13, counting from 0)
>> >>     -2> 2014-11-14 17:33:02.216768 7fd9dd1ab700  0
>> >> filestore(/data/osd/osd.0) ENOSPC handling not implemented
>> >>     -1> 2014-11-14 17:33:02.216783 7fd9dd1ab700  0
>> >> filestore(/data/osd/osd.0)  transaction dump:
>> >>     ...
>> >>     ...
>> >>     0> 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 os/FileStore.cc: In
>> >> function 'unsigned int
>> >> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
>> >> ThreadPool::TPHandle*)' thread 7fd9dd1ab700             time
>> >> 2014-11-14 17:33:02.251570
>> >>       os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error")
>> >>
>> >>       ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>> >>      1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x85) [0x17f8675]
>> >>      2: (FileStore::_do_transaction(ObjectStore::Transaction&,
>> >> unsigned long, int, ThreadPool::TPHandle*)+0x4855)         [0x1534c21]
>> >>      3:
>> >> (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,
>> >> std::allocator<ObjectStore::Transaction*> >&,      unsigned long,
>> >> ThreadPool::TPHandle*)+0x101) [0x152d67d]
>> >>      4: (FileStore::_do_op(FileStore::OpSequencer*,
>> >> ThreadPool::TPHandle&)+0x57b) [0x152bdc3]
>> >>      5: (FileStore::OpWQ::_process(FileStore::OpSequencer*,
>> >> ThreadPool::TPHandle&)+0x2f) [0x1553c6f]
>> >>      6:
>> >> (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*,
>> >> ThreadPool::TPHandle&)+0x37)      [0x15625e7]
>> >>      7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x7a4) [0x18801de]
>> >>      8: (ThreadPool::WorkThread::entry()+0x23) [0x1881f2d]
>> >>      9: (Thread::_entry_func(void*)+0x23) [0x1998117]
>> >>     10: (()+0x79d1) [0x7fd9e92bf9d1]
>> >>     11: (clone()+0x6d) [0x7fd9e78ca9dd]
>> >>     NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> needed to interpret this.
>> >>
>> >>     It seens the error code was ENOSPC(No space left), why the osd
>> >> program exited with "assert" at
>> >> this time? If there was no space left, why the cluster should choose
>> >> to migrate? Only osd.6
>> >> and osd.7 was alive. I tried to restarted other OSDs, but after a
>> >> while, there osds crashed again.
>> >> And now I can't read the data any more.
>> >>     Is it a bug? Anyone can help me?
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com