Re: [Jewel] Crash Osd with void Hit_set_trim

Brad Hubbard <bhubbard@xxxxxxxxxx> · Tue, 24 Oct 2017 16:50:12 +1000

On Tue, Oct 24, 2017 at 3:49 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

On Mon, Oct 23, 2017 at 4:51 PM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx> wrote:

    Hello,

    Le 23/10/2017 à 02:05, Brad Hubbard a écrit :

            2017-10-22 17:32:56.031086 7f3acaff5700  1 osd.14
              pg_epoch: 72024 pg[37.1c( v 71593'41657
              (60849'38594,71593'41657] local-les=72023 n=13 ec=7037
              les/c/f 72023/72023/66447 72022/72022/72022) [14,1,41] r=0
              lpr=72022 crt=71593'41657 lcod 0'

              0 mlcod 0'0 active+clean] hit_set_trim
              37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31
              01%3a03%3a24.697717Z_2017-08-31
              01%3a52%3a34.767197Z:head not found

              2017-10-22 17:32:56.033936 7f3acaff5700 -1
              osd/ReplicatedPG.cc: In function 'void
              ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
              unsigned int)' thread 7f3acaff5700 time 2017-10-22
              17:32:56.031105

              osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

            It appears to be looking for (and failing to find) a hitset
            object with a timestamp from August? Does that sound right
            to you? Of course, it appears an object for that timestamp
            does not exist.

    How is-it possible ? How to fix it. I am sure, if I run a lot of
    read, other objects like this will crash other osd.

    (Cluster is OK now, I will probably destroy OSD 14 and recreate it).

    How to find this object ?

You should be able to do a find on the OSDs filestore and grep the output for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs responsible for pg 37.1c and then move on to the others if it's feasible.

Many thanks to Kefu for correcting me on this.

You'll need to use something more like the following command to find this object.

find ${path_to_osd} -name 'hit\\uset\\u37.1c\\uarchive\\u2017-08-31 01:03:24.697717Z\\u2017-08-31 01:52:34.767197Z*'

Apologies for the confusion, it was entirely mine.

Let us know the results.

    For information : All ceph server are NTP time synchrone. 

        What are the settings for this cache tier?

    Just Tier in "backwrite" on erasure pool 2+1.

    # ceph osd pool get cache-nvme-data all

    size: 3

    min_size: 2

    crash_replay_interval: 0

    pg_num: 512

    pgp_num: 512

    crush_ruleset: 10

    hashpspool: true

    nodelete: false

    nopgchange: false

    nosizechange: false

    write_fadvise_dontneed: false

    noscrub: false

    nodeep-scrub: false

    hit_set_type: bloom

    hit_set_period: 14400

    hit_set_count: 12

    hit_set_fpp: 0.05

    use_gmt_hitset: 1

    auid: 0

    target_max_objects: 1000000

    target_max_bytes: 100000000000

    cache_target_dirty_ratio: 0.4

    cache_target_dirty_high_ratio: 0.6

    cache_target_full_ratio: 0.8

    cache_min_flush_age: 600

    cache_min_evict_age: 1800

    min_read_recency_for_promote: 1

    min_write_recency_for_promote: 1

    fast_read: 0

    hit_set_grade_decay_rate: 0

    hit_set_search_last_n: 0

    #  ceph osd pool get raid-2-1-data all

    size: 3

    min_size: 2

    crash_replay_interval: 0

    pg_num: 1024

    pgp_num: 1024

    crush_ruleset: 8

    hashpspool: true

    nodelete: false

    nopgchange: false

    nosizechange: false

    write_fadvise_dontneed: false

    noscrub: false

    nodeep-scrub: false

    use_gmt_hitset: 1

    auid: 0

    erasure_code_profile: raid-2-1

    min_write_recency_for_promote: 0

    fast_read: 0

    # ceph osd erasure-code-profile get raid-2-1

    jerasure-per-chunk-alignment=false

    k=2

    m=1

    plugin=jerasure

    ruleset-failure-domain=host

    ruleset-root=default

    technique=reed_sol_van

    w=8

      Could you check your logs for any errors from the
        'agent_load_hit_sets' function?

    join log : #  pdsh -R exec -w
    ceph-osd-01,ceph-osd-02,ceph-osd-03,ceph-osd-04 ssh -x  %h 'zgrep
    -B10 -A10 agent_load_hit_sets  /var/log/ceph/ceph-osd.*gz'|less >
    log_agent_load_hit_sets.log

    On 19 October, I restarted on morning OSD 14.

    thanks for your help.

    regards,

          On Mon, Oct 23, 2017 at 2:41 AM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx>
            wrote:

                Hello,
                I ran today a lot read IO with
                    an simple rsync... and again, an OSD crashed :
                But as before, I can't restart
                    OSD. It continue crashing again. So OSD is out,
                    cluster is recovering.
                I had just time to increase OSD
                    log. 

                # ceph tell osd.14 injectargs
                    --debug-osd 5/5

                Join log : 

                # grep -B100 -100 objdump
                    /var/log/ceph/ceph-osd.14.log

                If I ran another read, an other
                    OSD willl probably crash.
                Any Idee ?

                I will probably plan to move
                    data from erasure pool to replicat 3x pool. It's
                    becoming unstable without any change.

                Regards,
                PS: Last sunday, I lost RBD
                    header during remove of cache tier... a lot of
                    thanks to http://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/,
                    to recreate it and resurrect RBD disk :)

                    Le
                      19/10/2017 à 00:19, Brad Hubbard a écrit :

                      On Wed, Oct 18, 2017 at 11:16 PM, pascal.pucci@xxxxxxxxxxxxxxx
<pascal.pucci@xxxxxxxxxxxxxxx> wrote:

                        hello,

For 2 week, I lost sometime some OSD :
Here trace :

    0> 2017-10-18 05:16:40.873511 7f7c1e497700 -1 osd/ReplicatedPG.cc: In
function '*void ReplicatedPG::hit_set_trim(*ReplicatedPG::OpContextUPtr&,
unsigned int)' thread 7f7c1e497700 time 2017-10-18 05:16:40.869962
osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

                      Can you try to capture a log with debug_osd set to 10 or greater as
per http://tracker.ceph.com/issues/19185 ?

This will allow us to see the output from the
PrimaryLogPG::get_object_context() function which may help identify
the problem.

Please also check your machines all have the same time zone set and
their clocks are in sync.

                         ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x55eec15a09e5]
 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x6dd)
[0x55eec107a52d]
 3: (ReplicatedPG::hit_set_persist()+0xd7c) [0x55eec107d1bc]
 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1a92)
[0x55eec109bbe2]
 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x747) [0x55eec10588a7]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x41d) [0x55eec0f0bbad]
 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d)
[0x55eec0f0bdfd]
 8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x77b) [0x55eec0f0f7db]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887)
[0x55eec1590987]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55eec15928f0]
 11: (()+0x7e25) [0x7f7c4fd52e25]
 12: (clone()+0x6d) [0x7f7c4e3dc34d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

I am using Jewel 10.2.10

I am using Erasure coding pool (2+1) + Nvme cache tier (backwrite) with 3
replica with simple RBD disk.
(12 OSD Sata disk on 4 nodes + 1 nvme on each node = 48 x OSD sata + 8 x
NVMe Osd (I split NVMe in 2).
Last week, it was only nvme OSD which crashed. So I unmap all disk, detroyed
cache and recreated It.
>From this days, it work fine. Today, an OSD crahed. But it was not an NVME
OSD this time, a normal OSD (sata).

Any idee ? what about this void "*ReplicatedPG::hit_set_trim".

*thanks for your help,*
*
Regards,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad

-- 
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com