Re: [Jewel] Crash Osd with void Hit_set_trim

Brad Hubbard <bhubbard@xxxxxxxxxx> · Tue, 24 Oct 2017 15:49:24 +1000

On Mon, Oct 23, 2017 at 4:51 PM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx> wrote:

    Hello,

    Le 23/10/2017 à 02:05, Brad Hubbard a écrit :

            2017-10-22 17:32:56.031086 7f3acaff5700  1 osd.14
              pg_epoch: 72024 pg[37.1c( v 71593'41657
              (60849'38594,71593'41657] local-les=72023 n=13 ec=7037
              les/c/f 72023/72023/66447 72022/72022/72022) [14,1,41] r=0
              lpr=72022 crt=71593'41657 lcod 0'

              0 mlcod 0'0 active+clean] hit_set_trim
              37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31
              01%3a03%3a24.697717Z_2017-08-31
              01%3a52%3a34.767197Z:head not found

              2017-10-22 17:32:56.033936 7f3acaff5700 -1
              osd/ReplicatedPG.cc: In function 'void
              ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
              unsigned int)' thread 7f3acaff5700 time 2017-10-22
              17:32:56.031105

              osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

            It appears to be looking for (and failing to find) a hitset
            object with a timestamp from August? Does that sound right
            to you? Of course, it appears an object for that timestamp
            does not exist.

    How is-it possible ? How to fix it. I am sure, if I run a lot of
    read, other objects like this will crash other osd.

    (Cluster is OK now, I will probably destroy OSD 14 and recreate it).

    How to find this object ?

You should be able to do a find on the OSDs filestore and grep the output for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs responsible for pg 37.1c and then move on to the others if it's feasible.

Let us know the results.

    For information : All ceph server are NTP time synchrone. 

        What are the settings for this cache tier?

    Just Tier in "backwrite" on erasure pool 2+1.

    # ceph osd pool get cache-nvme-data all

    size: 3

    min_size: 2

    crash_replay_interval: 0

    pg_num: 512

    pgp_num: 512

    crush_ruleset: 10

    hashpspool: true

    nodelete: false

    nopgchange: false

    nosizechange: false

    write_fadvise_dontneed: false

    noscrub: false

    nodeep-scrub: false

    hit_set_type: bloom

    hit_set_period: 14400

    hit_set_count: 12

    hit_set_fpp: 0.05

    use_gmt_hitset: 1

    auid: 0

    target_max_objects: 1000000

    target_max_bytes: 100000000000

    cache_target_dirty_ratio: 0.4

    cache_target_dirty_high_ratio: 0.6

    cache_target_full_ratio: 0.8

    cache_min_flush_age: 600

    cache_min_evict_age: 1800

    min_read_recency_for_promote: 1

    min_write_recency_for_promote: 1

    fast_read: 0

    hit_set_grade_decay_rate: 0

    hit_set_search_last_n: 0

    #  ceph osd pool get raid-2-1-data all

    size: 3

    min_size: 2

    crash_replay_interval: 0

    pg_num: 1024

    pgp_num: 1024

    crush_ruleset: 8

    hashpspool: true

    nodelete: false

    nopgchange: false

    nosizechange: false

    write_fadvise_dontneed: false

    noscrub: false

    nodeep-scrub: false

    use_gmt_hitset: 1

    auid: 0

    erasure_code_profile: raid-2-1

    min_write_recency_for_promote: 0

    fast_read: 0

    # ceph osd erasure-code-profile get raid-2-1

    jerasure-per-chunk-alignment=false

    k=2

    m=1

    plugin=jerasure

    ruleset-failure-domain=host

    ruleset-root=default

    technique=reed_sol_van

    w=8

      Could you check your logs for any errors from the
        'agent_load_hit_sets' function?

    join log : #  pdsh -R exec -w
    ceph-osd-01,ceph-osd-02,ceph-osd-03,ceph-osd-04 ssh -x  %h 'zgrep
    -B10 -A10 agent_load_hit_sets  /var/log/ceph/ceph-osd.*gz'|less >
    log_agent_load_hit_sets.log

    On 19 October, I restarted on morning OSD 14.

    thanks for your help.

    regards,

          On Mon, Oct 23, 2017 at 2:41 AM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx>
            wrote:

                Hello,
                I ran today a lot read IO with
                    an simple rsync... and again, an OSD crashed :
                But as before, I can't restart
                    OSD. It continue crashing again. So OSD is out,
                    cluster is recovering.
                I had just time to increase OSD
                    log. 

                # ceph tell osd.14 injectargs
                    --debug-osd 5/5

                Join log : 

                # grep -B100 -100 objdump
                    /var/log/ceph/ceph-osd.14.log

                If I ran another read, an other
                    OSD willl probably crash.
                Any Idee ?

                I will probably plan to move
                    data from erasure pool to replicat 3x pool. It's
                    becoming unstable without any change.

                Regards,
                PS: Last sunday, I lost RBD
                    header during remove of cache tier... a lot of
                    thanks to http://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/,
                    to recreate it and resurrect RBD disk :)

                    Le
                      19/10/2017 à 00:19, Brad Hubbard a écrit :

                      On Wed, Oct 18, 2017 at 11:16 PM, pascal.pucci@xxxxxxxxxxxxxxx
<pascal.pucci@xxxxxxxxxxxxxxx> wrote:

                        hello,

For 2 week, I lost sometime some OSD :
Here trace :

    0> 2017-10-18 05:16:40.873511 7f7c1e497700 -1 osd/ReplicatedPG.cc: In
function '*void ReplicatedPG::hit_set_trim(*ReplicatedPG::OpContextUPtr&,
unsigned int)' thread 7f7c1e497700 time 2017-10-18 05:16:40.869962
osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

                      Can you try to capture a log with debug_osd set to 10 or greater as
per http://tracker.ceph.com/issues/19185 ?

This will allow us to see the output from the
PrimaryLogPG::get_object_context() function which may help identify
the problem.

Please also check your machines all have the same time zone set and
their clocks are in sync.

                         ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x55eec15a09e5]
 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x6dd)
[0x55eec107a52d]
 3: (ReplicatedPG::hit_set_persist()+0xd7c) [0x55eec107d1bc]
 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1a92)
[0x55eec109bbe2]
 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x747) [0x55eec10588a7]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x41d) [0x55eec0f0bbad]
 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d)
[0x55eec0f0bdfd]
 8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x77b) [0x55eec0f0f7db]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887)
[0x55eec1590987]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55eec15928f0]
 11: (()+0x7e25) [0x7f7c4fd52e25]
 12: (clone()+0x6d) [0x7f7c4e3dc34d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

I am using Jewel 10.2.10

I am using Erasure coding pool (2+1) + Nvme cache tier (backwrite) with 3
replica with simple RBD disk.
(12 OSD Sata disk on 4 nodes + 1 nvme on each node = 48 x OSD sata + 8 x
NVMe Osd (I split NVMe in 2).
Last week, it was only nvme OSD which crashed. So I unmap all disk, detroyed
cache and recreated It.
>From this days, it work fine. Today, an OSD crahed. But it was not an NVME
OSD this time, a normal OSD (sata).

Any idee ? what about this void "*ReplicatedPG::hit_set_trim".

*thanks for your help,*
*
Regards,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com