On Sat, 26 Aug 2017, Linux Chips wrote: > On 08/24/2017 06:58 AM, Sage Weil wrote: > > > > Okay, so I think the combination of (1) removing empty PGs and (2) pruning > > past_intervals will help. (1) can be scripted by looking in > > current/$pg_HEAD directories and picking out the ones with 0 or 1 objects > > in them, doing ceph-objecstore-tool export to make a backup (just in > > case), and then removing them (with ceph-objectstore-tool). Be careful of > > PGs for empty pools since those will be naturally empty (and you want > > to keep them). > > > > For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if > > that is applied to the kraken branch it ought ot work (although it's > > untested). Alternatively, you can just upgrade to luminous, as it > > implements a more sophisticated version of the same thing. You need to > > upgrade mons, mark all osds down, upgrade osds and start at least one of > > them, and then set 'ceph osd require-osd-release luminous' before it'll > > switch to the new past intervals representation. Definitely test it on > > your test cluster to ensure it reduces the memory usage! > > > > If that doesn't sort things out we'll need to see a heap profile for an > > OOMing OSD to make sure we know what is using all of the RAM... > > > > sage > > > > well, big thank you sage. we tested the upgrade on the test cluster. and it > did work in an awesome way. and we did not even needed to remove any pg with > the tool. OMG what a relief! This is great news for a Friday night! :) > I read about the prune thing in the release notes, so when our attempts failed > to start the cluster, we tried upgrading, but it did not help. it turned out > that we missed the 'ceph osd require-osd-release luminous' thing. I mean we > was looking on the command in the release notes upgrade section, and said to > each other "it dose not matter, it would only restrict the old osds from > joining" and moved on. damn, we would be up a week ago. > having said that, I think the release notes should highlight this in the > future. Good point--I'll update the wording to make it clear that a lot of new behavior does not kick in until the switch is flipped. > now we have upgraded the production cluster, and it is up and running now, > memory foot print was down to the tenth. the largest ram using osd i saw was > about 6.5GB. > but we faced some issues, particularly OSDs crashing with "FAILED > assert(interval.last > last)" Just to clarify: this production cluster never ran the hacky kraken patch I posted that prune past intervals, right? If so, then yes, please open a ticket with any/all osd.299 bugs you still have. If it ran the kraken patch, then let's not bother--I don't want to confuse the situation with logs from that weird code. Anyway, I'm delighted you're back up (and I'm sorry the release notes wording didn't help lead you there a week ago)! FWIW the 6.5gb still sounds very high; I'd confirm that after the cluster stabilizes a restarted OSD shrinks back down to a more typical size (I suspect the allocator isn't releasing memory back to the OS due to fragmentation etc). Thanks, and enjoy the weekend! sage > > > logs: > > -34> 2017-08-26 00:38:00.505114 7f14556b4d00 0 osd.299 1085665 load_pgs > opened 455 pgs > -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e > needs 1050342-1085230 > -32> 2017-08-26 00:38:00.505814 7f14556b4d00 1 osd.299 1085665 > build_past_intervals_parallel over 1050342-1085230 > -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050342 > -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 - > loading and decoding 0x7f14b3dfb0c0 > -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl > 1050342 780781 bytes > -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80] > up [80], same_interval_since = 1050342 > -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050343 > -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 - > loading and decoding 0x7f14b3dfad80 > -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl > 1050343 781371 bytes > -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050344 > -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 - > loading and decoding 0x7f14b3dfb740 > -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl > 1050344 782446 bytes > -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050345 > -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 - > loading and decoding 0x7f14b3dfba80 > -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl > 1050345 782071 bytes > -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050346 > -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 - > loading and decoding 0x7f14b79c4000 > -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl > 1050346 781955 bytes > -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050347 > -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 - > loading and decoding 0x7f14b79c4340 > -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl > 1050347 781930 bytes > -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050348 > -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 - > loading and decoding 0x7f14b79c4680 > -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl > 1050348 784883 bytes > -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050349 > -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 - > loading and decoding 0x7f14b79c49c0 > -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl > 1050349 785081 bytes > -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050350 > -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 - > loading and decoding 0x7f14b79c4d00 > -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl > 1050350 785948 bytes > -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665 > build_past_intervals_parallel epoch 1050351 > -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 - > loading and decoding 0x7f14b79c5040 > -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl > 1050351 788650 bytes > 0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: > In function 'virtual void pi_compact_rep::add_interval(bool, const > PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26 > 00:38:00.532119 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: > 3205: FAILED assert(interval.last > last) > > ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x110) [0x7f145612f420] > 2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t > const&)+0x3b2) [0x7f1455e030b2] > 3: (PastIntervals::check_new_interval(int, int, std::vector<int, > std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, > int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, > std::allocator<int> > const&, unsigned int, unsigned int, > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) > [0x7f1455de8ab0] > 4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f] > 5: (OSD::load_pgs()+0x503) [0x7f1455bbef13] > 6: (OSD::init()+0x2179) [0x7f1455bd7779] > 7: (main()+0x2def) [0x7f1455add56f] > 8: (__libc_start_main()+0xf5) [0x7f1451d14b35] > 9: (()+0x4ac8a6) [0x7f1455b7b8a6] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 20/20 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 4/ 5 memdb > 1/ 5 kinetic > 1/ 5 fuse > 1/ 5 mgr > 1/ 5 mgrc > 1/ 5 dpdk > 1/ 5 eventtrace > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.299.log > --- end dump of recent events --- > 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) ** > in thread 7f14556b4d00 thread_name:ceph-osd > > ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc) > 1: (()+0xa21a01) [0x7f14560f0a01] > 2: (()+0xf370) [0x7f1452cfe370] > 3: (gsignal()+0x37) [0x7f1451d281d7] > 4: (abort()+0x148) [0x7f1451d298c8] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x284) [0x7f145612f594] > 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t > const&)+0x3b2) [0x7f1455e030b2] > 7: (PastIntervals::check_new_interval(int, int, std::vector<int, > std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, > int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, > std::allocator<int> > const&, unsigned int, unsigned int, > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) > [0x7f1455de8ab0] > 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f] > 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13] > 10: (OSD::init()+0x2179) [0x7f1455bd7779] > 11: (main()+0x2def) [0x7f1455add56f] > 12: (__libc_start_main()+0xf5) [0x7f1451d14b35] > 13: (()+0x4ac8a6) [0x7f1455b7b8a6] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- begin dump of recent events --- > 0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) > ** > in thread 7f14556b4d00 thread_name:ceph-osd > > ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc) > 1: (()+0xa21a01) [0x7f14560f0a01] > 2: (()+0xf370) [0x7f1452cfe370] > 3: (gsignal()+0x37) [0x7f1451d281d7] > 4: (abort()+0x148) [0x7f1451d298c8] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x284) [0x7f145612f594] > 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t > const&)+0x3b2) [0x7f1455e030b2] > 7: (PastIntervals::check_new_interval(int, int, std::vector<int, > std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, > int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, > std::allocator<int> > const&, unsigned int, unsigned int, > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) > [0x7f1455de8ab0] > 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f] > 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13] > 10: (OSD::init()+0x2179) [0x7f1455bd7779] > 11: (main()+0x2def) [0x7f1455add56f] > 12: (__libc_start_main()+0xf5) [0x7f1451d14b35] > 13: (()+0x4ac8a6) [0x7f1455b7b8a6] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 20/20 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 4/ 5 memdb > 1/ 5 kinetic > 1/ 5 fuse > 1/ 5 mgr > 1/ 5 mgrc > 1/ 5 dpdk > 1/ 5 eventtrace > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.299.log > --- end dump of recent events --- > > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid > 19.f1e > { > "pgid": "19.f1e", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [], > "history": { > "epoch_created": 1084817, > "epoch_pool_created": 1084817, > "last_epoch_started": 1085232, > "last_interval_started": 1085230, > "last_epoch_clean": 1050342, > "last_interval_clean": 1050342, > "last_epoch_split": 0, > "last_epoch_marked_full": 1061015, > "same_up_since": 1085230, > "same_interval_since": 1085230, > "same_primary_since": 1085230, > "last_scrub": "960114'865853", > "last_scrub_stamp": "2017-08-25 17:32:06.181006", > "last_deep_scrub": "952725'861179", > "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006", > "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006" > }, > "stats": { > "version": "0'0", > "reported_seq": "424", > "reported_epoch": "1085650", > "state": "active+undersized+degraded", > "last_fresh": "2017-08-25 18:52:46.520078", > "last_change": "2017-08-25 18:38:16.356266", > "last_active": "2017-08-25 18:52:46.520078", > "last_peered": "2017-08-25 18:52:46.520078", > "last_clean": "2017-08-25 17:32:06.181006", > "last_became_active": "2017-08-25 18:38:16.356266", > "last_became_peered": "2017-08-25 18:38:16.356266", > "last_unstale": "2017-08-25 18:52:46.520078", > "last_undegraded": "2017-08-25 18:38:16.304877", > "last_fullsized": "2017-08-25 18:38:16.304877", > "mapping_epoch": 1085230, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 1084817, > "last_epoch_clean": 1050342, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "960114'865853", > "last_scrub_stamp": "2017-08-25 17:32:06.181006", > "last_deep_scrub": "952725'861179", > "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006", > "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": false, > "dirty_stats_invalid": false, > "omap_stats_invalid": false, > "hitset_stats_invalid": false, > "hitset_bytes_stats_invalid": false, > "pin_stats_invalid": false, > "stat_sum": { > "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_missing": 0, > "num_objects_degraded": 0, > "num_objects_misplaced": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 0, > "num_whiteouts": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0, > "num_objects_omap": 0, > "num_objects_hit_set_archive": 0, > "num_bytes_hit_set_archive": 0, > "num_flush": 0, > "num_flush_kb": 0, > "num_evict": 0, > "num_evict_kb": 0, > "num_promote": 0, > "num_flush_mode_high": 0, > "num_flush_mode_low": 0, > "num_evict_mode_some": 0, > "num_evict_mode_full": 0, > "num_objects_pinned": 0, > "num_legacy_snapsets": 0 > }, > "up": [ > 299 > ], > "acting": [ > 299 > ], > "blocked_by": [], > "up_primary": 299, > "acting_primary": 299 > }, > "empty": 1, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 1085232, > "hit_set_history": { > "current_last_update": "0'0", > "history": [] > } > } > > > ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/ > total 0 > -rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13 > > ceph pg 19.f1e query > . > . > "blocked": "peering is blocked due to down osds", > "down_osds_we_would_probe": [ > 299 > ], > "peering_blocked_by": [ > { > "osd": 299, > "current_lost_at": 0, > "comment": "starting or marking this osd lost may let us > proceed" > } > ] > . > . > . > > > > removing the pg with ceph-objectstore-tool did the trick. but i am not sure if > that will happen to a pg with real data in it or not. > should i report this in the bug tracker? > > thanks > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html