Re: High memory usage kills OSD while peering

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 25 Aug 2017 22:46:57 +0000 (UTC)

On Sat, 26 Aug 2017, Linux Chips wrote:
> On 08/24/2017 06:58 AM, Sage Weil wrote:
> > 
> > Okay, so I think the combination of (1) removing empty PGs and (2) pruning
> > past_intervals will help.  (1) can be scripted by looking in
> > current/$pg_HEAD directories and picking out the ones with 0 or 1 objects
> > in them, doing ceph-objecstore-tool export to make a backup (just in
> > case), and then removing them (with ceph-objectstore-tool).  Be careful of
> > PGs for empty pools since those will be naturally empty (and you want
> > to keep them).
> > 
> > For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if
> > that is applied to the kraken branch it ought ot work (although it's
> > untested).  Alternatively, you can just upgrade to luminous, as it
> > implements a more sophisticated version of the same thing.  You need to
> > upgrade mons, mark all osds down, upgrade osds and start at least one of
> > them, and then set 'ceph osd require-osd-release luminous' before it'll
> > switch to the new past intervals representation.  Definitely test it on
> > your test cluster to ensure it reduces the memory usage!
> > 
> > If that doesn't sort things out we'll need to see a heap profile for an
> > OOMing OSD to make sure we know what is using all of the RAM...
> > 
> > sage
> > 
> 
> well, big thank you sage. we tested the upgrade on the test cluster. and it
> did work in an awesome way. and we did not even needed to remove any pg with
> the tool.

OMG what a relief!  This is great news for a Friday night!  :)

> I read about the prune thing in the release notes, so when our attempts failed
> to start the cluster, we tried upgrading, but it did not help. it turned out
> that we missed the 'ceph osd require-osd-release luminous' thing. I mean we
> was looking on the command in the release notes upgrade section, and said to
> each other "it dose not matter, it would only restrict the old osds from
> joining" and moved on. damn, we would be up a week ago.
> having said that, I think the release notes should highlight this in the
> future.

Good point--I'll update the wording to make it clear that a lot of new 
behavior does not kick in until the switch is flipped.

> now we have upgraded the production cluster, and it is up and running now,
> memory foot print was down to the tenth. the largest ram using osd i saw was
> about 6.5GB.
> but we faced some issues, particularly OSDs crashing with "FAILED
> assert(interval.last > last)"

Just to clarify: this production cluster never ran the hacky kraken patch 
I posted that prune past intervals, right?  If so, then yes, please open a 
ticket with any/all osd.299 bugs you still have.  If it ran the kraken 
patch, then let's not bother--I don't want to confuse the situation with 
logs from that weird code.

Anyway, I'm delighted you're back up (and I'm sorry the release notes 
wording didn't help lead you there a week ago)!

FWIW the 6.5gb still sounds very high; I'd confirm that after the cluster 
stabilizes a restarted OSD shrinks back down to a more typical size (I 
suspect the allocator isn't releasing memory back to the OS due to 
fragmentation etc).

Thanks, and enjoy the weekend!
sage

> 
> 
> logs:
> 
>    -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665 load_pgs
> opened 455 pgs
>    -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e
> needs 1050342-1085230
>    -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665
> build_past_intervals_parallel over 1050342-1085230
>    -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050342
>    -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 -
> loading and decoding 0x7f14b3dfb0c0
>    -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050342 780781 bytes
>    -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80]
> up [80], same_interval_since = 1050342
>    -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050343
>    -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 -
> loading and decoding 0x7f14b3dfad80
>    -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050343 781371 bytes
>    -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050344
>    -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 -
> loading and decoding 0x7f14b3dfb740
>    -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050344 782446 bytes
>    -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050345
>    -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 -
> loading and decoding 0x7f14b3dfba80
>    -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050345 782071 bytes
>    -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050346
>    -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 -
> loading and decoding 0x7f14b79c4000
>    -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050346 781955 bytes
>    -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050347
>    -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 -
> loading and decoding 0x7f14b79c4340
>    -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050347 781930 bytes
>    -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050348
>    -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 -
> loading and decoding 0x7f14b79c4680
>    -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050348 784883 bytes
>     -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050349
>     -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 -
> loading and decoding 0x7f14b79c49c0
>     -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050349 785081 bytes
>     -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050350
>     -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 -
> loading and decoding 0x7f14b79c4d00
>     -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050350 785948 bytes
>     -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050351
>     -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 -
> loading and decoding 0x7f14b79c5040
>     -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050351 788650 bytes
>      0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
> In function 'virtual void pi_compact_rep::add_interval(bool, const
> PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26
> 00:38:00.532119
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
> 3205: FAILED assert(interval.last > last)
> 
>  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x7f145612f420]
>  2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x7f1455e030b2]
>  3: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x7f1455de8ab0]
>  4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>  5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>  6: (OSD::init()+0x2179) [0x7f1455bd7779]
>  7: (main()+0x2def) [0x7f1455add56f]
>  8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>  9: (()+0x4ac8a6) [0x7f1455b7b8a6]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>   20/20 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    4/ 5 memdb
>    1/ 5 kinetic
>    1/ 5 fuse
>    1/ 5 mgr
>    1/ 5 mgrc
>    1/ 5 dpdk
>    1/ 5 eventtrace
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.299.log
> --- end dump of recent events ---
> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
>  in thread 7f14556b4d00 thread_name:ceph-osd
> 
>  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>  1: (()+0xa21a01) [0x7f14560f0a01]
>  2: (()+0xf370) [0x7f1452cfe370]
>  3: (gsignal()+0x37) [0x7f1451d281d7]
>  4: (abort()+0x148) [0x7f1451d298c8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x7f145612f594]
>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x7f1455e030b2]
>  7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x7f1455de8ab0]
>  8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>  9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>  10: (OSD::init()+0x2179) [0x7f1455bd7779]
>  11: (main()+0x2def) [0x7f1455add56f]
>  12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>  13: (()+0x4ac8a6) [0x7f1455b7b8a6]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
> --- begin dump of recent events ---
>      0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted)
> **
>  in thread 7f14556b4d00 thread_name:ceph-osd
> 
>  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>  1: (()+0xa21a01) [0x7f14560f0a01]
>  2: (()+0xf370) [0x7f1452cfe370]
>  3: (gsignal()+0x37) [0x7f1451d281d7]
>  4: (abort()+0x148) [0x7f1451d298c8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x7f145612f594]
>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x7f1455e030b2]
>  7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x7f1455de8ab0]
>  8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>  9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>  10: (OSD::init()+0x2179) [0x7f1455bd7779]
>  11: (main()+0x2def) [0x7f1455add56f]
>  12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>  13: (()+0x4ac8a6) [0x7f1455b7b8a6]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>   20/20 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    4/ 5 memdb
>    1/ 5 kinetic
>    1/ 5 fuse
>    1/ 5 mgr
>    1/ 5 mgrc
>    1/ 5 dpdk
>    1/ 5 eventtrace
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.299.log
> --- end dump of recent events ---
> 
> 
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid
> 19.f1e
> {
>     "pgid": "19.f1e",
>     "last_update": "0'0",
>     "last_complete": "0'0",
>     "log_tail": "0'0",
>     "last_user_version": 0,
>     "last_backfill": "MAX",
>     "last_backfill_bitwise": 0,
>     "purged_snaps": [],
>     "history": {
>         "epoch_created": 1084817,
>         "epoch_pool_created": 1084817,
>         "last_epoch_started": 1085232,
>         "last_interval_started": 1085230,
>         "last_epoch_clean": 1050342,
>         "last_interval_clean": 1050342,
>         "last_epoch_split": 0,
>         "last_epoch_marked_full": 1061015,
>         "same_up_since": 1085230,
>         "same_interval_since": 1085230,
>         "same_primary_since": 1085230,
>         "last_scrub": "960114'865853",
>         "last_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_deep_scrub": "952725'861179",
>         "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006"
>     },
>     "stats": {
>         "version": "0'0",
>         "reported_seq": "424",
>         "reported_epoch": "1085650",
>         "state": "active+undersized+degraded",
>         "last_fresh": "2017-08-25 18:52:46.520078",
>         "last_change": "2017-08-25 18:38:16.356266",
>         "last_active": "2017-08-25 18:52:46.520078",
>         "last_peered": "2017-08-25 18:52:46.520078",
>         "last_clean": "2017-08-25 17:32:06.181006",
>         "last_became_active": "2017-08-25 18:38:16.356266",
>         "last_became_peered": "2017-08-25 18:38:16.356266",
>         "last_unstale": "2017-08-25 18:52:46.520078",
>         "last_undegraded": "2017-08-25 18:38:16.304877",
>         "last_fullsized": "2017-08-25 18:38:16.304877",
>         "mapping_epoch": 1085230,
>         "log_start": "0'0",
>         "ondisk_log_start": "0'0",
>         "created": 1084817,
>         "last_epoch_clean": 1050342,
>         "parent": "0.0",
>         "parent_split_bits": 0,
>         "last_scrub": "960114'865853",
>         "last_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_deep_scrub": "952725'861179",
>         "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "log_size": 0,
>         "ondisk_log_size": 0,
>         "stats_invalid": false,
>         "dirty_stats_invalid": false,
>         "omap_stats_invalid": false,
>         "hitset_stats_invalid": false,
>         "hitset_bytes_stats_invalid": false,
>         "pin_stats_invalid": false,
>         "stat_sum": {
>             "num_bytes": 0,
>             "num_objects": 0,
>             "num_object_clones": 0,
>             "num_object_copies": 0,
>             "num_objects_missing_on_primary": 0,
>             "num_objects_missing": 0,
>             "num_objects_degraded": 0,
>             "num_objects_misplaced": 0,
>             "num_objects_unfound": 0,
>             "num_objects_dirty": 0,
>             "num_whiteouts": 0,
>             "num_read": 0,
>             "num_read_kb": 0,
>             "num_write": 0,
>             "num_write_kb": 0,
>             "num_scrub_errors": 0,
>             "num_shallow_scrub_errors": 0,
>             "num_deep_scrub_errors": 0,
>             "num_objects_recovered": 0,
>             "num_bytes_recovered": 0,
>             "num_keys_recovered": 0,
>             "num_objects_omap": 0,
>             "num_objects_hit_set_archive": 0,
>             "num_bytes_hit_set_archive": 0,
>             "num_flush": 0,
>             "num_flush_kb": 0,
>             "num_evict": 0,
>             "num_evict_kb": 0,
>             "num_promote": 0,
>             "num_flush_mode_high": 0,
>             "num_flush_mode_low": 0,
>             "num_evict_mode_some": 0,
>             "num_evict_mode_full": 0,
>             "num_objects_pinned": 0,
>             "num_legacy_snapsets": 0
>         },
>         "up": [
>             299
>         ],
>         "acting": [
>             299
>         ],
>         "blocked_by": [],
>         "up_primary": 299,
>         "acting_primary": 299
>     },
>     "empty": 1,
>     "dne": 0,
>     "incomplete": 0,
>     "last_epoch_started": 1085232,
>     "hit_set_history": {
>         "current_last_update": "0'0",
>         "history": []
>     }
> }
> 
> 
> ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
> total 0
> -rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13
> 
> ceph pg 19.f1e query
> .
> .
>             "blocked": "peering is blocked due to down osds",
>             "down_osds_we_would_probe": [
>                 299
>             ],
>             "peering_blocked_by": [
>                 {
>                     "osd": 299,
>                     "current_lost_at": 0,
>                     "comment": "starting or marking this osd lost may let us
> proceed"
>                 }
>             ]
> .
> .
> .
> 
> 
> 
> removing the pg with ceph-objectstore-tool did the trick. but i am not sure if
> that will happen to a pg with real data in it or not.
> should i report this in the bug tracker?
> 
> thanks
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html