Excuse the top-posting. When looking at the logs it helps to filter by the actual thread that crashed. $ grep 7f08af3b6700 ceph-osd.27.log.last.error.txt|tail -15 -1001> 2019-10-30 12:55:41.498823 7f08af3b6700 1 -- 129.20.199.93:6803/977508 --> 129.20.199.7:0/2975967502 -- osd_op_reply(283046730 rbd_data.384d296b8b4567.0000000000000f99 [set-alloc-hint object_size 4194304 write_size 4194304,write 3145728~4096] v194345'6696469 uv6696469 ondisk = 0) v8 -- 0x5598ed521440 con 0 -651> 2019-10-30 12:55:42.211634 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -565> 2019-10-30 12:55:42.775786 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.2:6823/3002168 -- MOSDScrubReserve(5.2d8 REJECT e194345) v1 -- 0x5598ed7e4000 con 0 -457> 2019-10-30 12:55:43.390134 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'4406723, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -435> 2019-10-30 12:55:43.850768 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'1735861, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -335> 2019-10-30 12:55:44.637635 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'7602452, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -325> 2019-10-30 12:55:44.682357 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.1:6802/3802 -- osd_repop(client.108792126.1:283046901 6.369 e194345/194339 6:96f81e66:::rbd_data.384d296b8b4567.0000000000000f99:head v 194345'6696470) v2 -- 0x5598ee591600 con 0 -324> 2019-10-30 12:55:44.682450 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.2:6821/6004637 -- osd_repop(client.108792126.1:283046901 6.369 e194345/194339 6:96f81e66:::rbd_data.384d296b8b4567.0000000000000f99:head v 194345'6696470) v2 -- 0x5598cf2ad600 con 0 -323> 2019-10-30 12:55:44.682510 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'6696470, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -20> 2019-10-30 12:55:46.366704 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.2:6806/1848108 -- pg_scan(digest 2.1d9 2:9b97b661:::rb.0.a7bb39.238e1f29.000000107c9b:head-MAX e 194345/194345) v2 -- 0x5598efc0bb80 con 0 0> 2019-10-30 12:55:46.496423 7f08af3b6700 -1 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f08af3b6700 time 2019-10-30 12:55:46.487842 2019-10-30 12:55:46.557930 7f08af3b6700 -1 *** Caught signal (Aborted) ** in thread 7f08af3b6700 thread_name:tp_osd_tp 0> 2019-10-30 12:55:46.557930 7f08af3b6700 -1 *** Caught signal (Aborted) ** in thread 7f08af3b6700 thread_name:tp_osd_tp Since PrimaryLogPG::on_local_recover() prints the object id when the function is entered at debug level 10 I'd suggest gathering a log at a higher 'debug_osd' level (I'd suggest 20) to be sure about what object is causing the issue. 334 void PrimaryLogPG::on_local_recover( 335 const hobject_t &hoid, 336 const ObjectRecoveryInfo &_recovery_info, 337 ObjectContextRef obc, 338 bool is_delete, 339 ObjectStore::Transaction *t 340 ) 341 { 342 dout(10) << __func__ << ": " << hoid << dendl; On Wed, Oct 30, 2019 at 11:43 PM Jérémy Gardais <jeremy.gardais@xxxxxxxxxxxxxxx> wrote: > > The "best" health i was able to get was : > HEALTH_ERR norecover flag(s) set; 1733/37482459 objects misplaced (0.005%); 5 scrub errors; Possible data damage: 2 pgs inconsistent; Degraded data redundancy: 7461/37482459 objects degraded (0.020%), 24 pgs degraded, 2 pgs undersized > OSDMAP_FLAGS norecover flag(s) set > OBJECT_MISPLACED 1733/37482459 objects misplaced (0.005%) > OSD_SCRUB_ERRORS 5 scrub errors > PG_DAMAGED Possible data damage: 2 pgs inconsistent > pg 2.2ba is active+clean+inconsistent, acting [42,29,30] > pg 2.2bb is active+clean+inconsistent, acting [25,42,18] > PG_DEGRADED Degraded data redundancy: 7461/37482459 objects degraded (0.020%), 24 pgs degraded, 2 pgs undersized > pg 2.3e is active+recovery_wait+degraded, acting [27,31,5] > pg 2.9d is active+recovery_wait+degraded, acting [27,22,37] > pg 2.a3 is active+recovery_wait+degraded, acting [27,30,35] > pg 2.136 is active+recovery_wait+degraded, acting [27,18,22] > pg 2.150 is active+recovery_wait+degraded, acting [27,19,35] > pg 2.15e is active+recovery_wait+degraded, acting [27,11,36] > pg 2.1d9 is stuck undersized for 14023.243179, current state active+undersized+degraded+remapped+backfill_wait, last acting [25,30] > pg 2.20f is active+recovery_wait+degraded, acting [27,30,2] > pg 2.2a1 is active+recovery_wait+degraded, acting [27,18,35] > pg 2.2b7 is active+recovery_wait+degraded, acting [27,18,36] > pg 2.386 is active+recovery_wait+degraded, acting [27,42,17] > pg 2.391 is active+recovery_wait+degraded, acting [27,15,36] > pg 2.448 is stuck undersized for 51520.798900, current state active+recovery_wait+undersized+degraded+remapped, last acting [27,38] > pg 2.456 is active+recovery_wait+degraded, acting [27,5,43] > pg 2.45a is active+recovery_wait+degraded, acting [27,43,36] > pg 2.45f is active+recovery_wait+degraded, acting [27,16,36] > pg 2.46c is active+recovery_wait+degraded, acting [27,30,38] > pg 2.4bf is active+recovery_wait+degraded, acting [27,39,18] > pg 2.522 is active+recovery_wait+degraded, acting [27,17,3] > pg 2.535 is active+recovery_wait+degraded, acting [27,29,36] > pg 2.55a is active+recovery_wait+degraded, acting [27,29,18] > pg 5.23f is active+recovery_wait+degraded, acting [27,39,18] > pg 5.356 is active+recovery_wait+degraded, acting [27,36,15] > pg 5.4a6 is active+recovery_wait+degraded, acting [29,40,30] > > > After that, the flapping started again : > 2019-10-30 12:55:46.772593 mon.r730xd1 [INF] osd.38 failed (root=default,datacenter=IPR,room=11B,rack=baie2,host=r740xd1) (connection refused reported by osd.22) > 2019-10-30 12:55:46.850239 mon.r730xd1 [INF] osd.27 failed (root=default,datacenter=IPR,room=11B,rack=baie2,host=r730xd3) (connection refused reported by osd.19) > 2019-10-30 12:55:56.714029 mon.r730xd1 [WRN] Health check update: 2 osds down (OSD_DOWN) > > > Setting "norecover" flag allow these 2 OSDs to recover up state and > limit the flapping states and many backfills. > > > In both osd.27 and osd.38 logs, i still find these logs before one > FAILED assert : > -2> 2019-10-30 12:52:31.999571 7fb5c125b700 5 -- 129.20.177.3:6802/870834 >> 129.20.177.3:6808/810999 conn(0x564c31d31000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=97 cs=7 l=0). rx osd.25 seq 2600 0x564c3a02e1c0 MOSDPGPush(2.1d9 194334/194298 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 > -1> 2019-10-30 12:52:31.999633 7fb5c125b700 1 -- 129.20.177.3:6802/870834 <== osd.25 129.20.177.3:6808/810999 2600 ==== MOSDPGPush(2.1d9 194334/194298 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 ==== 909+0+0 (4137937376 0 0) 0x564c3a02e1c0 con 0x564c31d31000 > 0> 2019-10-30 12:52:32.008605 7fb59b03d700 -1 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fb59b03d700 time 2019-10-30 12:52:32.000033 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: 354: FAILED assert(recovery_info.oi.legacy_snaps.size()) > > -2> 2019-10-30 12:52:31.998835 7fce6189b700 5 -- 129.20.177.4:6810/2125155 >> 129.20.177.3:6808/810999 conn(0x5648d74a0800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=76 cs=1 l=0). rx osd.25 seq 24804 0x5648d708eac0 MOSDPGPush(2.1d9 194334/194298 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 > -1> 2019-10-30 12:52:31.998899 7fce6189b700 1 -- 129.20.177.4:6810/2125155 <== osd.25 129.20.177.3:6808/810999 24804 ==== MOSDPGPush(2.1d9 194334/194298 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 ==== 925+0+0 (542499734 0 0) 0x5648d708eac0 con 0x5648d74a0800 > 0> 2019-10-30 12:52:32.003339 7fce4803e700 -1 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fce4803e700 time 2019-10-30 12:52:31.999086 > /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: 354: FAILED assert(recovery_info.oi.legacy_snaps.size()) > > > More informations about this object (0c16b76b8b4567.000000000001426e) > from pg 2.1d9 : > ceph osd map rbd rbd_data.0c16b76b8b4567.000000000001426e > osdmap e194356 pool 'rbd' (2) object 'rbd_data.0c16b76b8b4567.000000000001426e' -> pg 2.181de9d9 (2.1d9) -> up ([27,30,38], p27) acting ([30,25], p30) > > I also checked the logs of all OSDs already done and got the same logs > about this object : > * osd.4, last time : 2019-10-10 16:15:20 > * osd.32, last time : 2019-10-14 01:54:56 > * osd.33, last time : 2019-10-11 06:24:01 > * osd.34, last time : 2019-10-18 06:24:26 > * osd.20, last time : 2019-10-27 18:12:31 > * osd.28, last time : 2019-10-28 12:57:47 > > No matter that the data came from osd.25 or osd.30, i have the same > error. It seems this PG|object try to recover an healthy state but > shutdown my OSDs one by one… > > Thus spake Jérémy Gardais (jeremy.gardais@xxxxxxxxxxxxxxx) on mercredi 30 octobre 2019 à 11:09:36: > > Thus spake Brad Hubbard (bhubbard@xxxxxxxxxx) on mercredi 30 octobre 2019 à 12:50:50: > > > You should probably try and work out what caused the issue and take > > > steps to minimise the likelihood of a recurrence. This is not expected > > > behaviour in a correctly configured and stable environment. > > This PG 2.1d9 is "only" marked as : "active+undersized+degraded+remapped+backfill_wait", not inconsistent… > > Everything i got from PG 2.1d9 (query, list_missing, > ceph-objectstore-tool list,…) is available here : > https://cloud.ipr.univ-rennes1.fr/index.php/s/BYtuAURnC7YOAQG?path=%2Fpg.2.1d9 > But nothing looks suspicious to me… > > I also separated the logs from the last error on osd.27 and it's > reboot ("only" ~22k lines ^^) : > https://cloud.ipr.univ-rennes1.fr/index.php/s/BYtuAURnC7YOAQG/download?path=%2F&files=ceph-osd.27.log.last.error.txt > > Is anybody understand these logs or do i have to leave with this damned > object ? ^^ > > -- > Gardais Jérémy > Institut de Physique de Rennes > Université Rennes 1 > Téléphone: 02-23-23-68-60 > Mail & bonnes pratiques: http://fr.wikipedia.org/wiki/Nétiquette > ------------------------------- -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com