On Thu, 13 Jun 2019, Paul Emmerich wrote: > Something I had suggested off-list (repeated here if anyone else finds > themselves in a similar scenario): > > since only one PG is dead and the OSD now seems to be alive enough to > start/mount: consider taking a backup of the affected PG with > > ceph-objectstore-tool --op export --pgid X.YY > > (That might also take a loong time) > > That export can later be imported into any other OSD if these three dead > OSDs turn out to be a lost cause. Yes--this is a great suggestion! There may also be other PGs that are stale because all 3 copies land on these 3 OSDs... and for those less-problematic PGs, importing them elsewhere is comparatively safe. But doing those imports on fresh OSD(s) is always a good practice! sage > (Risk: importing the PG somewhere else might kill that OSD as well, > depending on the nature of the problem; I suggested new OSDs as import > target) > > Paul > > On Thu, Jun 13, 2019 at 3:52 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > On Thu, 13 Jun 2019, Harald Staub wrote: > > > Idea received from Wido den Hollander: > > > bluestore rocksdb options = "compaction_readahead_size=0" > > > > > > With this option, I just tried to start 1 of the 3 crashing OSDs, and it > > came > > > up! I did with "ceph osd set noin" for now. > > > > Yay! > > > > > Later it aborted: > > > > > > 2019-06-13 13:11:11.862 7f2a19f5f700 1 heartbeat_map reset_timeout > > > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had timed out after 15 > > > 2019-06-13 13:11:11.862 7f2a19f5f700 1 heartbeat_map reset_timeout > > > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had suicide timed out after 150 > > > 2019-06-13 13:11:11.862 7f2a37982700 0 --1- > > > v1:[2001:620:5ca1:201::119]:6809/3426631 >> > > > v1:[2001:620:5ca1:201::144]:6821/3627456 conn(0x564f65c0c000 > > 0x564f26d6d800 > > > :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=18075 cs=1 > > l=0).handle_connect_reply_2 > > > connect got RESETSESSION > > > 2019-06-13 13:11:11.862 7f2a19f5f700 -1 *** Caught signal (Aborted) ** > > > in thread 7f2a19f5f700 thread_name:tp_osd_tp > > > > > > ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus > > > (stable) > > > 1: (()+0x12890) [0x7f2a3a818890] > > > 2: (pthread_kill()+0x31) [0x7f2a3a8152d1] > > > 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char > > const*, > > > unsigned long)+0x24b) [0x564d732ca2bb] > > > 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, > > unsigned > > > long, unsigned long)+0x255) [0x564d732ca895] > > > 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5a0) > > > [0x564d732eb560] > > > 6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564d732ed5d0] > > > 7: (()+0x76db) [0x7f2a3a80d6db] > > > 8: (clone()+0x3f) [0x7f2a395ad88f] > > > > > > I guess that this is because of pending backfilling and the noin flag? > > > Afterwards it restarted by itself and came up. I stopped it again for > > now. > > > > I think that increasing the various suicide timeout options will allow > > it to stay up long enough to clean up the ginormous objects: > > > > ceph config set osd.NNN osd_op_thread_suicide_timeout 2h > > > > > It looks healthy so far: > > > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck > > > fsck success > > > > > > Now we have to choose how to continue, trying to reduce the risk of > > losing > > > data (most bucket indexes are intact currently). My guess would be to > > let this > > > OSD (which was not the primary) go in and hope that it recovers. In case > > of a > > > problem, maybe we could still use the other OSDs "somehow"? In case of > > > success, we would bring back the other OSDs as well? > > > > > > OTOH we could try to continue with the key dump from earlier today. > > > > I would start all three osds the same way, with 'noout' set on the > > cluster. You should try to avoid triggering recovery because it will have > > a hard time getting through the big index object on that bucket (i.e., it > > will take a long time, and might trigger some blocked ios and so forth). > > > > (Side note that since you started the OSD read-write using the internal > > copy of rocksdb, don't forget that the external copy you extracted > > (/mnt/ceph/db?) is now stale!) > > > > sage > > > > > > > > Any opinions? > > > > > > Thanks! > > > Harry > > > > > > On 13.06.19 09:32, Harald Staub wrote: > > > > On 13.06.19 00:33, Sage Weil wrote: > > > > [...] > > > > > One other thing to try before taking any drastic steps (as described > > > > > below): > > > > > > > > > > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck > > > > > > > > This gives: fsck success > > > > > > > > and the large alloc warnings: > > > > > > > > tcmalloc: large alloc 2145263616 bytes == 0x562412e10000 @ > > 0x7fed890d6887 > > > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 > > 0x56238566fa05 > > > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 > > 0x562385420ae1 > > > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a > > > > tcmalloc: large alloc 4290519040 bytes == 0x562492bf2000 @ > > 0x7fed890d6887 > > > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 > > 0x56238566fa05 > > > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 > > 0x562385420ae1 > > > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a > > > > tcmalloc: large alloc 8581029888 bytes == 0x562593068000 @ > > 0x7fed890d6887 > > > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 > > 0x56238566fa05 > > > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 > > 0x562385420ae1 > > > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a > > > > tcmalloc: large alloc 17162051584 bytes == 0x562792fea000 @ > > 0x7fed890d6887 > > > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2 > > 0x56238566fa05 > > > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 > > 0x562385420ae1 > > > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a > > > > tcmalloc: large alloc 13559291904 bytes == 0x562b92eec000 @ > > 0x7fed890d6887 > > > > 0x562385370229 0x56238537181b 0x562385723a99 0x56238566dd25 > > 0x56238566fa05 > > > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418 > > 0x562385420ae1 > > > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a > > > > > > > > Thanks! > > > > Harry > > > > > > > > [...] > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com