On Thu, 13 Jun 2019, Harald Staub wrote: > On 13.06.19 15:52, Sage Weil wrote: > > On Thu, 13 Jun 2019, Harald Staub wrote: > [...] > > I think that increasing the various suicide timeout options will allow > > it to stay up long enough to clean up the ginormous objects: > > > > ceph config set osd.NNN osd_op_thread_suicide_timeout 2h > > ok > > > > It looks healthy so far: > > > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck > > > fsck success > > > > > > Now we have to choose how to continue, trying to reduce the risk of losing > > > data (most bucket indexes are intact currently). My guess would be to let > > > this > > > OSD (which was not the primary) go in and hope that it recovers. In case > > > of a > > > problem, maybe we could still use the other OSDs "somehow"? In case of > > > success, we would bring back the other OSDs as well? > > > > > > OTOH we could try to continue with the key dump from earlier today. > > > > I would start all three osds the same way, with 'noout' set on the > > cluster. You should try to avoid triggering recovery because it will have > > a hard time getting through the big index object on that bucket (i.e., it > > will take a long time, and might trigger some blocked ios and so forth). > > This I do not understand, how would I avoid recovery? Well, simply doing 'ceph osd set noout' is sufficient to avoid recovery, I suppose. But in any case, getting at least 2 of the existing copies/OSDs online (assuming your pool's min_size=2) will mean you can finish the reshard process and clean up the big object without copying the PG anywhere. I think you may as well do all 3 OSDs this way, then clean up the big object--that way in the end no data will have to move. This is Nautilus, right? If you scrub the PGs in question, that will also now raise the health alert if there are any remaining big omap objects... if that warning goes away you'll know you're doing cleaning up. A final rocksdb compaction should then be enough to remove any remaing weirdness from rocksdb's internal layout. > > (Side note that since you started the OSD read-write using the internal > > copy of rocksdb, don't forget that the external copy you extracted > > (/mnt/ceph/db?) is now stale!) > > As suggested by Paul Emmerich (see next E-mail in this thread), I exported > this PG. It took not that long (20 minutes). Great :) sage _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com