Re: rocksdb corruption, stale pg, rebuild bucket index

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 12 Jun 2019 15:40:57 +0000 (UTC)

On Wed, 12 Jun 2019, Harald Staub wrote:
> Also opened an issue about the rocksdb problem:
> https://tracker.ceph.com/issues/40300

Thanks!

The 'rocksdb: Corruption: file is too short' the root of the problem 
here. Can you try starting the OSD with 'debug_bluestore=20' and 
'debug_bluefs=20'?  (And attach them to the ticket, or ceph-post-file and 
put the uuid in the ticket..)

Thanks!
sage

> 
> On 12.06.19 16:06, Harald Staub wrote:
> > We ended in a bad situation with our RadosGW (Cluster is Nautilus 
> > 14.2.1, 350 OSDs with BlueStore):
> > 
> > 1. There is a bucket with about 60 million objects, without shards.
> > 
> > 2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024
> > 
> > 3. Resharding looked fine first, it counted up to the number of objects, 
> > but then it hang.
> > 
> > 4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too short"
> > 
> > 5. Trying to start the OSDs manually led to the same segfaults.
> > 
> > 6. ceph-bluestore-tool repair ...
> > 
> > 7. The repairs all aborted, with the same rocksdb error as above.
> > 
> > 8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, and 
> > it contained the index of this big bucket.
> > 
> > Is there any hope in getting these rocksdbs up again?
> > 
> > Otherwise: how would we fix the bucket index pool? Our ideas:
> > 
> > 1. ceph pg $BAD_PG mark_unfound_lost delete
> > 2. rados -p .rgw.buckets ls, search $BAD_BUCKET_ID and remove these 
> > objects. The hope of this step would be to make the following step 
> > faster, and avoid another similar problem.
> > 3. radosgw-admin bucket check --check-objects
> > 
> > Will this really rebuild the bucket index? Is it ok to leave the 
> > existing bucket indexes in place? Is it ok to run for all buckets at 
> > once, or has it to be run bucket by bucket? Is there a risk that the 
> > indexes that are not affected by the BAD_PG will be broken afterwards?
> > 
> > Some more details that may be of interest.
> > 
> > ceph-bluestore-repair says:
> > 
> > 2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is too 
> > short (6139497190 bytes) to be an sstabledb/079728.sst
> > 2019-06-12 11:15:38.345 7f56269670c0 -1 
> > bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:
> > error from fsck: (5) Input/output error
> > 
> > The repairs also showed several warnings like:
> > 
> > tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 
> > 0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 
> > 0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 
> > 0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 
> > 0x56126a1a73a1 0x7f561b228b97 0x56126a28077a
> > 
> > The processes showed up with like 45 GB of RAM used. Fortunately, there 
> > was no Out-Of-Memory.
> > 
> >   Harry
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com