Re: rocksdb corruption, stale pg, rebuild bucket index

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 12 Jun 2019 11:39:11 -0400

Hi Harald,

If the bucket reshard didn't complete, it's most likely one of the new 
bucket index shards that got corrupted here and the original index shard 
should still be intact. Does $BAD_BUCKET_ID correspond to the 
new/resharded instance id? If so, once the rocksdb/osd issues are 
resolved, you should still be able to access and write to the bucket. 
The 'radosgw-admin reshard stale-instances list/rm' commands should be 
able to detect and clean up after the failed reshard. Without knowing 
more about the rocksdb problem, it's hard to tell whether it's safe to 
re-reshard.

Casey

On 6/12/19 10:31 AM, Harald Staub wrote:
Also opened an issue about the rocksdb problem:
https://tracker.ceph.com/issues/40300

On 12.06.19 16:06, Harald Staub wrote:
We ended in a bad situation with our RadosGW (Cluster is Nautilus 
14.2.1, 350 OSDs with BlueStore):

1. There is a bucket with about 60 million objects, without shards.

2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024

3. Resharding looked fine first, it counted up to the number of 
objects, but then it hang.

4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too 
short"

5. Trying to start the OSDs manually led to the same segfaults.

6. ceph-bluestore-tool repair ...

7. The repairs all aborted, with the same rocksdb error as above.

8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, 
and it contained the index of this big bucket.

Is there any hope in getting these rocksdbs up again?

Otherwise: how would we fix the bucket index pool? Our ideas:

1. ceph pg $BAD_PG mark_unfound_lost delete
2. rados -p .rgw.buckets ls, search $BAD_BUCKET_ID and remove these 
objects. The hope of this step would be to make the following step 
faster, and avoid another similar problem.
3. radosgw-admin bucket check --check-objects

Will this really rebuild the bucket index? Is it ok to leave the 
existing bucket indexes in place? Is it ok to run for all buckets at 
once, or has it to be run bucket by bucket? Is there a risk that the 
indexes that are not affected by the BAD_PG will be broken afterwards?

Some more details that may be of interest.

ceph-bluestore-repair says:

2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is 
too short (6139497190 bytes) to be an sstabledb/079728.sst
2019-06-12 11:15:38.345 7f56269670c0 -1 
bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:
error from fsck: (5) Input/output error

The repairs also showed several warnings like:

tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 
0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 
0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 
0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 
0x56126a1a73a1 0x7f561b228b97 0x56126a28077a

The processes showed up with like 45 GB of RAM used. Fortunately, 
there was no Out-Of-Memory.

  Harry
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com