Re: Bucket resharding: "radosgw-admin bi list" ERROR

Maarten De Quick <mdequick85@xxxxxxxxx> · Wed, 5 Jul 2017 10:57:31 +0200

Hi Orit,

We're running on jewel, version 10.2.7.

I've ran the bi-list with the debugging commands and this is the end of it:

2017-07-05 08:50:19.705673 7ff3bfefe700  1 -- 10.21.4.1:0/3313807338 <== osd.3 10.21.4.111:6810/3633200
 2297 ==== osd_op_reply(2571 .dir.be-east.5582981.76.0 [call] v0'0 
uv65572318 _ondisk_ = 0) v7 ==== 145+0+385625 (3432176685 0 1775102993) 
0x7ff3b00041a0 con 0x7ff4272f48f0
2017-07-05 08:50:19.724193 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6810/3633200
 -- osd_op(client.5971646.0:2572 48.1b47291b .dir.be-east.5582981.76.0 
[call rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 --
 ?+0 0x7ff427327400 con 0x7ff4272f48f0
2017-07-05 08:50:19.767758 7ff3bfefe700  1 -- 10.21.4.1:0/3313807338 <== osd.3 10.21.4.111:6810/3633200
 2298 ==== osd_op_reply(2572 .dir.be-east.5582981.76.0 [call] v0'0 
uv65572318 _ondisk_ = 0) v7 ==== 145+0+385625 (3432176685 0 2330398289) 
0x7ff3b00041a0 con 0x7ff4272f48f0
2017-07-05 08:50:19.786309 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6810/3633200
 -- osd_op(client.5971646.0:2573 48.1b47291b .dir.be-east.5582981.76.0 
[call rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 --
 ?+0 0x7ff427327400 con 0x7ff4272f48f0
2017-07-05 08:50:19.827960 7ff3bfefe700  1 -- 10.21.4.1:0/3313807338 <== osd.3 10.21.4.111:6810/3633200
 2299 ==== osd_op_reply(2573 .dir.be-east.5582981.76.0 [call] v0'0 
uv65572318 _ondisk_ = 0) v7 ==== 145+0+385625 (3432176685 0 1724305540) 
0x7ff3b00041a0 con 0x7ff4272f48f0
2017-07-05 08:50:19.846588 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6810/3633200
 -- osd_op(client.5971646.0:2574 48.1b47291b .dir.be-east.5582981.76.0 
[call rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 --
 ?+0 0x7ff427327400 con 0x7ff4272f48f0
2017-07-05 08:50:19.870830 7ff3bfefe700  1 -- 10.21.4.1:0/3313807338 <== osd.3 10.21.4.111:6810/3633200
 2300 ==== osd_op_reply(2574 .dir.be-east.5582981.76.0 [call] v0'0 uv0 
_ondisk_ = -4 ((4) Interrupted system call)) v7 ==== 145+0+0 (798610401 0 
0) 0x7ff3b00041a0 con 0x7ff4272f48f0
ERROR: bi_list(): (4) Interrupted system call
2017-07-05 08:50:19.872489 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.112:6822/2795125 -- osd_op(client.5971646.0:2575 24.4322fa9f notify.0 [watch unwatch cookie 140686606221264] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427302b10
2017-07-05 08:50:19.878128 7ff3bf0f7700  1 -- 10.21.4.1:0/3313807338 <== osd.23 10.21.4.112:6822/2795125
 63 ==== osd_op_reply(2575 notify.0 [watch unwatch cookie 
140686606221264] v31545'6808 uv6416 _ondisk_ = 0) v7 ==== 128+0+0 
(3462997515 0 0) 0x7ff3980014f0 con 0x7ff427302b10
2017-07-05 08:50:19.878221 7ff4250219c0 20 remove_watcher() i=0
2017-07-05 08:50:19.878229 7ff4250219c0  2 removed watcher, disabling cache
2017-07-05 08:50:19.878278 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.113:6807/2176843 -- osd_op(client.5971646.0:2576 24.16dafda0 notify.1 [watch unwatch cookie 140686606235888] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427304ae0
2017-07-05 08:50:19.880843 7ff3beef5700  1 -- 10.21.4.1:0/3313807338 <== osd.27 10.21.4.113:6807/2176843
 63 ==== osd_op_reply(2576 notify.1 [watch unwatch cookie 
140686606235888] v31545'6706 uv6304 _ondisk_ = 0) v7 ==== 128+0+0 
(4086455760 0 0) 0x7ff3900014f0 con 0x7ff427304ae0
2017-07-05 08:50:19.880910 7ff4250219c0 20 remove_watcher() i=1
2017-07-05 08:50:19.880940 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6802/3632911 -- osd_op(client.5971646.0:2577 24.88aa5c95 notify.2 [watch unwatch cookie 140686606250416] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff4273083d0
2017-07-05 08:50:19.886387 7ff3becf3700  1 -- 10.21.4.1:0/3313807338 <== osd.1 10.21.4.111:6802/3632911
 94 ==== osd_op_reply(2577 notify.2 [watch unwatch cookie 
140686606250416] v31545'10057 uv9497 _ondisk_ = 0) v7 ==== 128+0+0 
(2583541993 0 0) 0x7ff388001630 con 0x7ff4273083d0
2017-07-05 08:50:19.886476 7ff4250219c0 20 remove_watcher() i=2
2017-07-05 08:50:19.886513 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6814/3633408 -- osd_op(client.5971646.0:2578 24.f8c99aee notify.3 [watch unwatch cookie 140686606264944] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff42730bca0
2017-07-05 08:50:19.888815 7ff3beaf1700  1 -- 10.21.4.1:0/3313807338 <== osd.5 10.21.4.111:6814/3633408
 32 ==== osd_op_reply(2578 notify.3 [watch unwatch cookie 
140686606264944] v31545'3419 uv3231 _ondisk_ = 0) v7 ==== 128+0+0 
(1994465853 0 0) 0x7ff380000a20 con 0x7ff42730bca0
2017-07-05 08:50:19.888893 7ff4250219c0 20 remove_watcher() i=3
2017-07-05 08:50:19.888940 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6802/3632911 -- osd_op(client.5971646.0:2579 24.a204812d notify.4 [watch unwatch cookie 140686606267200] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff4273083d0
2017-07-05 08:50:19.891441 7ff3becf3700  1 -- 10.21.4.1:0/3313807338 <== osd.1 10.21.4.111:6802/3632911
 95 ==== osd_op_reply(2579 notify.4 [watch unwatch cookie 
140686606267200] v31545'10058 uv9499 _ondisk_ = 0) v7 ==== 128+0+0 
(840319076 0 0) 0x7ff388001630 con 0x7ff4273083d0
2017-07-05 08:50:19.891511 7ff4250219c0 20 remove_watcher() i=4
2017-07-05 08:50:19.891545 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.112:6822/2795125 -- osd_op(client.5971646.0:2580 24.31099063 notify.5 [watch unwatch cookie 140686606269328] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427302b10
2017-07-05 08:50:19.893535 7ff3bf0f7700  1 -- 10.21.4.1:0/3313807338 <== osd.23 10.21.4.112:6822/2795125
 64 ==== osd_op_reply(2580 notify.5 [watch unwatch cookie 
140686606269328] v31545'6809 uv6418 _ondisk_ = 0) v7 ==== 128+0+0 
(3042799304 0 0) 0x7ff3980014f0 con 0x7ff427302b10
2017-07-05 08:50:19.893592 7ff4250219c0 20 remove_watcher() i=5
2017-07-05 08:50:19.893624 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.113:6807/2176843 -- osd_op(client.5971646.0:2581 24.97c520d4 notify.6 [watch unwatch cookie 140686606271968] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427304ae0
2017-07-05 08:50:19.895393 7ff3beef5700  1 -- 10.21.4.1:0/3313807338 <== osd.27 10.21.4.113:6807/2176843
 64 ==== osd_op_reply(2581 notify.6 [watch unwatch cookie 
140686606271968] v31545'6707 uv6306 _ondisk_ = 0) v7 ==== 128+0+0 
(1081188691 0 0) 0x7ff3900014f0 con 0x7ff427304ae0
2017-07-05 08:50:19.895454 7ff4250219c0 20 remove_watcher() i=6
2017-07-05 08:50:19.895485 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 --> 10.21.4.111:6802/3632911 -- osd_op(client.5971646.0:2582 24.84ada7c9 notify.7 [watch unwatch cookie 140686606274544] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff427300db0 con 0x7ff4273083d0
2017-07-05 08:50:19.897426 7ff3becf3700  1 -- 10.21.4.1:0/3313807338 <== osd.1 10.21.4.111:6802/3632911
 96 ==== osd_op_reply(2582 notify.7 [watch unwatch cookie 
140686606274544] v31545'10059 uv9501 _ondisk_ = 0) v7 ==== 128+0+0 
(4260323257 0 0) 0x7ff388001630 con 0x7ff4273083d0
2017-07-05 08:50:19.897492 7ff4250219c0 20 remove_watcher() i=7
2017-07-05 08:50:19.898051 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4273083d0 -- 0x7ff427308c50
2017-07-05 08:50:19.898108 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272f48f0 -- 0x7ff4272f5370
2017-07-05 08:50:19.898150 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff42730bca0 -- 0x7ff42730c510
2017-07-05 08:50:19.898203 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272eb9f0 -- 0x7ff4272ea730
2017-07-05 08:50:19.898257 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff42732aee0 -- 0x7ff427329c20
2017-07-05 08:50:19.898296 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272e8480 -- 0x7ff4272e71c0
2017-07-05 08:50:19.898343 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272fd830 -- 0x7ff4272fc570
2017-07-05 08:50:19.898469 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff427302b10 -- 0x7ff427301800
2017-07-05 08:50:19.898518 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272efa00 -- 0x7ff4272ee6f0
2017-07-05 08:50:19.898642 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272f1120 -- 0x7ff4272f20f0
2017-07-05 08:50:19.898690 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff427304ae0 -- 0x7ff427305390
2017-07-05 08:50:19.898739 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272f9cb0 -- 0x7ff4272f89f0
2017-07-05 08:50:19.898780 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272e0600 -- 0x7ff4272df2f0
2017-07-05 08:50:19.898820 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272e1d40 -- 0x7ff4272e2da0
2017-07-05 08:50:19.898938 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down 0x7ff4272d8160 -- 0x7ff4272d6160
2017-07-05 08:50:19.899192 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 mark_down_all
2017-07-05 08:50:19.899568 7ff4250219c0  1 -- 10.21.4.1:0/3313807338 shutdown complete.

I'll have a look at the tracker on how to open an issue there.

Regards,
Maarten

On Wed, Jul 5, 2017 at 8:26 AM, Orit Wasserman <owasserm@xxxxxxxxxx> wrote:
Hi Maarten,
On Tue, Jul 4, 2017 at 9:46 PM, Maarten De Quick <mdequick85@xxxxxxxxx> wrote:
Hi,

Background: We're having issues with our index pool (slow requests / time outs causes crashing of an OSD and a recovery -> application issues). We know we have very big buckets (eg. bucket of 77 million objects with only 16 shards) that need a reshard so we were looking at the resharding process.

First thing we would like to do is making a backup of the bucket index, but this failed with:

# radosgw-admin -n client.radosgw.be-west-3 bi list --bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup
2017-07-03 21:28:30.325613 7f07fb8bc9c0  0 System already converted
ERROR: bi_list(): (4) Interrupted system call

What version of are you using?
Can you rerun the command with --debug-rgw=20 --debug-ms=1?
Also please open a tracker issue (for rgw) with all the information.

Thanks,
Orit 

When I grep for "idx" and I count these:
 # grep idx priv-prod-up-alex.list.backup | wc -l
2294942
When I do a bucket stats for that bucket I get:
# radosgw-admin -n client.radosgw.be-west-3 bucket stats --bucket=priv-prod-up-alex | grep num_objects
2017-07-03 21:33:05.776499 7faca49b89c0  0 System already converted
            "num_objects": 20148575

It looks like there are 18 million objects missing and the backup is not complete (not sure if that's a correct assumption?). We're also afraid that the resharding command will face the same issue.
Has anyone seen this behaviour before or any thoughts on how to fix it?

We were also wondering if we really need the backup. As the resharding process creates a complete new index and keeps the old bucket, is there maybe a possibility to relink your bucket to the old bucket in case of issues? Or am I missing something important here?

Any help would be greatly appreciated, thanks!

Regards,
Maarten

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com