Re: Problem unusable after deleting pool with bilion objects

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

thank you, I also think that it is the problem you described.

I recreated OSD's now and also noticed strange warnings -

HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded (14766.667%)

Maybe there are some "phantom", zero sized objects (OMAPs?), that cluster is recovering, but I don't need them (are not listed in ceph df).

You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB from device - each device has it's own RockDB on it.

With regards
Jan Pekar

On 11/09/2020 14.36, Igor Fedotov wrote:

Hi Jan,

most likely this is a known issue with slow and ineffective pool removal procedure in Ceph.

I did some presentation on the topic at yesterday's weekly performance meeting, presumably a recording will be available in a couple of days.

An additional accompanying issue not covered during this meeting is RocksDB's misbehavior after (or during) such massive removals. At some point it starts to slow  down reading operations handling (e.g. collection listing) which results in OSD suicide timeouts. Exactly what is observed in your case. There were multiple discussion on this issue in this mailing list too. In short the currect workaround is to perform manual DB compaction using ceph-kvstore-tool. Pool removal will most likely to proceed hence one might face similar assertions after a while. Hence there might be a need for multiple "compaction-restart" iterations until pool is finally removed.


And yet another potential issue (or at least an additional factor) with your setup is a pretty high DB vs. Main devices ratio (1:11). Deleting procedures from multiple OSDs result in a pretty highload on DB volume which becomes overburdened...


Thanks,

Igor

On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote:
Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on each host.
Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu.

Because we want to save small size object, I set bluestore_min_alloc_size 8192 (it is maybe important in this case)

I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools (to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with..

```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282    -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304    -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 2073067520 old mem: 1932735282 new mem: 1932735282    -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282    -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce18290000/0x2d08c0000/0x1d180000000, data 0x23143355/0x974a0000, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers [3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist [])    -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282    -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282    -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 2073067520 old mem: 1932735282 new mem: 1932735282    -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304     -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for _collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000000::::0# end #MAX# max 2147483647
    -8> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15
    -7> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out after 150
    -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: get_auth_request con 0x555b15d07680 auth_method 0
    -5> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00
    -4> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980
    -3> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80
    -2> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15960480 session 0x555a9f9d6f80
    -1> 2020-09-05 22:11:24.446 7f7a3c494700  3 osd.42 103257 handle_osd_map epochs [103258,103259], i have 103257, src has [83902,103259]
     0> 2020-09-05 22:11:24.450 7f7a2e024700 -1 *** Caught signal (Aborted) **
```

I have approx 12 OSD's down with this error.

I decided to wipe problematic OSD's so I cannot debug it, but I'm curious what I did wrong (deleting pool with many small data?) or what to do next time.

I did that before but not with bilion object and without bluestore_min_alloc_size change, and it worked without problems.

With regards
Jan Pekar

--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz | +420326555326
============
--

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux