Re: Problem unusable after deleting pool with bilion objects

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 11 Sep 2020 16:19:40 +0300

Jan,

please see inline

On 9/11/2020 4:13 PM, Jan Pekař - Imatic wrote:

Hi Igor,

thank you, I also think that it is the problem you described.

I recreated OSD's now and also noticed strange warnings -

HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded 
(14766.667%)

Maybe there are some "phantom", zero sized objects (OMAPs?), that 
cluster is recovering, but I don't need them (are not listed in ceph df).

The above look pretty weird but I don't know what's happening here...

You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB 
from device - each device has it's own RockDB on it.

Are you saying that DB is colocated with main data and resides on HDD? 
If so this is another significant (or may be the major) trigger for the 
issue. RocksDB + HDD is a bad pair for high load DB operation handling 
which bulk pool removal is.

With regards
Jan Pekar

On 11/09/2020 14.36, Igor Fedotov wrote:

Hi Jan,

most likely this is a known issue with slow and ineffective pool 
removal procedure in Ceph.

I did some presentation on the topic at yesterday's weekly 
performance meeting, presumably a recording will be available in a 
couple of days.

An additional accompanying issue not covered during this meeting is 
RocksDB's misbehavior after (or during) such massive removals. At 
some point it starts to slow  down reading  operations handling (e.g. 
collection listing) which results in OSD suicide timeouts. Exactly 
what is observed in your case. There were multiple discussion on this 
issue in this mailing list too. In short the currect workaround is to 
perform manual DB compaction using ceph-kvstore-tool. Pool removal 
will most likely to proceed hence one might face similar assertions 
after a while. Hence there might be a need for multiple 
"compaction-restart" iterations until pool is finally removed.

And yet another potential issue (or at least an additional factor) 
with your setup is a pretty high DB vs. Main devices ratio (1:11). 
Deleting procedures from multiple OSDs result in a pretty highload on 
DB volume which becomes overburdened...

Thanks,

Igor

On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote:
Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on 
each host.
Running ceph version 14.2.10 
(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu.

Because we want to save small size object, I set 
bluestore_min_alloc_size 8192 (it is maybe important in this case)

I have filled it through rados gw with approx billion of small 
objects. After tests I changed min_alloc_size back and deleted rados 
pools (to emtpy whole cluster) and I was waiting till cluster 
deletes data from OSD's, but that destabilized the cluster. I never 
reached health OK. OSD's were killed in random order. I can start 
them back but they will again get out from cluster with..

```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 
1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
   -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 
heartbeat 
osd_stat(store_statfs(0x1ce18290000/0x2d08c0000/0x1d180000000, data 
0x23143355/0x974a0000, compress 0x0/0x0/0x0, omap 0x1f11e, meta 
0x2d08a0ee2), peers 
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] 
op hist [])
   -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 
1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
    -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 
bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation 
observed for _collection_list, latency = 151.113s, lat = 2m cid 
=5.47_head start #5:e2000000::::0# end #MAX# max 2147483647
    -8> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map 
reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out 
after 15
    -7> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map 
reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide 
timed out after 150
    -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: 
get_auth_request con 0x555b15d07680 auth_method 0
    -5> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00
    -4> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980
    -3> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80
    -2> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15960480 session 0x555a9f9d6f80
    -1> 2020-09-05 22:11:24.446 7f7a3c494700  3 osd.42 103257 
handle_osd_map epochs [103258,103259], i have 103257, src has 
[83902,103259]
     0> 2020-09-05 22:11:24.450 7f7a2e024700 -1 *** Caught signal 
(Aborted) **
```

I have approx 12 OSD's down with this error.

I decided to wipe problematic OSD's so I cannot debug it, but I'm 
curious what I did wrong (deleting pool with many small data?) or 
what to do next time.

I did that before but not with bilion object and without 
bluestore_min_alloc_size change, and it worked without problems.

With regards
Jan Pekar

--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz  | +420326555326
============
--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx