Jan,
please see inline
On 9/11/2020 4:13 PM, Jan Pekař - Imatic wrote:
Hi Igor,
thank you, I also think that it is the problem you described.
I recreated OSD's now and also noticed strange warnings -
HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded
(14766.667%)
Maybe there are some "phantom", zero sized objects (OMAPs?), that
cluster is recovering, but I don't need them (are not listed in ceph df).
The above look pretty weird but I don't know what's happening here...
You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB
from device - each device has it's own RockDB on it.
Are you saying that DB is colocated with main data and resides on HDD?
If so this is another significant (or may be the major) trigger for the
issue. RocksDB + HDD is a bad pair for high load DB operation handling
which bulk pool removal is.
With regards
Jan Pekar
On 11/09/2020 14.36, Igor Fedotov wrote:
Hi Jan,
most likely this is a known issue with slow and ineffective pool
removal procedure in Ceph.
I did some presentation on the topic at yesterday's weekly
performance meeting, presumably a recording will be available in a
couple of days.
An additional accompanying issue not covered during this meeting is
RocksDB's misbehavior after (or during) such massive removals. At
some point it starts to slow down reading operations handling (e.g.
collection listing) which results in OSD suicide timeouts. Exactly
what is observed in your case. There were multiple discussion on this
issue in this mailing list too. In short the currect workaround is to
perform manual DB compaction using ceph-kvstore-tool. Pool removal
will most likely to proceed hence one might face similar assertions
after a while. Hence there might be a need for multiple
"compaction-restart" iterations until pool is finally removed.
And yet another potential issue (or at least an additional factor)
with your setup is a pretty high DB vs. Main devices ratio (1:11).
Deleting procedures from multiple OSDs result in a pretty highload on
DB volume which becomes overburdened...
Thanks,
Igor
On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote:
Hi all,
I have build testing cluster with 4 hosts, 1 SSD's and 11 HDD on
each host.
Running ceph version 14.2.10
(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu.
Because we want to save small size object, I set
bluestore_min_alloc_size 8192 (it is maybe important in this case)
I have filled it through rados gw with approx billion of small
objects. After tests I changed min_alloc_size back and deleted rados
pools (to emtpy whole cluster) and I was waiting till cluster
deletes data from OSD's, but that destabilized the cluster. I never
reached health OK. OSD's were killed in random order. I can start
them back but they will again get out from cluster with..
```
-18> 2020-09-05 22:11:19.430 7f7a3ee40700 5 prioritycache
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
-17> 2020-09-05 22:11:19.430 7f7a3ee40700 5
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size:
1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc:
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
-16> 2020-09-05 22:11:20.434 7f7a3ee40700 5 prioritycache
tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
-15> 2020-09-05 22:11:21.434 7f7a3ee40700 5 prioritycache
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
-14> 2020-09-05 22:11:22.258 7f7a2b81f700 5 osd.42 103257
heartbeat
osd_stat(store_statfs(0x1ce18290000/0x2d08c0000/0x1d180000000, data
0x23143355/0x974a0000, compress 0x0/0x0/0x0, omap 0x1f11e, meta
0x2d08a0ee2), peers
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43]
op hist [])
-13> 2020-09-05 22:11:22.438 7f7a3ee40700 5 prioritycache
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
-12> 2020-09-05 22:11:23.442 7f7a3ee40700 5 prioritycache
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
-11> 2020-09-05 22:11:24.442 7f7a3ee40700 5 prioritycache
tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
-10> 2020-09-05 22:11:24.442 7f7a3ee40700 5
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size:
1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc:
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
-9> 2020-09-05 22:11:24.442 7f7a2e024700 0
bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation
observed for _collection_list, latency = 151.113s, lat = 2m cid
=5.47_head start #5:e2000000::::0# end #MAX# max 2147483647
-8> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map
reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out
after 15
-7> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map
reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide
timed out after 150
-6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient:
get_auth_request con 0x555b15d07680 auth_method 0
-5> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257
ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00
-4> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257
ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980
-3> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257
ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80
-2> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257
ms_handle_reset con 0x555b15960480 session 0x555a9f9d6f80
-1> 2020-09-05 22:11:24.446 7f7a3c494700 3 osd.42 103257
handle_osd_map epochs [103258,103259], i have 103257, src has
[83902,103259]
0> 2020-09-05 22:11:24.450 7f7a2e024700 -1 *** Caught signal
(Aborted) **
```
I have approx 12 OSD's down with this error.
I decided to wipe problematic OSD's so I cannot debug it, but I'm
curious what I did wrong (deleting pool with many small data?) or
what to do next time.
I did that before but not with bilion object and without
bluestore_min_alloc_size change, and it worked without problems.
With regards
Jan Pekar
--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz | +420326555326
============
--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx