Problem unusable after deleting pool with bilion objects

Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> · Fri, 11 Sep 2020 14:00:45 +0200

Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on each host.
Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu.

Because we want to save small size object, I set bluestore_min_alloc_size 8192 (it is maybe important in this case)

I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools 
(to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health 
OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with..

```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 
1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
   -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce18290000/0x2d08c0000/0x1d180000000, data 
0x23143355/0x974a0000, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers 
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist [])
   -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 
1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
    -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for 
_collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000000::::0# end #MAX# max 2147483647
    -8> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15
    -7> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out 
after 150
    -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: get_auth_request con 0x555b15d07680 auth_method 0
    -5> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00
    -4> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980
    -3> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80
    -2> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset con 0x555b15960480 session 0x555a9f9d6f80
    -1> 2020-09-05 22:11:24.446 7f7a3c494700  3 osd.42 103257 handle_osd_map epochs [103258,103259], i have 103257, src has [83902,103259]
     0> 2020-09-05 22:11:24.450 7f7a2e024700 -1 *** Caught signal (Aborted) **
```

I have approx 12 OSD's down with this error.

I decided to wipe problematic OSD's so I cannot debug it, but I'm curious what I did wrong (deleting pool with many small data?) or what to 
do next time.

I did that before but not with bilion object and without bluestore_min_alloc_size change, and it worked without problems.

With regards
Jan Pekar

--
============
Ing. Jan Pekař
jan.pekar@xxxxxxxxx
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz | +420326555326
============
--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx