Hello,
During the night the osd.16 crashed after hitting a suicide timout. Thus
this morning I did a ceph-kvstore-tool compact and restarted the osd.
I thus compared the results of ceph daemon osd.16 perf dump I had before
(i.e. yesterday) and now (after compaction). I noticed a interresting
difference in msgr_active_connections. Before the compaction it was, for
all AsyncMessenger::Worker-0, 1 and 2 at a crasy value
(18446744073709550998) and get back to something comparable to what I
have for other osds (72).
Does this helps you to identify the problem ?
F.
Le 31/08/2020 à 15:59, Wido den Hollander a écrit :
On 31/08/2020 15:44, Francois Legrand wrote:
Thanks Igor for your answer,
We could try do a compaction of RocksDB manually, but it's not clear
to me if we have to compact on the mon with something like
ceph-kvstore-tool rocksdb /var/lib/ceph/mon/mon01/store.db/ compact
or on the concerned osd with
ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-16/ compact
(or for all osd with a script like in
https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )
You would compact the OSDs, not the MONs. So the last command or my
script which you linked there.
For my culture, how does compaction works ? Is it done automatically
in background, regularly, at startup ?
Usually it's done by the OSD in the background, but sometimes an
offline compact works best.
Because in the logs of the osd we have every 10mn some reports about
compaction (which suggests that compaction occurs regularly), like :
Yes, that is normal. But the offline compaction is sometimes more
effective than the online ones are.
2020-08-31 15:06:55.448 7f03fb398700 4 rocksdb: [db/db_impl.cc:777]
------- DUMPING STATS -------
2020-08-31 15:06:55.448 7f03fb398700 4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 449404.8 total, 600.0 interval
Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0
writes per commit group, ingest: 0.28 GB, 0.00 MB/s
Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync,
written: 0.28 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes
per commit group, ingest: 0.22 MB, 0.00 MB/s
Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written:
0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB)
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 1/0 60.48 MB 0.2 0.0 0.0 0.0 0.1
0.1 0.0 1.0 0.0 163.7 0.52 0.40 2
0.258 0 0
L1 0/0 0.00 KB 0.0 0.1 0.1 0.0 0.1
0.1 0.0 0.5 48.2 26.1 2.32 0.64 1
2.319 920K 197K
L2 17/0 1.00 GB 0.8 1.1 0.1 1.1 1.1
0.0 0.0 18.3 69.8 67.5 16.38 4.97 1
16.380 4747K 82K
L3 81/0 4.50 GB 0.9 0.6 0.1 0.5 0.3
-0.2 0.0 4.3 66.9 36.6 9.23 4.95 2
4.617 9544K 802K
L4 285/0 16.64 GB 0.1 2.4 0.3 2.0 0.2
-1.8 0.0 0.8 110.3 11.7 21.92 4.37 5
4.384 12M 12M
Sum 384/0 22.20 GB 0.0 4.2 0.6 3.6 1.8
-1.8 0.0 21.8 85.2 36.6 50.37 15.32 11
4.579 28M 13M
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.00 0.00 0
0.000 0 0
** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB)
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Low 0/0 0.00 KB 0.0 4.2 0.6 3.6 1.7
-1.9 0.0 0.0 86.0 35.3 49.86 14.92 9
5.540 28M 13M
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.1
0.1 0.0 0.0 0.0 150.2 0.40 0.40 1
0.403 0 0
User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 211.7 0.11 0.00 1
0.114 0 0
Uptime(secs): 449404.8 total, 600.0 interval
Flush(GB): cumulative 0.083, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB read,
0.01 MB/s read, 50.4 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read,
0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction,
0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
memtable_compaction, 0 memtable_slowdown, interval 0 total count
Concerning the data removal, I don't know if this could be the
trigger. We had some osd marked down before starting the removal, but
at this epoch the situation was so confuse that I cannot be sure that
the origin of the problem was the same. Indeed, the large data
removal concerned datas in an old pool which have been destroyed
since (thus all the pg of this ancient pool no longer exists). And it
seems that now the cluster is rather inactive due to holidays time
(only a few KB/s of io and no recover).
We have no standalone fast drive for DB/WAL and nothing in the osds
(nor mons) logs suggesting any problem (apart the heartbeat_map
is_healthy timeout).
Can you also check if during compaction the %util of the disks spike
to 100%? I think this is the case. Your DB isn't offloaded to an SSD
and this probably causes the HDD to become overloaded with IOps.
Wido
Thanks
F.
Le 31/08/2020 à 12:15, Igor Fedotov a écrit :
Hi Francois,
given that slow operations are observed for collection listings you
might want to manually compact RocksDB using ceph-kvstore-tool.
The observed slowdown tends to happen after massive data removals.
I've seen multiple compains about this issue including some post in
this mailing list. BTW I can see your post from Jun 24 about slow
pool removal - couldn't this be a trigger?
Also wondering whether you have standalone fast(SSD/NVMe) drive for
DB/WAL? Aren't there any BlueFS spillovers which might be relevant?
Thanks,
Igor
On 8/28/2020 11:33 AM, Francois Legrand wrote:
Hi all,
We have a ceph cluster in production with 6 osds servers (with
16x8TB disks), 3 mons/mgrs and 3 mdss. Both public and cluster
networks are in 10GB and works well.
After a major crash in april, we turned the option
bluefs_buffered_io to false to workaround the large write bug when
bluefs_buffered_io was true (we were in version 14.2.8 and the
default value at this time was true).
Since that time, we regularly have some osds wrongly marked down by
the cluster after heartbeat timeout (heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).
Generally the osd restart and the cluster is back healthy, but
several time, after many of these kick-off the osd reach the
osd_op_thread_suicide_timeout and goes down definitely.
We increased the osd_op_thread_timeout and
osd_op_thread_suicide_timeout... The problems still occurs (but
less frequently).
Few days ago, we upgraded to 14.2.11 and revert the timeout to
their default value, hoping that it will solve the problem (we
thought that it should be related to this bug
https://tracker.ceph.com/issues/45943), but it didn't. We still
have some osds wrongly marked down.
Can somebody help us to fix this problem ?
Thanks.
Here is an extract of an osd log at failure time:
---------------------------------
2020-08-28 02:19:05.019 7f03f1384700 0 log_channel(cluster) log
[DBG] : 44.7d scrub starts
2020-08-28 02:19:25.755 7f040e43d700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:19:25.755 7f040dc3c700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
this last line is repeated more than 1000 times
...
2020-08-28 02:20:17.484 7f040d43b700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:20:17.551 7f03f1384700 0
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation
observed for _collection_list, latency = 67.3532s, lat = 67s cid
=44.7d_head start GHMAX end GHMAX max 25
...
2020-08-28 02:20:22.600 7f040dc3c700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.774 7f03f1384700 0
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation
observed for _collection_list, latency = 63.223s, lat = 63s cid
=44.7d_head start
#44:beffc78d:::rbd_data.1e48e8ab988992.00000000000011bd:0# end
#MAX# max 2147483647
2020-08-28 02:21:20.774 7f03f1384700 1 heartbeat_map reset_timeout
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.805 7f03f1384700 0 log_channel(cluster) log
[DBG] : 44.7d scrub ok
2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.16 down, but it is still running
2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log
[DBG] : map e609411 wrongly marked me down at e609410
2020-08-28 02:21:21.099 7f03fd997700 1 osd.16 609411
start_waiting_for_healthy
2020-08-28 02:21:21.119 7f03fd997700 1 osd.16 609411 start_boot
2020-08-28 02:21:21.124 7f03f0b83700 1 osd.16 pg_epoch: 609410
pg[36.3d0( v 609409'481293 (449368'478292,609409'481293]
local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72]
r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198
crt=609409'481293 lcod 609409'481292 active mbc={}]
start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16]
-> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 ->
-1, features acting 4611087854031667199 upacting 4611087854031667199
...
2020-08-28 02:21:21.166 7f03f0b83700 1 osd.16 pg_epoch: 609411
pg[36.56( v 609409'480511 (449368'477424,609409'480511]
local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609410/609410/609410)
[103,102] r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511
lcod 609409'480510 unknown NOTIFY mbc={}] state<Start>:
transitioning to Stray
2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
set_numa_affinity public network em1 numa node 0
2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
set_numa_affinity cluster network em2 numa node 0
2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
set_numa_affinity objectstore and network numa nodes do not match
2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
set_numa_affinity not setting numa affinity
2020-08-28 02:21:21.566 7f040a435700 1 osd.16 609413 tick checking
mon for new map
2020-08-28 02:21:22.515 7f03fd997700 1 osd.16 609414 state:
booting -> active
2020-08-28 02:21:22.515 7f03f0382700 1 osd.16 pg_epoch: 609414
pg[36.20( v 609409'483167 (449368'480117,609409'483167]
local-lis/les=609403/609404 n=155171 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609414/609414/609361)
[97,16,72] r=1 lpr=609414 pi=[609403,609414)/1 crt=609409'483167
lcod 609409'483166 unknown NOTIFY mbc={}] start_peering_interval up
[97,72] -> [97,16,72], acting [97,72] -> [97,16,72], acting_primary
97 -> 97, up_primary 97 -> 97, role -1 -> 1, features acting
4611087854031667199 upacting 4611087854031667199
...
2020-08-28 02:21:22.522 7f03f1384700 1 osd.16 pg_epoch: 609414
pg[36.2f3( v 609409'479796 (449368'476712,609409'479796]
local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609414/609414/609414)
[16,34,21] r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796
lcod 609409'479795 mlcod 0'0 unknown NOTIFY mbc={}]
start_peering_interval up [34,21] -> [16,34,21], acting [34,21] ->
[16,34,21], acting_primary 34 -> 16, up_primary 34 -> 16, role -1
-> 0, features acting 4611087854031667199 upacting 4611087854031667199
2020-08-28 02:21:22.522 7f03f1384700 1 osd.16 pg_epoch: 609414
pg[36.2f3( v 609409'479796 (449368'476712,609409'479796]
local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609414/609414/609414)
[16,34,21] r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796
lcod 609409'479795 mlcod 0'0 unknown mbc={}] state<Start>:
transitioning to Primary
2020-08-28 02:21:24.738 7f03f1384700 0 log_channel(cluster) log
[DBG] : 36.2f3 scrub starts
2020-08-28 02:22:18.857 7f03f1384700 0 log_channel(cluster) log
[DBG] : 36.2f3 scrub ok
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx