Ok, I did the compact on 1 osd.
The utilization is back to normal, so that's good... Thumbs up to
you guys!
Though, one thing I want to get out of the way before adapting
the other OSDs:
When I now get the RocksDb stats, my L1, L2 and L3 are gone:
db_statistics {
"rocksdb_compaction_statistics": "",
"": "",
"": "** Compaction Stats [default] **",
"": "Level Files Size Score Read(GB) Rn(GB)
Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s)
Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"":
"----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
"": " L0 1/0 968.45 KB 0.2 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 105.1 0.01
0.00 1 0.009 0 0",
"": " L4 1557/0 98.10 GB 0.4 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
"": " Sum 1558/0 98.10 GB 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 105.1 0.01
0.00 1 0.009 0 0",
"": " Int 0/0 0.00 KB 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 105.1 0.01
0.00 1 0.009 0 0",
"": "",
"": "** Compaction Stats [default] **",
"": "Priority Files Size Score Read(GB) Rn(GB)
Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s)
Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"":
"-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
"": "User 0/0 0.00 KB 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 105.1 0.01
0.00 1 0.009 0 0",
"": "Uptime(secs): 0.3 total, 0.3 interval",
"": "Flush(GB): cumulative 0.001, interval 0.001",
"": "AddFile(GB): cumulative 0.000, interval 0.000",
"": "AddFile(Total Files): cumulative 0, interval 0",
"": "AddFile(L0 Files): cumulative 0, interval 0",
"": "AddFile(Keys): cumulative 0, interval 0",
"": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write,
0.00 GB read, 0.00 MB/s read, 0.0 seconds",
"": "Interval compaction: 0.00 GB write, 2.84 MB/s write,
0.00 GB read, 0.00 MB/s read, 0.0 seconds",
"": "Stalls(count): 0 level0_slowdown, 0
level0_slowdown_with_compaction, 0 level0_numfiles, 0
level0_numfiles_with_compaction, 0 stop for
pending_compaction_bytes, 0 slowdown for
pending_compaction_bytes, 0 memtable_compaction, 0
memtable_slowdown, interval 0 total count",
"": "",
"": "** File Read Latency Histogram By Level [default] **",
"": "** Level 0 read latency histogram (micros):",
"": "Count: 5 Average: 69.2000 StdDev: 85.92",
"": "Min: 0 Median: 1.5000 Max: 201",
"": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9:
201.00 P99.99: 201.00",
"": "------------------------------------------------------",
"": "[ 0, 1 ] 2 40.000% 40.000% ########",
"": "( 1, 2 ] 1 20.000% 60.000% ####",
"": "( 110, 170 ] 1 20.000% 80.000% ####",
"": "( 170, 250 ] 1 20.000% 100.000% ####",
"": "",
"": "** Level 4 read latency histogram (micros):",
"": "Count: 4664 Average: 0.6895 StdDev: 0.82",
"": "Min: 0 Median: 0.5258 Max: 27",
"": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45
P99.99: 13.83",
"": "------------------------------------------------------",
"": "[ 0, 1 ] 4435 95.090% 95.090%
###################",
"": "( 1, 2 ] 149 3.195% 98.285% #",
"": "( 2, 3 ] 55 1.179% 99.464% ",
"": "( 3, 4 ] 12 0.257% 99.721% ",
"": "( 4, 6 ] 8 0.172% 99.893% ",
"": "( 6, 10 ] 3 0.064% 99.957% ",
"": "( 10, 15 ] 2 0.043% 100.000% ",
"": "( 22, 34 ] 1 0.021% 100.021% ",
"": "",
"": "",
"": "** DB Stats **",
"": "Uptime(secs): 0.3 total, 0.3 interval",
"": "Cumulative writes: 0 writes, 0 keys, 0 commit groups,
0.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s",
"": "Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync,
written: 0.00 GB, 0.00 MB/s",
"": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
"": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0
writes per commit group, ingest: 0.00 MB, 0.00 MB/s",
"": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync,
written: 0.00 MB, 0.00 MB/s",
"": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
}
We use the NVMe's to store the RocksDb, but with the spillover
towards the spinning drives.
L4 is intended to be stored on the spinning drives...
Will the other levels be created automatically?
Op di 6 okt. 2020 om 13:18 schreef Stefan Kooman <stefan@xxxxxx
<mailto:stefan@xxxxxx>>:
On 2020-10-06 13:05, Igor Fedotov wrote:
>
> On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>> Another strange thing is going on:
>>
>> No client software is using the system any longer, so we
would expect
>> that all IOs are related to the recovery (fixing of the
degraded PG).
>> However, the disks that are reaching high IO are not a
member of the
>> PGs that are being fixed.
>>
>> So, something is heavily using the disk, but I can't find
the process
>> immediately. I've read something that there can be old client
>> processes that keep on connecting to an OSD for retrieving
data for a
>> specific PG while that PG is no longer available on that disk.
>>
>>
> I bet it's rather PG removal happening in background....
^^ This, and probably the accompanying RocksDB housekeeping
that goes
with it. As only removing PGs shouldn't be a too big a deal
at all.
Especially with very small files (and a lot of them) you
probably have a
lot of OMAP / META data, (ceph osd df will tell you).
If that's indeed the case than there is a (way) quicker
option to get
out of this situation: offline compacting of the OSDs. This
process
happens orders of magnitude faster than when the OSDs are
still online.
To check if this hypothesis is true: are the OSD servers
under CPU
stress where the PGs were located previously (and not the new
hosts)?
Offline compaction per host:
systemctl stop ceph-osd.target
for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool
bluestore-kv
/var/lib/ceph/osd/$osd compact &);done
Gr. Stefan