Hi Igor and Stefan, Everything seems okay, so we'll now create a script to automate this on all the nodes and we will also review the monitoring possibilities. Thanks for your help, it was a time saver. Does anyone know if this issue is better handled in the newer versions or if this is planned in an upcoming release? My best regards, Kristof Op di 6 okt. 2020 om 14:36 schreef Igor Fedotov <ifedotov@xxxxxxx>: > I've seen similar reports after manual compactions as well. But it looks > like a presentation bug in RocksDB to me. > > You can check if all the data is spilled over (as it ought to be for L4) > in bluefs section of OSD perf counters dump... > > > On 10/6/2020 3:18 PM, Kristof Coucke wrote: > > Ok, I did the compact on 1 osd. > The utilization is back to normal, so that's good... Thumbs up to you guys! > Though, one thing I want to get out of the way before adapting the other > OSDs: > When I now get the RocksDb stats, my L1, L2 and L3 are gone: > > db_statistics { > "rocksdb_compaction_statistics": "", > "": "", > "": "** Compaction Stats [default] **", > "": "Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) > Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) > CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop", > "": > "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------", > "": " L0 1/0 968.45 KB 0.2 0.0 0.0 0.0 > 0.0 0.0 0.0 1.0 0.0 105.1 0.01 0.00 > 1 0.009 0 0", > "": " L4 1557/0 98.10 GB 0.4 0.0 0.0 0.0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 > 0 0.000 0 0", > "": " Sum 1558/0 98.10 GB 0.0 0.0 0.0 0.0 > 0.0 0.0 0.0 1.0 0.0 105.1 0.01 0.00 > 1 0.009 0 0", > "": " Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 > 0.0 0.0 0.0 1.0 0.0 105.1 0.01 0.00 > 1 0.009 0 0", > "": "", > "": "** Compaction Stats [default] **", > "": "Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) > Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) > CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop", > "": > "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------", > "": "User 0/0 0.00 KB 0.0 0.0 0.0 0.0 > 0.0 0.0 0.0 0.0 0.0 105.1 0.01 0.00 > 1 0.009 0 0", > "": "Uptime(secs): 0.3 total, 0.3 interval", > "": "Flush(GB): cumulative 0.001, interval 0.001", > "": "AddFile(GB): cumulative 0.000, interval 0.000", > "": "AddFile(Total Files): cumulative 0, interval 0", > "": "AddFile(L0 Files): cumulative 0, interval 0", > "": "AddFile(Keys): cumulative 0, interval 0", > "": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB > read, 0.00 MB/s read, 0.0 seconds", > "": "Interval compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB > read, 0.00 MB/s read, 0.0 seconds", > "": "Stalls(count): 0 level0_slowdown, 0 > level0_slowdown_with_compaction, 0 level0_numfiles, 0 > level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 > slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 > memtable_slowdown, interval 0 total count", > "": "", > "": "** File Read Latency Histogram By Level [default] **", > "": "** Level 0 read latency histogram (micros):", > "": "Count: 5 Average: 69.2000 StdDev: 85.92", > "": "Min: 0 Median: 1.5000 Max: 201", > "": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9: 201.00 > P99.99: 201.00", > "": "------------------------------------------------------", > "": "[ 0, 1 ] 2 40.000% 40.000% ########", > "": "( 1, 2 ] 1 20.000% 60.000% ####", > "": "( 110, 170 ] 1 20.000% 80.000% ####", > "": "( 170, 250 ] 1 20.000% 100.000% ####", > "": "", > "": "** Level 4 read latency histogram (micros):", > "": "Count: 4664 Average: 0.6895 StdDev: 0.82", > "": "Min: 0 Median: 0.5258 Max: 27", > "": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45 P99.99: > 13.83", > "": "------------------------------------------------------", > "": "[ 0, 1 ] 4435 95.090% 95.090% > ###################", > "": "( 1, 2 ] 149 3.195% 98.285% #", > "": "( 2, 3 ] 55 1.179% 99.464% ", > "": "( 3, 4 ] 12 0.257% 99.721% ", > "": "( 4, 6 ] 8 0.172% 99.893% ", > "": "( 6, 10 ] 3 0.064% 99.957% ", > "": "( 10, 15 ] 2 0.043% 100.000% ", > "": "( 22, 34 ] 1 0.021% 100.021% ", > "": "", > "": "", > "": "** DB Stats **", > "": "Uptime(secs): 0.3 total, 0.3 interval", > "": "Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes > per commit group, ingest: 0.00 GB, 0.00 MB/s", > "": "Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: > 0.00 GB, 0.00 MB/s", > "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent", > "": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes > per commit group, ingest: 0.00 MB, 0.00 MB/s", > "": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: > 0.00 MB, 0.00 MB/s", > "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent" > } > > We use the NVMe's to store the RocksDb, but with the spillover towards the > spinning drives. > L4 is intended to be stored on the spinning drives... > Will the other levels be created automatically? > > > Op di 6 okt. 2020 om 13:18 schreef Stefan Kooman <stefan@xxxxxx>: > >> On 2020-10-06 13:05, Igor Fedotov wrote: >> > >> > On 10/6/2020 1:04 PM, Kristof Coucke wrote: >> >> Another strange thing is going on: >> >> >> >> No client software is using the system any longer, so we would expect >> >> that all IOs are related to the recovery (fixing of the degraded PG). >> >> However, the disks that are reaching high IO are not a member of the >> >> PGs that are being fixed. >> >> >> >> So, something is heavily using the disk, but I can't find the process >> >> immediately. I've read something that there can be old client >> >> processes that keep on connecting to an OSD for retrieving data for a >> >> specific PG while that PG is no longer available on that disk. >> >> >> >> >> > I bet it's rather PG removal happening in background.... >> >> ^^ This, and probably the accompanying RocksDB housekeeping that goes >> with it. As only removing PGs shouldn't be a too big a deal at all. >> Especially with very small files (and a lot of them) you probably have a >> lot of OMAP / META data, (ceph osd df will tell you). >> >> If that's indeed the case than there is a (way) quicker option to get >> out of this situation: offline compacting of the OSDs. This process >> happens orders of magnitude faster than when the OSDs are still online. >> >> To check if this hypothesis is true: are the OSD servers under CPU >> stress where the PGs were located previously (and not the new hosts)? >> >> Offline compaction per host: >> >> systemctl stop ceph-osd.target >> >> for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool bluestore-kv >> /var/lib/ceph/osd/$osd compact &);done >> >> Gr. Stefan >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx