Re: Slow ops on OSDs

Kristof Coucke <kristof.coucke@xxxxxxxxx> · Tue, 6 Oct 2020 15:24:45 +0200

Hi Igor and Stefan,

Everything seems okay, so we'll now create a script to automate this on all
the nodes and we will also review the monitoring possibilities.
Thanks for your help, it was a time saver.

Does anyone know if this issue is better handled in the newer versions or
if this is planned in an upcoming release?

My best regards,

Kristof

Op di 6 okt. 2020 om 14:36 schreef Igor Fedotov <ifedotov@xxxxxxx>:

> I've seen similar reports after manual compactions as well. But it looks
> like a presentation bug in RocksDB to me.
>
> You can check if all the data is spilled over (as it ought to be for L4)
> in bluefs section of OSD perf counters dump...
>
>
> On 10/6/2020 3:18 PM, Kristof Coucke wrote:
>
> Ok, I did the compact on 1 osd.
> The utilization is back to normal, so that's good... Thumbs up to you guys!
> Though, one thing I want to get out of the way before adapting the other
> OSDs:
> When I now get the RocksDb stats, my L1, L2 and L3 are gone:
>
> db_statistics {
>     "rocksdb_compaction_statistics": "",
>     "": "",
>     "": "** Compaction Stats [default] **",
>     "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB)
> Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
> CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
>     "":
> "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
>     "": "  L0      1/0   968.45 KB   0.2      0.0     0.0      0.0
> 0.0      0.0       0.0   1.0      0.0    105.1      0.01              0.00
>         1    0.009       0      0",
>     "": "  L4   1557/0   98.10 GB   0.4      0.0     0.0      0.0
> 0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00
>         0    0.000       0      0",
>     "": " Sum   1558/0   98.10 GB   0.0      0.0     0.0      0.0
> 0.0      0.0       0.0   1.0      0.0    105.1      0.01              0.00
>         1    0.009       0      0",
>     "": " Int      0/0    0.00 KB   0.0      0.0     0.0      0.0
> 0.0      0.0       0.0   1.0      0.0    105.1      0.01              0.00
>         1    0.009       0      0",
>     "": "",
>     "": "** Compaction Stats [default] **",
>     "": "Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB)
> Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
> CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
>     "":
> "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
>     "": "User      0/0    0.00 KB   0.0      0.0     0.0      0.0
> 0.0      0.0       0.0   0.0      0.0    105.1      0.01              0.00
>         1    0.009       0      0",
>     "": "Uptime(secs): 0.3 total, 0.3 interval",
>     "": "Flush(GB): cumulative 0.001, interval 0.001",
>     "": "AddFile(GB): cumulative 0.000, interval 0.000",
>     "": "AddFile(Total Files): cumulative 0, interval 0",
>     "": "AddFile(L0 Files): cumulative 0, interval 0",
>     "": "AddFile(Keys): cumulative 0, interval 0",
>     "": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB
> read, 0.00 MB/s read, 0.0 seconds",
>     "": "Interval compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB
> read, 0.00 MB/s read, 0.0 seconds",
>     "": "Stalls(count): 0 level0_slowdown, 0
> level0_slowdown_with_compaction, 0 level0_numfiles, 0
> level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0
> slowdown for pending_compaction_bytes, 0 memtable_compaction, 0
> memtable_slowdown, interval 0 total count",
>     "": "",
>     "": "** File Read Latency Histogram By Level [default] **",
>     "": "** Level 0 read latency histogram (micros):",
>     "": "Count: 5 Average: 69.2000  StdDev: 85.92",
>     "": "Min: 0  Median: 1.5000  Max: 201",
>     "": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9: 201.00
> P99.99: 201.00",
>     "": "------------------------------------------------------",
>     "": "[       0,       1 ]        2  40.000%  40.000% ########",
>     "": "(       1,       2 ]        1  20.000%  60.000% ####",
>     "": "(     110,     170 ]        1  20.000%  80.000% ####",
>     "": "(     170,     250 ]        1  20.000% 100.000% ####",
>     "": "",
>     "": "** Level 4 read latency histogram (micros):",
>     "": "Count: 4664 Average: 0.6895  StdDev: 0.82",
>     "": "Min: 0  Median: 0.5258  Max: 27",
>     "": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45 P99.99:
> 13.83",
>     "": "------------------------------------------------------",
>     "": "[       0,       1 ]     4435  95.090%  95.090%
> ###################",
>     "": "(       1,       2 ]      149   3.195%  98.285% #",
>     "": "(       2,       3 ]       55   1.179%  99.464% ",
>     "": "(       3,       4 ]       12   0.257%  99.721% ",
>     "": "(       4,       6 ]        8   0.172%  99.893% ",
>     "": "(       6,      10 ]        3   0.064%  99.957% ",
>     "": "(      10,      15 ]        2   0.043% 100.000% ",
>     "": "(      22,      34 ]        1   0.021% 100.021% ",
>     "": "",
>     "": "",
>     "": "** DB Stats **",
>     "": "Uptime(secs): 0.3 total, 0.3 interval",
>     "": "Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes
> per commit group, ingest: 0.00 GB, 0.00 MB/s",
>     "": "Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written:
> 0.00 GB, 0.00 MB/s",
>     "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
>     "": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes
> per commit group, ingest: 0.00 MB, 0.00 MB/s",
>     "": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written:
> 0.00 MB, 0.00 MB/s",
>     "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
> }
>
> We use the NVMe's to store the RocksDb, but with the spillover towards the
> spinning drives.
> L4 is intended to be stored on the spinning drives...
> Will the other levels be created automatically?
>
>
> Op di 6 okt. 2020 om 13:18 schreef Stefan Kooman <stefan@xxxxxx>:
>
>> On 2020-10-06 13:05, Igor Fedotov wrote:
>> >
>> > On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>> >> Another strange thing is going on:
>> >>
>> >> No client software is using the system any longer, so we would expect
>> >> that all IOs are related to the recovery (fixing of the degraded PG).
>> >> However, the disks that are reaching high IO are not a member of the
>> >> PGs that are being fixed.
>> >>
>> >> So, something is heavily using the disk, but I can't find the process
>> >> immediately. I've read something that there can be old client
>> >> processes that keep on connecting to an OSD for retrieving data for a
>> >> specific PG while that PG is no longer available on that disk.
>> >>
>> >>
>> > I bet it's rather PG removal happening in background....
>>
>> ^^ This, and probably the accompanying RocksDB housekeeping that goes
>> with it. As only removing PGs shouldn't be a too big a deal at all.
>> Especially with very small files (and a lot of them) you probably have a
>> lot of OMAP / META data, (ceph osd df will tell you).
>>
>> If that's indeed the case than there is a (way) quicker option to get
>> out of this situation: offline compacting of the OSDs. This process
>> happens orders of magnitude faster than when the OSDs are still online.
>>
>> To check if this hypothesis is true: are the OSD servers under CPU
>> stress where the PGs were located previously (and not the new hosts)?
>>
>> Offline compaction per host:
>>
>> systemctl stop ceph-osd.target
>>
>> for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool bluestore-kv
>> /var/lib/ceph/osd/$osd compact &);done
>>
>> Gr. Stefan
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx