Re: Slow ops on OSDs

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 6 Oct 2020 16:27:43 +0300

I'm working on improving PG removal in master, see: 
https://github.com/ceph/ceph/pull/37496

Hopefully this will help in case of "cleanup after rebalancing" issue 
which you presumably had.

On 10/6/2020 4:24 PM, Kristof Coucke wrote:
Hi Igor and Stefan,

Everything seems okay, so we'll now create a script to automate this 
on all the nodes and we will also review the monitoring possibilities.
Thanks for your help, it was a time saver.

Does anyone know if this issue is better handled in the newer versions 
or if this is planned in an upcoming release?

My best regards,

Kristof

Op di 6 okt. 2020 om 14:36 schreef Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>>:

    I've seen similar reports after manual compactions as well. But it
    looks like a presentation bug in RocksDB to me.

    You can check if all the data is spilled over (as it ought to be
    for L4) in bluefs section of OSD perf counters dump...

    On 10/6/2020 3:18 PM, Kristof Coucke wrote:
    Ok, I did the compact on 1 osd.
    The utilization is back to normal, so that's good... Thumbs up to
    you guys!
    Though, one thing I want to get out of the way before adapting
    the other OSDs:
    When I now get the RocksDb stats, my L1, L2 and L3 are gone:

    db_statistics {
        "rocksdb_compaction_statistics": "",
        "": "",
        "": "** Compaction Stats [default] **",
        "": "Level    Files   Size     Score Read(GB)  Rn(GB)
    Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s)
    Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
        "":
    "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
        "": "  L0      1/0   968.45 KB   0.2      0.0 0.0      0.0  
        0.0      0.0       0.0   1.0  0.0    105.1      0.01        
         0.00         1  0.009       0      0",
        "": "  L4   1557/0   98.10 GB   0.4      0.0 0.0      0.0    
      0.0      0.0       0.0   0.0  0.0      0.0      0.00          
       0.00         0  0.000       0      0",
        "": " Sum   1558/0   98.10 GB   0.0      0.0 0.0      0.0    
      0.0      0.0       0.0   1.0  0.0    105.1      0.01          
       0.00         1  0.009       0      0",
        "": " Int      0/0    0.00 KB   0.0      0.0 0.0      0.0    
      0.0      0.0       0.0   1.0  0.0    105.1      0.01          
       0.00         1  0.009       0      0",
        "": "",
        "": "** Compaction Stats [default] **",
        "": "Priority    Files   Size     Score Read(GB)  Rn(GB)
    Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s)
    Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
        "":
    "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
        "": "User      0/0    0.00 KB   0.0      0.0 0.0      0.0    
      0.0      0.0       0.0   0.0  0.0    105.1      0.01          
       0.00         1  0.009       0      0",
        "": "Uptime(secs): 0.3 total, 0.3 interval",
        "": "Flush(GB): cumulative 0.001, interval 0.001",
        "": "AddFile(GB): cumulative 0.000, interval 0.000",
        "": "AddFile(Total Files): cumulative 0, interval 0",
        "": "AddFile(L0 Files): cumulative 0, interval 0",
        "": "AddFile(Keys): cumulative 0, interval 0",
        "": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write,
    0.00 GB read, 0.00 MB/s read, 0.0 seconds",
        "": "Interval compaction: 0.00 GB write, 2.84 MB/s write,
    0.00 GB read, 0.00 MB/s read, 0.0 seconds",
        "": "Stalls(count): 0 level0_slowdown, 0
    level0_slowdown_with_compaction, 0 level0_numfiles, 0
    level0_numfiles_with_compaction, 0 stop for
    pending_compaction_bytes, 0 slowdown for
    pending_compaction_bytes, 0 memtable_compaction, 0
    memtable_slowdown, interval 0 total count",
        "": "",
        "": "** File Read Latency Histogram By Level [default] **",
        "": "** Level 0 read latency histogram (micros):",
        "": "Count: 5 Average: 69.2000  StdDev: 85.92",
        "": "Min: 0  Median: 1.5000  Max: 201",
        "": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9:
    201.00 P99.99: 201.00",
        "": "------------------------------------------------------",
        "": "[       0,       1 ]        2  40.000%  40.000% ########",
        "": "(       1,       2 ]        1  20.000%  60.000% ####",
        "": "(     110,     170 ]        1  20.000%  80.000% ####",
        "": "(     170,     250 ]        1  20.000% 100.000% ####",
        "": "",
        "": "** Level 4 read latency histogram (micros):",
        "": "Count: 4664 Average: 0.6895  StdDev: 0.82",
        "": "Min: 0  Median: 0.5258  Max: 27",
        "": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45
    P99.99: 13.83",
        "": "------------------------------------------------------",
        "": "[       0,       1 ]     4435  95.090%  95.090%
    ###################",
        "": "(       1,       2 ]      149   3.195%  98.285% #",
        "": "(       2,       3 ]       55   1.179%  99.464% ",
        "": "(       3,       4 ]       12   0.257%  99.721% ",
        "": "(       4,       6 ]        8   0.172%  99.893% ",
        "": "(       6,      10 ]        3   0.064%  99.957% ",
        "": "(      10,      15 ]        2   0.043% 100.000% ",
        "": "(      22,      34 ]        1   0.021% 100.021% ",
        "": "",
        "": "",
        "": "** DB Stats **",
        "": "Uptime(secs): 0.3 total, 0.3 interval",
        "": "Cumulative writes: 0 writes, 0 keys, 0 commit groups,
    0.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s",
        "": "Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync,
    written: 0.00 GB, 0.00 MB/s",
        "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
        "": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0
    writes per commit group, ingest: 0.00 MB, 0.00 MB/s",
        "": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync,
    written: 0.00 MB, 0.00 MB/s",
        "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
    }

    We use the NVMe's to store the RocksDb, but with the spillover
    towards the spinning drives.
    L4 is intended to be stored on the spinning drives...
    Will the other levels be created automatically?

    Op di 6 okt. 2020 om 13:18 schreef Stefan Kooman <stefan@xxxxxx
    <mailto:stefan@xxxxxx>>:

        On 2020-10-06 13:05, Igor Fedotov wrote:
        >
        > On 10/6/2020 1:04 PM, Kristof Coucke wrote:
        >> Another strange thing is going on:
        >>
        >> No client software is using the system any longer, so we
        would expect
        >> that all IOs are related to the recovery (fixing of the
        degraded PG).
        >> However, the disks that are reaching high IO are not a
        member of the
        >> PGs that are being fixed.
        >>
        >> So, something is heavily using the disk, but I can't find
        the process
        >> immediately. I've read something that there can be old client
        >> processes that keep on connecting to an OSD for retrieving
        data for a
        >> specific PG while that PG is no longer available on that disk.
        >>
        >>
        > I bet it's rather PG removal happening in background....

        ^^ This, and probably the accompanying RocksDB housekeeping
        that goes
        with it. As only removing PGs shouldn't be a too big a deal
        at all.
        Especially with very small files (and a lot of them) you
        probably have a
        lot of OMAP / META data, (ceph osd df will tell you).

        If that's indeed the case than there is a (way) quicker
        option to get
        out of this situation: offline compacting of the OSDs. This
        process
        happens orders of magnitude faster than when the OSDs are
        still online.

        To check if this hypothesis is true: are the OSD servers
        under CPU
        stress where the PGs were located previously (and not the new
        hosts)?

        Offline compaction per host:

        systemctl stop ceph-osd.target

        for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool
        bluestore-kv
        /var/lib/ceph/osd/$osd compact &);done

        Gr. Stefan

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx