Re: adding block.db to OSD

Igor Fedotov <ifedotov@xxxxxxx> · Mon, 11 May 2020 14:25:07 +0300

Hi Stefan,

I don't have specific preferences, hence any public storage you prefer.

Just one note - I presume you collected the logs for the full set of 10 
runs. Which is redundant, could you please collect detailed logs (one 
per OSD) for single shot runs.

Sorry for the unclear previous inquiry.

Additionally I realized that it's faster OSD.38 which has higher 
flush/sync latency values. Which is valid for both attempts.

This seems pretty odd to be honest. Is that correct indeed, wasn't 
anything misplaced along the road?

Thanks,

Igor

On 5/11/2020 9:44 AM, Stefan Priebe - Profihost AG wrote:
Hi Igor,

where to post the logs?

Am 06.05.20 um 09:23 schrieb Stefan Priebe - Profihost AG:
Hi Igor,

Am 05.05.20 um 16:10 schrieb Igor Fedotov:
Hi Stefan,

so (surprise!) some DB access counters show a significant difference, e.g.

         "kv_flush_lat": {
             "avgcount": 1423,
             "sum": 0.000906419,
             "avgtime": 0.000000636
         },
         "kv_sync_lat": {
             "avgcount": 1423,
             "sum": 0.712888091,
             "avgtime": 0.000500975
         },
vs.

      "kv_flush_lat": {
             "avgcount": 1146,
             "sum": 3.346228802,
             "avgtime": 0.002919920
         },
       "kv_sync_lat": {
             "avgcount": 1146,
             "sum": 3.754915016,
             "avgtime": 0.003276540
         },

Also for bluefs:
"bytes_written_sst": 0,
vs.
  "bytes_written_sst": 59785361,

Could you please rerun these benchmark/perf counter gathering steps a couple more times and check if the difference is persistent.
I reset all perf counters and ran the bench 10 times on each osd.

OSD 38:
bench: wrote 12 MiB in blocks of 4 KiB in 1.22796 sec at 9.5 MiB/sec
2.44k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.26407 sec at 9.3 MiB/sec
2.37k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.24987 sec at 9.4 MiB/sec
2.40k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.37125 sec at 8.5 MiB/sec
2.19k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.25549 sec at 9.3 MiB/sec
2.39k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.24358 sec at 9.4 MiB/sec
2.41k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.24208 sec at 9.4 MiB/sec
2.42k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.2433 sec at 9.4 MiB/sec
2.41k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.26548 sec at 9.3 MiB/sec
2.37k IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 1.31509 sec at 8.9 MiB/sec
2.28k IOPS

kv_flush_lat.sum: 8.955978864
kv_sync_lat.sum: 10.869536503
bytes_written_sst: 0

OSD 0:
bench: wrote 12 MiB in blocks of 4 KiB in 5.71447 sec at 2.1 MiB/sec 524
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 6.18679 sec at 1.9 MiB/sec 484
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 6.69068 sec at 1.8 MiB/sec 448
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 7.06413 sec at 1.7 MiB/sec 424
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 7.50321 sec at 1.6 MiB/sec 399
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 6.86882 sec at 1.7 MiB/sec 436
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 7.11702 sec at 1.6 MiB/sec 421
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 7.10497 sec at 1.6 MiB/sec 422
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 6.69801 sec at 1.7 MiB/sec 447
IOPS
bench: wrote 12 MiB in blocks of 4 KiB in 7.13588 sec at 1.6 MiB/sec 420
IOPS
kv_flush_lat.sum: 0.003866224
kv_sync_lat.sum: 2.667407139
bytes_written_sst: 34904457

If that's particularly true for "kv_flush_lat" counter - please rerun with debug-bluefs set to 20 and collect OSD logs for both cases
Yes it's still true for kv_flush_lat - see above. Where to upload / put
those logs?

greets,
Stefan

Thanks,
Igor

On 5/5/2020 11:46 AM, Stefan Priebe - Profihost AG wrote:
Hello Igor,

Am 30.04.20 um 15:52 schrieb Igor Fedotov:
1) reset perf counters for the specific OSD

2) run bench

3) dump perf counters.
This is OSD 0:

# ceph tell osd.0 bench -f plain 12288000 4096
bench: wrote 12 MiB in blocks of 4 KiB in 6.70482 sec at 1.7 MiB/sec 447
IOPS

https://pastebin.com/raw/hbKcU07g

This is OSD 38:

# ceph tell osd.38 bench -f plain 12288000 4096
bench: wrote 12 MiB in blocks of 4 KiB in 2.01763 sec at 5.8 MiB/sec
1.49k IOPS

https://pastebin.com/raw/Tx2ckVm1

Collecting disks' (both main and db) activity with iostat would be nice
too. But please either increase benchmark duration or reduce iostat
probe period to 0.1 or 0.05 second
This gives me:

# ceph tell osd.38 bench -f plain 122880000 4096
Error EINVAL: 'count' values greater than 12288000 for a block size of 4
KiB, assuming 100 IOPS, for 30 seconds, can cause ill effects on osd.
Please adjust 'osd_bench_small_size_max_iops' with a higher value if you
wish to use a higher 'count'.

Stefan

Thanks,

Igor

On 4/28/2020 8:42 PM, Stefan Priebe - Profihost AG wrote:
HI Igor,

but the performance issue is still present even on the recreated OSD.

# ceph tell osd.38 bench -f plain 12288000 4096
bench: wrote 12 MiB in blocks of 4 KiB in 1.63389 sec at 7.2 MiB/sec
1.84k IOPS

vs.

# ceph tell osd.10 bench -f plain 12288000 4096
bench: wrote 12 MiB in blocks of 4 KiB in 10.7454 sec at 1.1 MiB/sec 279
IOPS

both baked by the same SAMSUNG SSD as block.db.

Greets,
Stefan

Am 28.04.20 um 19:12 schrieb Stefan Priebe - Profihost AG:
Hi Igore,
Am 27.04.20 um 15:03 schrieb Igor Fedotov:
Just left a comment at https://tracker.ceph.com/issues/44509

Generally bdev-new-db performs no migration, RocksDB might
eventually do
that but no guarantee it moves everything.

One should use bluefs-bdev-migrate to do actual migration.

And I think that's the root cause for the above ticket.
perfect - this removed all spillover in seconds.

Greets,
Stefan

Thanks,

Igor

On 4/24/2020 2:37 PM, Stefan Priebe - Profihost AG wrote:
No not a standalone Wal I wanted to ask whether bdev-new-db migrated
dB and Wal from hdd to ssd.

Stefan

Am 24.04.2020 um 13:01 schrieb Igor Fedotov <ifedotov@xxxxxxx>:

Unless you have 3 different types of disks beyond OSD (e.g. HDD, SSD,
NVMe) standalone WAL makes no sense.

On 4/24/2020 1:58 PM, Stefan Priebe - Profihost AG wrote:
Is Wal device missing? Do I need to run *bluefs-bdev-new-db and
Wal?*

Greets,
Stefan

Am 24.04.2020 um 11:32 schrieb Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx>:

Hi Igor,

there must be a difference. I purged osd.0 and recreated it.

Now it gives:
ceph tell osd.0 bench
{
     "bytes_written": 1073741824,
     "blocksize": 4194304,
     "elapsed_sec": 8.1554735639999993,
     "bytes_per_sec": 131659040.46819863,
     "iops": 31.389961354303033
}

What's wrong wiht adding a block.db device later?

Stefan

Am 23.04.20 um 20:34 schrieb Stefan Priebe - Profihost AG:
Hi,
if the OSDs are idle the difference is even more worse:
# ceph tell osd.0 bench
{
      "bytes_written": 1073741824,
      "blocksize": 4194304,
      "elapsed_sec": 15.396707875000001,
      "bytes_per_sec": 69738403.346825853,
      "iops": 16.626931034761871
}
# ceph tell osd.38 bench
{
      "bytes_written": 1073741824,
      "blocksize": 4194304,
      "elapsed_sec": 6.8903985170000004,
      "bytes_per_sec": 155831599.77624846,
      "iops": 37.153148597776521
}
Stefan
Am 23.04.20 um 14:39 schrieb Stefan Priebe - Profihost AG:
Hi,
Am 23.04.20 um 14:06 schrieb Igor Fedotov:
I don't recall any additional tuning to be applied to new DB
volume. And assume the hardware is pretty the same...

Do you still have any significant amount of data spilled over
for these updated OSDs? If not I don't have any valid
explanation for the phenomena.
just the 64k from here:
https://tracker.ceph.com/issues/44509

You might want to try "ceph osd bench" to compare OSDs under
pretty the same load. Any difference observed
Servers are the same HW. OSD Bench is:
# ceph tell osd.0 bench
{
       "bytes_written": 1073741824,
       "blocksize": 4194304,
       "elapsed_sec": 16.091414781000001,
       "bytes_per_sec": 66727620.822242722,
       "iops": 15.909104543266945
}

# ceph tell osd.36 bench
{
       "bytes_written": 1073741824,
       "blocksize": 4194304,
       "elapsed_sec": 10.023828538,
       "bytes_per_sec": 107118933.6419194,
       "iops": 25.539143953780986
}

OSD 0 is a Toshiba MG07SCA12TA SAS 12G
OSD 36 is a Seagate ST12000NM0008-2H SATA 6G

SSDs are all the same like the rest of the HW. But both drives
should give the same performance from their specs. The only other
difference is that OSD 36 was directly created with the block.db
device (Nautilus 14.2.7) and OSD 0 (14.2.8) does not.

Stefan

On 4/23/2020 8:35 AM, Stefan Priebe - Profihost AG wrote:
Hello,

is there anything else needed beside running:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-${OSD}
bluefs-bdev-new-db --dev-target /dev/vgroup/lvdb-1

I did so some weeks ago and currently i'm seeing that all osds
originally deployed with --block-db show 10-20% I/O waits while
all those got converted using ceph-bluestore-tool show 80-100%
I/O waits.

Also is there some tuning available to use more of the SSD? The
SSD (block-db) is only saturated at 0-2%.

Greets,
Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx