Re: osd true blocksize vs bluestore_min_alloc_size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor



Many thanks, it worked!


ewceph1-osd001-prod:~ # egrep -a --color=always "min_alloc_size" /var/log/ceph/ceph-osd.0.log | tail -111
2022-02-10 18:12:53.918 7f3a1dd4bd00 10 bluestore(/var/lib/ceph/osd/ceph-0) _open_super_meta min_alloc_size 0x10000
2022-02-10 18:12:53.926 7f3a1dd4bd00 10 bluestore(/var/lib/ceph/osd/ceph-0) _set_alloc_sizes min_alloc_size 0x10000 order 16 max_alloc_size 0x0 prefer_deferred_size 0x8000 deferred_batch_ops 64
ewceph1-osd001-prod:~ # echo $((16#10000))
65536

So I get 64K for hdd and 16K for nvme.
I will recreate the nvme osd's with 4K to avoid any allocation overhead issue with EC8+2.


Cheers
Francois



--


EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich

tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheurer@xxxxxxxxxxxx
web: http://www.everyware.ch


________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: Thursday, February 10, 2022 6:06 PM
To: Scheurer François; Dan van der Ster
Cc: Ceph Users
Subject: Re:  Re: osd true blocksize vs bluestore_min_alloc_size

Hi Fransois,

you should set debug_bluestore = 10 instead.

And then grep for bluestore or min_alloc_size not bluefs, here is how
this is printed:

  dout(10) << __func__ << " min_alloc_size 0x" << std::hex << min_alloc_size
            << std::dec << " order " << (int)min_alloc_size_order
            << " max_alloc_size 0x" << std::hex << max_alloc_size
            << " prefer_deferred_size 0x" << prefer_deferred_size
            << std::dec
            << " deferred_batch_ops " << deferred_batch_ops
            << dendl;

On 2/10/2022 7:39 PM, Scheurer François wrote:
> Dear Dan
>
>
> Thank you for your help.
>
> After putting debug_osd = 10/5 in ceph.conf under [osd], I still do not get min_alloc_size logged.
>
> Probably no logging it on 14.2.5.
>
> But this come up:
>
> ewceph1-osd001-prod:~ # egrep -a --color=always bluefs /var/log/ceph/ceph-osd.0.log | tail -111
> 2022-02-10 17:26:59.512 7f6026737d00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block.db size 80 GiB
>
> 2022-02-10 17:26:59.512 7f6026737d00  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-0/block size 7.3 TiB
> 2022-02-10 17:26:59.512 7f6026737d00  1 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-0/block.wal size 2 GiB
> 2022-02-10 17:27:00.896 7f6026737d00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block.db size 80 GiB
> 2022-02-10 17:27:00.900 7f6026737d00  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-0/block size 7.3 TiB
> 2022-02-10 17:27:00.900 7f6026737d00  1 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-0/block.wal size 2 GiB
> 2022-02-10 17:27:00.900 7f6026737d00  1 bluefs mount
> 2022-02-10 17:27:00.900 7f6026737d00  1 bluefs _init_alloc id 0 alloc_size 0x100000 size 0x80000000
> 2022-02-10 17:27:00.900 7f6026737d00  1 bluefs _init_alloc id 1 alloc_size 0x100000 size 0x1400000000
> 2022-02-10 17:27:00.900 7f6026737d00  1 bluefs _init_alloc id 2 alloc_size 0x10000 size 0x746fc051000
> 2022-02-10 17:27:04.516 7f6026737d00  1 bluefs umount
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block.db size 80 GiB
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-0/block size 7.3 TiB
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-0/block.wal size 2 GiB
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs mount
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs _init_alloc id 0 alloc_size 0x100000 size 0x80000000
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs _init_alloc id 1 alloc_size 0x100000 size 0x1400000000
> 2022-02-10 17:27:05.200 7f6026737d00  1 bluefs _init_alloc id 2 alloc_size 0x10000 size 0x746fc051000
>
> So alloc_size for block is 1MiB.

These are alloc sizes for bluefs not for user data. So bluefs data at
main device (id=2) uses 16K allocation unit. But it's user data
allocation size (=min_alloc_size) which mostly matters for main devices
as bluefs uses this device in case of data spillover (i.e. lack of free
space at DB volume) only .


And please do not confuse allocation unit and device block size. The
latter is almost always = 4K and determines minimal block size
read/written from the disk. While allocation unit (=min_alloc_size)
determines the allocated/tracked block size, i.e. minimal addresable
block which BlueStore uses.

>
> Any other way to get min_alloc_size ?
>
>
>
> Cheers
>
> Francois
>
> --
>
>
> EveryWare AG
> François Scheurer
> Senior Systems Engineer
> Zurlindenstrasse 52a
> CH-8003 Zürich
>
> tel: +41 44 466 60 00
> fax: +41 44 466 60 10
> mail: francois.scheurer@xxxxxxxxxxxx
> web: http://www.everyware.ch
>
>
> ________________________________
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> Sent: Thursday, February 10, 2022 4:33 PM
> To: Scheurer François
> Cc: Ceph Users
> Subject: Re:  osd true blocksize vs bluestore_min_alloc_size
>
> Hi,
>
> When an osd starts it should log at level 1 the min_alloc_size, see
> https://smex-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2fceph%2fceph%2fblob%2fmaster%2fsrc%2fos%2fbluestore%2fBlueStore.cc%23L12260&umid=560175da-1141-4562-8189-c7cdbbc482a4&auth=25aea1164aa8751f7aac2232af31e2d418ade22a-84288fd25bb96a66a45175f2ed8db0ebed887e80
>
> grep "min_alloc_size 0x" ceph-osd.*.log
>
> Cheers, Dan
>
>
> On Thu, Feb 10, 2022 at 3:50 PM Scheurer François
> <francois.scheurer@xxxxxxxxxxxx> wrote:
>> Hi everyone
>>
>>
>> How can we display the true osd block size?
>>
>>
>> I get 64K for a hdd osd:
>>
>>          ceph daemon osd.0 config show | egrep --color=always "alloc_size|bdev_block_size"
>>              "bdev_block_size": "4096",
>>              "bluefs_alloc_size": "1048576",
>>              "bluefs_shared_alloc_size": "65536",
>>              "bluestore_extent_map_inline_shard_prealloc_size": "256",
>>              "bluestore_max_alloc_size": "0",
>>              "bluestore_min_alloc_size": "0",
>>              "bluestore_min_alloc_size_hdd": "65536",
>>              "bluestore_min_alloc_size_ssd": "16384",
>>
>> But it was explained that bluestore_min_alloc_size_hdd is only affecting newly created osd's.
>> So to check the current block size I can check the osd metadata and find 4K:
>>          ceph osd metadata osd.0 | jq '.bluestore_bdev_block_size'
>>
>>              "bluestore_bdev_block_size": "4096",
>>
>> Checking an object block size directly also shows 4K:
>>          ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 6.5s4 "cb1594b3-a782-49d0-a19f-68cd48870a63.95398870.14_DriveE/Predator/Doc/2021/03/01101038/1523111.pdf.zip" dump | jq '.stat'
>>              {
>>                "size": 32768,
>>                "blksize": 4096,
>>                "blocks": 8,
>>                "nlink": 1
>>              }
>>
>> So these hdd osd's were created with 4K block size without honoring bluestore_min_alloc_size_hdd?
>> The osd's are running on nautilus 14.2.5 and were created on luminous.
>>
>> Newer nvme osd's created on nautilus were also created with 4K without honoring bluestore_min_alloc_size_ssd (16K).
>>
>> This is confusing... Actually I would be happy with 4K as it is recommended to avoid over-allocation issue with EC pools.
>> But I would like to understand how to show the true block size of an existing osd...
>>
>> Many thanks for your help! ;-)
>>
>>
>> Cheers
>> Francois Scheurer
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>> EveryWare AG
>> François Scheurer
>> Senior Systems Engineer
>> Zurlindenstrasse 52a
>> CH-8003 Zürich
>>
>> tel: +41 44 466 60 00
>> fax: +41 44 466 60 10
>> mail: francois.scheurer@xxxxxxxxxxxx
>> web: http://www.everyware.ch
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://smex-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fcroit.io&umid=053c6ecb-f791-49ec-b70b-4190cfea616a&auth=48c5c946edf114a93f29850dcf198d124dbaf7a3-0caeb9f1ecbca90258eff4703f0e86725db07528

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://smex-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fcroit.io&umid=053c6ecb-f791-49ec-b70b-4190cfea616a&auth=48c5c946edf114a93f29850dcf198d124dbaf7a3-0caeb9f1ecbca90258eff4703f0e86725db07528 | YouTube: https://smex-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgoo.gl%2fPGE1Bx&umid=053c6ecb-f791-49ec-b70b-4190cfea616a&auth=48c5c946edf114a93f29850dcf198d124dbaf7a3-454d47410bd17f141088ab42eff65639ade1f59a

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux