Re: question about block sizes, rados objects and file striping (and maybe more)

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 20 Mar 2017 20:36:20 -0400

On Mon, Mar 20, 2017 at 6:49 PM, Alejandro Comisario
<alejandro@xxxxxxxxxxx> wrote:
> Jason, thanks for the reply, you really got my question right.
> So, some doubts that might show that i lack of some general knowledge.
>
> When i read that someone is testing a ceph cluster with secuential 4k
> block writes, does that could happen inside a vm that is using an RBD
> backed OS ?

You can use some benchmarks directly against librbd (e.g. see fio's
rbd engine), some within a VM against an RBD-backed block device, and
some within a VM against a filesystem backed by an RBD-backed block
device.

> In that case, should the vm's FS should be formated to allow 4K writes
>  so that the block level of the vm writes 4K down to the hypervisor ?
>
> In that case, asuming that i have a 9K mtu between the compute node
> and the ceph cluster.
> What is the default rados block size in whitch the objects are divided
> against the amount of information ?

MTU size (network maximum packet size) and the RBD block object size
are not interrelated.

>
> On Mon, Mar 20, 2017 at 7:06 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>> It's a very broad question -- are you trying to determine something
>> more specific?
>>
>> Notionally, your DB engine will safely journal the changes to disk,
>> commit the changes to the backing table structures, and prune the
>> journal. Your mileage my vary depending on the specific DB engine and
>> its configuration settings.
>>
>> The VM's OS will send write requests addressed by block offset and
>> block counts (e.g. 512 blocks) through the block device hardware
>> (either a slower emulated block device or a faster paravirtualized
>> block device like virtio-blk/virtio-scsi). Within the internals of
>> QEMU, these block-addressed write requests will be delivered to librbd
>> in byte-addressed format (the blocks are converted to absolute byte
>> ranges).
>>
>> librbd will take the provided byte offset and length and quickly
>> calculate which backing RADOS objects are associated with the provided
>> range [1]. If the extent intersects multiple backing objects, the
>> sub-operation is sent to each affected object in parallel. These
>> operations will be sent to the OSDs responsible for handling the
>> object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
>> maximum size of each IP packet -- larger MTUs allow you to send more
>> data within a single packet [2].
>>
>> [1] http://docs.ceph.com/docs/master/architecture/#data-striping
>> [2] https://en.wikipedia.org/wiki/Maximum_transmission_unit
>>
>>
>>
>> On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
>> <alejandro@xxxxxxxxxxx> wrote:
>>> anyone ?
>>>
>>> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>>> <alejandro@xxxxxxxxxxx> wrote:
>>>> Hi, it's been a while since im using Ceph, and still im a little
>>>> ashamed that when certain situation happens, i dont have the knowledge
>>>> to explain or plan things.
>>>>
>>>> Basically what i dont know is, and i will do an exercise.
>>>>
>>>> EXCERCISE:
>>>> a virtual machine running on KVM has an extra block device where the
>>>> datafiles of a database runs (this block device is exposed to the vm
>>>> using libvirt)
>>>>
>>>> facts.
>>>> * the db writes to disk in 8K blocks
>>>> * the connection between the phisical compute node and Ceph has an MTU of 1500
>>>> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
>>>> * everything else is default
>>>>
>>>> So conceptually, if someone can explain me, what happens from the
>>>> momment the DB contained on the VM commits to disk a query of
>>>> 20MBytes, what happens on the compute node, what happens on the
>>>> client's file striping, what happens on the network (regarding
>>>> packages, if other than creating 1500 bytes packages), what happens
>>>> with rados objects, block sizes, etc.
>>>>
>>>> I would love to read this from the bests, mainly because as i said i
>>>> dont understand all the workflow of blocks, objects, etc.
>>>>
>>>> thanks to everyone !
>>>>
>>>> --
>>>> Alejandrito
>>>
>>>
>>>
>>> --
>>> Alejandro Comisario
>>> CTO | NUBELIU
>>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>>> _
>>> www.nubeliu.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com