Re: Basic Ceph questions

Marcus White <roastedseaweed.k@xxxxxxxxx> · Thu, 9 Oct 2014 17:19:09 -0700

Thanks:)

Just curious, what kind of applications use RBD? It cant be
applications which need high speed SAN storage performance
characteristics?

For VMs, I am trying to visualize how the RBD device would be exposed.
Where does the driver live exactly? If its exposed via libvirt and
QEMU, does the kernel driver run in the host OS, and communicate with
a backend Ceph cluster? If yes, does libRBD provide a target (SCSI?)
interface which the kernel driver connects to? Trying to visualize
what the stack looks like, and the flow of IOs for block devices.

FUSE is probably for Ceph file system..

MW

On Wed, Oct 8, 2014 at 6:37 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:
> Comments inline.
>
> On Tue, Oct 7, 2014 at 5:51 PM, Marcus White <roastedseaweed.k@xxxxxxxxx>
> wrote:
>>
>> Hello,
>> Some basic Ceph questions, would appreciate your help:) Sorry about
>> the number and detail in advance!
>>
>> a. Ceph RADOS is strongly consistent and different from usual object,
>> does that mean all metadata also, container and account etc is all
>> consistent and everything is updated in the path of the client
>> operation itself, for a single site?
>
>
> Yes.  In a single site, it's CP out of CAP.

>
>>
>> b. If it is strongly consistent, is that the case across sites also?
>> How can it be performant across geo sites if that is the case? If its
>> choosing consistency over partitioning and availability...For object,
>> I read somewhere that it is now eventually consistent(local CP,
>> remotely AP) via DR. Gets a bit confusing with all the literature out
>> there. If it is DR, isnt that slightly different from the Swift case?
>
>
> If you're referring to RadosGW Federation, no.  That replication is async.
> The replication has several delays built in, so the fastest you could to see
> your data show up in the secondary is about a minute.  Longer if the file
> takes a while to transfer, or you have a lot of activity to replicate.
>
> Each site is still CP.  There is just delay getting data from the primary to
> the secondary.
In that case, it is like Swift, only differently done. The async makes
it eventually consistent across sites, no?

>
>
> If you want CP in multiple locations, that's doable by creating one cluster
> that spans both locations, and tuning the CRUSH rules to make sure the
> object is written to both locations. You really want a low latency
> connection between the two sites.
>
> I tested one cluster in two colos with 20ms of latency between them.  It
> worked, but it was noticeably slow.  I went with two clusters and async
> replication.
>
>
>>
>>
>> c. For block, is it CP on a single site and then usual DR to another
>> site using snapshotting?
>
>
> Yes.
>
>
>>
>>
>> d. For block, is it just a linux block device or is it SCSI? Is it a
>> custom device driver running within Linux which hooks into the block
>> layer? Trying to understand the layering diagram.
>
>
> I'm a bit out of my element here, but there is a kernel module and a FUSE
> module.  The kernel module connects RDB images to a /dev/rbd/... block
> device.  It can then be used however you would use a block device.  Most
> people put a filesystem on it, but it's not required.  I'm really unfamiliar
> with the FUSE module.
>
> Several people are exporting RDB images via iSCSI and Fiber Channel.
>
>>
>> e. Do the snapshot, compression features come from the underlying file
>> system?
>
>
> It depends on the filesystem.  Ceph will emulate any required features that
> the FS doesn't support.  For example, ext4 and XFS have no snapshots, so
> Ceph has track them itself.  On BtrFS, Ceph uses the native snapshots, and
> it much quicker because of it.
>
>>
>>
>> f. What is the plan for deduplication? If that comes from the local
>> file system, how would it deduplicate across nodes to achieve the best
>> dedup ratio?
>>
>
> I don't believe Ceph does anything with de-dup.  If the FS underneath has it
> turned on, it can de-dup the stuff it sees, but there's no cluster-wide
> de-dup.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com