Re: some newbie questions...

Wolfgang Hennerbichler <wogri@xxxxxxxxx> · Tue, 20 Aug 2013 16:55:30 +0200

On Aug 20, 2013, at 15:18 , Johannes Klarenbeek <Johannes.Klarenbeek@xxxxxxx> wrote:

>  
>  
> Van: Wolfgang Hennerbichler [mailto:wogri@xxxxxxxxx] 
> Verzonden: dinsdag 20 augustus 2013 10:51
> Aan: Johannes Klarenbeek
> CC: ceph-users@xxxxxxxxxxxxxx
> Onderwerp: Re:  some newbie questions...
>  
> On Aug 20, 2013, at 09:54 , Johannes Klarenbeek <Johannes.Klarenbeek@xxxxxxx> wrote:
> 
> > dear ceph-users,
> > 
> > although heavily active in the past, i didn’t touch linux for years, so I’m pretty new to ceph and i have a few questions, which i hope someone could answer for me.
> > 
> > 1) i read somewhere that it is recommended to have one OSD per disk in a production environment.
> >    is this also the maximum disk per OSD or could i use multiple disks per OSD? and why?
> 
> you could use multiple disks for one OSD if you used some striping and abstract the disk (like LVM, MDRAID, etc). But it wouldn't make sense. One OSD writes into one filesystem, that is usually one disk in a production environment. Using RAID under it wouldn't increase neither reliability nor performance drastically.
> 
> Ok thats cleared out! Are you also saying that with a pure ceph machine  i should not install LVM either, since that is unnecessary overhead?

yep. that's what I'm saying. 

> 
> > 2) i've read about some use-cases where the cluster consisted of a monitor and some osd's but no mds.
> >    is that possible? i believe mds is used as a floating file system for inode storage at the osd cluster nodes.
> >    so without it, could it be used for object storage of some kind?
> 
> RADOS is an object store, which needs MON's and OSD's to work. If you want to 'connect' those objects to behave like a disk drive, you would use RBD, still you only need MON's and OSD's.
> If you want to expose the object store through HTTP you need radosgw additionally.
> If you want a distributed filesystem on top of the object store you would configure and run the MDS, too.
> 
> Yes I understand, but if i would like to run cephfs only do I need radosgw? My guess is no.

no. you don't need cephfs for radosgw. 

> 
> > 3) in a san configuration where i only expose iscsi targets to the "outside" world, do i need radosgw, mds or cephfs?
> 
> no. Great!
> 
> >    I prefer some sort of plug-in that exposes iscsi targets directly on the rbd interface. But then again, how would you
> >    manage these virtual disks without cephfs…
> 
> FS = filesystem.
> RBD = Block Device. You manage these virtual disks with either iscsitgtd (I hope the name is correct) which as rbd support builtin, or you map the RBD to expose to the system like a disk drive (e. g. /dev/rbd/mypool/mydisk)
> 
> Will look it up thank you J is that load balanced by any chance? So I mean, if I set up 2 client machines both with iscsitgtd and nothing else, can I configure them both to expose the same target (for shared storage purposes)?

you could use keepalived for that. But I've read on this mailinglist a couple of months ago, that one really needs to understand iscsi in order to provide a working failover scenario. I haven't tried that myself. 

> > 4) Since we like to think green too, is it possible to shutdown nodes?
> >    or at least set them at a sort of standby mode after office hours?
> 
> You could turn off the whole cluster. Turning off single nodes would result in ceph rebalancing the data, and this would not be wise.
> 
> Hmmm, so rdb doesn’t have any sleep command or something, in order to let the network know it is bed time. 

no. 

> Thinking about that, I’m not so sure if its possible to turn of a whole cluster. In the fraction of a second that the other nodes are still up the cluster could already trigger ceph rebalancing.

not true. ceph will wait for a user definable amount of time, until it starts rebalancing. so you would have enough time to shut the cluster down. You could also shut down half of your nodes, configure ceph i a way that it won't rebalance, it will still be accessbile and working, but if you modify data in the "green" time and a disk fails, you won't have a replicated dataset of that data. I wouldn't recommend doing this. Either shut down the whole cluster or leave it as is. 

> I guess there is no plausible way (yet) for a green mode then.

I think it's better to buy power efficient machines and hard-drives, and put some effort into a useful, but minimalistic setup in this case, that can be easily expanded (that's what ceph is best at). 

> > 5) many of the production deployments out there use xfs as their base file system, however without journaling most of these
> >    systems use extra ssd's to emulate copy-on-write journaling. So from this i gather that its somehow possible to assign this
> >    journaling role to some dedicated machines. how?
> 
> No. You can assign this journaling to dedicated disks, not dedicate machines. So if you have 12 OSD's in a machine, you could use 3 additional SSDs to hold the journals for 4 OSD's each.
> 
> What is the journal/storage ratio for this?

the rule of thumb is one SSD for 4 OSDs, but in fact this really depends on your hardware. 

> > 6) picture perfect and I would use btrfs. But i hear some complains of not being stable enough. How is that in the current version
> >    of ubuntu 13.04 for example. I'm on a very short time schedule and have to come up with a solution. for example I could as well
> >    install the first few nodes with xfs and then later on add some other machines with btrfs. but that is not my prefer scenario.
> >    if a node with btrfs corrupts, does that mean all the other nodes are likely to corrupt as well?
> 
> I would not recommend using btrfs for now. You could migrate any time later (bring an OSD down, reformat with btrfs, bring it up again, the data will be moved 'back' automatically). I use XFS in a production cluster and I am happy with it.
> 
> However, you need additional ssd’s for journaling as you mentioned before in order to use all other disks for storage.

you could put the journal on the OSD disks (that's what I do). That means if you write data it will be written twice (Journal and XFS), in our use case (virtualisation for non i/o hungry machines) this doesn't matter. Controller Caches (and qemu caching) are still making this very quick. 

> 
> > 7) i followed some example setup from your website and when installing ceph i see a lot of packages being installed ending with -dev.
> >    this probably only when you like to build from git yourself. however is that really necessary or can i just grab the latest binaries somewhere?
> 
> I don't know how you installed ceph and on which distribution. -dev are usually the header-files so you can compile for example qemu which depends on rbd.
> 
> Ubuntu 13.04. I picked the newest one in the hope btrfs was working. I read something on the ceph.comwebsite claiming many bug have been fixed in the last release. However, since I subscribed to the ceph-user mailing list I understand that there are still a lot of problems with dumpling, should I focus on cuttlefish first? But then again, if I want to upgrade, I probably need the header-files as well?!

upgrading ceph releases is really easy and can be done with no downtime (this is actually better than anything I've seen before). If I were you, I would start with dumpling. Sage & his Team are developing in such a fast pace, that most bugs will be fixed when your cluster is production ready. 

Wolfgang
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com