Re: questions about monitor data and ceph recovery

Martin B Nielsen <martin@xxxxxxxxxxx> · Tue, 25 Feb 2014 11:23:50 +0100

Hi Pavel,
Will try and answer some of your questions:

My first question will be about monitor data directory. How much space I need to reserve for it? Can monitor-fs be corrupted if monitor goes out of storage space?

We have about 20GB partitions for monitors - they really don't use much space, but in case you need to do some extra logging it is nice to have (and ceph doing max debug consumes scary amounts of space).
Also if you look in the monitor log they constantly monitor for free space. I don't know what will happen if a monitor runs full (or close to full), but I'm guessing that monitor will simply be marked as down or stopped somehow. You can change some of the values for a mon about how much data to keep before trimming etc.

I also have questions about ceph auto-recovery process.

For example, I have two nodes with 8 drives on each, each drive is presented as separate osd. The number of replicas = 2. I have wrote a crush ruleset, which picks two nodes and one osd on each to store replicas. Which will happens on following scenarios:

1. One drive in one node failed. Will ceph automatically re-replicate affected objects? Where replicas will be stored?
Yes, as long as you have available space on the node that lost one OSD the data that was on that disk will be distributed aross the remaining 7 OSD on that node (according to your CRUSH rules)

1.1 The failed osd will appears online again with all of it's data. How ceph cluster will deal with it?
This is just how I _think_ it works; please correct me if I'm wrong. All OSD have an internal map (pg map) which is constantly updated throughout the cluster. When any OSD goes offline/down and is started back up the latest pgmap of that OSD is 'diffed' up vs the latest map from the cluster and then the cluster can generate a new map based on what it has/had, what is missing/updated and generate a new map with the objects the newly started OSD should have. Then it will start to replicate and only get the changed/new objects.

Bottom line, this really just works and works very well.

2. One node (with 8 osds) goes offline. Will ceph automatically replicate all objects on the remaining node to maintain number of replicas = 2?
No, because it can no longer satisfy your CRUSH rules. Your crush rule states 1x copy pr. node and it will keep it that way. The cluster will go into a degraded state until you can bring up another node (ie all your data now is very vulnerable). I think it is often suggested to run with 3x replica if possible - or at the very least nr_nodes = replicas + 1. If you had to make it replicate on the remaining node you'd have to change your CRUSH rule to replicate based on OSD and not node. But then you'll most likely have problems when 1 node dies because objects could easily be on 2x OSD on the failed node. 

2.1 The failed node goes online again with all data. How ceph cluster will deal with it?
Same as the above with the OSD.

Cheers,
Martin

Thanks in advance,

  Pavel.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com