[ Please stay on the list. :) ] On Tue, Jun 18, 2013 at 12:54 PM, Edward Huyer <erhvks@xxxxxxx> wrote: >> > First questions: Are there obvious flaws or concerns with the >> > following configuration I should be aware of? Does it even make sense >> > to try to use ceph here? Anything else I should know, think about, or >> > do instead of the above? >> >> It looks basically fine to me. I'm a little surprised you want to pay the cost of >> networking and replication for 44TB of storage (instead of just setting up >> some arrays and exporting them), but I don't know how to admin anything >> except for my little desktop. ;) The big thing to watch out for is that if you're >> planning on using the kernel RBD driver, you want a fairly new kernel. (For >> VMs this obviously isn't a problem when the hypervisor is managing the >> storage.) > > I already have a large chunk of the 10G infrastructure in place, a big pile or storage, and several small piles of storage. The problem I'm running into with the "allocate and export arrays" arrangement is they're kinda annoying to manage even at this scale, particularly when a portion of their unused storage needs to be reallocated. It will be really ugly when I get stuck doing a forklift upgrade of that MD3200 system in a few years. While I don't expect to get even close to petabyte usage any time soon, the usage of network attached storage here is growing, and the basic rationales of easy growth, easy replacement, self-redistribution, etc., would seem to hold on this smaller scale as well. If you have specific comments or suggestions to that, by all means share. Even if it's something seemingly obvious, I'd rather have it pointed out now than stumble over it later. :) > > I'm not too worried about the kernel driver. If all else fails, I'm already running Gentoo on a few of my systems. I'm perfectly happy to add a few more if need be. :) > >> > My more specific question relates to the two RAID controllers in the >> > MD3200, and my intended 2 or 3 copy replication (also striping): What >> > happens if all OSDs with copies of a piece of data go down for a >> > period of time, but then the OSDs come back "intact" (e.g. by moving >> > them to a different controller)? >> >> The specifics of what data will migrate where will depend on how you've set >> up your CRUSH map, when you're updating the CRUSH locations, etc, but if >> you move an OSD then it will fully participate in recovery and can be used as >> the authoritative source for data. > > Ok, so if data chunk "bar" lives only on OSDs 3, 4, and 5, and OSDs 3, 4, and 5 suddenly vanish for some reason but then come back later (with their data intact), the cluster will recover more-or-less gracefully? That is, it *won't* go "sorry, your RBD 'foobarbaz' lost 'bar' for a while, all that data is gone"? I would *assume* it has a way to recover more-or-less gracefully, but it's also not something I want to discover the answer to later. :) Well, if the data goes away and you try and read it, the request is just going to hang, and presumably eventually the kernel/hypervisor/block device whatever will time out and throw an error. At that point you have a choice between marking it lost (at which point it will say ENOENT to requests to access it, and RBD will turn that into zero blocks) or getting the data back online. When you do bring it back online, it will peer and then be accessible again without much fuss (if you only bring back one copy it might kick off a bunch of network and disk traffic re-replicating). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com