Hi, I've been slowly getting some insight into this, but I haven't yet found any compromise that works well. I'm currently testing ceph using librados C bindings directly, there are two components that access the storage cluster, one that only writes, and another that only reads. Between them, all that happens is stat(), write(), and read() on a pool - for efficiency we're using the AIO variants. On the reader side, it's stat(), and read() if an object exists, with an nginx proxy cache infront, a fair amount of accesses just stat() and return 304 if the IMS headers match the object's mtime. On the writer side, it's only stat() and write() if an object doesn't exist in the pool. Each object is independent, there is no relationship between that and any other objects stored. If this makes it to production, this pool will have around 1.8 billion objects, anywhere between 4 and 30kbs in size. The store itself is pretty much 90% write. Of what is read, there is zero reference of locality, a given object could be read once, then not again for many days. Even then, the seek times are mission critical, and so can't have a situation where it takes a longer than a normal disk's seek time to stat() a file. On that front, Ceph has been working very well for us, with most requests taking an average of around 6 milliseconds - though there are problems that relate to deep-scrubbing running in the background that I may come back to at a later date. With that brief description out the way, what probably makes our usage maybe unique is that of the data we do write, it's actually not a big a deal if just goes missing, infact we've even had a situation where 10% of our data was being deleted on a daily basis, and no one noticed for months. This is because our writers guarantee that whatever isn't present on disk will be regenerated in the next loop of their workload. Probably the one thing we do care about, is that our clients continue working, no matter what state the cluster is in. Unfortunately, this is where a nasty (feature? bug?) side-effect of Ceph's internals comes in. Where something happens(tm) to cause PGs to go missing, and all client operations will become effectively blocked, and indefinitely so. That is 100% downtime for loosing something even as small as 4-6% of data held, this is outrageously undesirable! When playing around with a test instance, I managed to get normal operations to resume using something to the effect of the following commands, though I'm not sure which were required or not, probably all. ceph osd down osd.0 ceph osd out osd.0 ceph out lost osd.0 for pg in $(get list of missing pgs) do ceph pg force_create & done ceph osd crush rm osd.0 Only 20 minutes later after being stuck and stale, probably repeating some steps in a different order, (or doing something else that I didn't make a note of) did the cluster finally decide to recreate the lost PGs on the OSDs still standing, and normal operations were unblocked. Sometime later, the lost osd.0 came back up, though I didn't look too much into whether the objects it held were merged back, or just wiped. It wouldn't really make a difference either way. So, the short question I have, is there a way to keep ceph running following a small data loss that would be completely catastrophic in probably any situation except my specific use case? Increasing replication count isn't a solution I can afford. Infact, with the relationship this application has with the storage layer, it would actually be better off without any sort of replication whatsoever. The desired behaviour for me would be for the client to get an instant "not found" response from stat() operations. For write() to recreate unfound objects. And for missing placement groups to be recreated on an OSD that isn't overloaded. Halting the entire cluster when 96% of it can still be accessed is just not workable, I'm afraid. Thanks ahead of time for any suggestions. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com