> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>: > > > Hi, > > I've been slowly getting some insight into this, but I haven't yet > found any compromise that works well. > > I'm currently testing ceph using librados C bindings directly, there > are two components that access the storage cluster, one that only > writes, and another that only reads. Between them, all that happens > is stat(), write(), and read() on a pool - for efficiency we're using > the AIO variants. > > On the reader side, it's stat(), and read() if an object exists, with > an nginx proxy cache infront, a fair amount of accesses just stat() > and return 304 if the IMS headers match the object's mtime. On the > writer side, it's only stat() and write() if an object doesn't exist > in the pool. Each object is independent, there is no relationship > between that and any other objects stored. > > If this makes it to production, this pool will have around 1.8 billion > objects, anywhere between 4 and 30kbs in size. The store itself is > pretty much 90% write. Of what is read, there is zero reference of > locality, a given object could be read once, then not again for many > days. Even then, the seek times are mission critical, and so can't > have a situation where it takes a longer than a normal disk's seek > time to stat() a file. On that front, Ceph has been working very well > for us, with most requests taking an average of around 6 milliseconds > - though there are problems that relate to deep-scrubbing running in > the background that I may come back to at a later date. > > With that brief description out the way, what probably makes our usage > maybe unique is that of the data we do write, it's actually not a big > a deal if just goes missing, infact we've even had a situation where > 10% of our data was being deleted on a daily basis, and no one noticed > for months. This is because our writers guarantee that whatever isn't > present on disk will be regenerated in the next loop of their > workload. > > Probably the one thing we do care about, is that our clients continue > working, no matter what state the cluster is in. Unfortunately, this > is where a nasty (feature? bug?) side-effect of Ceph's internals comes > in. Where something happens(tm) to cause PGs to go missing, and all > client operations will become effectively blocked, and indefinitely > so. That is 100% downtime for loosing something even as small as 4-6% > of data held, this is outrageously undesirable! > > When playing around with a test instance, I managed to get normal > operations to resume using something to the effect of the following > commands, though I'm not sure which were required or not, probably > all. > > ceph osd down osd.0 > ceph osd out osd.0 > ceph out lost osd.0 > for pg in $(get list of missing pgs) do ceph pg force_create & done > ceph osd crush rm osd.0 > > Only 20 minutes later after being stuck and stale, probably repeating > some steps in a different order, (or doing something else that I > didn't make a note of) did the cluster finally decide to recreate the > lost PGs on the OSDs still standing, and normal operations were > unblocked. Sometime later, the lost osd.0 came back up, though I > didn't look too much into whether the objects it held were merged > back, or just wiped. It wouldn't really make a difference either way. > > So, the short question I have, is there a way to keep ceph running > following a small data loss that would be completely catastrophic in > probably any situation except my specific use case? Increasing > replication count isn't a solution I can afford. Infact, with the > relationship this application has with the storage layer, it would > actually be better off without any sort of replication whatsoever. > > The desired behaviour for me would be for the client to get an instant > "not found" response from stat() operations. For write() to recreate > unfound objects. And for missing placement groups to be recreated on > an OSD that isn't overloaded. Halting the entire cluster when 96% of > it can still be accessed is just not workable, I'm afraid. > Well, you can't make Ceph do that, but you can make librados do such a thing. I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157 You can set these options: - client_mount_timeout - rados_mon_op_timeout - rados_osd_op_timeout Where I think only the last two should be sufficient in your case. You wel get ETIMEDOUT back as error when a operation times out. Wido > Thanks ahead of time for any suggestions. > > -- > Iain Buclaw > > *(p < e ? p++ : p) = (c & 0x0f) + '0'; > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com