Auto recovering after loosing all copies of a PG(s)

Iain Buclaw <ibuclaw@xxxxxxxxx> · Tue, 16 Aug 2016 15:59:07 +0200

Hi,

I've been slowly getting some insight into this, but I haven't yet
found any compromise that works well.

I'm currently testing ceph using librados C bindings directly, there
are two components that access the storage cluster, one that only
writes, and another that only reads.  Between them, all that happens
is stat(), write(), and read() on a pool - for efficiency we're using
the AIO variants.

On the reader side, it's stat(), and read() if an object exists, with
an nginx proxy cache infront, a fair amount of accesses just stat()
and return 304 if the IMS headers match the object's mtime.  On the
writer side, it's only stat() and write() if an object doesn't exist
in the pool.  Each object is independent, there is no relationship
between that and any other objects stored.

If this makes it to production, this pool will have around 1.8 billion
objects, anywhere between 4 and 30kbs in size.  The store itself is
pretty much 90% write.  Of what is read, there is zero reference of
locality, a given object could be read once, then not again for many
days.  Even then, the seek times are mission critical, and so can't
have a situation where it takes a longer than a normal disk's seek
time to stat() a file.  On that front, Ceph has been working very well
for us, with most requests taking an average of around 6 milliseconds
- though there are problems that relate to deep-scrubbing running in
the background that I may come back to at a later date.

With that brief description out the way, what probably makes our usage
maybe unique is that of the data we do write, it's actually not a big
a deal if just goes missing, infact we've even had a situation where
10% of our data was being deleted on a daily basis, and no one noticed
for months.  This is because our writers guarantee that whatever isn't
present on disk will be regenerated in the next loop of their
workload.

Probably the one thing we do care about, is that our clients continue
working, no matter what state the cluster is in.  Unfortunately, this
is where a nasty (feature? bug?) side-effect of Ceph's internals comes
in.  Where something happens(tm) to cause PGs to go missing, and all
client operations will become effectively blocked, and indefinitely
so.  That is 100% downtime for loosing something even as small as 4-6%
of data held, this is outrageously undesirable!

When playing around with a test instance, I managed to get normal
operations to resume using something to the effect of the following
commands, though I'm not sure which were required or not, probably
all.

  ceph osd down osd.0
  ceph osd out osd.0
  ceph out lost osd.0
  for pg in $(get list of missing pgs) do ceph pg force_create & done
  ceph osd crush rm osd.0

Only 20 minutes later after being stuck and stale, probably repeating
some steps in a different order, (or doing something else that I
didn't make a note of) did the cluster finally decide to recreate the
lost PGs on the OSDs still standing, and normal operations were
unblocked.  Sometime later, the lost osd.0 came back up, though I
didn't look too much into whether the objects it held were merged
back, or just wiped.  It wouldn't really make a difference either way.

So, the short question I have, is there a way to keep ceph running
following a small data loss that would be completely catastrophic in
probably any situation except my specific use case?  Increasing
replication count isn't a solution I can afford.  Infact, with the
relationship this application has with the storage layer, it would
actually be better off without any sort of replication whatsoever.

The desired behaviour for me would be for the client to get an instant
"not found" response from stat() operations.  For write() to recreate
unfound objects.  And for missing placement groups to be recreated on
an OSD that isn't overloaded.  Halting the entire cluster when 96% of
it can still be accessed is just not workable, I'm afraid.

Thanks ahead of time for any suggestions.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com