Re: Auto recovering after loosing all copies of a PG(s)

Wido den Hollander <wido@xxxxxxxx> · Tue, 16 Aug 2016 17:13:38 +0200 (CEST)

> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>:
> 
> 
> Hi,
> 
> I've been slowly getting some insight into this, but I haven't yet
> found any compromise that works well.
> 
> I'm currently testing ceph using librados C bindings directly, there
> are two components that access the storage cluster, one that only
> writes, and another that only reads.  Between them, all that happens
> is stat(), write(), and read() on a pool - for efficiency we're using
> the AIO variants.
> 
> On the reader side, it's stat(), and read() if an object exists, with
> an nginx proxy cache infront, a fair amount of accesses just stat()
> and return 304 if the IMS headers match the object's mtime.  On the
> writer side, it's only stat() and write() if an object doesn't exist
> in the pool.  Each object is independent, there is no relationship
> between that and any other objects stored.
> 
> If this makes it to production, this pool will have around 1.8 billion
> objects, anywhere between 4 and 30kbs in size.  The store itself is
> pretty much 90% write.  Of what is read, there is zero reference of
> locality, a given object could be read once, then not again for many
> days.  Even then, the seek times are mission critical, and so can't
> have a situation where it takes a longer than a normal disk's seek
> time to stat() a file.  On that front, Ceph has been working very well
> for us, with most requests taking an average of around 6 milliseconds
> - though there are problems that relate to deep-scrubbing running in
> the background that I may come back to at a later date.
> 
> With that brief description out the way, what probably makes our usage
> maybe unique is that of the data we do write, it's actually not a big
> a deal if just goes missing, infact we've even had a situation where
> 10% of our data was being deleted on a daily basis, and no one noticed
> for months.  This is because our writers guarantee that whatever isn't
> present on disk will be regenerated in the next loop of their
> workload.
> 
> Probably the one thing we do care about, is that our clients continue
> working, no matter what state the cluster is in.  Unfortunately, this
> is where a nasty (feature? bug?) side-effect of Ceph's internals comes
> in.  Where something happens(tm) to cause PGs to go missing, and all
> client operations will become effectively blocked, and indefinitely
> so.  That is 100% downtime for loosing something even as small as 4-6%
> of data held, this is outrageously undesirable!
> 
> When playing around with a test instance, I managed to get normal
> operations to resume using something to the effect of the following
> commands, though I'm not sure which were required or not, probably
> all.
> 
>   ceph osd down osd.0
>   ceph osd out osd.0
>   ceph out lost osd.0
>   for pg in $(get list of missing pgs) do ceph pg force_create & done
>   ceph osd crush rm osd.0
> 
> Only 20 minutes later after being stuck and stale, probably repeating
> some steps in a different order, (or doing something else that I
> didn't make a note of) did the cluster finally decide to recreate the
> lost PGs on the OSDs still standing, and normal operations were
> unblocked.  Sometime later, the lost osd.0 came back up, though I
> didn't look too much into whether the objects it held were merged
> back, or just wiped.  It wouldn't really make a difference either way.
> 
> So, the short question I have, is there a way to keep ceph running
> following a small data loss that would be completely catastrophic in
> probably any situation except my specific use case?  Increasing
> replication count isn't a solution I can afford.  Infact, with the
> relationship this application has with the storage layer, it would
> actually be better off without any sort of replication whatsoever.
> 
> The desired behaviour for me would be for the client to get an instant
> "not found" response from stat() operations.  For write() to recreate
> unfound objects.  And for missing placement groups to be recreated on
> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
> it can still be accessed is just not workable, I'm afraid.
> 

Well, you can't make Ceph do that, but you can make librados do such a thing.

I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157

You can set these options:
- client_mount_timeout
- rados_mon_op_timeout
- rados_osd_op_timeout

Where I think only the last two should be sufficient in your case.

You wel get ETIMEDOUT back as error when a operation times out.

Wido

> Thanks ahead of time for any suggestions.
> 
> -- 
> Iain Buclaw
> 
> *(p < e ? p++ : p) = (c & 0x0f) + '0';
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com