handling lost objects

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 21 Oct 2010 16:41:45 -0700 (PDT)

If you have an unfortunate series of OSD failures, you can get into a 
situation where you know that objects were modified, but weren't able to 
copy the data off of the OSDs that contained them before they failed.  
Or, you only have what are (at least potentially) stale copies, but the 
OSDs with the most recent copies are down and declared (by an 
administrator) "lost" and irretrieable.

The current strategy is/was to go through and log LOST events for any 
object for which we have no copy.  Essentially it is treated like a 
delete... the object is gone.  For objects where we have an old version of 
the data, a LOST_REVERT event would be logged and we'd revert back to the 
old content (this second case isn't implemented yet).

I wonder if a better strategy would be to _not_ delete the objects, but to 
create a placeholder, and mark it such that any attempts to read it return 
EIO or ESTALE or something along those lines.  That would let an 
application know when data is gone instead of 'silently' (well, at the 
behest of a desperate administrator) losing the data.  Things like remove 
and replace would succeed, but reads would not.  Stale objects could then 
always be removed on a per-object basis.

The other nice thing about this is currently the peering phase stalls 
until it locates all lost objects.  But we know exactly which objects 
those are, and can go active (allowing IO to the rest of the PG) and stall 
only requests for those objects (until they are located or declared lost).  
(Actually, we can make this change regardless of whether we decide to mark 
or silently delete/revert.)

Any thoughts here?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html