On Wed, Jun 11, 2014 at 12:54 AM, Guang Yang <yguang11@xxxxxxxxxxx> wrote: > On Jun 11, 2014, at 6:33 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > >> On Tue, May 20, 2014 at 6:44 PM, Guang Yang <yguang11@xxxxxxxxxxx> wrote: >>> Hi ceph-devel, >>> Like some users of Ceph, we are using Ceph for a latency sensitive project, and scrubbing (especially deep-scrubbing) impacts the SLA in a non-trivial way, as commodity hardware could fail in one way or the other, I think it is essential to have scrubbing enabled to preserve data durability. >>> >>> Inspired by how erasure coding backend implement scrubbing[1], I am wondering if the following changes is valid to somehow reduce the performance impact from scrubbing: >>> 1. Store the CRC checksum along with each physical copy of the object on filesystem (via xattr or omap?) >>> 2. For read request, it checks the CRC locally and if it mismatch, redirect the request to a replica and mark the PG as inconsistent. >> >> The problem with this is that you need to maintain the CRC across >> partial overwrites of the object. And the real cost of scrubbing isn't >> in the network traffic, it's in the disk reads, which you would have >> to do anyway with this method. :) > Thanks Greg for the response! > Partial update is the right concern if that happens frequently. However, the major benefit of this proposal is to postpone the CRC check to READ request instead of doing it from within a background job (although we may still need to do background check as deep-scrubbing, we can reduce the frequency dramatically). By checking the CRC at read time, in-consistent object are marked as inconsistent (PG) and further we can trigger a repair for the PG. Oh, I see. Still, partial update is in fact the major concern. We have a debug mechanism called "sloppy crc" or similar that keeps track of them for full (or sufficiently large?) writes, but it's not something you can use on production cluster because it turns every write into a read-modify-write cycle, and that's just prohibitively expensive (in addition to issues with stuff like OSD restart, I think). This sort of thing would make sense for the erasure-coded pools; maybe that would be a better place to start? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html