Re: Erasure Coded Pools Cause Irretrievable Objects and Possible Corruption

Brian Felton <bjfelton@xxxxxxxxx> · Thu, 5 May 2016 11:50:32 -0500

Logs have been attached to the issue: http://tracker.ceph.com/issues/15745

On Thu, May 5, 2016 at 11:23 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
Can you reproduce with

debug ms = 1

debug objecter = 20

on the radosgw side?

-Sam

On Thu, May 5, 2016 at 8:28 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:

> Greetings,

>

> We are running a number of Ceph clusters in production to provide object

> storage services.  We have stumbled upon an issue where objects of certain

> sizes are irretrievable.  The symptoms are very similar to the fix

> referenced here:

> https://www.redhat.com/archives/rhsa-announce/2015-November/msg00060.html.

> We can put objects into the cluster via s3/radosgw, but we cannot retrieve

> them (cluster closes the connection without delivering all bytes).

> Unfortunately, this fix does not apply to us, as we are and have always been

> running Hammer.  We've stumbled on a brand-new edge case.

>

> We have produced this issue on the 0.94.3, 0.94.4, and 0.94.6 releases of

> Hammer.

>

> We have produced this issues using three different storage hardware

> configurations -- 5 instances of clusters running 648 6TB OSDs across nine

> physical nodes, 1 cluster running 30 10GB OSDs across ten VM nodes, and 1

> cluster running 288 6TB OSDs across four physical nodes.

>

> We have determined that this issue only occurs when using erasure coding

> (we've only tested plugin=jerasure technique=reed_sol_van

> ruleset-failure-domain=host).

>

> Objects of exactly 4.5MiB (4718592 bytes) can be placed into the cluster but

> not retrieved.  At every interval of `rgw object stripe size` thereafter (in

> our case, 4 MiB), the objects are similarly irretrievable.  We have tested

> this from 4.5 to 24.5 MiB, then have spot-checked for much larger values to

> prove the pattern holds.  There is a small range of bytes less than this

> boundary that are irretrievable.  After much testing, we have found this

> boundary to be strongly correlated with the k value in our erasure coded

> pool.  We have observed that the m value in the erasure coding has no effect

> on the window size.  We have tested erasure coded values of k from 2 to 9,

> and we've observed the following ranges:

>

> k = 2, m = 1 -> No error

> k = 3, m = 1 -> 32 bytes (i.e. errors when objects are inclusively between

> 4718561 - 4718592 bytes)

> k = 3, m = 2 -> 32 bytes

> k = 4, m = 2 -> No error

> k = 4, m = 1 -> No error

> k = 5, m = 4 -> 128 bytes

> k = 6, m = 3 -> 512 bytes

> k = 6, m = 2 -> 512 bytes

> k = 7, m = 1 -> 800 bytes

> k = 7, m = 2 -> 800 bytes

> k = 8, m = 1 -> No error

> k = 9, m = 1 -> 800 bytes

>

> The "bytes" represent a 'dead zone' object size range wherein objects can be

> put into the cluster but not retrieved.  The range of bytes is 4.5MiB -

> (4.5MiB - buffer - 1) bytes. Up until k = 9, the error occurs for values of

> k that are not powers of two, at which point the "dead zone" window is

> (k-2)^2 * 32 bytes.  My team has not been able to determine why we plateau

> at 800 bytes (we expected a range of 1568 bytes here).

>

> This issue cannot be reproduced using rados to place objects directly into

> EC pools.  The issue has only been observed with using RadosGW's S3

> interface.

>

> The issue can be reproduced with any S3 client (s3cmd, s3curl, CyberDuck,

> CloudBerry Backup, and many others have been tested).

>

> At this point, we are evaluating the Ceph codebase in an attempt to patch

> the issue.  As this is an issue affecting data retrievability (and possibly

> integrity), we wanted to bring this to the attention of the community as

> soon as we could reproduce the issue.  We are hoping both that others out

> there can independently verify and possibly that some with a more intimate

> understanding of the codebase could investigate and propose a fix.  We have

> observed this issue in our production clusters, so it is a very high

> priority for my team.

>

> Furthermore, we believe the objects to be corrupted at the point they are

> placed into the cluster.  We have tested copying the .rgw.buckets pool to a

> non-erasure coded pool, then swapping names, and we have found that objects

> copied from the EC pool to the non-EC pool to be irretrievable once RGW is

> pointed to the non-EC pool.  If we overwrite the object in the non-EC pool

> with the original, it becomes retrievable again.  This has not been tested

> as exhaustively, though, but we felt it important enough to mention.

>

> I'm sure I've omitted some details here that would aid in an investigation,

> so please let me know what other information I can provide.  My team will be

> filing an issue shortly.

>

> Many thanks,

>

> Brian Felton

>

>

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com