issues on CT + EC pool

Luis Periquito <periquito@xxxxxxxxx> · Fri, 4 May 2018 19:09:29 +0100

Hi,

I have a big-ish cluster that, amongst other things, has a radosgw
configured to have an EC data pool (k=12, m=4). The cluster is
currently running Jewel (10.2.7).

That pool spans 244 HDDs and has 2048 PGs.

from the df detail:
    .rgw.buckets.ec     26     -            N/A               N/A
       76360G     28.66          185T     97908947     95614k
73271k       185M      101813G
    ct-radosgw             37     -            N/A               N/A
           4708G     70.69         1952G      5226185      2071k
591M      1518M        9416G

The ct-radosgw should be size 3, but currently due to an unrelated
issue (pdu failure) is size 2.

Whenever I flush data from the cache tier to the base tier the OSDs
start updating their local leveldb database, using up 100% IO, until
they: a) are set as down for no answer, and/or b) suicide timeout.

I have other pools targeting those same OSDs but until now nothing has
happened when the IO goes to the other pools.

Any ideas on where to proceed?

thanks,
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com