Re: Help with full osd and RGW not responsive

Webert de Souza Lima <webert.boss@xxxxxxxxx> · Wed, 18 Oct 2017 08:46:27 -0200

Hi Bryan.
I hope that solved it for you.
Another think you can do in situations like this is to set the full_ration higher so you can work on the problem. Always set it back to a safe value after the issue is solved.
ceph pg set_full_ratio 0.98

Regards,
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil

On Tue, Oct 17, 2017 at 6:52 PM, Bryan Banister <bbanister@xxxxxxxxxxxxxxx> wrote:

Thanks for the response, we increased our pg count to something more reasonable (512 for now) and things are rebalancing.

Cheers,
-Bryan

From: Andreas Calminder [mailto:andreas.calminder@klarna.com]

Sent: Tuesday, October 17, 2017 3:48 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re:  Help with full osd and RGW not responsive

Note: External Email

Hi,

You should most definitely look over number of pgs, there's a pg calculator available here: http://ceph.com/pgcalc/

You can increase pgs but not the other way around (http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/)

To solve the immediate problem with your cluster being full you can reweight your osds, giving the full osd a lower weight will cause writes going to other osds and data on that osd being migrated to other osds in the cluster: ceph osd
 reweight $OSDNUM $WEIGHT, described here http://docs.ceph.com/docs/master/rados/operations/control/#osd-subsystem

When the osd isn't above the full threshold, default is 95%, the cluster will clear its full flag and your radosgw should start accepting write operations again, at least until another osd gets full, main problem here is probably the low
 pg count.

Regards,

Andreas

On 17 Oct 2017 19:08, "Bryan Banister" <bbanister@xxxxxxxxxxxxxxx> wrote:

Hi all,

Still a real novice here and we didn’t set up our initial RGW cluster very well.  We have 134 osds and set up our RGW pool with only 64 PGs, thus not all of our OSDs got data and
 now we have one that is 95% full.  

This apparently has put the cluster into a HEALTH_ERR condition:
[root@carf-ceph-osd01 ~]# ceph health detail
HEALTH_ERR full flag(s) set; 1 full osd(s); 1 pools have many more objects per pg than average; application not enabled
 on 6 pool(s); too few PGs per OSD (26 < min 30)
OSDMAP_FLAGS full flag(s) set
OSD_FULL 1 full osd(s)
    osd.5 is full
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
    pool carf01.rgw.buckets.data objects per pg (602762) is more than 18.3752 times cluster average (32803)

There is plenty of space on most of the OSDs and don’t know how to go about fixing this situation.  If we update the pg_num and pgp_num settings for this pool, can we rebalance
 the data across the OSDs?

Also, seems like this is causing a problem with the RGWs, which was reporting this error in the logs:
2017-10-16 16:36:47.534461 7fffe6c5c700  1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdc447700'
 had timed out after 600

After trying to restart the RGW, we see this now:
2017-10-17 10:40:38.517002 7fffe6c5c700  1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7fffddc4a700'
 had timed out after 600
2017-10-17 10:40:42.124046 7ffff7fd4e00  0 deferred set uid:gid to 167:167 (ceph:ceph)
2017-10-17 10:40:42.124162 7ffff7fd4e00  0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc),
 process (unknown), pid 65313
2017-10-17 10:40:42.245259 7ffff7fd4e00  0 client.769905.objecter  FULL, paused modify 0x55555662fb00 tid 0
2017-10-17 10:45:42.124283 7fffe7bcf700 -1 Initialization timeout, failed to initialize
2017-10-17 10:45:42.353496 7ffff7fd4e00  0 deferred set uid:gid to 167:167 (ceph:ceph)
2017-10-17 10:45:42.353618 7ffff7fd4e00  0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc),
 process (unknown), pid 71842
2017-10-17 10:45:42.388621 7ffff7fd4e00  0 client.769986.objecter  FULL, paused modify
 0x55555662fb00 tid 0
2017-10-17 10:50:42.353731 7fffe7bcf700 -1 Initialization timeout, failed to initialize

Seems pretty evident that the “FULL, paused” is a problem.  So if I fix the first issue the RGW should be ok after?

Thanks in advance,
-Bryan

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this
 email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness
 or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial
 product.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this
 email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness
 or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial
 product.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com