Fwd: RadosGW not responding if ceph cluster in state health_error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Sorry to bring this up again - any ideas? Or should I try the IRC channel?

Cheers,
Thomas

-------- Original Message --------
Subject: RadosGW not responding if ceph cluster in state health_error
Date: Mon, 21 Nov 2016 17:22:20 +1300
From: Thomas <thomas@xxxxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx


Hi All,

I have a cluster setup with 16 OSDs on 4 nodes, standard RGW install with standard rgw pools, replication on those pools is set to 2 (size 2, min_size 1).

We've had the situation before where one node totally dropped out (so 4 OSDs) and the cluster health was warning and rgw as well as other pools were working fine.

I now had a problem where we added a test pool with replication 1 (size 1, min_size 1), the node died again and 4 OSDs dropped out resulting in health_error and RGW not responding at all which I'm not sure why that would be the case.

I understand that with a pool that uses size 1 and one OSD dropping out (unrecoverable), you'll loose all that data (pretty much), and it was only set to do some benchmarking, however, I didn't know that it was affecting the entire cluster. Restarting the rados-gw service would work, however, it wouldn't listen to requests as well as showing errors like this in the logs:

2016-11-18 11:13:47.231827 7f0aaadb2a00 10  cannot find current period zonegroup using local zonegroup
2016-11-18 11:13:47.231860 7f0aaadb2a00 20 get_system_obj_state: rctx=0x7fffb14242c0 obj=.rgw.root:default.realm state=0x564c3fa99858 s->prefetch_data=0
2016-11-18 11:13:47.232754 7f0aaadb2a00 10 could not read realm id: (2) No such file or directory
2016-11-18 11:13:47.232772 7f0aaadb2a00 10 Creating default zonegroup
2016-11-18 11:13:47.233376 7f0aaadb2a00 10 couldn't find old data placement pools config, setting up new ones for the zone

...

2016-11-18 11:13:47.251629 7f0aaadb2a00 10 ERROR: name default already in use for obj id 712c74f9-baf4-4d74-956b-022c67e4a5bb
2016-11-18 11:13:47.251631 7f0aaadb2a00 10 create_default() returned -EEXIST, we raced with another zonegroup creation

Full log here: http://pastebin.com/iYpiF9wP

Once we removed the pool with size = 1 via 'rados rmpool', the cluster started recovering and RGW served requests!

Any ideas?

Cheers,
Thomas


--

Thomas Gross
TGMEDIA Ltd.
p. +64 211 569080 | info@xxxxxxxxxxxxx


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux