Re: RGW Replication

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Tue, 4 Feb 2014 14:43:06 -0800

On Tue, Feb 4, 2014 at 2:21 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email clewis@xxxxxxxxxxxxxxxxxx
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
> On 2/4/14 11:36 , Yehuda Sadeh wrote:
>
> Also, verify whether any objects are missing. Start with just counting
> the total number of objects in the buckets (radosgw-admin bucket stats
> can give you that info).
>
> Yehuda
>
>
> Thanks, I didn't know about bucket stats.
>
> bucket stats reports that the slave have fewer objects and kB than the
> master.
>
> Now that objects are missing in the slave, how do I fix it?  radosgw-agent
> --sync-scope=full ?
>

That would do it, yes.

>
>
> I figured out why replication went so quickly after the restart.  I missed
> an error in the radosgw-agent logs:
> 2014-02-04T08:16:28.936 14145:WARNING:radosgw_agent.worker:error locking
> shard 36 log,  skipping for now. Traceback:
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/worker.py", line 58,
> in lock_shard
>     self.lock.acquire()
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/lock.py", line 65, in
> acquire
>     self.zone_id, self.timeout, self.locker_id)
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 241,
> in lock_shard
>     expect_json=False)
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 155,
> in request
>     check_result_status(result)
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 116,
> in check_result_status
>     HttpError)(result.status_code, result.content)
> HttpError: Http error code 423 content {"Code":"Locked"}
> 2014-02-04T08:16:28.939 12730:ERROR:radosgw_agent.sync:error syncing shard
> 36

A shard was locked by the agent, but the agent never unlocked it
(maybe because you took it down?).  The lock itself has a timeout, so
it's supposed to get released after a while, and then processing
should resume as usual. However, when it happens you can try playing
with the rados lock commands (rados lock list, rados lock info, rados
lock break) to release it (as long as there's no agent running that
has locked the shard).
>
> Full radosgw-agent.log, starting at restart:
> https://cd.centraldesktop.com/p/eAAAAAAAC60_AAAAAAia_J0
>
>
>
> I shutdown radosgw-agent, and restarted all radosgw daemons in the slave
> cluster.  Replication is proceeding again on shard 36, but I'm seeing the
> same behavior.  The slave is catching up much too quickly.
>
> Before the stall:
> root@ceph1c:/var/log/ceph# zegrep '(live-2:us-west-1|shard 36)'
> radosgw-agent.us-west-1.us-central-1.log.1.gz | grep -v
> 'WARNING:radosgw_agent.sync:shard 36 log has fallen behind' | tail
> 2014-02-03T23:19:11.434 11783:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000115883.315938.2"
> 2014-02-03T23:24:51.246 11783:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:25:30.185 6419:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:25:46.826 6468:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000116882.316964.3"
> 2014-02-03T23:30:13.648 6468:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:30:50.132 29240:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:31:06.808 29390:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000117881.317984.2"
> 2014-02-03T23:38:56.830 29390:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:39:58.408 3744:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:40:15.049 3837:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000118880.319057.3"
>
> After the radosgw and radosgw-agent restart (contained in the full logs
> linked above):
> root@ceph1c:/var/log/ceph# egrep '(live-2:us-west-1|shard 36)'
> radosgw-agent.us-west-1.us-central-1.log | grep -v
> 'WARNING:radosgw_agent.sync:shard 36 log has fallen behind'
> 2014-02-04T08:15:58.966 14045:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:16:28.936 14145:WARNING:radosgw_agent.worker:error locking
> shard 36 log,  skipping for now. Traceback:
> 2014-02-04T08:16:28.939 12730:ERROR:radosgw_agent.sync:error syncing shard
> 36
> 2014-02-04T08:23:50.318 15231:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:24:05.970 15288:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000118880.319057.3"
> 2014-02-04T08:42:20.351 15288:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:48:36.509 24250:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:48:53.145 24280:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000119879.320127.2"
> 2014-02-04T08:57:22.429 24280:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:03:35.292 23586:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:03:53.561 23744:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000120878.321183.3"
> 2014-02-04T09:14:36.249 23744:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:20:15.250 30093:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:20:31.925 30330:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000121877.322255.2"
> 2014-02-04T09:26:46.652 30330:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:32:57.308 20145:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:33:13.897 20215:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000122876.323275.3"
> 2014-02-04T09:43:05.327 20215:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:49:20.255 25443:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T09:49:35.869 25479:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000123875.324352.2"
> 2014-02-04T09:57:12.177 25479:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T10:03:55.676 23373:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T10:04:11.318 23450:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000124874.325371.3"
> 2014-02-04T10:10:00.548 23450:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T13:29:05.528 28131:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T13:29:36.329 28219:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000125873.326393.2"
> 2014-02-04T13:35:25.659 28219:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T13:40:56.360 14609:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T13:41:12.087 14679:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000126872.327440.3"
> 2014-02-04T13:48:23.826 14679:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T13:56:18.406 15364:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T13:56:34.125 15578:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "00000127871.328492.2"
> 2014-02-04T14:05:30.358 15578:INFO:radosgw_agent.worker:finished processing
> shard 36
>
>
Does it ever catching up? You mentioned before that most of the writes
went to the same two buckets, so that's probably one of them. Note
that writes to the same bucket are being handled in-order by the
agent.

Yehuda
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com