This seems to have resolved the issue. The cluster completed recovery while I was strace'ing osd.4, and hasn't had any issues since then. I restarted radosgw-agent, and it's running fine. I don't think the snapshots are related, but I don't know. The snapshots I deleted were taken over a 2 week period, and covered an increase of 40% of the cluster data size. The snapshot cron is still active, so I guess I'll repeat the experiment. If the issue comes back in a couple weeks, I try the strace without removing the snapshots. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com