On Mon, Nov 6, 2017 at 7:29 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > Hi, > > On a Ceph Luminous (12.2.1) environment I'm seeing RGWs stall and about the same time I see these errors in the RGW logs: > > 2017-11-06 15:50:24.859919 7f8f5fa1a700 0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:.bucket.meta.XXXXX:eb32b1ca-807a-4867-aea5-ff43ef7647c6.20755572.20 > 2017-11-06 15:50:41.768881 7f8f7824b700 0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:XXXXX > 2017-11-06 15:55:15.781739 7f8f7824b700 0 ERROR: failed to distribute cache for gn1-pf.rgw.meta:.meta:bucket.instance:XXXXX:eb32b1ca-807a-4867-aea5-ff43ef7647c6.20755572.32:_XK5LExyXXXXX6EEIXxCD5Cws:1 > 2017-11-06 15:55:25.784404 7f8f7824b700 0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:.bucket.meta.XXXXX:eb32b1ca-807a-4867-aea5-ff43ef7647c6.20755572.32 > > I see one message from a year ago: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-June/010531.html > > The setup has two RGWs running: > > - ceph-rgw1 > - ceph-rgw2 > > While trying to figure this out I see that a "radosgw-admin period pull" hangs for ever. > > I don't know if that is related, but it's something I've noticed. > > Mainly I see that at random times the RGW stalls for about 30 seconds and while that happens these messages show up in the RGW's log. > do you happen to know if there's a dynamic resharding happening? The dynamic resharding should only affect the writes to the specific bucket, and should not affect cache distribution though. Originally I thought it could be HUP signal related issue, but that seem to be fixed in 12.2.1. Yehuda > Is anybody else running into this issue? > > Wido > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com