Re: RGW: ERROR: failed to distribute cache

Wido den Hollander <wido@xxxxxxxx> · Tue, 7 Nov 2017 10:41:00 +0100 (CET)

> Op 6 november 2017 om 20:17 schreef Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx>:
> 
> 
> On Mon, Nov 6, 2017 at 7:29 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> > Hi,
> >
> > On a Ceph Luminous (12.2.1) environment I'm seeing RGWs stall and about the same time I see these errors in the RGW logs:
> >
> > 2017-11-06 15:50:24.859919 7f8f5fa1a700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:.bucket.meta.XXXXX:eb32b1ca-807a-4867-aea5-ff43ef7647c6.20755572.20
> > 2017-11-06 15:50:41.768881 7f8f7824b700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:XXXXX
> > 2017-11-06 15:55:15.781739 7f8f7824b700  0 ERROR: failed to distribute cache for gn1-pf.rgw.meta:.meta:bucket.instance:XXXXX:eb32b1ca-807a-4867-aea5-ff43ef7647c6.20755572.32:_XK5LExyXXXXX6EEIXxCD5Cws:1
> > 2017-11-06 15:55:25.784404 7f8f7824b700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:.bucket.meta.XXXXX:eb32b1ca-807a-4867-aea5-ff43ef7647c6.20755572.32
> >
> > I see one message from a year ago: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-June/010531.html
> >
> > The setup has two RGWs running:
> >
> > - ceph-rgw1
> > - ceph-rgw2
> >
> > While trying to figure this out I see that a "radosgw-admin period pull" hangs for ever.
> >
> > I don't know if that is related, but it's something I've noticed.
> >
> > Mainly I see that at random times the RGW stalls for about 30 seconds and while that happens these messages show up in the RGW's log.
> >
> 
> do you happen to know if there's a dynamic resharding happening? The
> dynamic resharding should only affect the writes to the specific
> bucket, and should not affect cache distribution though. Originally I
> thought it could be HUP signal related issue, but that seem to be
> fixed in 12.2.1.
> 

No, it doesn't seem to be that way:

$ radosgw-admin reshard list

That's empty.

Looking at the logs I see this happening:

2017-11-07 09:45:12.147335 7f985b34f700 10 cache put: name=gn1-pf.rgw.data.root++.bucket.meta.XXX-mon-bucket:eb32b1ca-807a-4867-aea5-ff43ef7647c6.14977556.9 info.flags=0x17
2017-11-07 09:45:12.147357 7f985b34f700 10 adding gn1-pf.rgw.data.root++.bucket.meta.XXX-mon-bucket:eb32b1ca-807a-4867-aea5-ff43ef7647c6.14977556.9 to cache LRU end
2017-11-07 09:45:12.147364 7f985b34f700 10 updating xattr: name=user.rgw.acl bl.length()=155
2017-11-07 09:45:12.147376 7f985b34f700 10 distributing notification oid=notify.6 bl.length()=708
2017-11-07 09:45:22.148361 7f985b34f700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:.bucket.meta.XXX-mon-bucket:eb32b1ca-807a-4867-aea5-ff43ef7647c6.14977556.9

2017-11-07 09:45:22.150273 7f985b34f700 10 cache put: name=gn1-pf.rgw.meta++.meta:bucket:XXX-mon-bucket:_iaUdq4vufCpgnMlapZCm169:1 info.flags=0x17
2017-11-07 09:45:22.150283 7f985b34f700 10 adding gn1-pf.rgw.meta++.meta:bucket:XXX-mon-bucket:_iaUdq4vufCpgnMlapZCm169:1 to cache LRU end
2017-11-07 09:45:22.150291 7f985b34f700 10 distributing notification oid=notify.1 bl.length()=407
2017-11-07 09:45:31.881703 7f985b34f700 10 cache put: name=gn1-pf.rgw.data.root++XXX-mon-bucket info.flags=0x17
2017-11-07 09:45:31.881720 7f985b34f700 10 moving gn1-pf.rgw.data.root++XXX-mon-bucket to cache LRU end
2017-11-07 09:45:31.881733 7f985b34f700 10 distributing notification oid=notify.1 bl.length()=372

As you can see, for OID 'gn1-pf.rgw.data.root++.bucket.meta.XXX-mon-bucket:eb32b1ca-807a-4867-aea5-ff43ef7647c6.14977556.9' the cache notify failed, but for 'gn1-pf.rgw.data.root++XXX-mon-bucket' it went just fine.

Skimming through the logs I see that notifies fail when one of these objects is used:

- notify.4
- notify.6

In total there are 8 notify objects in the 'control' pool:

- notify.0
- notify.1
- notify.2
- notify.3
- notify.4
- notify.5
- notify.6
- notify.7

I don't know if that's something which might relate to it.

I created this issue in the tracker: http://tracker.ceph.com/issues/22060

Wido

> Yehuda
> 
> > Is anybody else running into this issue?
> >
> > Wido
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com