RGWs offline after upgrade to Nautilus

"Ben.Zieglmeier" <Ben.Zieglmeier@xxxxxxxxxx> · Thu, 20 Jul 2023 14:41:08 +0000

Hello,

We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The upgrade went mostly fine, though now several of our RGWs will not start. One RGW is working fine, the rest will not initialize. They are on a crash loop. This is part of a multisite configuration, and is currently not the master zone. Current master zone is running 14.2.22. These are the only two zones in the zonegroup. After turning debug up to 20, these are the log snippets between each crash:
```
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.52
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.54
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got realms_names. <redacted>
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got <redacted>
2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.371 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=114
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=686
2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup init ret 0
2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup name <redacted>
2023-07-20 14:29:56.374 7fd8dec40900 20 using current period zonegroup <redacted>
2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903
2023-07-20 14:29:56.375 7fd8dec40900 10 Cannot find current period zone using local zone
2023-07-20 14:29:56.375 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903
2023-07-20 14:29:56.375 7fd8dec40900 20 zone <redacted>
2023-07-20 14:29:56.375 7fd8dec40900 20 generating connection object for zone <redacted> id f10b465f-bf18-47d0-a51c-ca4f17118ee1
2023-07-20 14:34:56.198 7fd8cafe8700 -1 Initialization timeout, failed to initialize
```

I’ve checked all file permissions, filesystem free space, disabled selinux and firewalld, tried turning up the initialization timeout to 600, and tried removing all non-essential config from ceph.conf. All produce the same results. I would greatly appreciate any other ideas or insight.

Thanks,
Ben
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx