I just upgraded to Luminous yesterday and before the upgrade was complete, we had SSD OSDs flapping up and down and scrub errors in the RGW index pools. I consistently made sure that we had all OSDs back up and the cluster healthy before continuing and never reduced the min_size below 2 for the pools on the NVMes. The RGW daemons for our 2 multi-site realms restarted themselves (due to a long-standing memory leak supposedly fixed in 12.2.2) and prematurely upgraded themselves before all of the OSDs had been upgraded and I thought that was the reason for the scrub errors and inconsistent PGs... however this morning I had a scrub error in our local only realm which does not use multi-site and had not restarted any of it's RGW daemons until after all of the OSDs had been upgraded.
Is there anything we should be looking at for this? Any idea what could be causing these scrub errors? I can issue a repair on the PG and the scrub errors go away, but then they keep coming back on the same PGs later. I can also issue a deep-scrub on every PG in these pools and they return clean, but then later show back up with the scrub errors and inconsistent PGs on the same PGs.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com