On Sat, Mar 11, 2017 at 12:21 PM, <cephmailinglist@xxxxxxxxx> wrote: > > The next and biggest problem we encountered had to do with the CRC errors on the OSD map. On every map update, the OSDs that were not upgraded yet, got that CRC error and asked the monitor for a full OSD map instead of just a delta update. At first we did not understand what exactly happened, we ran the upgrade per node using a script and in that script we watch the state of the cluster and when the cluster is healthy again, we upgrade the next host. Every time we started the script (skipping the already upgraded hosts) the first host(s) upgraded without issues and then we got blocked I/O on the cluster. The blocked I/O went away within a minute of 2 (not measured). After investigation we found out that the blocked I/O happened when nodes where asking the monitor for a (full) OSD map and that resulted shortly in a full saturated network link on our monitor. Thanks for the detailed upgrade report. I wanted to zoom in on this CRC/fullmap issue because it could be quite disruptive for us when we upgrade from hammer to jewel. I've read various reports that the fool proof way to avoid the full map DoS would be to upgrade all OSDs to jewel before the mon's. Did anyone have success with that workaround? I'm cc'ing Bryan because he knows this issue very well. Cheers, Dan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com