So like many I probably thought I had done my research and understood what would happen when rebooting as brick/node only to find out I was wrong.
In my mind I saw I had a 1x3 replicate so I could rolling reboot and they'd heal up. However looking at logs of ovirt shortly after the rebooted brick came up all vm's started pausing/going unresponsive. At the time I was puzzled and freaked out. Next morning on my run I think I found the error in my logic and reading comprehension of my research. Once the 3rd brick came up it had to heal and changes to all the VM's. It is file based not block based healing so it saw multi-GB files that it had to recopy over. It had to halt all write to those files while that occurred or it would be a never ending cycle of re-copying the large images. So the fact most VM's went haywire isnt that odd. It does look based on timing in alerts the 2 bricks that were up kept serving images until 3rd brick came back. It did heal all images just fine.
So knowing what I believe I now know you can't really do what I had hoped and just reboot one brick and have the VM's stay up all the time. In order to achieve something like that I'd need a 2nd set of bricks I could live storage migrate to.
Am I understanding correctly how that works?
I could also look at minimizing downtime by moving to sharding and that way the heal would only need to copy smaller files. However I'd still end up potentially with paused VM's unless those heals were pretty quick. Probably safest to plan downtime of VM's or work out a storage migration plan if I had a real need for a high number of 9's uptime.
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users