This is a post that I have been meaning to write for some time now, in fact, I carried out the testing for this one back in July.
I have had some great conversations with customers of late around migrations and the use of VMware HCX to migrate them, I am a big advocate of HCX and all that it offers. There is one particular feature of HCX that has always intrigued me, that’s Replication Assisted vMotion, or RAV. RAV makes use of vSphere Replication and vMotion technology to migrate workloads from source to destination environments live – zero downtime migration.
This post isn’t intend to go into detail on HCX migration methods, but more explore a question I had on RAV that I wasn’t really able to find much in the way of an answer on.
Once you have defined your mobility group/migration wave within HCX, chosen RAV as your replication method HCX will invoke a base sync. Assuming that you have defined a switch over window some time into the future, once the base sync has completed, HCX will then move the VM into a continuous replication state. In this state HCX will sync changes generated on the VM at source to the destination environment every 2 hours.
This got me thinking though, what if our mobility group contained a number of VM’s that are all different sizes at what point does the continuous replication time start? Is it on a per VM basis? Is it controlled at the mobility group level? I had to find out.
I created a mobility group with only 2 VM’s in there and set of a RAV migration task in HCX. One VM was 16GB in size, the other 48GB, not huge but enough for me to test what I wanted. As the below image shows the smaller of the VM “photon-hcx-vlan-100” completed the base sync within 6 minutes and moved into the continuous replication sync. The larger of the VM’s “Win-10-VM” was only at 24% on the base sync.
Just over 30 minutes later the larger of the 2 VM’s completes it base sync.
Now to wait for around 2 hours and see what happens. Would only the smaller of the VMs start it’s replication cycle, or would both VM’s start their replication cycle.
The reason that this question came to mind for me was all around switch over planning. When using RAV and the switch over window is reached a final replication delta sync will take place. If that switchover window is for arguments sake 90 minutes after the last replication RPO cycle, that has the potential to be a lot of data to be sync’ed before switchover can take place. As RAV switchover is a serial process I was interested to understand what would happen with the continuous replication cycles and if switch over events could be un-predictable.
The events were enough to answer the initial question, the continuous replication sync cycle appears to be set at the mobility group level not per VM. As the screenshot below shows, it was actually the larger of the 2 VM’s that initiated a continuous replication sync ahead of the smaller VM, the smaller of the VM’s followed shortly after.
Just for my own future reference I have also added the below screenshot which shows all of the events on each of the VMs and their timings.
This answered my initial questions, but it has left me with some more regarding RAV and switchover windows/events. Sadly not something that I have the capacity in my lab to test, or unfortunately the time. It made me think, if a mobility group contains a larger number of VMs, lets say the max of 200 and they are heavy storage consumers – is it possible that continuous replication sync tasks are not in sync with switchover leading to switchover events running past continuous replication sync triggers? Perhaps another way of putting it, if switchover events are still running after 2 hours, will continuous replication sync tasks still take place? I’d be interested to know for sure if anyone happens to know the answer.