I recently volunteered to produce a presentation for an internal working group session, covering a “refresher” of VMware’s Site Recovery Manager (SRM). It had been a few years since I had worked with the product in any anger, so I, myself had to run through somewhat of a refresher, to begin with, so in the spirit of the vExpert program I wanted some of that content on my blog as well.
As part of the presentation, I put together a 5-minute demo where I cover the process of performing a controlled failover within SRM from the production site to the recovery site.
Before we head into the demonstration though, I just wanted to cover in the form of a post, at a high level, the topics discussed in the presentation. Firstly, what is SRM? To quote VMware “SRM is the industry-leading disaster recovery (DR) management solution, designed to minimize downtime in case of a disaster. It provides policy-based management, automated orchestration, and non-disruptive testing of centralized recovery plans. It is designed for virtual machines and scalable to manage all applications in a vSphere environment.”
Since having last worked with SRM, there have been some major changes, specifically around the operating platform. Historically SRM was an application that had to be installed on a Microsoft Windows Server, with a reliance on an external SQL Database. This changed, some time ago actually back in 2019 and SRM is now packaged as an appliance from VMware, running on PhotonOS utilising an embedded deployment of vPostgres for the database as well.
When we start to look at the logical architecture of SRM, there are several different components that come into the scope of a deployment.
- Site Recovery Manager Server – This integrates with an underlying replication technology to provide policy-based management, non-disruptive testing and automated orchestration of recovery plans.
- vSphere Replication Appliance – vSphere Replication is a proprietary, host-based replication engine to replicate VMware virtual machines to the recovery site.
- Storage Replication Adaptors – Integrates with third-party storage array-based replication products for data replication.
- vCenter Server – vCenter Server is a centralized platform for managing your VMware vSphere environment.
- Platform Services Controller – Provides infrastructure services for the environment, including Single-Sign-on, Licensing, and Certificates.
For the purpose of this article/demo, I am utilising vSphere Replication as I don’t have the hardware available within my lab environment to facilitate an array-based replication. I will aim to cover an overview of vSphere Replication in another article, but for the moment we just need to know that this is the replication technology used for the demo.
There are a number of key concepts that I want to mention with regard to SRM, specifically around resource mappings. Resource mappings are established following the creation of a site paring. We have 3 different resources that we map to ensure when a failover is initiated the virtual machine is bought up in the recovery site correctly. The mappings consist of:
- Virtual networks – ensuring that the recovered virtual machine is connected to the correct network in the recovery site.
- Folders – ensuring that recovered virtual machines are located in the correct folder structure in the recovery site. This could be particularly key for organisations that have a permission structure in place within vCenter based on the virtual machine folder structure.
- Resource mappings – ensuring that the recovered virtual machine is powered up on the correct compute at the recovery site. This could be a top-level vSphere Cluster or a Resource Pool contained within a cluster.
Additionally, we also have to define a “placeholder datastore”. A placeholder in SRM context is a subset of virtual machine files, these are very small and do not represent a full copy of the virtual machine that you are protecting. There are no VMDKs attached to the placeholder virtual machine object, this serves as a reservation if you will for the compute on the recovery site. You may have noticed that we haven’t discussed a mapping for the datastore where the virtual machine will be located on the recovery site. That’s because this isn’t a mapping that is configured in SRM, this (in the case of this demo) is handled by vSphere Replication, and the target datastore is configured within the replication task, along with the replication RPO requirements as well. if you are using array-based replication and have an SRA adapter installed into SRM then this is configured at the array level between the LUNs/devices.
Finally, within SRM we have protection groups and recovery plans. Protection groups serve as a grouping of virtual machines that you want to failover together, for example, if we use the classic 3-tier app example of a web server, application server and a database server failing these 3 virtual machines over together would ensure no elements of the application structure are missed. A recovery plan builds on a protection group, this allows you to pull multiple protection groups together and failover within the same task. Additionally, this is the section that we define and build upon our orchestration. We can set within the recovery plan, virtual machine power-up priorities and power-up dependencies as we wouldn’t want the web and application tiers of our application being powered up before the database is ready. As well as any pre or post-power on steps, or IP customisation that may need to take place within the guest operating system if we are not fortunate enough to have underlying networking capabilities to stretch the layer 2 networks over multiple sites.
On to the demo, the below 5-minute video shows the steps taken to initiate a planned failover from a protected site into a recovery site via SRM.