SC 2012 SP1 – DPM: Leveraging DPM ScaleOut feature to protect VMs deployed on a big cluster

Windows 2012 improved the CSV clustering by increased scale and performance.  Windows 2012 clusters can be as big as 64 nodes and 8000 VMs.  Windows 2012 clusters can now have their storage on a Cluster Storage Volume or can be a remote SMB share which in turn can be a scale out clusters.  As part of DPM 2012 SP1s mission of enabling efficient backup for Windows 2012 private cloud deployments, DPM leveraged Windows 2012 CSV capabilities and improved backup performance by 900% (assuming VM had 10% churning per day).  DPM 2012 SP1 can now protect 64 node clusters as opposed to 16 node clusters.  Even though DPM 2012 SP1’s scale numbers have increased to 800 VMs with 100GB average VM size with 3% churning per day and once a day backup, one DPM server cannot protect 64 node clusters.  Until DPM 2012 SP1, VMs on a cluster can only be protected by a single DPM server. This meant that DPM 2012 could not protect a 64 node cluster.  With DPM 2012 SP1, VMs on a single cluster can be protected by multiple DPM servers.   This is achieved by DPM’s architectural breakthrough implementation of DPM ScaleOut feature.  By using this feature, agents running on a Hyper-V cluster can be “attached” to multiple DPM servers.  Each DPM server can be configured to protect “some” VMs on a cluster.  The strong affinity of a VM and its “backing up” DPM server is still maintained.  So, even though the VM is migrated within a cluster or across clusters, same DPM server will continue to protect the VM even if the VM is migrated to other Hyper-V deployments.  Refer to the blog article “VM Mobility: Uninterrupted data protection” blog article located here.

Enabling DPM Scale out feature is basically about agent installation on node(s) and attaching them to multiple DPM servers.  This can be achieved by performing below mentioned steps.

  • Calculate the number of DPM servers required for protecting a cluster using DPM 2012 SP1 Hyper-V Calculator
    • For ex., assume that this particular deployment has 64 node cluster with 4000 VMs
    • 100GB average VM size
    • Once a day backup
    • 3% a day data churning
    • Per DPM 2012 SP1 Hyper-V Calculator, 5 DPM servers are required to protect this 64 node cluster.  Let us name them as DPM1,… DPM5
  • Deploy DPM agent on all nodes via one of the DPM servers say DPM1 as mentioned here or manually as described here
  • Now all 64 nodes’ agent can communicate with DPM1 and DPM1 can protect any VM on 64 node cluster
  • Next step is to connect all 64 nodes to DPM2, … DPM5 with the following steps
    • On each node of 64 node cluster run the following commands to connect to all required DPM servers
    • SetDpmServer.exe –Add –dpmServerName DPM2
    • SetDpmServer.exe –Add –dpmServerName DPM3
    • SetDpmServer.exe –Add –dpmServerName DPM4
    • SetDpmServer.exe –Add –dpmServerName DPM5
  • On each DPM server (DPM2 to DPM5) for each cluster node
    • Attach-ProductionServer.ps1 DPMServerName nodeN <user name> <password> <domain> as described in step5 here
    • Or use DPM’s administrator console, management, Agents, click “Install” and select the “attach agents” as shown below

clip_image002

  • Perform agent attach operation for all DPM servers and all 64 nodes of the cluster
  • By following these steps, all DPM servers are connected to all 64 nodes on the cluster

For configuring VM backup, create PG(s) with/without colocation depending on various parameters, protect VMs on each DPM sever.  As mentioned above, once a VM is protected by a DPM server, that VM is managed by this particular DPM server for the rest of the VM life cycle.  By following this procedure, 5 DPM servers are needed to protect 4000 VMs on a 64 node cluster (actual DPM requirements vary depending on the VM size, number of VMs, VM churning etc).  To protect them one can take multiple methods to protect VMs.  If this is a organically growing environment, one can start protecting with one DPM server and when it reaches its limits, deploy another DPM server to protect rest of the VMs and so on.  If the deployment is already big and customers are opting-in for backup on a regular basis, one can deploy multiple DPM servers at the same time and load balance the VM protection across multiple DPM servers.  Backup admin can protect VMs in a DPM server by performing following steps.

1. Opt-in for new protection and expand cluster

2. Select VM to be protected

3. Configure VM protection parameters

Due to the increased cluster sizes, the first step can take a long time for the scanning the cluster and configure the backups.  Considering this, DPM 2012 SP1 dramatically improved on the time it takes to achieve first step above for clusters.  DPM improved this by two ways.  One expanding caching for cluster parameters and other by improving the VM enquiry performance by performing enquiry of all VMs per node in a single step. Here are the steps in achieving optimal enquiry performance when enquiring VMs in a CSV cluster.  First time when DPM is started to protect VMs of a cluster, following steps are to be done to improve the backup configuration performance.

a) Click and expand each node of the cluster as shown below. Overall node level resource population should take about 3 to 5 minutes depending on the Hyper-V load running etc. So, scanning whole cluster would roughly take about 5 to 7 minutes.

clip_image004

b) Once the node level enquiry is complete, then expand the Cluster that will show all the VMs that can be selected for protection as shown below.

clip_image006

c) Follow usual protection steps after this.

Once PG is created on the cluster, DPM will cache the cluster, nodes and VM info and will update the caching as part of nightly jobs.  At times, refreshed data is necessary for protecting newly deployed VMs.  In that case, either backup admin can wait for next day or force fresh enquiry on the cluster manually by initiating the fresh enquiry process as described in steps #a and #b above.

Other points to consider:

Secondary DPM protection is not supported for a scale-out DPM server deployments. 

In each DPM server connected to this cluster show all VMs that are present in the cluster even though a particular VM might already been protected by another DPM server.  Backup admin is advised not to protect the VM on multiple DPM servers.  If a VM is protected initially by DPM1 and then protected by DPM2, DPM2 will take ownership of the VM and protecting from this DPM server.  Backups on DPM1 will fail.  If this is done by mistake, DPM admin can follow the steps mentioned below.

· Stop protect (with retain data to keep the current recovery points that are in this server) on DPM2

· Go to DPM1 and force CC on the VM and backups would resume on DPM1 and DPM1 will now be the owner of this VM

As the scale out option enables the protection of any VM on cluster by any DPM server, AutoProtectInstance.ps1 that enables auto protection of all VMs on a cluster to a DPM server cannot work in this environment. 

As scale out feature can potentially do unlimited number of backups from one node causing production server workload, we limited the number of concurrent backups to 8 per node. 

If the number of concurrent backups is exceeded, the recovery point job will fail with following details and error code.

Type: Recovery point

Status: Failed

Description: DPM could not run the backup job for the data source because the number of currently running backup and recovery jobs on the Production server has reached its limits.

Data source: \Backup Using Child Partition Snapshot\Server name

Production Server: XXXXXX (ID 3185 Details: Internal error code: 0x809909E5)

Backup admin will see following Alert messages as a reason for backup failure.  As this effects production deployments and is dependent on node to node, this is a node specific parameter. 

  • DPM could not run the backup job for the data source because the number of currently running backup and recovery jobs on the Production server has reached its limits.  Data source: %DatasourceName;  Production Server: %ServerName;  Reduce the number of backup/recovery jobs running on this production server, or wait for some backup and recovery jobs to complete and retry the operation.
  • DPM could not run the recovery  job for the data source because the number of currently running backup and recovery jobs on the Production server has reached its limits.  Data source: %DatasourceName;  Production Server: %ServerName;  Reduce the number of backup/recovery jobs running on this production server, or wait for some backup and recovery jobs to complete and retry the operation.

The number of concurrent backups is a configurable parameter described below error messages located in DataSoureResourceLimit.xml located in DPM agent installation folder on production server.  DPMRA need to be restarted by going to services and click restart of DPMRA service to take this change effected. 

<DatasourceLimits>
    <Writer writerId="66841cd4-6ded-4f4b-8f17-fd23f8ddc3de" version="0" isParallelRecoveryAllowed="true">
        <MaxLimit value="8" type="1"/>
–>
This represents the number of backups allowed for Hyper-V workloads.  Fine tune this number based on the number of VMs, parallel backup load that can be allowed on production server etc.
        <MaxLimit value="8" type="2"/>
        <MaxLimit value="8" type="3"/>
        <MaxLimit value="8" type="4"/>
–>
This represents the number of recoveries allowed for Hyper-V workloads.  Fine tune this number based on the number of VMs, parallel recovery load that can be allowed on the production server etc.
    </Writer>
</DatasourceLimits>

DPM Scale out feature is supported only for Windows 2012 Hyper-V CSV workloads.

No Comments