5 Best Practices for Ensuring Your IBM i High Availability Solution is Switch-Ready
Read this eBook, 5 Best Practices for Ensuring IBM i High Availability is Switch-Ready, to discover:
- Best practices in HA management that ensure smooth planned and unplanned switches
- Why determining the data & applications you need to protect is critical, before installing HA
- The simplicity of regular monitoring and optimization of your HA solution
- How HA audits can prove invaluable in helping to spot potential issues before a switch
- The importance of ensuring your IT staff has the necessary training in monitoring and managing your HA solution
High Availability (HA) is a significant and important IT investment for many companies with IBM i, which can pay off in spades in the event a significant hardware failure or site disaster occurs. But what’s critical to ensuring an HA payoff is having confidence in being able to efficiently and successfully switch your business applications and other critical processes from your production server to a fully synchronized backup. Being able to execute smooth switchovers and failovers (and thus minimizing downtime and data loss) depends on following a number of best practices, which are outlined in this e-book.
Simply installing HA software doesn’t mean you have a complete solution!
High Availability software is a critical piece of a complete availability solution, but simply buying and installing the software doesn’t mean the solution is complete. Despite the fact that our advanced HA solutions include many automation and selfhealing features, it’s still critical to perform regular monitoring, optimizing, and testing of the solution.
Best Practice 1: Ensure Proper HA Configuration
Before installing HA, it is critical to determine the data and applications you need to protect by performing a data inventory as well as setting replication priorities. From this, your HA can then be properly configured to your environment. But don’t lose sight of the fact that IT environments are dynamic; therefore, the configuration of your HA may need to be fine-tuned from time to time. For instance, when applications are upgraded, it is important to make sure your replication configuration is reviewed and, if needed, updated to reflect the upgraded applications.
Verify that a replication process is in place for all objects that are needed on the backup system in order to fully run your essential business functions.
- User profiles
- Authorization lists
- Object authorities
- Object authorities
- IFS directories
- Spooled files
Best Practice 2: Regularly Monitor and Optimize HA Health
Keeping your HA optimized and switch-ready requires regularly checking the health of your replication environments to ensure essential functions are running smoothly. This includes:
Depending on your HA software, you may automatically be notified should an issue arise with one of the important processes listed below, and in many cases, the software will automatically resolve the issue.
- Verify replication is active.
- Validate that audits have run, and make sure any discovered differences have been automatically repaired.
- Verify communication is active between source and target.
- Verify remote journaling from the source to the target is active.
- Verify there is little or no latency in replication.
- Look for any replication errors not automatically corrected.
- Respond quickly to alerts generated by HA software.
- Review errors from the week to see if there are any patterns (e.g., after a nightly batch run, etc.).
- Send a report to management on the status of the solution to provide reassurance that the solution is protecting the business as expected.
- Check for and install new product fixes or service packs.
- Check for and install any recommended/required OS PTFs.
- Review errors from the month to find and correct any larger patterns.
- Check that bandwidth is kept optimized for your replication workload so as not to waste system resources.
- Review configuration settings to ensure that changes to your production system, and particularly your business applications, are properly replicated by your HA.
- Perform a switch test, or better yet, switch-and-stay (run on each system for a quarter and then switch). This ensures you can reliably run your business processes on each server.
- Update your HA runbook as needed based on the results of switch-tests and changes to your HA environment (more about the value of an HA runbook later in this e-book).
- Check for and install new version releases of your HA software to ensure you are always on a supported release. If you are on an unsupported release and you run into issues with your HA during a switchover, you may need to first update your software before you are able to switch, which can significantly delay your recovery.
- Have a certified HA consultant perform an audit of your HA environment. More about this later in the e-book.
Document all configuration, monitoring, maintenance, and switch-testing of your HA solution. In addition to making it easier for you to conduct your periodic maintenance tasks (because you can easily refer to specific processes and procedures), it provides important documentation for your IT manager and any other IT staff who may need to cover for you.
Best Practice 3: Regularly Test the Switch Process
It’s one thing to be diligent in keeping up with the monitoring and maintenance of your HA, but if you don’t regularly test the switch process, you are simply rolling the dice should you need to do a failover after a hardware failure or site disaster. More than making sure all needed data is replicated to the backup, a switch test includes verifying all objects needed to run business processes exist in the backup environment and each functions properly. Keep in mind that testing the switch process invariably reveals a number of issues that need to be addressed in order for the process to complete successfully. And that’s the point. What you learn and fine-tune during these tests is indispensable to ensuring your HA is truly switch-ready.
Keep an HA “Runbook”
An HA runbook documents your switch process and guides you step by step through either a switchover or a failover so you don’t miss anything that could delay the process. Your HA vendor should be able to help you create this when the software is installed. The runbook should be updated as needed after each switch test and will prove invaluable during the stress of an actual hardware failure or site disaster.
Tips for conducting your switch test:
- Do your initial switch tests in a “switch while- active” mode, which allows users to continue work on the production server while the HA software emulates the switch process. Once this is working to your satisfaction, then it’s time to do a full test in which all work is stopped on the production server and all business processes are started and tested on the backup.
- Before starting your test, check to see if there is any latency between your source (production) machine and your target (backup) machine. It is important that you resolve any issues that are causing extended latency before conducting your tests. In fact, it is critical that you work to minimize latency on an ongoing basis. When the pressure is on after a hardware failure or site disaster, any latency will likely delay the successful completion of your failover and could also result in lost data.
- Once the switch has been completed, you will need to verify that essential jobs have started and essential applications are available.
- Verify that HA replication between the new source system (formerly the target) and the new target system (formerly the source) is functioning properly.
Best Practice 4: Perform an Annual Audit with a Certified HA Consultant
Having a certified Precisely HA consultant perform an annual audit of your HA environment can prove invaluable in helping you to spot any potential issues before they create surprises during a planned or unplanned switch. During this audit, your production environment and all replication processes are thoroughly reviewed for any gaps or other issues, and a report of recommendations is provided. If needed, your HA consultant can help you implement these recommendations and train you on configuration best practices. And if you haven’t done so already, your HA consultant can apply any needed PTFs and fixes or even update your HA software to the latest release and help you test the new release on your environment while also walking you through the latest features.
Even organizations with on-staff expertise know the value of having a periodic external audit of their HA environment, which often brings to light potential issues while also providing recommendations for improvement.
Best Practice 5: Ensure IT Staff Has the Necessary Training and Time to Fulfill All Other Best Practices
Of course, it’s critical that someone on your IT team is properly trained in the monitoring and management of your HA solution. As mentioned at the beginning of this e-book, simply buying and installing HA software doesn’t provide a complete solution. You need someone on staff who is properly trained and who reliably has the bandwidth to perform regular maintenance and switch tests. In addition, in the same way you need a synchronized backup system in case something happens to your production system, you also need a properly trained backup person who can look after your HA when your primary HA manager is away or is otherwise unavailable.
Many time and personnel factors conflict with regular HA monitoring and switch-readiness!
- Companies are facing increasing pressure to trim IT budgets, which means existing staff gets spread thinand HA management suffers.
- An increasing number of IBM i professionals are retiring and taking their HA expertise with them.
- Smaller IT departments often have just one person who knows how to manage HA, which creates a problem when that person is unavailable or suddenly leaves the company.
- Even if your IT department has sufficient staff to manage HA, it’s not uncommon for shortterm urgencies to cause HA management to be neglected.
Troubling Trends: HA Expectations Aren’t Meeting Reality
* Precisely HA/DR Survey 2017 – IBM Power Systems
** Precisely 2016 State of Resilience report. Respondents here represent various platforms, not just IBM Power Systems.
In recent surveys conducted with IT professionals, we discovered that despite investing in HA, many companies aren’t fully reaping its benefits. The reason appears to point to IT being stretched thin. As a result, HA is not regularly monitored, optimized, and tested.
IBM Power shops aren’t meeting their RPO and RTO requirements
- RPO: 33% expect zero data loss after failure/disaster, yet few if any achieve this.*
- RTO: 29% expect to recover within 30 minutes or less after failure/ disaster, yet only 20% achieve this.*
IT shops aren’t properly managing their HA 44% say they’re not current on HA upgrades, audits, or role-swap tests (or don’t know if they are).**
- 30% say they perform an HA test only once each year.
- 18% say they’ve never done a switch test.**
IT personnel is spread thin in IBM Power shops
- 25% need more internal staff for HA management.*
- 40% are looking to outsource HA management.*
- 10% need to reallocate HA staff to other critical projects.*
- 10% are losing HA staff to retirement.*
Managed HA/DR Services: The Cost-Effective Alternative for Ensuring Switch-Readiness
A growing number of companies with HA software are choosing to engage a managed services contract with Precisely in an effort to counter the issues of staffing and conflicting IT priorities, and to assure ongoing switch-readiness. With Managed HA/DR Services:
- You have dedicated HA experts regularly managing your environment so you don’t have to worry about maintaining HA skills in-house.
- You and your team are freed up to focus on other, more strategic IT priorities.
- Your HA environment is regularly audited to proactively find and resolve errors and to ensure you have an optimal configuration that is properly tuned to maximize your system and network resources.
- Your HA environment stays up to date with the latest version release and service packs so you are never behind or unsupported.
- You receive regular, timely reports on the status of your environment and any corrective actions taken.
- You are contacted immediately if any critical issues are found.
- You benefit from leading experts in HA/DR and our products, with hundreds of years of collective experience.
- You can more readily do maintenance on your production server with little to no downtime because the system is always switch-ready.
- You can choose from various levels of managed services available based on need and budget.
Best of all:
You have confidence your HA is switch-ready should a hardware failure or site disaster occur.