Mainframe

Overcoming Three Major Challenges of Moving Data from the Mainframe to Hadoop

January 24, 2020

Ron Franklin

When it comes to digging value out of the wealth of data many companies have at their disposal, Hadoop is the platform of choice. It is built on a number of capable, sophisticated, and well-supported open source tools that are specifically designed to support big data analytics. But while Hadoop’s popularity continues to climb, there is one glaring gap in its capabilities—it doesn’t provide native support for mainframes. Importing mainframe data into a Hadoop environment and processing it to extract value can be difficult, time-consuming, and costly.

Connect is specifically designed to bridge the gap between the mainframe and Hadoop. It drastically simplifies the process of transferring data from mainframes to Hadoop clusters, overcoming several difficult challenges in the process. Let’s take a look at some of those challenges, and how Connect addresses them.

Challenges with mainframe data

It might seem that transferring data from a mainframe into a Hadoop data lake should be a simple process of uploading it via FTP or Connect:Direct. But it’s a lot more complicated than that. Mainframe datasets exist in many different forms and formats, including VSAM files, fixed and variable length files, Db2 and IMS databases, and COBOL Copybooks. They use EBCDIC or packed decimal encoding rather than ASCII, and may be compressed.

COBOL Copybooks are a particular problem. They are metadata blocks that define the physical layout of data, but which are stored separately from that data. They can be quite complex, containing not just formatting information, but also logic in the form, for example, of nested Occurs Depending On clauses. Hadoop knows nothing of Copybooks, but without that knowledge, there’s no way for it to understand the structure of mainframe data.

Because of all these unfamiliar structural variations, feeding mainframe data in its native form into a Hadoop cluster would cause a bad case of digital indigestion!

Connect is specifically designed to handle these data formatting variations. It supports all mainframe data types and formats, and can ingest such data into a Hadoop cluster, and process it there, without changing its format. In many cases, that ability to preserve the original form of the data is necessary to comply with governance, compliance and auditing mandates.

Challenges with security

In today’s world, maintaining the highest level of data security, whether the data is in transit or at rest, is an absolute necessity. Mainframes are noted for their extremely high levels of data security, but what happens when that data is exported to Hadoop?

Connect protects data security on both the mainframe and Hadoop ends in several ways. First, because it does not require the installation of any additional applications on the mainframe, there’s no chance of it compromising security at that point. Then, through its support for FTPS, Connect:Direct, and encrypted datasets, Connect strongly protects data during the transfer process. Finally, with its native support of Kerberos and LDAP, and its close integration with Apache Sentry, Connect maintains that high level of security when the data is stored and processed within the Hadoop cluster.

Challenges with staff skills

Mainframers who also have Hadoop skills, or Hadoop mavens who also understand mainframes, are extremely rare creatures. In other words, the chances of finding IT staff members who can handle the technical complexities of the entire mainframe to Hadoop process are pretty low. That’s why Connect is designed for a high degree of automation in the process of integrating mainframe data into a Hadoop environment.

Connect sports a simple, user-friendly, highly intuitive GUI (graphical user interface) that allows application developers, who may have little or no mainframe experience, to simply point and click to build and maintain any desired data integration workflow. Connect handles complex Hadoop tasks, such as creating mappers and reducers for MapReduce, entirely on its own.

Connect helps companies get the most out of their mainframe data

The potential value of the data stored on corporate mainframes is immense. But until recently the fact that the best data analytics tools, such as Hadoop, were not available for the mainframe made fully leveraging that data difficult and expensive. But that’s no longer the case. Connect makes adding the richness of mainframe-based data to Hadoop data lakes not only practical but also relatively easy.

Learn how to unleash the power of data – Read our eBook: A Data Integrator’s Guide to Successful Big Data Projects