What Is a Data Pipeline and How Does It Work?
Data integration technology and practices have come a long way in the past two decades. The world has gradually moved away from batch integration between siloed systems toward a real-time data analytics-ready approach in which information flows seamlessly from one system to another, the moment a transaction is committed to the database.
There has been a shift in the value proposition for integration as well. In the old days, the focus was almost exclusively on transactional and operational needs. Today, as the value-creating potential of big data analytics continues to grow, the emphasis around integration has shifted in that direction as well. Business leaders are realizing the opportunity in front of them, to establish strategic competitive advantage by unlocking the power of their data.
In this respect, data integration is no longer just an IT-driven undertaking; it is a top-down strategic imperative. It has ceased to be about sharing information between the silos; now it is about a holistic understanding of the enterprise, its customers, its suppliers, and the world which it operates.
The transition to real-time
In the old world, integration was often built around a stepwise process of extracting data from a source, transforming it, and loading that data into its target destination. That ETL process (extract, transform, load) continues to be a common approach today, especially for companies that work with a traditional data warehouse architecture. It has been a common practice to run ETL processes overnight, when computing resources are otherwise underutilized, and when the volume of transactions is relatively light.
Twenty years ago, that was a reasonable approach to a problem that existed at the time, namely, that the computationally intensive nature of business analytics made it necessary to pre-process and stage information in a secondary data store that was wholly dedicated to business intelligence.
Read our eBook
Learn how Connect CDC from Precisely gives businesses the power to build streaming pipelines, share application data across the enterprise, and integrate to modern data platforms such as Snowflake and Hadoop.
Today, it still makes sense to aggregate data for analytics, but for a different reason. Businesses are working with a multitude of different systems and applications. They are sharing business data and platforms with their trading partners. Many are consuming software as a service. Often, they are integrating data from mobile devices, clickstream analysis, or real-time data feeds from other sources. Smart companies are enriching their data with information from external sources, adding value and context to the data they already have.
When a company can effectively bring all that information together, they can dramatically increase the value of the business insights that result. Today’s integration challenges arise from this fundamental need for a holistic view of the business.
What is a streaming data pipeline?
There is another important factor that impacts the need for better integration, namely, that business intelligence is generally most impactful when the underlying data is fully up to date. In other words, real-time data has considerably more value than information that is a day old or older. This is especially true with certain business processes such as fraud detection in the credit card industry, or intrusion detection in IT services, where rapid detection of anomalies can prevent potential concerns from developing into real-world problems.
With the increased volume of data, there is also increased velocity. That combination of forces makes it difficult to manage integration the old-fashioned way, with point-to-point connections between source applications and destination data stores. This is where streaming data pipelines come into the picture.
Streaming data pipelines offer a highly coordinated, manageable system for capturing data changes across a myriad of different systems, transforming and harmonizing that information, and delivering it to one or more target systems at scale. This provides business leaders with the real-time insights that drive informed decision-making and competitive advantage. It also breaks down information silos and enables next-generation innovation that helps businesses to leapfrog the competition, leveraging artificial intelligence and machine learning, for example. Real-time streaming data helps leaders to understand their customers better, identify patterns in buying behavior and create memorable customer experiences.
Key considerations for a streaming data pipeline
Here are some important considerations that IT leadership keep in mind as they design a strategy for integration with streaming pipelines:
Take a holistic approach. Most organizations fail to establish a complete picture of the enterprise, because they omit certain systems or processes from their data pipelines. Mainframe systems, for example, are commonly left out of the picture, due in part to the complexities of integrating mainframe data sources with modern relational databases and web services APIs. Modern data platforms lack native connectivity and processing capabilities for mainframe data, making it challenging to integrate much of the data stored in an organization’s most critical business systems.
Most integration tools are enabled to easily handle mainframe data formats, including variable length records, COBOL copybooks, and other idiosyncrasies of mainframe systems. Mainframe data is simply not compatible with most data analysis tools without first being prepared for use in a modern analytics environment. The mere process of capturing changes on the mainframe and feeding them to the data pipeline is beyond the scope of most data integration tools.
Garbage in, garbage out. Data quality matters. If business leaders are to rely upon advanced analytics for strategic insights, they need to be confident that the underlying data is accurate and complete. As businesses increasingly turn to AI and machine learning technologies, the risk of getting it wrong looms much larger than ever before.
Consider the big picture. As business leaders seek to bring order to the data stored in different systems across the enterprise, data governance is becoming increasingly important. A sound strategy for a streaming data pipeline should fit within an overall governance framework that incorporates data quality (as already noted), enrichment, location intelligence, and more.
Scalability matters. Finally, it’s important to use enterprise-grade tools capable of handling thousands or tens of thousands of records per second so that both your data pipeline and your business can scale as the volume of available data increases, which it inevitably will in coming years.
With Connect CDC from Precisely, businesses have the power to build streaming pipelines, create critical links between legacy and target systems, share application data across the enterprise, and integrate easily with modern data platforms such as Snowflake and Hadoop. Learn more about how Connect CDC can help your organization build a holistic approach to data integration; download our e-book Streaming Legacy Data for Real-Time Insights.