Managing Big Data Quality in a Big Data World
An informative eBook that looks at the challenges and solutions of managing big data quality in data environments.
Should I go it alone?
A recommended path for the citizen data scientist
If you’re looking to obtain data quickly, learn if the path toward citizen data scientist is right for you Managers today are under incredible pressure to deliver revenue and cost reduction business transformation. Between meeting revenue goals, profit targets and helping to meet corporate objectives, the need to utilize data analytics to improve results is critical to a business’s success – but are we falling short of analyzing data at the speed of business?
While managers understand the importance of data analytics, most are frustrated and feel powerless that they can’t just pull the data and insights themselves. Instead, they often must either rely upon analytical experts with advanced knowledge to use complex tools in order to pull reports or wait in line to get IT and analytics experts assigned to work on their projects. By the time IT is done, critical business decisions may have been made with obsolete or no data.
The problem is that these teams need this data quickly in order to make business judgments based on the analytical insights that are hidden in the data. To help optimize the business, they need proof that their business hypothesis is backed up by data. However, not being able to pull their own data using business friendly self-service tools is slowing down the decision-making process. Additionally, if new ideas or insights are conjured, there’s no quick way to test ideas if they cannot pull the data on their own. Finally, things can get out of hand fast, forcing teams to fix a crisis with insufficient data analysis to make a business decision that can’t wait.
And while these teams believe success is inhibited by their inability to pull the data needed to be successful, the rest of the organization seems confused because a variety of tools have been purchased throughout the years at great cost. It’s often not that existing tools can’t solve the problem, but that they are too complex and require specialized understanding to use, resulting in delays or, worse yet, reverting business users to rely on gut instincts. While progress has been made, some basic measurements coming from different sources provide different results and create confusion, wasting time needed to reconcile these results.
So, what’s a manager to do?
- One option is to put pressure on IT to do more. But with IT already stressed, overworked and lacking sufficient bandwidth, getting them to reprioritize often means sacrificing other high priority projects. Often, requests for data analytics take a backseat because IT is overwhelmed satisfying commitments for others across the organization.
- Another option is to escalate the situation to a higher level. The problem with this technique is that it may not be productive. Addressing the problem may take a long time, especially if it requires hiring and training new staff.
- A third option is to consider the role of a citizen data scientist. A recent Gartner report defines a citizen data scientist as “a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics.” In laypeople terms, it is business users using an organization’s tools to autonomously pull data analytics, even though they do not have a background in data analysis, business intelligence (BI) or data mining.
Planning And Preparing to Become a Citizen Data Scientist
While becoming a citizen data scientist sounds like a win-win solution for everyone, there are several essential considerations to ensure the role is right for you and your organization.
Have clearly defined analytical goals: Depending on whether you want to explore, measure or prove a point in a business case, you need to formulate measurable goals. If your goals are too vague, you’ll likely not advance your goals. If your definition is too precise, you may limit yourself with extracting value from the data in the future.
Conduct a data quality assessment: Once you’ve clearly defined your goals, there are a multitude of questions to answer that will help you decide if becoming a citizen data scientist is possible in your organization. At the end of this assessment, you want to be able to know the extent of your ability to continue based on the availability and quality of the data you can access. Questions include:
- Where is the data relevant to what I want to do? Is it located in one place, in multiple places? Who can tell me where I can find the good quality data?
- If the data is in multiple places, how can that data be integrated?
- Is the data complete, is it fresh?
- Who can speak about the quality of the data?
- When it comes to privacy, security and confidentiality, are there issues that I need to consider before accessing the data?
While IT or the analytics team will likely support you at the beginning, it’s important to find out to what extent and for how long.
Conduct a data prep assessment: Once you know that the data is reliable, secure and available, it is time to assess your ability to access the information and transform it into insights that can help achieve your goals.
Questions during this phase may include:
- What does the data mean?
- How do I know what data to keep and what to throw away?
- How can I join these multiple data sources into one data store for analysis?
- What tools do I need to manipulate the data without writing SQL queries?
At the end of this assessment, you should have a plan that identifies how you are going to reign in the data to meet your goals. You may want to enlist a trusted adviser to guide you through this phase.
Analytical assessment: For this assessment, you want to figure out what you should do with the data and how can you process it in a way that tells you what you need to know.
- What analytical models should you apply?
- What report(s) do you need to create?
- How do you operationalize the data?
- How do you socialize the insight and reap the benefit of thought leadership?
At the end of this phase, you’ll have a clearer picture of what you can do with the data, as well as what potential insights can be gained from your effort.
There is more than one way for managers to arrive at achieving the analytical data and insights they need. However, to do it on your own requires some important questions and internal assessments before proceeding. But the opportunity it opens up is endless and can help propel the organization if you’re able to flexibly pull the data necessary to make better, smarter and faster business decisions.
There’s more than one way to skin big data
There’s the hard way and the easy way – you decide.
Leveraging data analytics to accomplish business goals is crucial in today’s business environment. Data analytics can tell you many things about a customer and it plays an important role throughout the customer lifecycle. But many business managers are frustrated because they cannot properly use their organizational data to further their business goals.
Challenges To Liberating Data
Business managers want to liberate their data into analytical insights, but are stymied by:
- Conflicting priorities: IT’s primary role is to maintain and enhance data architecture. They have their own agenda and in many instances are only able to accommodate business requests if and when cycles free up.
- Change requests: The constant back and forth requests from the business to IT are often due to iterative changes not envisioned until after the fact. In many cases, you only know what you want after you see a first or second draft of a report. No matter the complexity of the change, it results in adding days, if not weeks, to timelines. During that time the data gets old.
- Access: There are many internal solutions available to acquire, prepare, analyze, act and operationalize data, but many are too complex for the average business layperson.
In essence, the business user wants to do something that in their mind is simple, but is stymied by complex tools, kludgey integrations between tools and lack of technical expertise. An analogy is that business users feel they are given an awesome tool like PhotoShop, when the reality is that they just want a tool that enables them to do basic editing without a long learning curve.
The three challenges above can be alleviated if only the business user was given control to liberate their data. It’s not for lack of initiative, but the current approach to give the business control of their data isn’t translating into results due to failure to address fundamental and sometimes obvious impediments. After all, they only need to engage the IT department once to find and integrate the proper data. After that, business should be free to use that data any way they like.
Let’s explore the three big challenges that business users are struggling with after they gain access to the data:
Three Main Challenges After Gaining Access to Data
- Point Solutions Aren’t Cutting It: Current data infrastructures using vertical tools (e.g., MDM, BI, Data Quality Solutions, etc.) while meeting operational needs do little to address the analytical needs of business users because those tools are fragmented. These tools typically focus on discrete tasks that follow a logical sequence – acquire data, prepare data for analysis, apply analytics, act and operationalize analytical insights. While these tools solve operational issues, they do not focus on resolving a specific business issue from beginning to end. As a result, companies struggle with multiple point solutions that aren’t tightly integrated resulting in reliance upon a small set of experts with knowledge and experience in using one or more point solutions.
- Organization Fragmentation Makes Things Difficult: From an organizational standpoint, groups are fragmented across organizations, each with a specific goal. Organizations typically have business leaders who want to setup quick analytical processes to further their goals. Then there are IT managers who want to meet the needs of all internal constituents while reducing impact on existing infrastructure. But each group has a different need. The data scientists need access to more, clean data. The business intelligence department wants to create and maintain sound, consistent and agreed upon reports, while the data governance department needs to show measurable results to the CIO or CEO to justify existing and future budgets. Finally, analysts need a faster way to create reports that answer business questions on the first try. This leaves business leaders and IT managers looking for outside help to ensure everyone’s needs are met.
- Boiling it Down for the Business Owner: From the standpoint of the business owners, they need quick results to meet their goals, but they cannot afford to align all parties and solutions to meet their business goals. Remember, they need clean, trusted data to make better business decisions, but reaching across organizational data silos to find that data is both cumbersome and time consuming.
So now begs the question, how do we solve these challenges?
Organizations need a self-service, business friendly tool that streamlines the process to acquire, prepare, analyze, act and operationalize data to accelerate time to turn raw data into insights upon which confident decisions can be made. In addition, cloud based tools speed deployment without the need to buy new hardware and can easily scale with less dependence on already burdened IT resources.
By putting data analytics into the hands of business users, the organization relieves the pressure on IT managers by enabling business users to analyze, visualize and generate insights that generate faster, smarter business decisions and processes. The solution will operationalize the ability to prepare, visualize and analyze more data to make faster, smarter business decisions.
As business decisions are becoming more analytical and time to market is crucial, business and IT need to think differently in order to meet all of their company’s goals.
How to successfully implement a big data/data lake project
Big Data’s Batting Average
We’ve all heard it before: “This was an extremely successful pilot, but we still need a solid business case to help our cause when we pitch this idea to the executive team. ”Gartner says that, as of 2016, only 14% of big data programs have hit” production” stage, while 70% of big data initiatives have not moved past the “pilot” phase. Chances are most big data budget requests for 2017 were turned down by the CIO/CEO, or put on hold, due to the inability to deliver a compelling business use case or direct sponsorship from business teams.
One of the most successful big data use cases in recent years was around a big data platform driven by a data lake. The idea was to store raw data to open up decentralized data access to business teams, democratizing data to create an opportunity by which all levels – from CEO to shop floor – could access the data analytics power needed for effective decision making. Naturally there was tremendous support from business teams who were deprived of data access for years. Despite successful use cases and general acceptance, IT executives continued to struggle to justify further investments toward experiments around data lakes or attract sponsorship from C-level executives.
Learn How Implementing a Big Data/Data Lake Can Help your Organization Draw Good Data and Make Better Decisions.
Driver for Change and Innovation
As technology teams continue to be influenced by the hype and disruption of big data, most fail to step back and understand where and how it can be of maximum business value. Such radically disruptive new business processes can’t be implemented without knowledge gathering and understanding how big data technology can become a catalyst for organization and cultural change. Change has been proven most effective by instituting smaller, incremental changes to prove value along the way versus the big bang approach to boil the ocean on a multi-year project that is bound to experience organizational resistance.
Looking back at historical projects – ERP, CRM and Data Warehouse (EDW) programs weren’t identified, created and executed overnight. And neither can big data. Execution and success are binary, when done properly. Big data/data lakes programs have much to learn from its predecessors. Success can come from capitalizing on the existing successful programs – their processes, timelines, technology, etc. – and identifying such small improvements that can quickly bring business value and garner momentum towards the next milestone.
The goal to such an approach is to find areas of innovation across existing programs, defining small wins that can build new innovative thinking across business teams. Small wins provide excellent platforms for leaders and small groups to learn from and can bring about more creativity and have cultural impact on the organization’s data usage. Such changes are the drivers for quick acceptance of data programs like big data/data lakes.
A big data analytics platform with self service capabilities allows you to draw on the data inside the data lake to make better decisions. Rather than waiting for IT or a data scientist to pull the data you need, you’re able to do it yourself and not lose the opportunity at hand because you were waiting days or weeks for the data needed. Below are a few considerations to make once you’ve disrupted your organization with a data lake:
- Data Quality: We said earlier that the idea of a data lake has been around longer than the word has in the English dictionary. Assuming this arbitrary fact is true, your organization likely already has a data lake. But is it a data lake or swamp? Have you done anything to ensure the quality of the data before it’s transferred for analysis? You’re not alone. Most organizations have used their data lake as a dumping ground for the past few years under false pretenses that they’ll eventually need the data. The reality is that they do need the data, but the information has to be clean. Applying big data quality before the data is transformed during the ELT process means you’re actually analyzing data you can trust. Novel concept, I know.
- Machine Learning: After you’ve cleaned up your data, apply machine learning analytics to improve the quality of your analytical outcomes. That’s where the real value of the data lake comes from. Look at how business problems can be addressed by machine learning. Remember you no longer need to be an expert to receive help from machine learning.
- High Volume Processing: At an earlier place of employment, I worked in a corporate complex that housed three office buildings with various tech companies. There was a shared cafeteria and I remember overhearing a conversation during lunch about how many Teradata’s worth of storage one gentleman needed for all the data they were storing. The reality is not how much data you store, but how much data you can process. Storing data just drives up your storage bill. But if you can process that data and draw conclusions on that, now you’re in business. And the only way to do that is to make sure that you can use good quality data at scale from which you’re drawing conclusions.
- Business User: Who are the frustrated business users in your company and what are they looking for? You need to put them on your side to strengthen your business case. Their ability to extract data from the data lake will inevitably help push the project through.
- One Step at a Time: Finally, the obvious one, work step by step, one use case at a time. Look for the low hanging fruit and prove to management that you can execute.
If you follow these steps, you will improve your chances of a successful data lake implementation. History repeats itself and we can learn from data warehouse and cloud implementations in the recent past to void the mistakes that were made.
Data mutations in your data lake? Explore these data quality strategies
Data Cleanliness Dilemma
If the DNA of your data is mutated, then it’s dirty data, and one can predict that any analysis of your big data is going to yield inaccurate results.
When you’re working with big data that hasn’t been validated for data quality prior to applying analytics, it’s a safe bet to conclude that no one is going to vouch with high confidence that the analytical outcomes have a high degree of accuracy.
When it comes to using big data, we marvel at the possibilities of making ground breaking predictions and game changing insights, which will transform the fortunes of the organization and leapfrog the competition. But none of that is going to come to fruition if we lack confidence that the data is reliable.
Costs of Dark Data
Improving insights comes from combining different data sets to make connections previously impossible to see. While logical, what happens in practice is much of the big data remains underutilized due to insufficient quality. This occurs when tools aren’t readily available to quickly perform data profiling to rate its reliability and data quality for big data tools at one’s finger tips to expedite cleaning up the data. What we need is fewer dark data and higher confidence/trust in the big data we are using. What we need is data quality for big data prior to applying analytics to make high impact predictions with conviction.
The Old World: Data Warehouses
Not too long ago, we saw a run on the traditional data warehouse. It was the perfect solution to store all of that data. But with the production of more and more data, bottlenecks were exposed and now many question if data warehouses are obsolete. While data warehouses can perform predictions based on historical data, they fail to harness the power of predictive analytics which can be unleashed by leveraging big data.
To combat these issues many organizations have jumped to data lakes. Data lakes can store endless amounts of data, from multiple systems, in any format or structure, all in one place.
For these reasons, and others, many organizations have been moving their data from various warehouses into data lakes. The many benefits include the ability to derive value from unlimited types of data, store all types of structured and unstructured data, unlimited ways to query data and its flexibility.
However, while the storage and access to data is improved, running analytics to gain insight into what the data means continues to be a major challenge.
Data Lake vs. Data Swamp
Many organizations across industries have invested millions of dollars into data lakes. They save money because their data is no longer in silos and data is preserved in its native form, but their data lake has turned into a data dumping ground, or what some fondly call a data swamp. Organizations have dumped their data from various sources into the data lake, hoping to run analytics on it down the road. While IT applauds the move because they no longer have to spend time understanding how information is used, getting value out of the data is often a problem, especially without determining data quality.
To run a successful analytics program in a data lake, organizations need to ensure their data is of the highest quality. Unlike food or medicine, there is no “born on” date and no expiration date to determine its value. It is important to define where the data came from, how the data will be used and how the data will be consumed.
Enterprises can do this by sending in a team of IT professionals to manually reconcile the data in a very specific way. But reconciling hundreds of sources within a data lake and correcting the errors is tedious and can take a tremendous amount of time.
In addition, data lakes make certain assumptions about the users of information. It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources and that they understand the incomplete nature of structured and unstructured data sets.
With no restrictions on the cleanliness of the data, errors can still slip through, making the data unreliable and not trustworthy, eventually hurting business intelligence and an organization’s reputation.
Getting data quality right may take a significant effort, but it does not have to be a manual process. It can be operationalized, saving an organization immense time and money.
Operationalizing Data Quality
To ensure data quality within your data lake you need a self-service, big data analytics platform designed to handle not one, but rather multiple steps from data acquisition and preparation, to data analysis and operationalization. The platform should enable users to source data from multiple data platforms and applications, including vendor products, external databases and data lakes.
The platform should sit on top of your data lake and monitor data quality. It should enable users to create automated notifications, manage exception workflows and develop automated data processing pipelines to integrate the results of that analysis back into operational applications and business processes.
In addition, the platform should enable users to apply statistical and process controls, as well as machine- learning algorithms for segmentation, classification, recommendation, regression and forecasting. Users should be able to create reports and dashboards to visualize the results and collaborate with other users.
Always remember the old saying goes, “Garbage In, Garbage Out.”
Data quality in the age of big data to achieve actionable insights
Less is More
While the growth in big data and cloud has multiplied our access to data, we are now faced with a new set of challenges in how many different applications and people need to be involved in the process to turn raw data into actionable insights. Take, for example, the multiple applications and handoffs required between capturing data, then preparing data, analyzing the data, visualizing data and operationalizing data. The central issue with specialized solutions that only perform specific functions such as preparing data or analyzing data is the complexity of integration along with the time required to turn raw data into actionable insights.
The name of the game is speed coupled with quality insights. Let’s talk speed first. To get from A to Z, would it be faster to use one integrated platform or to take a best of breed approach? Every company I’ve talked with is struggling to get more done with fewer resources and every analytics project that sits on the shelf waiting in line with other competing priorities is putting an organization at a competitive disadvantage. While speed is crucial let’s not forget about a new challenge in the era of big data – quality.
Data quality is suspect when it comes to big data. Big data is gathering dust based on the fact that we don’t know if it’s of sufficient quality to use in the first place. The other issue is where to find it. I constantly hear that we need to include a data quality for big data checkpoint in the process to change the outcome of the analysis to yield high confidence in the analytical results. An integrated tool which adds data quality in parallel with data prep, while keeping track of data lineage, reduces risk and supplies concrete proof that the data is reliable. What we don’t want are executives in the boardroom to be constantly questioning the viability of the analysis because no one can give a sufficient answer to the age old question, “Is the data being used for analysis accurate or garbage?”
Let’s continue to explore why less is more.
A “Non” User-Friendly Platform
IT is on a mission to extract insights from the massive amounts of data an organization collects. With so much focus on data and the insights to be gained, IT has likely implemented a slew of manual point solutions to ensure every bit of insight is garnered from the data. When IT extracts that data from storage, they need to combine it with other data sets. This inadvertently opens up an opportunity for data to be changed. Additionally, it brings into question the quality of the data. The process is inefficient, drains resources and executives are not satisfied with the results. A company’s ability to quickly retrieve and confirm quality data can be a significant factor in determining the ultimate success or failure of a business.
Learn How Implementing a Big Data/Data Lake Can Help your Organization Draw Good Data and Make Better Decisions.
Proactive Versus Reactive
It can take days, and in some cases weeks, if IT is forced to extract and process data. But what if you need timely data about customers to make informed business decisions that is time sensitive? By the time the data is extracted it can already be too late to obtain or keep a customer or put the right information into their hands. Generally speaking, if data isn’t orchestrated properly it leads to organizations reacting to problems after they have already occurred, rather than staying in front of customer concerns and proactively preventing them.
Additionally, IT teams must constantly keep an eye out for new data. When they do find new data, they must load and prepare data for analytics. Manually managing this process is not only a lengthy process, but in the age of big data, obtaining real-time data analytics is expected and anything less means slower decisions, which can affect revenue. Executive teams are used to immediate gratification. Staying current to reflect business change means dashboards that represent real-time information. It also means managing data discrepancies in minutes, rather than waiting on IT who is already bogged down with various responsibilities.
The Power of Empowerment
IT teams want to satisfy the business user within their organization, but they simply do not have the time. In some organizations, the business user is constantly on top of the IT team asking them to extract and process data to make better business decisions, but with the IT team already overloaded, they can’t jump every time a business user needs them. How nice would it be to satisfy the business user by performing a data extract that the end-user can make changes to?
Let’s face it, separate solutions to handle each one of these functions is not effective which is also why adding a data quality step is seen as counterproductive to a process that is already too siloed and taking too long to complete. However, an end-to-end platform that can take the mountains of data available, and integrate, transform and handle these challenges can free up IT’s time with a user-friendly solution using real-time information. Talk about a win-win – IT now has time to manage other priorities and no one person is using multiple applications to pull down data facts necessary for the business.
No Trade-Offs Necessary to Achieve Actionable Insights
A self-service, big data analytics platform designed to handle not one, but all the point solutions IT is using independently enables a business to automate, streamline their processes and add an important missing step – data quality for big data. Rather than multiple steps to handle data acquisition and preparation, data analysis and operationalization, or visualization, one platform can pull reports, detect new data and provide insights. In addition to access to specification logs, dashboard preparation and visual workflow, the platform can empower the business user to aggregate and control data to accelerate and improve the subsequent data analysis process.
A point and click solution will not only save time but increase productivity within budget.
Break through these business challenges preventing the adoption of big data
Big data holds a lot of promise with vast quantities of different types of data made available to mine for deep insights. From creating a central data repository to data discovery and analytics, there is no shortage of possibilities with big data. While these possibilities excite data enthusiasts, we continue to see a hesitation to fully embrace big data by business users.
Challenges Associated with Embracing a Big Data Environment
We have found that this hesitation is partially due to four obstacles:
Let’s dive into each of the obstacles to understand the issues and address the elephant in the room- how does one turn the problem into an opportunity? We need to keep our focus on ways to make incremental improvement which results in a greater percentage of business users embracing their big data environments to realize value in the multi millions of investment in big data.
50-80% of user’s time is spent scrubbing data. The question is “why” and “what can be done?”
- Data Quality: One of the most prevalent issues with big data is its quality. Early in the evolution of data lakes, big data environments were thought of as dumping grounds for different types of data without any need for quality. The idea was to siphon as much data with the intent to use it in the future. But those days are far gone as businesses are now trying to mine big data for insights and spending 50 – 80% of a user’s time scrubbing data is counterproductive. We now all know that “garbage in, garbage out” also applies to big data.
- To get the most out of big data, you need to find ways in which the quality of the data can be improved without spending exorbitant amounts of time fixing it. As big data environments continue to receive different types of data from both internal and external 3rd party sources, more often and at higher volumes, quality becomes even more important. Traditional data quality tools and other tools built using dated technology are not well equipped to provide data quality in big data environments. It’s equivalent to trying to put a square peg in a round hole which is why you can’t solve today’s problems with yesterday’s technology. What is required is a next generation data quality tool that is built specifically for big data environments. A tool that automates big data quality process by offering turnkey self-service options with simple drag and drop functionality opens up the opportunity to enable more resources, that aren’t coders, to clean up the data in order to improve the use and adoption of big data.
- One Stop Shop: The need for multiple tools is another major issue that hampers the adoption of big data. A suite of tools, on different platforms, with different standards and learning curves are required for ingesting, preparing, analyzing and operationalizing insights from both traditional databases and big data sources. This has been the status quo in terms of how, traditionally, data is accessed and analyzed. One solution that some pursue is to use complementary tools provided by a single vendor. With this choice, integration across those multiple tools is often fraught with unnecessary limitations and doesn’t solve the underlying problem. A better possibility is a platform tool that runs natively in a big data environment and capitalizes on the power and performance of Spark to shorten the time frame from data ingestion, preparation, analysis, visualization to operationalization in a big data environment. The major benefit of a unified tool is that it provides a simple way to go back and seamlessly make any changes at any point in the process without requiring cumbersome integration steps.
- Business Use Accessibility: Often times we hear that the data in big data environments is simply not usable by business users because the data sources are not easily accessible by non-technical users. In addition, the data tends to be in unreadable formats and requires significant data preparation. Tools available for these functions are not designed for the average business user because they need advanced technical skills. With the rise of the ‘citizen data scientist,” we know that business users are looking for a more active role in the data to-insights process. The right tool that empowers business users and makes it easy for them to own the process will substantially increase the adoption of big data.
- No SQL Here: Closely linked to the point above is that advance skills are required to successfully access, prepare and analyze data in big data environments. An efficient solution is to equip current employees with a tool that is intuitive and easy to use without the need for any R or SQL experience. This might seem like a needle in a haystack, but technology exists that does not require programming experience.
Solving Big Data Environment Adoption Problems
While these obstacles may seem insurmountable, this is certainly not the case. Business users can be empowered with a next generation tool that easily aggregates data from various sources, pinpoints data of interest, performs aggregations and transformations, evaluates and reviews data quality and combines and correlates data from different sources using a visual data prep process. Every single step in the process can be accomplished without using SQL queries to create repeatable, automated analytics that eliminate errors and accelerates time to insight.
Why a data supply chain is required in the age of big data
There is No Greater Fraud Than a Promise Not Kept
The promise of big data rests on the laurels of customer impact and value creation at fascinating speeds. However, the more data an organization collects directly correlates to how difficult it is to manage, analyze and achieve these values. It’s not for lack of interest, value or investment – many organizations realize that winning requires harnessing the power of big data to become an industry juggernaut and using data-driven decision making to tailor specific information to understand customers’ needs before they do. Yet many organizations are coming up short when it comes to their big data initiatives. In fact, Gartner predicted that “through 2017, 60% of Big Data projects will fail to go beyond piloting and experimentation and will be abandoned.” So what’s an organization to do that’s looking to harness the power of big data, yet doesn’t want to become a statistic as noted above? To make better business decisions, organizations are repurposing the supply chain management discipline in a new way – the data supply chain.
Ready, Fire, Aim. Do It! Make it happen! Action Counts. No one ever sat their way to success.
What is a Data Supply Chain?
To understand a data supply chain, start by picturing a traditional supply chain – the sequence of processes that transform raw materials to finished goods – and then to the distribution of the finished goods. Supply chain management tracks how goods and services flow through the chain effectively and efficiently. The challenge is that most organizations have a data supply chain, but they have zero visibility into how it works. This can quickly become the Achilles heel that will undermine your big data legacy.
Now correlate this approach to organizational data. Data is the raw material that enters an organization. That data is then stored, processed and distributed for analysis – akin to the transition of the raw material to the distribution of a finished product. Or in our case, from raw data into insights. The last leg of a data supply chain involves an easily searchable data portal that allows the business user to discover and order the data to their particular environment.
To break it down, the data supply chain consists of three parts. First, on the supply side, data is created, captured and collected. Then during the middle stage, management and exchange, the data is enriched, curated, controlled and improved. Then on the demand side, data is used, consumed and leveraged. Those that master the process will become leaders in their industry that the laggards will try to emulate.
Sounds straightforward in principle, but as Peter Drucker exposes, “Strategy is a commodity, execution is an art,” and an ideal place to start is by clearly understanding the challenges that lie ahead in order to anticipate how to create a highly efficient data supply chain.
Data Storage Challenges
Organizations feel overwhelmed by the volume, variety and velocity of data entering their system. If their infrastructure is dated, it’s an even bigger challenge because there is nowhere to store the data. As a result, IT ends up tossing out the data or only storing it for a short period of time before it gets deleted, and the business knows that data has a shelf life before it’s no longer of significant value to analyze. Any gaps in the supply chain prevent organizations from running predictive analytics because the historical data is missing.
Such scenarios have lead organizations to adopt data lakes and other big data storage strategies which, while flexible and cost-effective, have brought other difficulties to the forefront. One of those challenges includes data discovery or an organized way of finding data once it is stored. Organizations just load data into a repository with no attempts to clearly define what the data is or put it into a relational or query structure. This clouds up the data lake and makes it nearly impossible to find or deliver the data to the business user.
The Benefits of a Data Supply Chain
Every business that deals with data has a data supply chain, but most businesses are focused on taking in data and/or analyzing the data that they have. What they are missing are the processes that enable meaningful analytics and ultimately, insight. The benefits of an optimized data supply chain are like the benefits of an organized kitchen. The best chef in the world will struggle to make a good meal in a messy kitchen with unlabeled, or worse, mislabeled ingredients on the shelves. If you think of data as the ingredients in the messy kitchen, you will be describing the data situation at many firms. Even if the chef has an assistant, they won’t know what ingredients need to be bought from the grocery store because they don’t know what is in the cupboard. The determined chef will persevere – they will sort through the cupboards to find out what’s there, but it could take all day to make a simple lunch, and there is always the danger that they put paprika in the soup instead of salt. Furthermore, you also don’t want to pay a world-class chef to clean the pots and pans, yet, this is basically what some businesses are doing when they hire the high-priced data scientists that spend their hours sifting through or ‘cleansing’ unreliable data.
Creating a Data Supply Chain
To create a successful data supply chain, organizations need to know where their data is and how to find it. If they don’t, the process is undoubtedly strenuous and time consuming. Requesting and receiving data using a single, easily searchable portal can improve searchability, promote higher productivity, enable compliance and streamline data supply chain management. Luckily, new solutions automate the process. But for meaningful analysis of data to be successful, three types of initiatives need to be in place:
- Track Conceptual Metadata: Conceptual metadata is the meaning and purpose of a data set from a business standpoint. For example, a sales person needs an address to send material to a current customer. The problem is that a customer might have several addresses – there is the address of the factory, the billing address, the general customer service address and the address of the executive suite. To know which address is best for sending sales material, a seemingly simple data field like ‘address’ must be appropriately labeled and appended with meaningful metadata.
- Track Data Lineage: Data lineage helps organizations track where their data came from, what systems and processes it went through, how it was formatted and how it was transferred. With all of that information, organizations know exactly what they are dealing with when it comes to their data.
- Ensure Data Quality: Knowing organizational data is complete, accurate and consistent is paramount because business managers don’t know if their data is trustworthy or not. If they select low quality data, and are unaware of the quality issue, it can lead to flawed business decisions, and ultimately an organizational disaster.
When organizations track conceptual metadata, track data lineage and ensure data quality, searchability becomes possible, leading to end-to-end data supply chain success.