Unifying the Data Warehouse, Data Lake, and Data Marketplace
There was a time when developing a data warehouse was sufficient to quench the thirst for data, reporting, and analytics of most business users. Not anymore. Organizations have discovered that data can be a valuable business asset. It has taken some time, but finally they realize they can do more with all the data that’s available than just produce simple reports. With the right data they can distinguish themselves from the competition, reduce costs by optimizing business processes, and create new business opportunities.
Data science, investigative analytics, self-service BI, embedded BI, streaming analytics are just a few of the many new forms of how data can be used and exploited. To support all these new forms of data usage, organizations are currently developing new systems, such as data lakes, data marketplaces, and data streaming systems. Unfortunately, most of these new systems are developed as stand-alone systems with almost no relationship with the existing data warehouse system.
In other words, organizations are developing systems that all deliver data to business users. Developing all these data delivery systems independently has two severe drawbacks:
- Potentially, these data delivery systems share the same data sources. For example, traditional business users, data scientists, and business users who access the data marketplace for ad-hoc data analysis may all be interested in sales data. If several independently developed data delivery systems share the same data sources, similar solutions will be developed to deliver the requested data to the business users. It will be like reinventing the wheel over and over again, which negatively influences productivity and maintenance. Many comparable solutions have to be developed that deal with integrating, aggregating, transforming, filtering, governing, cleansing, auditing, and securing the data. For example, if the zip codes belonging to customer addresses have to be cleansed before they can be used, each data delivery system needs a solution. Or, if two different systems contain customer address data, each data delivery system needs a solution to integrate them.
- Potentially, these data delivery systems share the same users. For example, a specific user may want to combine results coming from a streaming analytics with data coming from a data warehouse system to compare what’s currently happening with what’s “normal.” If these systems are developed independently of each other, it’s hard to guarantee that the two results are consistent.
It’s crucial that organizations, somehow, bring these data delivery systems together, to create one all-encompassing architecture. This unified architecture is responsible for delivering any form of data in any form to any business user.
This unified data delivery platform is probably not an extension of the well-known data warehouse system. It’s an architecture in which the data warehouse system operates as a module that delivers data to an umbrella architecture that deploys other technologies and systems to deliver data, such as a streaming system and a data lake. This data delivery platform unifies the concepts of data warehouse, data lake, data marketplace, streaming data, and any other data delivery system.
The foundation of this new data delivery platform must be abstraction. It must be able to hide for business users how and where data is stored, how it is copied, which technologies are used, whether data is integrated on-demand or on batch, and so on. In addition, it must be transparent enough to business users to determine how source data has been manipulated. A data delivery platform must be able to support a wide range of business users, ranging from users requiring governable and auditable reports, to users demanding a highly agile marketplace, and to data scientists who analyze raw data.
For the coming years, architecting an integrated data delivery platform will be the challenge for many organizations. If they don’t, their multitude of data delivery systems can lead to a labyrinth of systems that won’t allow them to get the most out of their data asset. Not everyone’s data thirst will be quenched.
Blog written by Rick Van der Lans, originally published here.