DataOps: A fast and scalable way to turn data into insights
The amount of data that businesses collect is already beyond what we could ever imagine, and it continues to grow exponentially in volume and complexity. As it gets more and more difficult to work in efficient manners with all this data, we need modern approaches enabling data teams to deliver insights at speed and scale. Could DataOps be the solution for this challenge?
by Claude Zwicker
In recent years, data has increasingly become the center of businesses’ (economic) growth and development. Among others, data is being used to improve customer experiences, optimize supply chains and reduce production waste. What makes all this possible is the fact that, in every industry and every business, we have huge amounts of data that we can work with. And these amounts are supposed to increase still more: according to the market research group IDC, the volume of data will grow at a compound annual growth rate of 32% to 180 zettabytes (1021 bytes) by 2025.
In accordance with its enormous growth, the management of data is becoming increasingly difficult. Businesses in particular face the following challenges when it comes to managing data projects:
- Data projects are complex: They involve many different components that have to be integrated and enabled to work with each other, such as fast evolving ecosystems, open source and vendor specific technologies as well as tools-oriented development approaches.
- They have a long time to market: 12-18 months is the average time to market for data initiatives.
- They lack data ownership: Data for extensive testing, data quality and adhering to data sensitivity add to the complexity and uniqueness of data supply chain engagements.
- The business speed they have to meet is high: Only 17% of IT teams deliver successful data projects to match the speed of business need.
Until today, many businesses manage their data projects by applying centralized waterfall-like approaches and monolithic applications. However, to solve the aforementioned challenges, we need a different method for data management.
One approach increasingly gaining traction in this regard is DataOps. As Data Science Central says, the objective of DataOps is to “bring together the conflicting goals of the different data tribes in the organization (data science, BI, line of business, operations, and IT)”. By doing so, DataOps aims at improving the quality and collaboration within data projects while reducing their time-to-market.
But how can DataOps achieve that? And does it really work or is it just a hype? Let’s find out by looking at what DataOps consists of and how it can be applied in a real use case.
What does DataOps consist of?
In order to make data projects more qualitative and collaborative and bring them to the market more quickly, DataOps uses a combination of processes and technologies from 3 well-proven frameworks and applies them to data:
- Agile: The aim of the Agile methodology is that data teams and users work together more efficiently and effectively. To this end, the data team regularly releases new or updated analytics, so-called “sprints” – being well aware that they might not yet be fully developed. By thus getting continuous feedback from users, the team can react and adapt to changing conditions and requirements in the market. This makes the whole data project far more responsive than if the traditional waterfall methodology were applied: with waterfall, the whole data project happens behind closed doors and only the “end result” is shared with the users, often leading to the data project having long lost touch with what users actually want and need.
- DevOps: When it comes to DevOps (software development meets IT operations), the key word is “automation”. By automizing processes such as integration, test and deployment of code, DevOps reduces time to deployment, time to market, as well as the time required to resolve issues.
- Lean manufacturing: Lean manufacturing is a methodology that aims at minimizing waste within a system while maintaining productivity. This methodology refers to the “operations” side of a data project; more particularly, to the management of the data pipeline: data enters the pipeline, goes through several different steps and exits in the form of reports, models and views. DataOps orchestrates, monitors and manages this constant flow of data and e.g., informs the data analytics team about anomalies in the process through an automated alert. This improves efficiency, quality as well as transparency.
In order to implement this set of processes and technologies to better manage their own data projects, businesses can follow a so-called “DataOps journey”. At its end, they reach the state of “full DataOps” – an end-2-end DataOps lifecycle.
The DataOps journey
Currently, most organizations are still at the beginning of their DataOps journey. From a data & analytics technology perspective, this journey consists of 4 steps, the 4th step being the state that we want to achieve (see Figure 1):
Figure 1: Timeline of a DataOps journey.
- DevOps for data: In the first step of the DataOps journey, organizations use standardized ways for code deployments and testing for test and production environments. In contrast, there is no standardization yet around data quality & observability, ETL pipelines etc.
- ETL simplification: In Step 2, organizations introduce user managed configurations for the flexible and automated orchestration and management of ETL pipelines.
- DQ automation: As a third step, we establish metadata-driven automated suggestions and applications of data quality & observability rules using machine learning algorithms.
- Full DataOps: Finally, the goal is to do all data-related activities via the DataOps lifecycle, including the automated registration of data assets in the data catalog, the application of federated security principles and the API creation for programmatic exchange of data between projects.
So, this is the theoretical framework for the implementation of DataOps with a focus on data & analytics technology. Now let’s look at how DataOps is established in a real use case and what advantages it really brings.
HOW DO OIL AND GAS COMPANIES ACHIEVE AGILE DATA MANAGEMENT?
Reaching the next level of data and analytics transformation challenges companies across a wide range of industries – including those in the oil and gas business. For example, their transformation is held back by not being able to scale.
How can an oil and gas company overcome this hurdle? Read more in this blog post.
Use case and best practices: DataOps for a global pharmaceutical company
A global pharmaceutical leader with headquarters in Switzerland approached us with the following problem statement regarding their data landscape:
- Their data landscape consisted of centralized monolithic on-prem applications which lead to 3 months lead time to scale computing resources (for example getting a new server). Due to this, reliability satisfaction was medium and performance satisfaction low.
- Their data projects entailed a long time to market – the release cycle usually took 3 to 4 months.
- Due to low confidence in the IT organization, some business functions started building and managing IT systems on their own resulting in shadow IT.
Confronted with these challenges, it was clear to us that they would need to move into a scalable and efficient delivery approach to satisfy business functions’ requirements. By implementing DataOps, our specific goals were to:
- Drive automation for the 15+ tools and technologies integrated in the client’s new cloud-based data platform (replacing the old centralized monolithic landscape)
- Ensure a great user developer experience to allow data teams to provide and consume data in a self-service manner, eliminating bottlenecks and improving speed to market
- Develop smart and reusable assets such as out-of-the-box testing frameworks, data observability metrics and publishing data assets for the internal data marketplace to deliver at scale
- Build a DataOps community to ensure trust and collaboration across the company’s many data & analytics teams and drive self-service enablement
By using DataOps, the pharmaceutical company improved their ability to deliver value from data significantly:
- The number of releases increased to 120 per month (vs. 1 release in 3 months previously).
- On average, teams were able to launch the first version of a new data project in only 4-6 weeks (MVP time).
- Moreover, they achieved a high return on investment due to inventory reduction, cost avoidance and resource optimization.
Whilst introducing DataOps at our client, a number of best practice learnings emerged. These are my 4 top tips for your own DataOps journey:
- Team set-up is key: Differentiate between DataOps engineers and data engineers and make sure that every team has a strong DataOps engineer. The DataOps engineer should be the process owner for building, testing, deploying and maintaining data pipelines.
- Try. Fail. Learn. Repeat.: Start rolling out your data project as soon as it is “good enough” instead of waiting until it is (in your opinion) “perfect”. Support the project development with strong documentation.
- Enablement breeds excellence: Enable your data teams through a center of excellence approach with weekly cadence calls allowing them to learn and grow together.
- Pay attention to product stack harmony: Choose software products & tools that naturally work in harmony with each other to leverage existing integrations between tools for automation rather than having to build them yourself.
DataOps is more than just a hype – the combination of Agile, DevOps and lean manufacturing methodologies and their application to data projects has the potential to significantly improve quality as well as collaboration and time-to-market. Thus, DataOps allows to tap into the full potential of the insights still waiting to be discovered in the ever-growing masses of data.