Dataiku and H2O Driverless AI & MLOps on the test bench
To find out which machine learning platform is best suited to get your ML solution from the lab into production, you can use an evaluation framework. But how do different ML platforms on the market actually perform when this framework is applied? Let’s put two of them – Dataiku and H2O Driverless AI & MLOps – to the test.
by Marcel Moldenhauer
As explained in this article, we can evaluate machine learning platforms regarding their coverage and the maturity of each functional area and component. To give you a glimpse of the thus generated insights, we will take a look at two ML platform contenders: Dataiku and H2O Driverless AI + MLOps. In the case of H2O, we consider two separate products, H2O Driverless AI and H2O MLOps, which can be used separately but in combination encompass the E2E ML Lifecycle.
The assessment shows that the two platforms differ quite a lot in the approach of a data science workflow. While H2O Driverless AI wants to automate the data science lifecycle and therefore enables model training also for citizen data scientists (see box: What is a citizen data scientist?), Dataiku wants to support the day-to-day work of data scientists from data exploration and cleaning to model training and deployment.
Figure 1: Dataiku and H2O Driverless AI & MLOps in direct comparison.
1. Data ingestion and storage
Dataiku provides multiple connectors for batch and streaming data out-of-the-box from over 25 leading data sources, both cloud and on-premises. The sources include prominent hyperscalers such as Amazon S3, Azure Blob Storage, Google Cloud Storage as well as Snowflake, SQL and NoSQL databases and HDFS. The Dataiku Visual Flow enables coders and non-coders to collaborate on the same project by providing a seamless integration of no-code and code-based building blocks and the possibility to easily build and monitor data pipelines. The platform allows the usage of built-in or customizable recipes to clean, prepare and analyze data. Additionally, built-in data transformers can perform common data manipulation tasks like finding and replacing or normalizing data. For more sophisticated tasks, users can extend the functionalities of written code with either Python, R or Scala. Monitoring, in the case of data pipelines, is only possible as a log of all actions, as Dataiku does not provide metrics for dashboarding.
In comparison to this, H2O Driverless AI works on existing big data infrastructure, on bare metal or on top of existing Hadoop, Spark or Kubernetes clusters. Data is ingested directly from Hadoop HDFS, Spark, Amazon S3, Azure Data Lake or other data sources out-of-the-box. Manual and customized transformation capabilities are restricted because the platform executes modification functions as code (in a simplistic text box with basic syntax highlights). H2O Driverless AI’s strength is the automatic feature generation and transformation to automize engineering of new, high-value features for a given dataset. Additionally, H2O Driverless AI helps users with automated visualizations to get a quick understanding of their data before they start the model building process. Monitoring of any data ingestion processes is not possible with the H2O Driverless AI platform and needs to be implemented via separate tools.
2. Experimentation zone
Dataiku AutoML capabilities provide automated solutions for feature engineering and model training algorithms. For code-based experimentation, Dataiku supports a variety of notebooks using Python, R, and Scala-based on Jupyter. For deep learning models, data scientists can draw on Keras and Tensorflow modules and libraries: thus, they can utilize additional performance provided by GPUs for training and deployment. Furthermore, Dataiku allows a solid management of models and a variety of visualizations to understand the model outputs and behaviour. The Dataiku Visual Flow operates and governs the whole experimentation process in one unifying view from data ingestion to deployment.
H2O Driverless AI is first and foremost an AutoML platform, fully focusing on automated ML development, and thus making it easier and faster to build, train, and evaluate ML models for all sorts of analytics personas, including people with no coding background. Additionally, one can use custom code snippets (provided in external code repositories like Git) to expand the given AutoML capabilities. Manual development of ML models via code is not possible within the context of H2O Driverless AI. Robust techniques and customizable visualizations are provided for experiment tracking – this helps interpret and explain the results of ML models. One point worth mentioning is that H2O Driverless AI puts an emphasis on model explainability and provides a big suite of visualizations to tackle this new and upcoming topic in AI.
3. Continuous integration
Dataiku provides integration with Git, including version control of projects, importing Python and R code, developing and importing reusable plugins and more. Datasets created via Dataiku Visual Flow are automatically versioned in case data pipelines and executed multiple times. Models developed via building blocks, which are provided by Dataiku, are versioned by default with the respective metadata. Dataiku does not provide a comprehensive feature store. However, one can generate a set of recipes acting as a functional limited feature store.
H2O Driverless AI delivers a comprehensive model store that persists and versions models developed on the platform. A basic dataset manager displays all usable and connected datasets including metadata. H2O introduced a newly added feature store recently which, however, is not part of this assessment. H2O MLOps and H2O Driverless AI provide a shared production model repository: this enables teams to easily collaborate and deploy models onto test or production environments. Moreover, it creates a well-functioning link between experimentation and industrialization of models on the platform. Externally code-developed ML models can be deployed by H2O MLOps utilizing necessary code wrappers.
Before you run out of things to read ...
This NPO gets girls excited about programming
The best Easter Eggs from the World Wide Web
4. Industrialization zone
Dataiku Data Science projects bundle developed ML models as a ready to deploy package including all necessary environment variables to run it in a production environment. Containerization requires extra plugins with the possibility of integrating with Kubernetes. The Dataiku unified deployer manages movements of the packaged project between experimentation and production for batch and real-time scoring. The Dataiku production environment can plan everyday tasks for projects like monitoring, updating data, and retraining models based on a schedule or alerts. Additionally, it is possible to integrate Dataiku in an existing CI/CD landscape for automated testing, retraining and deployment with the help of available DevOps tools like Jenkins and GitLabCI.
H2O MLOps makes it easy to package and deploy models into production environments as a single instance or a Kubernetes cluster. MLOps teams can easily manage multiple environments for development, testing, and production, all running in different locations directly from H2O MLOps. H2O MLOps includes monitoring for different service levels as well as data drift with real-time dashboards integrated by Grafana. For model lifecycle management, H2O MLOps provides the operations team with the tools to seamlessly update and promote models in production, troubleshoot models and run deployment strategies like A/B tests on connected environments.
5. Data presentation
Dataiku entails effective visualizations to analyze outcomes and share data insights across the team or organization. Interactive and data driven dashboards can be built, viewed and shared with stakeholders across the company in just a few clicks. Integration with existing BI platforms like Tableau, Qlik, and PowerBI is provided out-of-the-box. Additionally, models can be deployed as REST-API to be consumed by an interface. With the Dataiku Apps, you can easily create AI apps and publish a project as a usable business application.
On the other hand, with H2O Driverless AI, models can be deployed automatically across several environments as a REST-API endpoint to be used in any kind of application. Alternatively, you can run them automatically as a service in the cloud (using AWS Lambda), or export them as a highly optimized Jar-File for edge devices. H2O Driverless AI also integrates into Knime and Snowflake. H2O Wave provides a readily accessible integrate web app platform which leverages ML models developed in H2O Driverless AI. This product was not part of our assessment, but it is worth mentioning.
Dataiku shines as a standalone platform with focus on ease-of-use, visual pipelines and no coding requirements. Unlike H2O Driverless AI, Dataiku is a fully fledged data science platform which not only covers ML model training but also data preparation, data exploration and the necessary augmentation via code to adapt to the most sophisticated data science use-cases. While parts of Dataiku are also usable for citizen data scientists, to utilize the whole power more knowledge is necessary. Moreover, Dataiku provides a great coverage across all functional areas and their components, but sometimes lacks the necessary maturity, for example in model serving.
In contrast to the more sophisticated approach of Dataiku, the H2O Driverless AI helps citizen data scientists with its intuitive UI to not only build models but also analyse them and successfully bring them to production via H2O MLOps. H2O Driverless AI enables users to quickly build ML models, e.g. classification, in case of clean data and provide necessary metrics and plots. This simplified package enables all different kinds of people to participate but comes at a cost since it is much less flexible and hardly usable for more sophisticated data science use-cases. Additionally, H2O Driverless AI + MLOps lacks coverage in Data Ingestion and Storage as well as in the Industrialization Zone, e.g. not supporting components for data pipeline monitoring or model retraining.
There are many more nuances to be talked about with regard to the evaluation of these two ML platforms. Hopefully, this article was a first appetizer for you to take a closer look at them and be curious about other platform offerings!