In the open source version, Apache Spark can be installed in your own on-premise data centre or in the cloud in virtual machines, in containers. The disadvantage of this setup is the complex installation and maintenance. The configuration of such a Spark cluster also requires a lot of expertise.
A more pleasant solution is to obtain Spark as a service from a cloud provider. This saves a lot of effort and costs, benefits from better scaling and can create new Spark clusters within minutes.
There are various cloud providers from which you can obtain Spark as a service. These are PaaS, i.e. Platform-as-a-Service. The big players here are AWS, Google Cloud Platform and Microsoft Azure. In my professional work, I mainly use the Azure Cloud, which is why I will highlight the two most important Spark services in the Azure Cloud in this article: Azure Databricks and Azure Synapse Analytics.
Azure Databricks is a fast, simple and collaborative Apache Spark-based Big Data analytics service designed for data science and data engineering. Databricks was founded in 2013 by the developers of Apache Spark and offers its services on AWS and Google Cloud Platform in addition to Azure.
Azure Synapse Analytics is an unlimited analytics service that combines data integration, data processing and Big Data analytics. On 4 November 2019, Microsoft released "Azure Synapse Analytics" at Ignite, Microsoft's annual developer conference. In addition to the Spark Engine, Azure Synapse Analytics contains other components, including the former Azure SQL Data Warehouse, a Massive Parallel Processing Data Warehouse. Azure Synapse Analytics is therefore a rather recent project.
A frequent question I am asked is which service to use - not an easy question to answer. There is also a question on Microsoft's documentation page about the comparison between the Spark Pool of Azure Databricks and Synapse Analytics.
The response from a Microsoft employee dated 22/04/2021 reads, "We are currently working with the content team to publish an article describing the differences between Azure Databricks and Azure Synapse Spark Pool. I will update you as soon as it is available."
To date, I have not found such a document, so I set out to find answers myself.
In this article, I describe the most obvious differences between these two Spark services in the Azure Cloud.
Databricks has its own proprietary runtime, the Databricks Runtime. The latest features are integrated into this runtime. With Databricks, you always have the latest functions and software versions. In addition, there are also functions that are not included in the open source version of Apache Spark.
The runtime of Synapse Analytics Spark is based on the Vanilla Spark Runtime, the open source version of Spark. The Spark version lags a bit behind, currently Spark version 3.0 is available in preview, whereas Databricks is already at version 3.1.2.
Both services are provisioned within minutes. The simplest provisioning is done via the Azure Portal.
Both services can also be provisioned via CLI, Powershell or an IaC solution.
There are already striking differences in the setup of the Spark clusters. With Synapse Analytics, one speaks of "Apache Spark Pool":
At Databricks, the term "cluster" is used:
In Databricks' Cluster Mode, you can choose between Standard, High Concurrency and Single Node.
"High Concurrency" is optimised for running concurrent SQL, Python and R workloads. Scala is not supported.
"Standard" is recommended for single-user clusters and can run SQL, Python, R and Scala workloads. Spark.NET (C#/F#) must be enabled with additional libraries.
Single Node is a cluster without workers. This means that the Databricks runtime is run on a single VM, only as a driver node. This is recommended for use cases where small amounts of data are worked with.
With Synapse, there is (currently) no option to create any other type of cluster except "Memory Optimized".
Presumably, this will look very different in the near future.
There is a wide choice for the image that is installed on the respective nodes, which also contain different Scala and Spark versions. Several versions are available for standard use cases and an equally large selection especially for machine learning projects.
The Spark version can be selected under advanced settings. You can choose between Spark 2.4 and Spark 3.0, which is still in the preview phase.
Depending on the Spark version selected, you will then receive information on the versions of the associated software used.
With Databricks, there is a very large selection of different worker types.
In addition, one has the possibility to make use of so-called "spot instances". With virtual spot computers, one can profit from unused compute capacity and thus from cost savings.
The size of the nodes at Synapse is given in "T-shirt sizes", from Small with 4 vCores and 32 GB RAM, to XXLarge with 80 vCores and 504 GB RAM.
So you can choose from 6 different sizes.
For both services, pricing is based on the number of virtual machines and their sizing, i.e. the underlying specifications such as CPU, RAM or the data carriers used such as HDD or SSD.
With Databricks, there is an additional fee on top of the costs for the VMs, the DBU. A DBU is a unit of processing capacity calculated on a per-second usage basis. DBU usage depends on the size and type of instance running Azure Databricks.
Databricks has its own implementation of the notebooks. Co-authoring takes place in real time. This means that both authors see each other's changes in real time. With Databricks, the notebooks are automatically versioned.
Synapse uses Nteract notebooks. Several people can work on the same notebook, but one person has to save the notebook before another person sees the change. Furthermore, there is no automatic versioning of the notebooks.
A data lake must be mounted in order to use it.
When creating the Synapse Workspace, an existing or a new data lake can be specified, which serves as the primary data lake. This allows it to be accessed directly from the scripts and notebooks. Additional Data Lakes can be added as "Linked Services".
Synapse Studio also offers an integrated file explorer from which you can right-click to open a Spark Notebook that loads the selected file into a data frame.
Both services now offer GIT integration.
To get a feel for how the two technologies compare in terms of speed/performance, I conducted a test. I chose the City of New York's TLC Trip Record Data Yellow Cab as my data source.
The yellow and green taxi trip data contains fields for recording pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, unit fares, fare types, payment types and passenger numbers reported by the driver.
Only yellow taxi trip data was used for the test. In this dataset, all records from January 2009 to December 2020 were downloaded in CSV format and stored in a data lake. The data size of the CSV files is 233 GB. The dataset contains a total of 1,620,781,620 rows (1.62 billion rows).
As a test environment, I created one Spark environment with Databricks and one with Azure Synapse Analytics. Since Azure Synapse Analytics offers fewer choices for configuring the nodes/workers, I created these first. I created the Spark pool with 3 nodes, each node with 32 GB RAM and 4 vCPU.
In order to create approximately the same conditions for both technologies, a similar SparkPool was created in the Databricks cluster. Three worker nodes with 32 GB RAM and 4vCPU each were also created, and a driver node also had to be specified. This was provisioned with the same size as the worker nodes.
I switched off Autoscale, which would then have to be tested in a further step.
Whether the comparison is really 1:1 cannot be said with 100% probability for this constellation. This is because there is too little information about the virtual machines provided and what their architecture looks like exactly. With Databricks you have more information about the VMs used, with the Synapse Spark pools you can only choose between the T-shirt sizes. The different Spark versions can also have an impact on performance.
I started with Databricks. The Data Lake was mounted in the Databricks environment and then I created a dataframe from the CSV files:
The creation of the data frame took 7.37 seconds.
The same operation was then performed with Synapse, which was completed in 4.2 seconds:
Below I have displayed the schema of the dataset:
I then created a temporary view from the dataframe, below you can see the operation in Databricks:
So, now comes the first endurance test: I will count all entries in the respective TempView.
With Databricks, it took 8.87 minutes, i.e. around 532 seconds.
At Synapse, the same operation took 8 minutes and 51 seconds, a total of 531 seconds.
To test the systems properly, I performed an aggregation query. With Databricks, I got a result of 11.88 minutes, i.e. 713 seconds.
The same operation took Synapse 18 minutes and 38 seconds, which is a total of 1118 seconds.
Now I want to compare the two environments by putting the data into a different data format. Firstly, I will save the data as Parquet in a new location, in a 2nd step as Delta.
Using Databricks, I put the dataframe into Parquet format in a new location. This operation took 31.05 minutes, that is 1865 seconds.
The same operation with Synapse took one hour and 5 seconds, or 3605 seconds:
That is about twice as long.
Now I count the number of entries again and make an aggregation query on the Parquet data. As with the CSV dataframe, I do this with Spark SQL and create a temp view from the dataset beforehand.
The query took 38.29 seconds with Databricks:
The same operation at Synapse took 10,279 seconds:
The aggregation query on the Parquet data took Databricks 2.72 minutes, or about 163 seconds:
The same operation at Synapse took 4 minutes and one second, or 241 seconds.
Finally, I do the tests with the Delta format.
With Databricks I write the CSV data frame to a new location in the Delta format, which took 42.99 minutes, or about 2579 seconds:
Again, the same operation in Synapse; the data was written as a delta in 1 hour, 5 minutes and 31.85 seconds, making about 3932 seconds:
Counting all entries in delta format took 1.16 seconds with Databricks:
Synapse counted the entries in delta format in 10.3 seconds:
The coronation now again comes in the form of the aggregation query.
Databricks entered the race with 2.62 minutes (157 seconds):
Synapse Analytics took a little more time with 4 minutes (240 seconds):
Below is a summary of the results:
Partly, the results have to do with the fact that different Delta versions and/or Spark versions were used.
And what do these results say? Well, I wouldn't choose these numbers as the sole basis for deciding on one technology or another.
For example, I am not 100% sure that the nodes (VM's) used are comparable. They may have the same amount of RAM and vCPU, but everything else is uncertain.
For the test results, I took the results from the first query in each case. If you run the same operation several times, either one after the other or at different times of the day, you will get different results.
Furthermore, the same software versions were not used. Spark 3.0 at Synapse and Spark 3.1 at Databricks.
The underlying Spark engines also differ.
It would be interesting to repeat the test at a later date when Synapse also uses Spark version 3.1.
Addendum (June 2022): As of April 2022, Synapse Analytics offers Spark version 3.1. Version 3.2 is currently in the beta phase.
As mentioned at the beginning of the article, both technologies are under active development. There is a fast pace being set by both Databricks and Microsoft with which new features and optimisations are being implemented. In half a year, the comparison will already look a bit different. In a year's time, the landscape will certainly have changed considerably. In any case, I am very curious to see what cool developments we can still expect on both sides.
Personally, I don't have a favourite yet; depending on the application and project, I use one or the other service.