The noob version is quicker to implement because a pipeline is used, but offers less flexibility during operation because it is a "rigid setup". The Pro version needs a bit more configuration, but is more flexible during operation, as the data is read in directly via a spark pool.
Via Integrate (point 1) I select a new resource and create a "Copy Data Pipeline".
The Excel is stored in a data lake and the aim is to store the data in Parquet format so that it can be made available as an integration data set and processed further.
The pipeline is put together with a few mouse clicks and looks like this:
Via debug, the pipeline is started and after a few moments, ideally, a success message appears.
The data is then available in the Synapse Workspace as an Integration Dataset, here with the name "DestinationDataset_z8q":
This dataset can now be further processed in the Synapse Analytics platform.
In order to extract data from Excel files with a Spark cluster in Azure Synapse Analytics, the required libraries must first be installed. A well-known Excel library for Spark is crealytics: https://github.com/crealytics/spark-excel
Unfortunately, this library is not the end of the story; additional libraries have to be installed. Of course, the software versions installed on the Spark cluster also play a role. In this example, a Spark cluster with the following configuration was used:
First, the corresponding jar files must be downloaded. Below is a list of the corresponding links of all the required jar files:
In Synapse Studio, the jar packages can be uploaded to the workspace via "Manage" (point 1):
Afterwards, the packages still have to be assigned to the corresponding Spark Cluster:
Now the workspace packages can be selected:
It takes a few minutes to finish installing the packages in the cluster. Once the installation is complete, you can read Excel files directly with Spark:
If a Spark Cluster with Spark version 3.0 is used, other versions must be installed for the following 3 packages, the rest remain the same:
The configuration of the Spark cluster is as follows: