Back to overview

Online Magazine

Import Excel files with Azure Synapse Analytics

Key Visual Hack of the week

There are different ways to read data into Azure Synapse Analytics. Below I show two different methods how Excel files can be read in.

 

By Stefan Koch

The noob version is quicker to implement because a pipeline is used, but offers less flexibility during operation because it is a "rigid setup". The Pro version needs a bit more configuration, but is more flexible during operation, as the data is read in directly via a spark pool.

Noob Hack

Via Integrate (point 1) I select a new resource and create a "Copy Data Pipeline".


The Excel is stored in a data lake and the aim is to store the data in Parquet format so that it can be made available as an integration data set and processed further.

The pipeline is put together with a few mouse clicks and looks like this:


Via debug, the pipeline is started and after a few moments, ideally, a success message appears.


The data is then available in the Synapse Workspace as an Integration Dataset, here with the name "DestinationDataset_z8q":



This dataset can now be further processed in the Synapse Analytics platform.

Pro Hack

In order to extract data from Excel files with a Spark cluster in Azure Synapse Analytics, the required libraries must first be installed. A well-known Excel library for Spark is crealytics: https://github.com/crealytics/spark-excel



Unfortunately, this library is not the end of the story; additional libraries have to be installed. Of course, the software versions installed on the Spark cluster also play a role. In this example, a Spark cluster with the following configuration was used:

 

 

First, the corresponding jar files must be downloaded. Below is a list of the corresponding links of all the required jar files:

 

 

In Synapse Studio, the jar packages can be uploaded to the workspace via "Manage" (point 1):

 

 

Afterwards, the packages still have to be assigned to the corresponding Spark Cluster:

 

 

Now the workspace packages can be selected:

 

 

 

It takes a few minutes to finish installing the packages in the cluster. Once the installation is complete, you can read Excel files directly with Spark:

 

 

If a Spark Cluster with Spark version 3.0 is used, other versions must be installed for the following 3 packages, the rest remain the same:

The configuration of the Spark cluster is as follows:

 

Your contact

SUCCEEDED? THERE'S MORE TO DISCOVER ...

Q&A - the Game
Key Visual Q&A - The Game
Cloud AI

Q&A - The Game Launch Edition
In conversation with
In conversation with Jolanda Spiess Hegglin
AI Analytics

... Jolanda Spiess-Hegglin
TechTalk
Key Visual Tech Talk
Google Analytics

Dschingis Khan & Knowledge Graphs