Building our big data platform (part 6)

In this almost final part, we will build the last component that will ingest the transformed data into Azure Data Explorer and create exports to our data lake.

Image adapted from pch.vector and Freepik

Series breakdown

In these series of blogs I will be covering building an ETL data pipeline using Python. The data is coming from field devices that are installed on mobile equipment. Each blog focusses on a specific part of the pipeline to move the data from the field device to the cloud.

Our big data platform

We are now nearing the end of our data pipeline. We started off by getting raw JSON data from field devices installed on mining trucks, transforming the data, and uploading it to the cloud. In the previous part, we sent data to our external end-user, our vendor. For this final part in the series, we need to make data available to our internal end-users.

For this task, we decided to use Azure Data Explorer (ADX) and as per the website, it is a “fast and highly scalable data exploration service”. It is optimised for large data volumes, and you can ingest data from several sources, e.g., IoT Hub and Function Apps (via the API). You can also create external tables which point to files in storage accounts and SQL databases, as well as link to BI tools.

Utilising the power of the ADX cluster, we can compute things like 10-minute averages and summarise the data automatically using materialised views. With the continuous export functions, we can export data in several formats to the required locations. As we plan to add more data sources in future and the power and flexibility available in ADX, this is a good big data platform for our project.

Microsoft offers free access to three Azure Data Explorer courses to get up to speed on this fast, fully managed data analytics service. Azure Data Explorer (pluralsight.com) (not an affiliate link)

Ingesting our transformed data

The first step is to create our ADX cluster. For the compute specification we will use the Dev/test workload and is more than adequate for our current project. As the platform grows, we can change our compute SKUs and scale with multiple instances.

Create ADX cluster.

Once the cluster is created, we create our dev database for our site. The use the default retention period of 365 days for use in ADX as our data is stored in our storage account if historic data is needed later. The default cache period is also good as most of the analysis would be on recent data. This may vary depending on your requirements.

Create the development database.

ADX uses the Kusto Query Language (KQL) and a cheat sheet is available for those familiar with SQL queries. For our project, we interact with the cluster using the built in Web UI which provides a quick way to start ingesting data using the Ingest from blob option. Below are some of the important screen shots.

Link to our ADX cluster and create a new table for our transformed data (compressed CSV files).
Select our source type as blob container and link to our storage account. It is important to define our folder path and file extension to avoid conflicts in future.

From our compressed CSV files, ADX will try and define the schema for us. Always check data types, especially when working with schemeless files such as CSV.

Schema detected in ADX.

The ingestion will happen in the background and after a few minutes we can start querying our data. Ingestion will take some time depending on the amount of data and this can be monitoring through the various tools in Azure. Once ingestion has completed, we see we have 52 field devices sending data and about 38.6 million data points already from end February to end April.

Summarize query for the device serial numbers.
Count query to see all the rows available in the table.

Automate data ingestion

We want to ensure that our data is automatically ingested into ADX when we upload data from our web server into our storage account. To achieve this, we add a data connection to ADX using Event Grid for Blob Storage. We link to our storage account and choose the event as Blob created. In the ingestion properties, we select our ADX database and our table we created above.

Create Event Grid data connection for automatic ingestion into ADX.
Link to our database and table we created previously.

Now as soon as we upload a file to our storage account, Event Grid will pick it up and ingest our new data into our ADX database. We can now leverage the full power of ADX and make data available to our internal end-users. By default, the maximum batching ingestion values are 5 minutes, 1 000 items, or a total size of 1 GB.

Another benefit here is that our end-users access data from the cloud, without interfering or adding additional load on the on-prem networks and source systems. This ensures that on-prem systems do what they need do and there is only one data pipeline going out. No more multiple connections to these systems and networks.

Creating a data model

Our job is not done just yet, we need to add context to our transformed data for our end-users. We have the tag names, timestamps, and values in our raw data table, but we also have meta data relating to the site, equipment, and tags (such as engineering units, limits, etc).

From our raw_oil_data table we can create a factOilData table and parse the TagName to create additional keys by splitting it on the ".". With the additional keys we can create two new dimension tables which we manually populate and load, called meta_site_eqmt and meta_system_tag. Remember that ADX is append-only, thus we need to recreate the dimension tables to update them.

Data model for data lake exports.

With our data model created, we can setup exports to our data lake for our end-users. The first step is to create the three external tables pointing to the data lake storage account. During the creation we define the data format (parquet in our case), the partitions and path as well as the storage account.

The last step is to export the data. As our dimension tables don’t change often, we manually export them when needed. Our data on the other hand changes all the time and for this we setup a continuous export in one-hour intervals. In the export, this is where we add the parse operation to create our additional keys that are used in the dimension tables.

Conclusion

We set up our big data platform, Azure Data Explorer, and created ingestion events to continually ingest our transformed data. Next, we created our data model and exported our dimension tables. As the last step we created a continuous export of our raw data to the data lake for our end-users.

Next up

Yes, there will be a bonus post! I am creating a blog post that will look at improvements in our pipeline to make it more robust and maintainable.

Originally published at https://h3xagn.com on May 7, 2022.

--

--

--

design | build | code. https://h3xagn.com

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Coenraad Pretorius

Coenraad Pretorius

design | build | code. https://h3xagn.com

More from Medium

Sending data with Azure Functions (part 5)

How should we use data integration frameworks to break the dilemma of full lifecycle implementation…

Apache Hive Installation on Ubuntu

Databricks: CI/CD Python Library to Share Functionality (2/3)