Data engineering is one of the most sought-after skills in the job market. According to a 2019 Dice.com report, there was an 88% year-over-year growth in job postings for data engineers, which was the highest growth rate among all technology jobs.
If you want to become a data engineer, then you’ll have to decide which technologies to learn because it’s impossible to be an expert in everything in such a broad field. Microsoft has been a data technology leader for many years, but is it still a top contender? Absolutely. Microsoft has moved very aggressively into the cloud with its Azure services. It has the second-highest market share among cloud providers, and it is growing at nearly twice the rate of Amazon Web Services.
Furthermore, Microsoft is so focused on Azure and its other cloud offerings that it is discontinuing all of its certification exams for Windows Server and SQL Server on June 30, 2020. This is a clear sign that the importance of on-premises technology is rapidly declining.
So what does an Azure data engineer do? Here’s what Microsoft says:
“Azure data engineers are responsible for data-related implementation tasks that include provisioning data storage services, ingesting streaming and batch data, transforming data, implementing security requirements, implementing data retention policies, identifying performance bottlenecks, and accessing external data sources.”
Are you convinced that Azure data engineering is a hot field worth pursuing? Then you can jump right into one of Cloud Academy’s two learning paths: Implementing an Azure Data Solution and Designing an Azure Data Solution. These learning paths combine the theory, technical knowledge, and hands-on practice that you’ll need to earn that certification and feel confident working in a live production environment:
If you still need some additional convincing, then let’s dive right into the specifics of how to become a Microsoft Certified Azure Data Engineer.
The Exams
To obtain this certification, you need to pass two exams, DP-200 and DP-201. The DP-200 exam focuses on implementation and configuration, while the DP-201 exam focuses on design.
DP-200 Exam
Here are the topics covered in the DP-200 exam and the relative weight of each section:
- Implement data storage solutions (40-45%)
- Manage and develop data processing (25-30%)
- Monitor and optimize data solutions (30-35%)
I’m not going to talk about every item in the exam guide, but I’ll go over some of the highlights of what you’ll need to know.
The first, and biggest, section of the exam guide is about implementing data storage solutions. These solutions are divided into non-relational and relational datastores. For many years, Microsoft’s primary relational data solution was SQL Server. If you wanted to migrate from an on-premises SQL Server to Azure, you could just run SQL Server in a virtual machine on Azure, but in most cases, you’d be better off using Azure SQL Database instead.
The advantage is that it’s a managed service with lots of built-in features that make it easy to scale and provide high availability, disaster recovery, and global distribution. And you need to know how to configure all of those features. SQL Database is not exactly the same as SQL Server, but it’s close enough that it shouldn’t be too much trouble migrating to it. If you really need full SQL Server compatibility, then you can use SQL Database Managed Instance.
Another relational data storage service is Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse. As you can tell from its name, it’s meant for analytics rather than transaction processing. It allows you to store and analyze huge amounts of data. The fastest way to get data into Synapse Analytics is by using Polybase, so it’s important to learn the details of how to use it. To make queries as fast and efficient as possible, you need to partition the datastore into multiple shards and also use the right distribution method.
Naturally, security is important for both SQL Database and Synapse Analytics, not just for restricting access to data but also for things like applying data masking to credit card numbers or encrypting an entire database.
That covers relational database services, but how about non-relational datastores? These are services that can store unstructured data, such as documents or videos. The most mature Azure service in this category is Blob storage, which is a highly available, highly durable place to put digital objects of any type. Unlike a filesystem, Blob storage has a flat structure. That is, the objects aren’t stored in a hierarchy of folders. You can make it look that way through clever naming conventions, but that’s really just faking a tree structure.
For a true hierarchical structure, you can use Azure Data Lake Storage Gen2, which is actually built on top of Blob storage. It’s especially useful for big data processing systems like Azure Databricks.
The final non-relational datastore you need to know for the exam is Cosmos DB. This is a pretty amazing database system because it can scale globally without sacrificing performance or flexibility. It can even support multiple types of data models, including document, key-value, graph, and wide column. Another surprising feature is the ability to support five different consistency levels ranging from strong to eventual consistency.
As with SQL Database and Synapse Analytics, you need to know how to configure partitioning, security, high availability, disaster recovery, and global distribution for Cosmos DB.
The next section of the exam guide is about managing and developing data processing solutions. It’s divided into two subsections: batch processing and stream processing. The two most important batch processing services are Azure Data Factory and Azure Databricks.
Data Factory makes it easy to copy data from one datastore to another, such as from Blob storage to SQL Database. It also makes it easy to transform data, which it accomplishes by using services like Databricks behind the scenes. You can even create complex automated processing pipelines by linking together a series of transformation activities that are kicked off by a trigger that responds to an event.
Azure Databricks is a managed data analytics service. It’s based on Apache Spark, which is a very popular open-source analytics and machine learning framework. You can also run Spark jobs on Azure HDInsight, but Databricks is the preferred solution, so it’s the one you’ll need to be most familiar with for the exam. Some of the Databricks topics covered are data ingestion, clusters, notebooks, jobs, and autoscaling.
The most important stream processing service is Azure Stream Analytics. You need to know how to get data into it from other services, how to process data streams using different windowing functions, and how to output the results to another service.
The final section of the exam guide is about monitoring and optimizing data solutions. The most important service for this section is Azure Monitor, which you can use to monitor and configure alerts for almost every other Azure service. One of the key components of Azure Monitor is Log Analytics, which you can use to implement auditing.
The optimization subsection doesn’t include new services. Instead, you need to know how to optimize the performance of services like Stream Analytics, SQL Database, and Synapse Analytics. Using the right partitioning method is one of the most important optimization techniques.
Finally, I should mention that since the DP-200 exam is all about implementation and configuration, you need to know how to actually configure data services in the Azure portal, so the exam includes tasks that you have to perform in a live lab! If you’re worried about how you’ll get the required level of hands-on practice, see the Preparing for the Exams section below.
DP-201 Exam
Here are the topics covered in the DP-201 exam and the relative weight of each section:
- Design Azure data storage solutions (40-45%)
- Design data processing solutions (25-30%)
- Design for data security and compliance (25-30%)
While the DP-200 exam is all about implementation, the DP-201 exam is about design, so it focuses more on planning and concepts than on getting everything set up.
The first, and biggest, section of the exam guide is about designing data storage solutions. You need to know which Azure services to recommend to meet business requirements. As with DP-200, these solutions are divided into relational datastores, including Azure SQL Database and Azure Synapse Analytics, and non-relational datastores, including Cosmos DB, Data Lake Storage Gen2, and Blob storage.
For all of the above services, you need to know how to design:
- Data distribution and partitions
- High scalability, taking into account multiple regions, latency, and throughput
- Disaster recovery, and
- High availability
The next section of the exam guide is about designing data processing solutions. It’s divided into batch processing and stream processing. For batch processing, you need to know how to design solutions using Azure Data Factory and Azure Databricks. For stream processing, you need to know how to design solutions using Stream Analytics and Azure Databricks. As you can tell, Azure Databricks is a very important service for data processing since it’s used for both batch and stream processing. You also need to know how to ingest data from other Azure services and how to output the results to other services.
The final section of the exam guide is about data security and compliance. First, you need to know how to secure your datastores. The most important decision is what authentication method to use for various use cases. For example, it’s usually preferable to rely on Azure Active Directory authentication than to embed an access key in your application code. Role-based access control and ACLs (or Access Control Lists) are also important.
The second part of this section deals with designing security for data policies and standards. Some of the topics include:
- Encryption, such as Transparent Data Encryption
- Data auditing
- Data masking, such as obscuring credit card numbers
- Data privacy and data classification
- Data retention
- Archiving, and
- Purging
Preparing for the Exams
Even if you already have a lot of experience with Azure data services, I recommend you spend a significant amount of time studying for the exams because DP-200 and DP-201 will thoroughly test your knowledge and skills.
To fill in the gaps in your knowledge and to review all of the topics, I recommend taking self-paced courses, getting hands-on experience, and taking practice exams. The easiest way to do that is to go through Cloud Academy’s DP-200 and DP-201 Exam Preparation learning paths. Both of them include video-based courses, hands-on labs, and a practice exam.
Good luck on the exams!