Ingest Data from Microsoft Graph API using Azure Data Factory
If you’ve ever wanted to ingest data from the Microsoft Graph API using Azure Data Factory, then this article will show you how.
This article will show how to ingest user data from the Graph API into an Azure Storage Account.
Important Note: Always follow your organization’s best practices when it comes to anything in Microsoft and/or Azure. This article is just designed to give a demonstration for how to ingest data from the Graph API.
Step 1: Create an App Registration
- Navigate to App Registrations in the Azure Portal
- Select New Registration
- Enter a name for your App Registration
- Select the option for who should have access to this API. (Most likely it will be Accounts in this organizational directory only, but choose whichever is correct for your needs)
- Click Register.
Step 2: Assign the needed Graph API permission(s) to your new App Registration
- In your newly created App Registration, select API Permissions under Manage in the menu on the left
- Select Add a permission
- Choose Microsoft Graph
- Choose Application permissions
- Scroll down to User permissions
- Select the User.Read.All permission
- Click Add permissions.
- Depending on your organization, you may have to get an admin to approve the permission
Step 3: Create a client secret for your App Registration
- In your newly created App Registration, select Certificates & Secrets under Manage in the menu on the left
- Select Client secrets tab
- Select New client secret
- Enter a Description
- For Expires, choose the option that corresponds to your organization requirements. (The default recommendation from Azure at the time of this article is 6 months.)
- Select Add
- Copy the secret from the Value column and store it securely, as this will be the only time you are able to see the secret value. (Note: we store these in an Azure Key Vault, and retrieve them with a Web Activity in the Azure Data Factory pipeline. For documentation on how to do this, see this article:
Use Azure Key Vault secrets in pipeline activities — Azure Data Factory | Microsoft Docs)
Step 4: Store your Client ID and Tenant ID securely
- Both the Client ID and Tenant ID can be found by navigating to your App Registration and selecting Overview from the left menu.
- As mentioned in the previous step, we store these in an Azure Key Vault and retrieve them with a Web Activity in Azure Data Factory.
Step 5: Create a Linked Service in Azure Data Factory
- Navigate to your Data Factory instance
- Choose Manage on the left menu (the toolbox icon)
- Under Connections on the left menu, select Linked services.
- Select New in the middle of the page to add a new Linked service.
- Select Data store > Services and apps > REST
- Fill out the settings for your new linked service
- Name: [Whatever you want it to be]
- Description: [Whatever you want it to be]
- Connect via integration runtime: [Depends on your organization]
- Base URL: https://graph.microsoft.com/
- Authentication type: Anonymous
- Server Certificate Validation: [Depends on your organization]
- Click Create.
Step 6: Create a Dataset in Azure Data Factory
- Select Author on the left menu (the pencil icon), hover over Datasets, click the three dots, and select New dataset
- Select Services and apps (similar to creating your linked service) and select REST (same as your linked service).
- Enter a name for your dataset (e.g. MicrosoftGraph_API)
- Choose the linked service that you just created
- Click OK.
- With your newly created dataset selected, navigate to the Parameters tab. Create a new parameter that will be used for the remaining part of the URL for your API request. (I typically name it something like endpoint/end_point)
- Now navigate to the Connection tab of your dataset
- Select the Relative URL field, and select Add dynamic content
- Inside the dynamic content, you can enter the ADF language to reference the parameter, or just click the parameter field to populate the dynamic expression
- Click Save on your dataset.
Step 7: Create your pipeline in Azure Data Factory
- Select Author on the left menu (the pencil icon), hover over Pipelines, click the three dots, and select New pipeline
- Give your pipeline a name (and description if desired) by selecting Properties > General of your pipeline
- Select Save.
Step 8: Create the Web activity to retrieve your bearer token for the Graph API
- Under activities on the left menu for options to add to your pipeline, select General > Web and drag it onto the pipeline. This activity is what we will use to get the bearer token for authenticating to the Graph API.
- Select the Web activity in your pipeline to configure it.
- Under the General tab, update the following:
- Name: [Whatever you want]
- Description: [Whatever you want]
- Timeout: [Whatever you want] (Note: this currently defaults to 7 days, which is almost always way too long. I typically make the timeout for my web activities 1 or 2 minutes)
- Retry: [Whatever you want, in case you want to retry on failure]
- Retry interval: [Whatever you want, in case you want to retry on failure]
- Secure output: Most likely you will want this checked. It prevents the output from being shown in the logs.
- Secure input: Most likely you will want this checked. It prevents the input from being shown in the logs.
- With your Web activity still selected, choose the Settings tab.
- Under the Settings tab, update the following:
- URL: https://login.microsoftonline.com/YourTenantID/oauth2/v2.0/token
Important Note: The value for YourTenantID was found in Step 4. I would not recommend hard-coding that value into this activity, but rather storing in an Azure Key Vault, and retrieving it in a Web activity. If you did that, as the article linked above showed how, this URL would look something like this:
https://login.microsoftonline.com/@{activity('Get Tenant ID’).output.value}/oauth2/v2.0/token
- Method: POST
- Headers: Select New to create a new header. For Name, enter Content-Type. For Value, enter application/x-www-form-urlencoded.
- Body: grant_type=client_credentials&client_id=YourClientID&client_secret=YourClientSecret&scope=https://graph.microsoft.com/.default
Important Note: The value for YourClientID was found in Step 4, and the value for YourClientSecret was found in Step 3. I would not recommend hard-coding these values into this activity, but rather storing them in an Azure Key Vault, and retrieving them in Web activities. If you did that, you will want to select Add dynamic content, and your value will look something like this:
grant_type=client_credentials&client_id=@{activity(‘Get Client ID’).output.value}&client_secret=@{activity(‘Get Client Secret’).output.value}&scope=https://graph.microsoft.com/.default
- Integration Runtime: [Depends on your organization]
- HTTP request timeout: [Whatever you want — you will probably want it to be 1 minute or less]
- You now have an activity that will retrieve your bearer token to authenticate to the Graph API. This is important because these tokens typically expire in 1 day or less, so you will want to retrieve the token each time you run this pipeline to make sure your token is active.
- There isn’t a need to store the token that is generated here since it expires so quickly. It will be used in the Copy activity that we set up in the next step.
Step 9: Create the Copy activity to ingest the data from the Graph API.
- Under activities on the left menu for options to add to your pipeline, select Move & transform > Copy data and drag it onto the pipeline.
- Click the plus sign in the bottom right of the Web activity, and select Success (Green box). Drag the green arrow from the Web activity to the Copy data activity to indicate that we want the Copy data activity to run only when the Get Token Web activity succeeds.
- Now select the Copy data activity to configure it.
- Under the General tab, update the following:
- Name: [Whatever you want]
- Description: [Whatever you want]
- Timeout: [Whatever you want] (Note: this currently defaults to 7 days, which is almost always way too long. It depends on how much data you will be consuming with your request for what you want to put here.)
- Retry: [Whatever you want, in case you want to retry on failure]
- Retry interval: [Whatever you want, in case you want to retry on failure]
- Secure output: You may not want to check this on the Copy data activity so you can see the metadata about the copy operation.
- Secure input: Most likely you will want this checked. It prevents the input from being shown in the logs. Since we will be passing the bearer token from the Web activity as input, it would be wise to secure the input.
- With your Web activity still selected, choose the Source tab.
- Under the Source tab, update the following:
- Source dataset: The Microsoft Graph API dataset you created in step 6.
- Dataset properties > endpoint: v1.0/users
Note: This is why we created the endpoint parameter for the dataset in step 6. We can now use this same dataset for other Graph API requests, not just users.
- Request method: GET
- Request timeout: [Whatever you want — it is going to depend on the size of your data]
- Request interval (ms): [Whatever you want — this comes into play if there’s pagination with an API and you have call limits]
- Additional headers: Select New to create a new header. For Name, enter Authorization. For Value, select Add dynamic content, and enter Bearer @{activity(‘Get Token’).output.access_token}
Note: This is assuming you named your Web activity to get the token ‘Get Token’. If you named it something differently, it would be that instead.
- Pagination rules: Select New to create a new pagination rule. For Name, choose AbsoluteUrl. For Value, choose Body, and then enter: [‘@odata.nextLink’]
Note: This pagination rule is saying that for each page of the response, the Graph API is including the next page’s URL in the body of the response. This is extremely helpful because now Data Factory and the Copy data activity will handle the pagination for us rather than us building out the pagination logic ourselves.
- With your Web activity still selected, choose the Sink tab.
- Under the Sink tab, update the following:
- Sink Dataset: [Wherever you want the file to be written]
Note: It’s outside the scope of this article (as if this article needed to be any longer) but I’ve assumed you already have some Azure Storage Account configured to accept JSON files because that’s what our example will show. But this should be able to write to the file system of your choice.
- Dataset properties: I parameterized the container, file path, and file name for my DataLake JSON Dataset, so here’s how I have configured that in this example:
Note: As an organization, you will have to decide how you want these saved in terms of containers, file paths, and file names. The above is just used as an example.
That’s it! You have now successfully built a pipeline that can ingest user data from the Microsoft Graph API.
This same methodology can be used to ingest all kinds of different things from the Graph API. Hopefully this gave you a good starting point in how to build out pipelines for your organization.
Thanks for reading!