Introduction
One very nice Azure Sphere feature that's not discussed very often is Error Reporting. Azure Sphere devices automatically generate and send error reports to the Azure Sphere Security Service (AS3). Each device sends zero or one error report each day, a single report can include multiple events. Error reports are generated for application crashes, exits and other types of application failures. When a high level application encounters unexpected errors it can exit and return application defined exit codes to the OS.
Error reports allow product teams to monitor the health of deployed devices. You can have tens of thousands of Azure Sphere devices deployed in a tenant and pull an error report that covers them all. Error reports can be extracted from your Azure Sphere tenant by using the Azure Sphere CLI or using the public API.
There are some constraints with error reporting. Error reports contain a maximum of 1,000 events or 14 days of data, whichever is reached first. If you're going to implement a device health monitoring system, it needs to pull the reports often enough so that error reports are not missed.
This blog documents an experiment that I conducted to help me learn about and exercise the error reporting feature.
Documentation
As usual, Microsoft has done a fantastic job of documenting the feature and how to interpret error reports here.
Example Application
In order to generate error reports I wrote an Azure Sphere application that uses a direct method to generate application crashes and application exits using passed in exit codes. This will allow me to execute code that will generate error reports. Next, I'll pull the error report and verify that it contains the expected information.
I developed the application using a new Azure Sphere development library called AzureSphereDevX created by Microsoft's Dave Glover. Expect to see more on this development library in the future. I'll include a link to my application so you can use it to play with the feature if you want to. By the way, my application development time was ~30 minutes using the AzureSphereDevX direct method example!
Application Features
Since I want to drive my errors/exits using a direct method from the cloud, my application will connect to an IoTHub in Azure. The application implements a direct method called "ErrorReport," and I added logic to drive the Starter Kit RGB LED to show IoTHub connection status.
Direct Method details
Name: "ErrorReport"
Expected payload: {"ErrorCode": <integer>}
ErrorCode == 0, the application exits with the SIGKILL signal
ErrorCode == 1, the application exits with the SIGEGV signal
ErrorCode > 1, the application exits and returns the passed in ErrorCode value to the OS
The direct method code is straight forward. First turn off both the red and blue LEDs, then use the passed in payload in the switch statement.
If you want to learn more about implementing and using direct methods you can reference this blog.
IoTHub Connection Status LED
Since my application will be exiting or crashing, and I want to run a few different tests, I added code to drive the Avnet Starter Kit RGB LED based on the connection status. This way after I crash the application, I just need to wait for the Blue LED to run my next test. Note that the Azure Sphere OS monitors the status of the application and will automatically restart the application if/when it exits for any reason.
RED == Not connected to the IoTHub
BLUE == Connected to the IoTHub
The LED logic runs every 2 seconds
Experiment Overview
Here are the high level steps for my experiment . . .
- Build and test my application in development mode
- Create a new OTA product to deploy my application to my device OTA
- Upload my application to my Azure Sphere tenant
- Create a deployment with my application and put it into my new product in the Production device group
- Learn how to setup OTA deployments using Avnet's OTA lab
- Move my device into the device group
- Verify that the device is connected to my IoTHub
- Using the AzureIoTExplorer tool
- Call my direct method with payload {"ErrorCode": 100} // Application exits and returns errorCode 100 to the OS. Why 100? Avnet turned 100 years old in 2021!
- Wait for the application to restart and connect to my IoTHub
- Call my direct method with payload {"ErrorCode": 0} // Application exits with the SIGKILL signal
- Wait for the application to restart and connect to my IoTHub
- Call my direct method with payload {"ErrorCode": 1} // Application exits with the SIGEGV signal
- Pull a error report and verify that the error reports are included
Experiment Details
Setup the OTA Deployment
- Upload my image to my Azure Sphere tenant
> azsphere image add --image C:/source/repos/ActiveProjects/errorReporting/examples/avt_error_reporting/out/ARM-Debug/direct_methods_example.imagepackage
Uploading image from file 'C:/source/repos/ActiveProjects/errorReporting/examples/avt_error_reporting/out/ARM-Debug/direct_methods_example.imagepackage':
--> Image ID: d2b1fa53-0f65-47e7-a1af-2157530c785f
--> Component ID: b217fd65-ac8d-4d7c-b62c-0d07b5df6357
--> Component name: 'direct_methods_example'
Removing temporary state for uploaded image.
Successfully uploaded image with ID 'd2b1fa53-0f65-47e7-a1af-2157530c785f' and name 'direct_methods_example' to component with ID 'b217fd65-ac8d-4d7c-b62c-0d07b5df6357'.
- Create a new product in my Azure Sphere tenant
> azsphere product create --name errorReportingTest
Default device groups will be created for this product, use the 'azsphere product device-group list' command to see them.
------------------------------------ ------------------------------------ ------------------
Id TenantId Name
============================================================================================
99d007a8-7f6f-4b5d-8f95-f7ee3bc7584c 8d34f65c-532e-4dcf-a1d6-3e811c1e5c68 errorReportingTest
------------------------------------ ------------------------------------ ------------------
- Create a new deployment in the errorReportingTest product and Production device group using the application I uploaded
> azsphere device-group deployment create --device-group "errorReportingTest/Production" --images d2b1fa53-0f65-47e7-a1af-2157530c785f
---------------------------- ---------------------------- ---------------------------- ----------------------------
Id TenantId DeployedImages DeploymentDateUtc
===================================================================================================================
b1eb12b0-1d1d-48fc-9891-a66c 8d34f65c-532e-4dcf-a1d6-3e81 d2b1fa53-0f65-47e7-a1af-2157 2021-07-14T13:19:59.804054+0
9043247c 1c1e5c68 530c785f 0:00
---------------------------- ---------------------------- ---------------------------- ----------------------------
- Add the device connected to my PC to the Production device group
> azsphere device enable-cloud-test --device-group "errorReportingTest/Production"
The device group with ID 'c0d6d4f9-bd0f-4ba3-b778-3ea2946fa05a' is selected from parameter 'errorReportingTest/Production'
Getting device group by ID 'c0d6d4f9-bd0f-4ba3-b778-3ea2946fa05a'.
Removing applications from device.
Component 'b217fd65-ac8d-4d7c-b62c-0d07b5df6357' deleted.
Removing debugging server from device.
Component '8548b129-b16f-4f84-8dbe-d2c847862e78' deleted.
Successfully removed applications from device.
Locking device.
Device ID: '83d989df42d344ffaa21769819081ffd1a55fc547808279878e57925de657f10d02b9ae85618ddb020bd9b11593dcb53180f39be2a96231338742f3139234588'
Downloading device capability configuration.
Applying device capability configuration to device.
The device is rebooting.
Successfully locked device.
Setting device group to 'Production' with ID 'c0d6d4f9-bd0f-4ba3-b778-3ea2946fa05a')
Successfully updated device's device group.
Successfully set up device for application updates.
(Device ID: '83d989df42d344ffaa21769819081ffd1a55fc547808279878e57925de657f10d02b9ae85618ddb020bd9b11593dcb53180f39be2a96231338742f3139234588
- My application was successfully deployed to my device and it connected to my IoTHub I see a Blue RGB LED
Exercise the direct method
Next we need to crash the application using the direct method. I'm a big fan of the AzureIoTExplorer tool. I'll use this tool to call my direct method from the cloud.
- Pass in errorCode 100 to the direct method. When the application processes the direct method call, it exits and returns 100 to the OS as the exit code.
- Call the direct method again with payload: {"ErrorCode": 0} // generate a SIGKILL event like you ran out of memory!
- Call the direct method again with payload: {"ErrorCode": 1} // generate a SIGSEGV event
Pull and review the error logs from my Azure Sphere tenant
You can use the Azure Sphere CLI or the Azure Sphere Public API to pull the error logs. When I ran my experiment my error logs did not show up in my error report until the next day. Since error reports are only sent to the Azure Sphere tenant once a day, this makes sense.
Follow this link to see the details for downloading the error report using the Azure Sphere CLI. Here's an example of the CLI call
> azsphere tenant download-error-report --destination c:\error-report.csv
I'll be using a tool called Azure Sphere Explorer, this is a new tool that I recently learned about. It's pretty basic, but it uses the Azure Sphere Public API to expose details about your Azure Sphere tenant including pulling the error report. After running the tool and requesting my error report, I can see my events. Please click on the image to enlarge it.
If we decode each of these events . . .
- This event was an application crash. The signal is reported as 11 (SIGSEGV). The program counter, stack pointer and component_id are all included in the reported. The signal enumerations can be decoded by referencing the signal man page. See the graphic below from the man page.
- This event was an application update and it happened two times. Once for the first deployment and after I used my example application for the first time I decided to add the LED feature. I loaded the new application to my OTA deployment and this event was captured as a second planned application update.
- This event was a result of the application exiting and returning exit code 100. Looks like I exercised this 4 times! Note that the report includes not only the exit code the application returned, but also the component_id (from the app_manifest.json file that identifies my application) and the unique image_id from the OTA deployment. Using this information, support teams should be able to identify the specific build that was running and the exact line of code that encountered the unexpected condition (assuming the error code is only used from a single location in the application).
- The last event is a application crash with SIGKILL signal. Again the component_id and image_id for the application are captured.
Application details
If you want to reproduce my experiment you can leverage my application . . .
- Pull my repo: > git clone --recurse-submodules https://github.com/Avnet/AzureSphereDevX
- Open the /examples/avt_error_reporting example
- Update the app_manifest.json file with details for . . .
- Your DPS Scope ID
- Your IoTHub Hostname
- Your Azure Sphere Tenant ID
- Build and run the application
Conclusion
To get the full benefits of error reports, application developers should incorporate exitCodes into applications in a way that allows them to identify specific code sections where unexpected errors occur. Other than that, generating meaningful error reports doesn't require any effort from the engineering team.
The Azure Sphere error reporting feature is very powerful, and it's free with any Azure Sphere MCU! Product support teams can monitor the health of all the Azure Sphere devices deployed to an Azure Sphere tenant, and when errors or crashes are reported, the team has all the details to locate the specific application and code that generated the error or exit. This is huge! I suspect that as the Azure Sphere echo system continues to expand we'll see projects that automatically pull and analyze the error reports.
If you're interested in seeing how to use the public API to pull error reports you can reference the Azure Sphere Explorer tool on GitHub. Be sure to visit the gitHub page and like the project, I would love to see this tool improve in the future.
Please post your questions, comments or experiences with this feature below. I enjoy learning about the different ways engineering teams leverage the Azure Sphere features.
Brian