In the current world of software development cloud computing has taken an important place. The triage of software crashes however has become more challenging because of the use of the cloud. Software crashes never happen at a good time, therefore it is important that you beforehand know where you can get all your logs. So that you can build an efficient triage strategy.
In this blog post I will explain some tips learnt from analyzing crashes from a cloud data solution that I recently developed. This data solution was a web service derived from Azure Machine Learning Studio (Azure ML) that consumes and produces blob storage files. Blob storage is the data storage solution from Azure. Although some of the tips are only related to Azure ML and blob storage, they can also help you out to build an efficient triage strategy for other cloud solutions.
In the next part of this blog I will first describe the cloud architecture, then the different logs that are available and how you can access them and finally, I will discuss some bug triage strategies.
High level overview of the cloud architecture
Web service from Azure Machine ML
In the picture above, you can see an overview of the architecture that I built around the web service derived from Azure ML. The only important thing that you need to know about Azure ML is that it provides you a lot of different data science and machine learning algorithms in an easy way to solve complex data problems. Such a solution is called an experiment in Azure ML. An experiment will consist of different components that are connected to each other. When your experiment satisfies the requirements, you can deploy it as a web service and make it part of your cloud solution.
Web services can both be consumed in a Request-Response Service and in a Batch Execution Service (BES) mode. The first one means that you only provide one data point to be evaluated by the web service and the second one means that you provide larger volumes of data to be consumed by the web service. In this example I use the BES mode.
Feeding the web service from blob storage
The web service is fed with files from Azure blog storage. Azure blob storage is a service for easily storing files in the cloud. In the current architecture, I make changes of my files locally, then these files are automatically uploaded to blob storage with a script. The web service consumes these blobs and produces new blobs. The blobs are finally downloaded to my local machine.
Multiple instances of this web service can run at the same time and if the maximum number of simultaneous instances has been reached, the instances are queued till one has been finished. In a perfect world, everything runs perfectly and my blog post would end here. However, bugs and crashes are part of real life. Therefore, I will provide now some tips that can help with efficient triage of bugs so that you can deal with them when they appear in production.
Have access to all the evidence
Know which logs are available and how you can acquire them
One of the most important things before you can start the triage process is having access to all available logs. On the one hand, the cloud components that you are using might be generating client side logs that you can easily access from your cloud portal. They might however not be turned on by default. On the other hand, your cloud provider also might be generating server side logs that you might get access to in case of need. Finally, you also must make sure that your own application generates useful logs that you also save in a consistent way.
Turn on client side logging in Azure ML
To access your logs, you need an Azure account. The client side logs are not automatically turned on in Azure ML. You can turn on these logs through the Classic portal or through the Azure Machine Learning Web Services portal. I prefer to use the Azure Classic Portal. You will be able to find these logs in the ml-diagnostics folder in your blob storage.
Get familiar with the log file structure
Your log files will be stored in a blob called ml-diagnostics and each web service will have a separate unique identifier. When you are running your web service, you will see folders with each run of your web service.
If you look into the folder belonging to the run of a web service. You will find files that are structured in the following way ‘COMPONENT_TYPE’_’NB’.stdoutand ‘COMPONENT_TYPE’_NB.stderr. In the stdout file you will find normal output information and in the stderr file you will find the error information.
Examples for the ‘COMPONENT_TYPE’ are Apply%20SQL%20Tranformation, Execute%20R%20Script, Join%20Data. Which refer to different components like SQL Transformations, R scripts and Join components that you can add in Azure ML.
It is however still a challenge to know which ‘NB’ corresponds to which component in Azure ML. For the Python script and the R script components you can solve this easily by adding in an extra print command with a useful identifier that makes it easier to determine which of the components you are looking at. For the other ones you need to dig deeper into the file structure.
Save your local logs
Also, make sure that you keep track of your local logs in which you are calling the web service. This will help you out later on with debugging issues. Some information that is useful here are: time stamps, unique identifiers, files that you are saving, errors thrown by the web service. Also make sure that you keep the information that you need to revert changes in case of a crash.
Save your local logs
Last but not least, make also sure that you keep track of your local logs in which you are calling the web service. This will help you out later on with debugging issues. Some information that is useful here are: time stamps, unique identifiers, files that you are saving, errors thrown by the web service. Also make sure that you keep the information that you need to revert changes in case of a crash.
Ask an example of a cloud side log in a normal case
At the cloud side there also might be extra log information available that you can’t access yourself directly. If your web service is a critical component of your data solution it might be a good idea to ask support for these log files. That way you have an idea what information you can derive from these log files. You can then also ask for these logs when you run into a problem.
Catching the bugs
Catching the bugs
Is it a missing data issue, a data formatting issue or something else?
The difference between a data issue and a formatting error is tricky. A formatting error means that there is an error in one of the input files which means that the web service couldn’t run till the end. A data issue means that the data was uploaded to blob storage but that a component of the web service started running before the data was available. This also means that the web service couldn’t run till the end. Besides this you also can have a cloud side error which also will mean that your web service didn’t run till the end.
After some trial and error, I have established the following steps to determine the root cause of the issue.
- 1. I go in Azure Machine Learning Studio and I run the Azure Machine Learning experiment with the input blobs that caused the bugs.
- 2. If the experiment runs till the end smoothly I will examine the log files from the last component and will look for something that says 0 rows. If I find this, this means that a component started to run before the data was available. In the other case there might be a cloud side issue which I will discuss later.
- 3. If the experiment fails, I now know on which component it failed. During development of the R and Python components I made sure to add in enough logging information and this will easily help me to track down the issue and resolve it. The fix will mean fixing the local input file and uploading the fixed file to the cloud.
In case these two strategies don’t work you might have run into a cloud side error.
Cloud side error
There exist two types of cloud side errors, the glitches and the systematic errors. A cloud side error will always start as a glitch because there always will need to be a first time that something bad happens. Make sure that you store the information of this error and try to memorize the error. If you don’t see the error happening again, then it was a glitch.
However, when you see the same error happening more than once you might have run into a systematic error and it is time to call the Azure ML support team for help. Make sure in this case you provide all your local logs and your client side logs which will enable you to nail down the issue in the fastest possible way.
I hope with these tips you will be able to discover new pieces of logging information of your cloud architecture. Hopefully, you will never need them, but that they will help you out in the unfortunate case that you run into a crash of your cloud application.