The Data Instructors: data disk

Background note

This blog post is the second post in a series about how to get started to build a data architecture in the cloud. The architecture is described in the first blog post. This series features two people, Dana Simpsons the data scientist and Frank Steve Davidson the full stack developer, who are building their own SaaS company that will build games based on data science. In this blog post we will describe how you can create a Linux Virtual Machine (VM) in Azure. We also will investigate the different components of hosting a VM in the cloud. This is the next blog in the series.

Introduction

In this blog, Frank has written down his notes that he uses to deploy virtual machines. He has added in lots of screenshots so that it is easy for him to remember the different steps he needs to take in the Azure Portal. These screenshots will also make it easier in the future to hand the creation of VMs off to someone else. While Frank is figuring out how to create the VMs, Dana is investigating the costs and the durability of the VMs when they are not powered on.

Position of this blog in the architecture

Currently Frank will have two VMs on which he will host the different components like the Ruby layer and the Apache web server to host the websites for the different games.
In what follows he will explain first how he creates the first Debian VM in the Azure cloud that is used for Ruby development. Next he will explain which links he uses to install Ruby on this VM.
The second VM he will use as a web server. Therefore he will need to install apache on this VM.
Finally, Dana will give her analysis of the costs that are involved in hosting a VM in the cloud. She will look both at the fixed and the variable costs of hosting a virtual machine in the cloud.

Creating a Debian Linux VM for Ruby development

1. Log into the Azure Portal

Go to the Virtual Machine Section by selecting Virtual Machines in the left column.

2. Add a virtual machine

3. Select Debian Jessie

4. Configure the basic options.

You can setup your virtual machine so that you can login with your public private key pair. To generate this pair, have a look here. For the rest you can create or use an existing resource group. Next select the geographical location where you want to store your virtual machine.

5. Select the size of the virtual machine

Now you can select the requirements for your virtual machine. When you are creating a virtual Linux Machine you will have an Operating system disk (OS disk) and a temporary disk attached to it.

The Operating system disk is labeled /dev/sda and the temporary disk is typically /dev/sdb. The data of the OS disk is persistent in case you stop, deallocate, pause, reboot, shut-down or resize the VM. However, in case of a delete or a failure of the VM this disk is not persistent. In case that you want to have a case that your disk is persistent in the case of a failure or a deletion of your vm you need to attach an extra data disk.

So in case that you are not working with a data disk make sure that you are making backups of your system in a different way. A good strategy for this is two make sure your code is in github and to study the Automation script that has been generated to create this VM combined with your .bash_history when you installed the needed software packages.

For the development VM we will select the DS1_V2. However in case you need more computing power, you just can change the size of your VM. In cases that you need less computing power, you just can scale down your VM.

6. Creating an attached data disk

Because Frank wants to make sure that his data and his code is still safe in case there happens a failure of the virtual machine or if it would get accidentally deleted, he also attaches a data disk to his VM. Therefore he goes to his VM in the azure portal and selects Disks. Next he will click Add data disk.

Now he will select a storage container where he wants to locate his attached data disk. He also has the possibility to select an existing blob. Make sure that you carefully watch the size of the disk that you are making. To have an idea of the prices you can have a look here. We also will discuss this further in the cost analysis section.

Based on the information that you can find here, you will be able now to initialize the new data disk in your VM and mount it as a separate drive. Carefully execute the different steps.

7. Installing the needed software and packages

Now we need to install Ruby. If we just would install Ruby with sudo-apt-get install we wouldn’t get the newest version. Therefore you can follow the steps as described on this site to install Ruby. For this application ruby 2.4.0 was installed.

To be able to work with Ruby and Azure you still will need to install the Azure gem. To call the web services generated from Azure ML, you will need to install the httparty gem for Ruby. Therefore, you will need to type the following two commands.

sudo gem install azure
sudo gem install httparty

Creating a Debian Linux VM as a Web Server

1-4. Creating a VM from the portal

1 till 4 are the same as the previous. However in this case we will select a smaller VM and we will not attach a data disk. You can see the specifications below. But remember again in case you need more computer power, you easily can scale up your VM.

5. Defining a public IP address

Because this VM will be used to host the webserver, a public IP is needed. You can select a static when you are creating the VM. Because this is an extra feature that you are selecting, there is an extra cost added for this.

6. Defining a Network Security Group

Because your VM will be used to host your websites, it also will need to be accessible through http. Therefore an extra Network Security Group will need to be defined as you can see below.

7. Installing Apache2

Finally, to install apache2, you need to type the following command.

sudo apt-get install apache2

Cost Analysis
As the last part of this post, Dana will describe the cost of creating and using the virtual machines in Azure. This consists of a combination of fixed costs and variable costs.

The cost of the web server

The web server will need to be up all the time. For the rest this VM also needs a static IP and http access. Therefore you can consider all the costs of this VM as fixed. The cost for a public static IP is 0.11 CAD per day. Therefore the cost of the web server will be 48.3 CAD a month.

The cost of the Ruby development VM

The development machine doesn't need to be powered on all the time. Therefore we can make a distinction between fixed costs and variable costs for the development VM.

The fixed costs are the costs that you need to pay whether you machine is running or not. In this case these costs consist of the Operating System Disk of the VM and the attached data disk. For this VM, the cost of the disks is 0.06 CAD per day. Next Frank also decided to attach a standard unmanaged data disk of 128 GB that costs him 8.6 CAD a month.

Let's have a look now at the variable costs. In case the virtual machine is running the whole day, the machine costs 2.13 CAD per day. So in this way you see that you can save a lot by turning off your development VM in case you are not using it. To facilitate this, there exist auto-shutdown options that you don't need to remember each day to turn off your VM. It is good to know that your data stays on the operating disk in case your VM is in the deallocated state. Remember however that there is no automatic back up of this drive in case of failure. If you want to make daily backups of your system. You can check out this resource to understand the costs.
If you are using a standard unmanaged disk, you also will need to take into account the access price cost. This is a cost that you will be able to estimate further after your first months of development.

Combining everything together. We assume that Frank has 20 work days of eight hours in a month. This means that he will be using his VM 160 hours a month. Therefore the cost of running his VM for a month is 14.2 CAD during his working days. Next he still needs to pay for this fixed costs of the OS disk (1.2 CAD a month) and the attached data disk (8.6 CAD a month). In this way the cost of the total development VM is 24 CAD a month. By using the auto-shut down option, Dana has implemented cost savings and used some of these cost savings to add in extra piece of mind for Frank by adding the attached data disk.

Conclusion

In this blog post Dana and Frank have described all the different steps that you need to follow to create a Linux VM in Azure and to use Ruby on this VM. We also discussed further how to setup a public IP address and how you can allow http access so that the VM can be used as a web server. Finally Dana discussed the cost of hosting such a VM in Azure and we discussed the types of disk that are available on the VM. In the next blog post we will learn more about how to use this VM to access blob storage in Ruby.

Introduction

In 2017 about 35% of new applications will use cloud-enabled, continuous delivery and DevOps life cycles. The use of data is becoming more and more important in the current enterprise with terms like Data-Driven Decision Making and Data-Driven Decision Management.

But how do you actually start with building data applications in the cloud? Which known technologies can you use? What are new technologies? How do all these technologies work together? What will the cost be of such a solution?

We will try to answer these questions in this series of blog posts. We will show how two people, Dana and Frank (see picture below) will start building the next Unicorn using cloud based data technology. We will first present their product, next introduce Dana and Frank and finally explain how they started building their architecture.

The next blogs in the series can be found here

The Usbourne first book of the Computer, Usbourn Publishing Ltd, 1985

The game changing idea

Together they are building a SaaS company that makes online games with data science. Their first game that they are working on is called What's the title?. They show automatically generated data graphs of classic novels. The user needs to guess the title. A real game changer. Below you can see their first game and you can also check out the link to their website.

Meet the team

For now, the team consists of only two people, but they hope to grow fast in the near future. Therefore they both agreed that they will need to acquire extra skill sets besides their current skills to make the full application. Their focus initially is to get all the components connected to have a global overview and see how the different pieces are working together. Later on they can then elaborate further on the different components and move on to using more advanced technologies.

Dana Simpsons, the Data Scientist

Dana loves everything about mathematics and data and she has a decent background in computer science. Her favourite programming languages are R and Python. Finally she also has some SQL and NoSQL database skills. As part of her tasks she will also watch the cloud costs and look for potential savings. She will also be responsible for customer success and will analyze the usage patterns.

Frank Steve Davidson, the Full Stack Developer

Frank loves everything about programming and has a decent background in mathematics. His favorite programming languages are Java and Ruby. He will be responsible for the architecture design and will make sure that the solution gets deployed in the cloud.

The generic architecture
Dana and Frank did some brainstorming meetings and came up with the following architecture. With this architecture they want to achieve the three goals explained below.

Goal 1

Their game will be hosted in the cloud. This means that both their development and their production servers will live in the cloud together with the data storage. This means that they will need to look at how virtual machines are being set up by their chosen cloud provider and what kind of cloud storage solutions there are available. Because they are hosting a website, they also still will need to investigate this further.

Goal 2

Frank has chosen to work with Ruby as programming language to build out the back-end further. Some responsibilities of the back-end will be the storing of different novels in the cloud in a data lake and transforming these novels into a data format consumable by the web services that Dana is building. He will extract the images that Dana generates and will store them back in the cloud. For now they only will use a simple front-end because the data solution will already provide detailed images. In the future they will look further into more advanced front-end technology.

Goal 3

Dana will need data science related software to be able to make her analysis and generate the graphs. She also will need to have an easy way to access different data sources when she is performing her analysis. She prefers to have access to python and R. Finally she wants to have an easy way to expose her graphs and analysis to Frank. When she is happy with her data science solutions she wants to build an automation layer on top of this.

The architecture in Azure

Dana and Frank compared the different cloud providers and they came to the conclusion that Azure suited their needs best. The architecture in Azure is shown below. In what follows we will explain these components more. In the next blogs you will learn how to setup and use these components.

Goal 1: How to deploy and develop

Frank uses a Linux Virtual Machine in Azure on which he has installed Ruby. He next uses a virtual machine to host the web server Apache2. To save costs he will put his Ruby development machine on auto-shut down mode. To be able to host this virtual machine as a web server he will need to select a public ip address and allow http access on the server. We go into more detail in these steps in the next blog post of this series.

Goal 2: How to talk to and feed the data storage solution

Frank and Dana use blob storage as their cloud storage solution. You can imagine Blob storage as your own big drive in the cloud. When you are developing you can use several APIs to access blob storage. Frank will use the azure gem for Ruby. Dana will analyze the costs that are involved with using blob storage. We go into more detail in this in the third blog post of this series.

Goal 3: How to talk with and build the Data Science Solution

Dana used Azure Machine Learning Studio to build an experiment that generates the needed graphs. Next she deployed this experiment as a web service. This web service consumes the formatted text of a classic novel that is stored in Blob storage and then produces six different graphs. These graphs are then stored in Blob storage. She will describe her strategy in the fourth blog post. Next Frank will describe in the last blog post how he feeds and consumes the data from this web service.

Conclusion

In this blog post we have introduced two people, Dana and Frank, on their journey to build their SaaS company in the cloud. We briefly mentioned some of the technologies that they needed. In the next blog posts we will explain these technologies more in depth so that you will be able to use them for your own application.

The Data Instructors

Tuesday, 7 March 2017

Setting up Linux Virtual Machines in Azure For Ruby Development.

Monday, 6 March 2017

How to build an architecture for your cloud data application