Machine learning (ML) and artificial intelligence (AI) are getting lot of attention these days, and an increasing number of people are interested in becoming data scientists. Many imagine ML/AI as a black box that does its magic and is somehow able to make predictions. Well, the magic is that an ML algorithm is designed in such a way that it is able to find a pattern in data that have the correct answers and use that to predict the next answer. The learning task may vary from predicting the next word in a sentence, a sentiment, or what else would you like to buy based on what you already have in your basket.
Many algorithms are inspired by the way we humans learn and make decisions. We learn based on experience and our abilities to find patterns and make deductions. This is the same for ML algorithms, where experiences are many examples of data representing the circumstance at the point of making a decision as well as the correct answer. AI brings this learning capability a step further. It is an ability to make a new decision based on past experience and discover new solutions to a presented problem without necessarily seeing an example of a correct answer in past observations. However, it is able to assess whether the decision was good or bad based on how close the learning algorithm gets to a set goal.
Although ML and AI are buzz words in the emerging technology space, ML has a long history. The first computer program able to learn as it ran was written in 1952 by Arthur Samuel. This was shortly followed by the first artificial neural network (NN) in 1958, designed by Frank Rosenblatt. The computational power back then was, however, not able to appreciate the prediction power of neural networks (NNs). Because of that, they had to wait for their glory days. More on the history and evolution of ML can be found at https://www.doc.ic.ac.uk/~jc3717/history-machine-learning.html.
Even though it is exciting to keep up with the latest ML research and see how prediction accuracy is increasing by creating more and more complex deep NNs (DNNs), industry applications go beyond designing an ML algorithm. The ability to design a production-ready ML pipeline is required for a company to use AI solutions.
Why ML pipelines?
There is a lot of enthusiasm around data science, with a large number of online ML courses that teach everything from natural language processing (NLP) to computer vision (CV). These courses use clean, labeled data that are easy to work with and require minimal processing before being ingested by an algorithm. In real-world applications, this is almost always not the case. Unless the data have been collected for research purposes, the data available for an ML project will be incomplete, unstructured, and with little to no labels.
Data processing becomes a crucial part of any ML project and, unless you have the luxury of spending a lot of time and money labeling data, you need to become creative about how to train the ML models with small, labeled data sets. This article aims to explain how ML is applied in real-world projects, covering data processing, working with small data sets, and concepts of continuous delivery (CD) and continuous integration (CI).
Extract, transform, and load
Businesses and companies that are venturing in the AI space like to gain insights into their data, often feeling like they have rich data sets available. However, these data have naturally not been collected with an AI project in mind. The data structure often changes over time, and data are stored in various formats. If you are lucky, it is just a set of databases and tables. More often than not, companies—even large enterprises—rely on Excel spreadsheets.
Because of that, learning tools for data ingestion and transformation will make your life as a data scientist much easier. From my experience, data extraction, transformation, and loading (ETL) forms 30–50% of the project development. Not only do you need to process and transform the data to train an ML model, you must have a script that does this every time a data set is being updated and a new prediction is being made.
Fig. 1 shows an expectations-versus-reality diagram of an ETL pipeline. Data are often collected from a variety of sources, such as a Structured Query Language (SQL) server with multiple databases and tables, Excel spreadsheets, and folders storing data files, such as images. The access to data are often restricted, and credentials to access the data need to be stored securely, away from the script itself. Once the data are processed, you want to store them somewhere so that, the next time a prediction is being made, only new data have to be processed.
Once the ingestion part is done, the output also needs to be transformed and stored. Exceptions and logs informing about any errors or warnings should be stored for better issue tracking. Processed output can be stored in an SQL database that is connected to an application. This application can either be a customer-facing web app or company dashboard, such as Power BI. Sometimes, an action needs to be taken based on the prediction outcome, like sending an email or text message.
Small data sets
Deep learning algorithms are the latest craze in the AI world. To train a DNN, a large, labeled data set is required. It is very seldom that this data set is available. More often, the data set you are going to work with has no labels, and the samples are being labeled as part of the project. Let’s assume that there is at least a small labeled set available. There are a few ways to overcome the need to go for unsupervised learning and make the most of the data you have.
Transfer learning
If the problem you are attempting to solve is well-known, you might be lucky enough to reuse a pretrained DNN and use transfer learning to retrain the network with your dataset. This is often the case when using NLP and CV for tasks such as sentiment analysis, word embedding, image recognition, and object identification. However, there are a few caveats as to when transfer learning can be applied:
- The learned task needs to be similar enough to the original problem. For example, a network trained to recognize cats can be used to recognize tigers.
-
The trained model must be able to learn general features with a good level of abstraction. Overfitted models are not suitable for transfer learning.
Transfer learning is popular and easy to implement in NNs. It can be as simple as cutting off the last two layers and retraining the model with a new data set on a shallow NN that is placed on top of the pretrained model.
Transfer learning can be applied to techniques other than DNN algorithms. Methods like expectation maximization are used to retrain ML model parameters for classic (non- NN) approaches. An illustration of transfer learning applied to DNN is shown in Fig. 2, where a pretrained neural network with n hidden layers is used to retrain on a similar task by replacing the last two layers, the activation layer and output layer, by a shallow neural network.
Classic approaches
Classic approaches often require far fewer data for training. The size of the data set required to train an ML model also depends on the numbers of features and target classes. The more features and classes there are, the larger the training set needs to be. This scales exponentially for DNNs. The number of features can be reduced by feature extraction and feature-mapping methods, such as principal component analysis. However, a large number of classes with often unbalanced class distribution—meaning there are more samples of some classes and fewer or no samples of other classes—makes the training of DNNs difficult.
From my experience, the data set size is hundreds to a few thousand samples. For data sets of this size, classic approaches and shallow NNs are a better fit than the complex DNNs. Fig. 3 illustrates how data set size affects the performance of ML models.
Virtual environment and containers
Now that the data ETL pipeline is done and the ML model is built, how are we to bring this project to production? The code needs to exist and be executed from a self-contained environment where you can control all of the variables from the operation system to versions of libraries used in the code. The reason why it is important to control library/package versions is that some library versions might not be compatible with each other. This will save you from saying, “It works on my machine.”
To achieve this, build your code in a virtual environment or container. Both cater to library version management. The main difference between a virtual environment and container is the control over the operating environment. A container is a platform as a service providing virtualization of the operating system (OS) and full control over library management. A virtual environment, not a virtual machine, has the same OS as the host machine and creates an isolated environment to manage libraries. The best known and most used container software is Docker. Grasping the basics of Docker takes only a few minutes and will certainly add value to your resume.
CD/CI pipelines
The self-contained environment has a number of advantages and lays foundations for CD and CI. The mantra of CD/CI pipelines is to develop in small cycles and deploy often. To be able to achieve this, a lot of deployment processes need to be automated. Let’s break down the parts contributing to a working CD/CI pipeline (Fig. 4).
Source control management
Keeping track of code changes and versions is a crucial part of the project development lifecycle. Especially when a team of developers is working on features individually, it is important that every member of the team has access to the latest working version of the code. The most well-known tool for source control management is Git. It is a free, open source tool adopted by many other software and cloud providers. Just a few Git commands are needed for basic version control operations, and it is essential to know.
Automated testing
While writing test cases is not the most exciting part of the development, it is very important for reliable code execution. While the testing of ML models is limited to testing only the convergence of the model and shapes of passed tensors/vectors, there is a lot of code surrounding an ML model that can and should be tested. When transforming data, make sure the code does what you expect it to do. Create a dummy array or data set and expected output, trans- form it with the function you wrote, and compare the function output to the expected output. This is unit testing. Then, with each deployment, a set of unit tests is executed to make sure that the code performs as expected.
Automated deployment
All of the preceding steps lay the foundation for this, the automated deployment. Because, in the CD/CI realm, the intention is to develop in small, iterative cycles, a project consists of many small deployments. Having this step automated will save a lot of time deploying and rolling back to the last stable version should the deployment fail. Nothing is perfect the first time, and every develop- er can make a mistake. Having these processes automated creates a safe way to fail, rebuild, and redeploy.
With the deployment of every new feature, the code has to pass an automated test phase. Should one of the tests fail, the code is not deployed. It is much better to find bugs this way than in production. There are a number of tools available, and, unlike with Docker or Git, there is no single tool that is far better than the others. If you are lucky, there will be a DevOps person in the company managing the automated deployment, which is a more advanced part of CD/CI pipelines. However, if you master Docker, Git, and testing, you will make the life of the DevOps person much easier.
The orchestration of ML pipelines
Once you have your code broken down into modules where each module is executed inside of a container, the container orchestrator defines how are these containers are executed and what resources should be allocated to them. Kubernetes and its simplified application Kubeflow are probably what come to mind first. Both are free and open source.
Another great open source platform for ML pipeline orchestration is MLflow, which allows for model registration, versioning, and performance tracking; it can be used for Data-bricks, AzureML, Amazon Sage- Maker, and Google Cloud ML deployments. Cloud platforms also have their own way to build, orchestrate, and host ML pipelines, and there are many great courses and certifications that teach how to use ML cloud services to build an ML project from start to finish.
Conclusion
This article aimed to show all of the aspects of an ML project beyond the scope of the ML model itself. I hope it helps students and graduates understand a bit more about the ML development lifecycle and the tools required to develop an ML project into a functional pipeline. Knowing these tools will help your resume stand out from others and speed up the learning curve when joining a team. I have also listed a few learning materials in the “Read More About It” section.
Read more about it
- “History of machine learning,” AI in Radiology, London. Accessed on: May 20, 2020. [Online]. Available: https://www.doc.ic.ac.uk/~jce317/history-machine-learning.html
- A. Ng, “Transfer learning,” Coursera, Mountain View, CA. Accessed on: May 18, 2020. [On- line]. Available: https://www.coursera.org/lecture/machine-learning-projects/transfer-learning-WNPap
- P. Srivastav, “Docker for beginners,” Docker, Palo Alto, CA. Accessed on: May 18, 2020. [Online]. Available: https://docker-curriculum.com
- G. Venkatesan, “Learn the basics of Git in under 10 minutes,” freeCodeCamp. Accessed on: May 19, 2020. [Online]. Available: https://www.freecodecamp .org/news/learn-the-basics-of-git-in-under-10-minutes-da548267cc91/
- “Learn Kubernetes basics,” Kubernetes. Accessed on: May 19, 2020. [Online]. Available: https://kubernetes.io/docs/tutorials/kubernetes-basics/
- “Top 5 continuous integration (CI) tools in 2019,” Browser- Stack, San Francisco. May 2019. [Online]. Available: https://www.browserstack.com/blog/best-ci-cd-tools-comparison/
About the author
Alexandra Posoldova (a.posoldova@ gmail.com) earned her Ph.D. degree in data science in 2017 and has worked in industry ever since. She has been an active volunteer for IEEE for five years. She currently holds the positions of vice chair of the IEEE Queensland Section as well as chair of the IEEE Computational Intelligence Chapter and vice chair of the Blockchain and Internet of Things group, both in the Queensland Section.