“People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world.” – Pedro Domingos
In a previous article of our Machine Learning for Everybody series, we gained a general overview of Artificial Intelligence (AI).
We saw that AI is concerned with solving difficult problems in dynamic environments and by examining some real-world problems, we attempt to understand that intelligent systems use very specific techniques to solve very domain-specific problems.
In this article, we are now ready to look at one of these techniques in detail: Machine Learning.
What is Machine Learning?
If there is only one message that you take away from this entire series on Machine Learning, then let it be the following: machine learning is glorified statistics.
That is, machine learning looks to solve problems by identifying patterns in historical data.
Having internalized these patterns, the software can then classify new, unseen data. The patterns are identified using statistics and internalized (i.e. “learned”). Therefore, any problem that lends itself well to statistical analysis, lends itself to machine learning.
It is important to remember however that not all types of problems lend themselves well to pattern recognition. Furthermore, using pattern recognition to build up knowledge and construct a model of the world is only a very specific type of learning. There are many other forms of learning. Forms at which humans excel at, and that we do not yet fully understand. Nevertheless, using statistical analysis, we can build programs that learn and improve over time, without requiring manual intervention.
Usually, an initial set of examples (called a “training dataset”) is used to "infuse" the software with knowledge about a very specific domain. As the software is being used, any unseen data that the software classifies is automatically added back into the training dataset. Depending on the type of problem being solved, users sometimes provide feedback as to the accuracy of the software’s prediction or classification. This feedback is then incorporated into the training cycle, resulting in software that becomes smarter and smarter, the more that it is being used.
Just like artificial intelligence in general, machine learning is inspired and influenced by a wide range of disciplines. The fusion of concepts from biology led to the development of neural networks (upcoming article) and genetic algorithms; the cognitive sciences inspired case-based reasoning (upcoming article) and advances in information theory resulted in decision trees.
At this point, these terms might sound alien. But don’t worry, we will cover them in detail in the upcoming articles. For now, just remember that machine learning draws from different concepts and builds on the understanding of the world which other fields of study brought to light.
How does machine learning differ from other problem-solving techniques?
So far, we have discovered that machine learning is a problem-solving technique that allows computer programs to learn patterns from data using methods borrowed from statistics. That is, the program developed using machine learning techniques looks at lots of data and then identifies patterns. It uses these patterns and tries to apply them to new, unseen data to determine "something" about this new data. But this all sounds very abstract. What exactly does this mean in practice?
Let’s look at a real-world use-case to answer this question: Patriot One Technologies Inc.
Patriot One Technologies Inc. is a Canadian defence company that provides covert weapons detection systems. Using a low-power impulse radar, the system creates a signal signature (think of this as simply a "digital representation" or “digital image”) of each person in range of the radar. Once created, these signal signatures are compared to the signatures in the database. The signatures in the database correspond to a wide range of signatures created of people carrying different types of concealed weapons – from knives and guns to bombs. When a new signature matches that of a signature found in the database, the system alerts a security guard or control centre, or sends a notification to law enforcement.
In classical programming, developers write computer code that describes the exact steps that the computer needs to take in order to detect each type of weapon. That is, the developers would first write code for translating the radar signals into some digital representation. For the sake of simplicity, let’s assume that this is simply an image created by translating the radar signals. They would then write a precise set of instructions that, using the image as an input, would allow the computer to detect different image features. For example, edges, straight lines, corners and certain textures.
Then, for each type of weapon, they would describe the steps involved in detecting them. For example, a rifle might consist of two long, straight lines (the barrel), followed by two smaller curved lines, ending in some lines shaped like a triangle (the stock).
Depending on the gun, the textures, positioning of edges or length of the lines may differ. The same is repeated for every single weapon type.
One doesn’t need to be a programmer to realize that this approach is very cumbersome and error-prone, if not almost impossible. There is a huge variety of knives, guns and bombs. Their shapes, sizes and textures differ immensely and depending on the angle of the weapon, or the person, the weapon may only be partially visible.
Writing a precise set of steps to take into account all of the possible variations may therefore not even be feasible! And even if it were, the developers would need to write new code every time a new weapon makes it onto the market or every time a new possible angle, shape or size becomes available.
Using machine learning, the developers approach the implementation of such a system not by describing a set of steps that lead to classification, but by feeding a large set of “images” into the system, and helping the system identify the characteristics of specific threats.
That is, the developers first produce a large dataset that contains both digital signatures (or “images”) of people carrying concealed weapons, and of people not carrying concealed weapons. They would then label these images according to the threat presented in them (“pistol”, "rifle", "knife", "bomb" or “no threat”) before feeding them into the system. The system would then use machine learning techniques to determine the characteristics of both “threat” and “no threat” images, and store and model them in such a way that they can be easily accessed. Once the radar signals create new, unseen images, the system will then check these images to see whether they match the characteristics of a known “threat” or “no threat” image.
To do so, the software will not try to measure the length of different lines or follow a series of steps for comparing the shape of different objects. Instead, it will simply take the image as a whole, and see to what degree the arrangement of the data that composes the image, corresponds to the characteristics that constitute one of the identified threats.
The advantage of this approach is many-fold.
Since the problem lends itself well to pattern recognition, this approach allows us to effectively deal with variations in image quality and account for distortions as well as variations in angle and distance.
Secondly, previously unseen images can be fed back into the training dataset once classified, allowing the system to "learn" and improve over time.
Thirdly, once new weapons reach the market, the system can be updated with minimal effort: one simply needs to record sample signatures of people carrying these weapons and then add them to the training dataset. The underlying code will not need to be updated or modified.
Defining machine learning
So far, we learned that machine learning is all about creating a model of the world using statistics and pattern recognition. By giving a real-world example of its application, we developed an understanding of how solving a problem using machine learning differs to solving a problem using traditional programming. What we do not yet have however, is a precise definition of machine learning.
What exactly do we mean when we say that a software "learns"? How can we define this process?
Luckily, defining what it means for a program to learn is a bit easier than defining intelligence or artificial intelligence. In his book "Machine Learning"1, Tom M. Mitchell gives the following definition:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
1 Publisher: McGraw-Hill Science/Engineering/Math; (March 1, 1997), ISBN: 0070428077
Now that’s a precise definition if there ever was one. But what exactly does this mean? In essence, Mitchell’s definition is a fancy way of saying that a program classifies as a machine learning program, if, given some clear, predefined sets of tasks, the program gradually becomes better at performing these tasks over time. To know whether someone or something is getting better at performing a task, we need some way of measuring or quantifying their performance. This is what the "performance measure P" refers to in his definition. The "experience E" simply refers to the timeframe over which the program is being used. Stripping out the awkward letters, we can simplify his definition to read:
"A computer program performing a set of clearly defined tasks is said to learn from experience if, given an objective performance measure, the program improves at performing these tasks over time"
Designing a machine learning-powered program
Irrespective of what type of problem we are trying to solve using machine learning, there are several concrete design choices that developers need to make before writing even a single line of computer code. These design choices are closely linked to the definition of the term "machine learning" that we presented in the previous section. That is, one must:
1. Precisely define the objective that is to be accomplished.
2. Identify how to measure the software’s performance when accomplishing these tasks.
3. Determine exactly what type of knowledge is to be learned.
4. Determine how this knowledge is to be learned.
Let’s look at these items in turn. Of utmost importance is the task definition.
This merely involves defining what it is that the program is meant to accomplish, in such a way that we can quantify (measure) it (a general description here won’t suffice). In the case of the threat detection system developed by Patriot One Technologies Inc, we must precisely define what it means to detect a threat. Simply saying that the software "detects a threat" is not practical and will not help us with the implementation of such a system. Instead, we must define what a threat is.
In this case, a threat is defined as a person carrying a weapon. So, the task of the threat detection system can be stated as classifying a digital signature (i.e. "image" created by radar) using 4 labels: "no threat / no weapon", "gun", "knife", "bomb". If the person in the image is carrying a bomb, the software should apply the label "bomb" to it; in the case of a concealed knife, the label "knife", and so on, so forth.
Next, we must determine how we can measure how good the program is at performing its classification task. Luckily, in this case, the performance measurement is fairly straight forward: we simply implement a mechanism for counting the number of false positives and false negatives. That is, how many times does the program falsely classify somebody as carrying a knife/gun/bomb whilst they aren’t; and how many times does the program classify somebody as not carrying a concealed weapon, whilst in fact they are. Over time, the number of incorrect classifications should diminish.
Third on our list, is the task of determining what "type of knowledge" to learn. In the case of the threat detection system, the software must be able to identify shapes and then classify them.
When defining the type of knowledge to deal with, machine learning experts define a "target function". This is a fancy way of saying that they determine a method for accepting and specifying input data, and then mapping this input to an output which maximizes the performance measure.
What do we mean by that?
Given a digital signature or image as an input, the target function acts as the formula for predicting whether or not the person represented by this input image is carrying a weapon or not (and if so, predicting the type of weapon).
Having defined the objectives, performance measure and type of knowledge to learn, one must determine how the knowledge that is used to make predictions or classifications is to be learned or assimilated. Experts call this the “training experience”. Tom Mitchell summarizes this notion well in his book "Machine Learning" by using the example of learning how to play checkers:
“The type of training experience available can have a significant impact on success or failure of the learner. One key attribute is whether the training experience provides direct or indirect feedback regarding the choices made by the performance system. For example, in learning to
play checkers, the system might learn from direct training examples consisting of individual checkers board states and the correct move for each. Alternatively, it might have available only indirect information consisting of the move sequences and final outcomes of various games played.“
Common problems in machine learning
So far, we defined machine learning, discussed how it differs to other problem-solving techniques, and talked about how one would go about designing a machine learning-powered program. We have looked at some real-world examples, and know that machine learning does well for problems that lend themselves to statistical analysis.
If machine learning is such a great solution, then why do many systems that claim to use it, often still perform badly? Why don’t we yet see self-driving cars to be readily available, and why haven’t repetitive manual tasks been all replaced by robots yet?
Well, the short answer is that aside from the real-world being a very complex place, machine learning still faces many problems and challenges.
An example of one such challenge is one of building a machine learning program that works too well with the training data. That is, one that fits the training data perfectly, and therefore cannot solve problems using new or unseen data. Experts call this overfitting and can be best understood as an over-calibration of the software. That is, the software is trained so well, that it looks for the exact pattern exhibited by the training data, instead of allowing for noise introduced by unseen data. The concept is best illustrated in figure below: here we see the performance of a given program that classifies some data. The y-axis indicates the number of false classifications, whilst the x-axis the time or number of training iterations run. We, therefore, see that, over time, the number of false classifications reduces when we use the training data (thick line). If the program were to behave correctly, then we would also see the number of errors reduce with time. But instead, after a certain number of training iterations, the number of false classifications increase (dotted line).
How can this be?
Well, consider a walk in the woods on a rainy autumn day. Everything is wet and the ground is muddy. You see two sets of footprints in front of you. The first of these footprints consists only of a rough outline and is all smudged. The print isn’t very deep or well-formed and could belong to any grown adult. The second footprint goes deep into the mud. The profile of the boot that made it is clearly visible, the imprint is deep and the outline well-preserved. As you try to replicate the footprints, you will quickly see that the first, smudged print is quite easy to replicate. You tread lightly, walked quickly and maybe wiggle your foot a bit from side to side in the process.
The second footprint, however, will be very difficult to replicate. Unless you have the exact same boot, the exact same weight and the exact same shoe size than the person that walked before you, you are unlikely to be able to replicate this footprint.
What does any of this have to do with overfitting? Well, the precise, clearly formed footprint is the imprint left on the system by the training data when overfitting took place. The imprint left by the training data is perfect, and can only be matched if we knew data precisely matches the training data. Since this is rarely the case, predictions or classifications made by the overfitted system are likely going to be incorrect.
Therefore, instead of a perfect imprint, we want the training data to generate a more vague, general record of the pattern that it detected whilst training. If the imprint is too precise, then it will reflect only the training data, but not the real world.
Luckily, there exist several ways of preventing overfitting. The first is to pay close attention to your training data, and ensure that the examples that you use for training, are an adequate
representation of the "real world". Your examples should be adequately varied, to cover the possible scenarios that the program can encounter as best as possible.
In the case of the Patriot One Technologies threat detection system, this means training the system with examples that include as many different types of knives, guns and bombs as possible as well as without any threats at all. Furthermore, the sample images should record these objects at different angles and locations to capture as much noise and variation in size, distance etc as possible. By using only one size of weapon, one angle or one type, we will ensure a perfect "imprint" of this specific scenario, whilst preventing the system to recognize the threats when one or more of the aforementioned factors changes.
Another method of preventing overfitting, is to use statistical analysis to measure when to stop the training process. In other words: we split our examples into two different datasets. 80% of the examples we could use to train the system, whilst the remaining 20% we use to test how accurate the predictions are. If we reach a certain point in time during training after which the system starts to increase the number of incorrect predictions, we know that it is time to stop training. This notion is illustrated in figure 4.4.
Of course, there exist other ways, and even those described here are over-simplifications. Getting suitable training data, and thoroughly testing and training the system is a difficult task that should not be under-estimated. Many times, neither of the two are financially feasible or realistic. As we previously discussed, often the right examples may exist in the real world, but are difficult to extract, collect or formalize (we can’t stress this point enough). Data might be
scattered across notes, emails, letters and different systems. Analyzing these different media, recording and organizing the required information into thousands and thousands of labelled training examples might simply not be possible.
The learning experience
The aforementioned issues are not the only problems encountered by machine learning experts. Many of the challenges around developing machine learning-powered systems are more abstract and require ways of looking at the world differently. For example, before arriving at the problem of overfitting, one common difficulty is defining the correct training experience to use in the first place.
Here, there exist three general categories, and all are fundamentally different.
The first type of learning is the one we have spoken about so far. It is called "supervised learning", which is a fancy name for just saying that your training data is labelled. We discussed the case of the threat detection system using supervised learning techniques, as the digital signatures/images that were fed into the system during training were labelled according to the threat which they represented.
Supervised learning has the obvious advantage of being able to produce correct and concise examples to use as the data is labelled. The obvious disadvantage of course is the fact that somebody needs to label the training data. In many cases, this involves people manually classifying, labelling or categorizing thousands upon thousands of data items, such as images or rows in Excel spreadsheets. Furthermore, by using labelling, we create a natural boundary as to the amount and type of knowledge that the system can learn. This might not be an issue for a specific problem at hand of course. Whether it is an issue or not depends largely on the problem.
When hearing machine learning experts talk about supervised learning, they might give exotic-sounding definitions, such as:
“The problem of supervised learning involves learning a function from examples of its inputs and outputs”
They might also formally define supervised learning as:
“Given a collection of examples of f, return a function h that approximates f”
Both of the above statements were taken from the book "Artificial Intelligence – A Modern Approach" by Stuart Russell and Peter Norvig. To make sense of these statements, let’s first explain what a function is. Technically speaking, a function is something that takes an input and maps it to an output.
Colloquially speaking, we could just call a function a "method" or a "formal way of doing something". This allows us to re-phrase "the problem of supervised learning involves learning a function from examples of its inputs and outputs" to: by knowing what the expected output for a given input is, supervised learning recognizes patterns in the input data and uses these patterns to map unseen data to the desired outputs. In other words, using labelled input data, a hypothesis about the data is formed (hence why the function is often labelled h in technical papers or books).
The second "type" of learning is called "unsupervised learning". Geoffrey Hinton, summarized this approach perfectly in just two sentences:
"When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. "
Unsupervised learning is much closer to how humans learn. Most of the time, we learn by either observation or trial and error. But outside of a school, university or other teaching environments, we rarely learn by being presented with a list of examples along with their meaning or classification. Instead, we learn through observation. That is why we can often compare supervised learning to “book learning”, and unsupervised learning to “the university of life”.
With supervised learning, you are essentially taking a class with a teacher (the data doing the actual teaching), whilst with unsupervised learning, you are on your own. A surfer, for example, can learn which waves are worth catching and which waves are too steep, too flat or not worth catching, without ever needing to be presented with exact labels for each. Likewise, a
farmer can identify what a good day or a bad day for working on the fields is by stepping outside his house in the morning. He developed a feeling for the weather, without ever being explicitly given a classification for it.
To put the two strategies into the perspective of our accompanying example (the threat detection system) we can summarize the two learning methods as follows:
With supervised learning, we are given examples of digital signatures or images, along with information as to what they represent, and we want to know what any unseen data represents. With unsupervised learning, on the other hand, we are given examples of digital signatures, but don’t know what they represent. When running new data through the system, we simply want to categorize the data into one of 4 piles, each pile containing similar signatures.
The third type of learning strategies is called “reinforcement learning”. This form of learning uses the feedback produced by the environment itself as a teacher. This way of learning is not concerned with identifying a pattern from a sample dataset but builds up knowledge by pairing up actions with positive and negative feedback.
A robot – such as the Roomba – trying to learn the layout of an apartment is a perfect example of where this learning strategy is employed in the real world. The robot itself is equipped with a sensor that sends a signal to the robot’s "brain" as soon as it hits something hard. Furthermore, when the robot is first placed onto the apartment’s floor, it knows nothing about its environment. No map, and no indication as to the size and shape of the apartment or the objects in it. The robot is turned on, and simply starts driving, recording its movements. As soon as it hits something hard, its sensors send a signal. This signal is a "feedback" that indicates something about the environment (in this case the fact that the robot can’t proceed). Upon receiving the feedback, the robot records it in its robot brain, and then turns left, right or reverses to find a new path. Hence, over time, building a dynamic map of its surroundings.
The choice of which learning strategy to use can be difficult (sometimes mixes strategies are used), and primarily depends on three factors:
i) the nature of what is to be learned;
ii) what type of feedback and performance measurements can be applied for learning; and
iii) how the information learned can be modelled or represented.
At the beginning of this article, I wrote that “if there is only one message that you take away from this entire book, then let it be the following: machine learning is glorified statistics”. As
we come to the end of this article, I wanted to reiterate this statement, as it goes a long way towards understanding what machine learning really is, and what its limitations are.
Once you understand that machine learning solves problems primarily by identifying patterns in data, then you will quickly be able to see through any false claims or fishy marketing tactics when it comes to products who claim to use machine learning to make “smart decisions”. Furthermore, when you encounter a product or project that claims to use machine learning, ask yourself why it claims to use it. By using statistics and lots of data, we can build programs that learn and get better over time. Therefore, "learning" and improving over time, should be an integral part of the given product’s objective.
Aside from understanding what machine learning is, and how it differs to other problem-solving techniques, we explained some (not all) of the common problems and challenges faced by experts building machine-learning powered systems. These problems include choosing the right training experience, and getting the actual training process correct. Specifically, experts must be careful not to "over-train" their software (a process that, in technical jargon, is called "over-fitting").
Equipped with a general knowledge of about machine learning, we are now ready to examine individual machine learning techniques in more detail. In other words, the following chapters will become slightly more technical, as we will discover how exactly to "learn" from data.
Interested in a Deep Dive on Machine Learning and AI?
Do check out our Machine Learning for Everybody series.
This article has been editorially reviewed by Suprotim Agarwal.
C# and .NET have been around for a very long time, but their constant growth means there’s always more to learn.
We at DotNetCurry are very excited to announce The Absolutely Awesome Book on C# and .NET. This is a 500 pages concise technical eBook available in PDF, ePub (iPad), and Mobi (Kindle).
Organized around concepts, this Book aims to provide a concise, yet solid foundation in C# and .NET, covering C# 6.0, C# 7.0 and .NET Core, with chapters on the latest .NET Core 3.0, .NET Standard and C# 8.0 (final release) too. Use these concepts to deepen your existing knowledge of C# and .NET, to have a solid grasp of the latest in C# and .NET OR to crack your next .NET Interview.
Click here to Explore the Table of Contents or Download Sample Chapters!