A casual intro to Machine Learning
Unless you're a member of an isolated ancient tribe living under one of the six remaining trees in what used to be the Amazon rainforest, you have almost certainly heard the term "Machine Learning" floating past within the last few years.
This blog post will explain the main ideas behind machine learning and try to show you why you should care about any of them in the first place.
[update] The example problem used throughout this blog post has been changed in response to readers' feedback to which we couldn't turn a blind eye.
Solving hard problems
A computer is capable of remarkable feats. In fact, personally, I'm convinced that if humanity doesn't eradicate itself prematurely, there won't be anything left humans can do that can't be done much better, faster and cheaper by a suitably designed and programmed computer (or a network of them).
Even though a computer can do just about anything, making it do what you want it to do can be very hard indeed. Depending on the problem at hand a programmer needs to be rather ingenious and resourceful.
Machine learning is mainly concerned with clever ways of programming a computer - although for most (if not all) techniques, optimised hardware can significantly improve performance.
To highlight the value of machine learning properly, an example should help. I'm going with one any cat owner (slave) should be familiar with. So, let's imagine you've got a cat, called Ada.
One fine Sunday afternoon, you're reading a novel while Ada's sitting on a bookshelf in front of you, staring idly into the middle distance. She hasn't moved for at least ten minutes. Next to her on the shelf is a small stone you like and have put there as an ornament.
Now you notice that some time during the last few seconds, she has moved her paw to a new position. While it was previously resting delicately on the shelf next to its sibling, it's now suddenly hovering above the shelf, just behind the stone.
You've only got time to yell 'nooooo' before she gently pats the stone on the back a few times, shoving it from the shelf onto the ground. She then returns to her earlier peaceful posture - this time you hear an added soft, content purr while she had been completely silent before.
What does Ada try to tell you, exactly? What does that purr mean?
A multitude of possible answers present themselves to you:
- She's bored and wants some attention.
- She's hungry.
- She has a particular dislike of stones.
- She has a particular dislike of that particular stone, possibly because it said something rude to her.
- She considers the shelf to be her property and she feels she should be the only thing on it.
- She's trying to point out a major shortcoming in your attitude towards her.
- She's working on an equation that will turn out to be the next revolution in quantum mechanics and the whole episode is an experiment to check the validity of her theories. Her theory stood up to this particular test, hence the purr.
- A multitude of very nuanced variations on or combinations of the possibilities above.
- She enjoys seeing you try to work out why she did it.
- The correct answer, definitely not one of the above.
Now which is it? You just don't know, and I certainly don't either.
The 'Ada Interpretation Problem'
However, computers are doing all sorts of stuff you wouldn't know how to do yourself these days, like parsing PDF files and running Facebook. So can't anyone build software to solve this problem for you?
Might it be possible to write software that understands anything your cat is actually trying to have you come up with all by yourself based on the sounds she makes and her behaviour?
To state the question more clearly, as you wish she would:
Is it possible to create software that can interpret and decode anything your cat does, translating it to a message that actually tells us what she wants from us?
I'll call this the 'Ada Interpretation Problem' or 'AIP'. (Which might also mean 'Artificial Intelligence Problem'. This must, no doubt, have some sort of cosmic significance).
A classic solution
In order to write software that can solve the Ada Interpretation Problem, an old-school programmer would have to come up with all the rules (in perfect detail!) that will take the problem specification of any sound your cat makes, and what she was doing while she made it, as input and end up with what we need to know: a sentence communicating the actual message she is sending.
The task at hand
Solving the problem comes down to correctly reconciling a host of very complex aspects:
- Any sounds Ada produces: the category of those sounds (meow, purr, yowl, hiss, ...), sound intonation, volume, ...
- Facial expression. You might be tempted to think a cat would have just two facial expressions (mouth open, mouth closed) but nothing is further from the truth - Ada somehow pulls off a supremely condescending arrogance in the configuration of her whiskers alone.
- Tail motion and velocity.
- Time since last meal.
- Time since last attention paid.
- Time since last destructive act performed.
- Paw configuration.
These are some aspects of the problem - I will have overlooked some, cause who knows, really? - in which we hope lie the clues that should suffice to decode the actual message.
That means that these aspects should be included, in some way or other, as inputs to any software program (or app, as all those hip youngsters call it nowadays) that is tasked with doing the translation for you, in order to provide it with enough information to do its job.
Now, the programmer's job is to map any specific problem instance - an occurrence, a specific sound your cat made plus all the circumstantial details of tone, volume and body language, like the 'stone shove' example above - to a solution:
- First, the programmer would have to find a way to represent each of the problem's aspects inside a computer. For example, he has to decide how to represent any sound type Ada might make: possibly he will assign a number to each sound he can recognise, but for many he won't know which type of sound it is exactly - and how can he represent subtle concepts like 'intonation'?
- Then he would have to come up with a set of rules that can be applied to the representation of a specific problem instance and lead to a correct solution.
In short, the programmer would have to understand your cat perfectly in order to teach a computer to reach the same 'understanding' if he wants to succeed using a classic approach.
The rules for interpreting sound volume alone would have to be dauntingly complex, let alone those for correctly evaluating a series of actions like the one in our example in combination with purr volume and the exact total time the stone has been on that shelf.
Let's say our programmer assigned a number to each sound category he could recognise, and then built lots of rules based on that assignment. What if next time Ada makes a weird sound nobody's ever heard before (and she will, rest assured) and for which he consequently doesn't have a number yet?
Also, the programmer is bound to have failed to identify some aspect of Ada's behaviour which is of paramount importance in decoding her messages and not having incorporated this aspect as input to his algorithms will prevent them from ever producing correct answers to start with - they just won't have the information needed to arrive at a correct answer.
It should be obvious, then, that a classic approach to solving the Ada Interpretation Problem will fail miserably. If you don't know how to program you wouldn't know:
- how to begin
- how to proceed
- how to end
And if you do know how to program, you wouldn't know any of those things either. The rules involved are just too complex.
A programmer has no chance at all of capturing every hard rule, soft rule, nuance and exception in a formal playbook that can be formulated and specified using computer code resulting in a program that can figure out The Hidden Message for any Ada Interpretation Problem instance one cares to throw at it.
A New Hope: Machine Learning
However, there is a slight glimmer of hope on the horizon and its name is Machine Learning. A programmer has no chance at all of keeping track of all the rules and complexities involved in solving your problem, but a computer might.
What machine learning algorithms do
The crux of programming software to solve a problem is nothing more than mapping questions/problems (input) to answers/solutions (output). In figuring out the steps in between lies the challenge and for many problems a programmer can't be expected to rise to it.
In order to solve those tough problems, here's a clever bit:
"Why not have the computer figure out how that mapping should be done?"
In other words: instead of telling the computer exactly how to solve a problem, why don't we tell the computer to figure it out for us? Why don't we just tell the thing about our problems and have it deal with it while we go and complain about them to a bartender?
That is what Machine Learning is all about. In general, it comes down to giving a computer lots of data about the problem at hand. Feed it everything you know - this must be nuanced, see Feature selection and extraction - in a form computers can work with and adopt some algorithmthat goes to town on that data and figures out the mapping from problem instances to solutions for you.
Why does a computer have better chances of coming up with correct mapping rules? Because it is much faster than you, has a perfect memory, and doesn't get bored or a headache as easily.
The first step in having a computer solve your problems, is to offer said problems to it in a suitable form.
Problem instances can be listed as a table. Each row of the table is a single problem instance and each column contains one of the aspects of your problem expressed as numbers, so an algorithm can do calculations with it - it can't do that on a bit of text, for example. These columns are called features of the problem domain. This table (or matrix) representation is very convenient and efficient to work with for most machine learning algorithms.
For the Ada Interpretation Problem this might look something like the following:
An important point to notice here is that pretty much everything you know about how your cat behaved, within the context of each problem instance, as a human observer is also in the data since we don't make any attempt to categorise sound or facial expressions ourselves. We didn't just pick a couple of facial expressions we could recognise and represented those as numbers.
Instead we use a recording of the complete situation:
The sounds Ada made, and what she looked like or was doing at the time, are in this list in the form of pixels (photo or video) and spectrograms(basically a graphical representation of sound, a 'photo' of a sound fragment, which are pixels as well).
This way of describing a problem instance digitally makes sure we don't lose subtle information embedded in the sounds or posture of Ada - aspects we would have missed using a classic approach of choosing a few characteristics we could recognise, as a human being, and assign numbers to.
You might argue a 'classical' programmer might use these digital representations as well rather than trying to recognise facial expressions himself, but this would have been even harder to do. Could you list the exact steps you'd take - by looking at individual pixels! - to recognise a cat in a photo?
Feature selection and extraction
An important thing to watch out for is irrelevant data. An algorithm that has to spend a lot of time trying to take into account factors that don't have any bearing on the problem at all will just make progress slower and less efficient.
In the context of your Ada Interpretation Problem: after yet another destructive event Ada has made happen especially for you, you might wonder whether or not an airplane to Barbados will take off within the next few days. But including that information will not help anyone (human nor computer) figure out what your cat was actually trying to tell you. It's irrelevant data in the quest for understanding.
The more features you feed to an algorithm the more work it has. Feeding an algorithm too many features to chew might bring it down to its knees or flat on its belly, face in the sand on a beautiful beach in what will hopefully turn out to be Barbados (the curse of dimensionality).
So it is important to explore the data you have and make an effort to give only relevant information to your algorithm, in other words: only those features that have an actual contribution to make in determining the solution to your problem. Barbados: irrelevant. Stone shattered: hardly relevant. Third whisker from top right slightly elevated: almost certainly very relevant indeed.
Features can be created as well, for example: sometimes instead of including two features (= two numbers) you have gathered to describe your problem you might include their ratio (= one number) instead and leave the separate numbers out.
Often, some features carry the same - or about the same - information and including them all will not improve your algorithm's solution while discarding all but one will yield equivalent (or better, even!) results.
Identifying good features - finding a set of as little features as possible carrying enough relevant information to discriminate problem instances - is usually the main challenge of a successful machine learning project.
Many algorithms work better if all features are expressed as numbers within similar ranges. If one feature is a number between 0 and 1, and a second feature is a number between -1.000.000 and 1.000.000, many algorithms will work a lot slower (or not at all) because of the imbalance in the orders of magnitude of these features.
To mitigate this problem, feature scaling is applied to each feature that requires it, this is a procedure to map the range of values for a feature to a 'more reasonable' interval, like the interval [0 - 1], in order to have each feature always be a number within a similar range as all the others.
This ensures each feature throws about the same weight in a machine learning algorithm's scales and basically avoids one moderately relevant feature which is represented by comparatively large values drowning out much more relevant features with smaller ranges.
Types of Machine Learning
For every problem, there is a large set of machine learning algorithms that can be applied to it. Some algorithms might fit a particular problem well, and others might not be suited at all.
When solving a particular problem:
- a host of algorithms will be tried.
- the performance of the models they yield will be evaluated.
- the best model (or a combination of models) will then be used to solve the problem.
There are three major classes of machine learning algorithms that can be discerned based on what the computer knows about possible solutions beforehand:
A supervised learning algorithm is supervised in the sense that it is given examples of solved problem instances it should learn from, where a human expert has labelled a number of cases with a correct answer. The algorithm is learning in the sense that it is becoming increasingly better at performing a given task by looking at examples of what it is supposed to achieve.
What would a supervised Machine Learning algorithm do then, to make sense of your Ada Interpretation Problem's features? It would train a modelfor you. (Not the kind of model that earns money from having light bounce off of herself and through a lens, sometimes on a beach in Barbados.)
What is a model?
A model in this context refers to a rule orset of rules which takes a problem instance as input and comes up with an answer. This model can be arbitrarily complex and represents the very rules the programmer couldn't figure out himself.
The model is the solution for your problem in general, which gives answers to problem instances specifically.
In other words, the model answers specific cases from your problem domain:
- when you give a model a specific instance of your problem, expressed as a set of numeric values, one for each problem feature (e.g. the Stone Shove)
- it will come up with an answer for that specific problem instance (e.g. "Turn on some light jazz and pet me, human!"). The answer will be one or more numbers as well, to which we attach meaning and significance. A good and well-trained model usually comes up with the correct answer: very few non-trivial models are infallible.
Training a model
Each algorithm starts with a 'raw' model, a blueprint which yields 'solutions' that initially are not even close to what they should be. The model has a number of nuts and bolts (model parameters) that determine the answers the model gives.
It's these nuts and bolts the machine learning algorithm will tweak, molding the model until it gives reasonably accurate answers (what it deems reasonable is determined by the programmer).
It does this by looking at each example you have given it (called the training set) and trying it out on the model. It will then compare the answer the model gives to the correct answer, and tweak the nuts and bolts of the model in order to make it produce a slightly better answer than it did.
This is repeated for each example in the training set (sometimes multiple times over), and in small steps the algorithm will arrive at a model that gives good answers to all of the training examples.
The idea is now that the model that was trained on the training and gives good answers to the examples it was given will be generally applicable to the problem, meaning it will also give good answers to examples it was not trained on.
How is it used?
Supervised learning is mostly used for teaching a computer to perform tasks of which we have lots of examples to which we know the answers we're expecting, but we can't really (or at all) say which steps to take to reach those correct answers.
This type of algorithm is most prevalent and you have most likely made use of its applications on many occasions.
Some common applications include:
- Handwriting recognition (OCR - Optical Character Recognition) - used at the post office to recognise addresses from envelopes for example.
- Spam detection - keeps your inbox clean after seeing a great many examples of spam email.
- Object or behaviour recognition from images or video - detection and tracking of objects, behaviour or beaches of interest.
- Speech recognition - conversion of a recording of a voice to a representation in text form, used in commercial apps like Siri, Apple's voice assistant.
Unsupervised algorithms are given no answers at all, and are supposed to find structure in data all by itself. An algorithm like this is fed lots of examples without knowing what any answer is supposed to be.
How does it learn?
These types of algorithms usually come up with a new representation of the data they were given, and point out patterns, groups, similarities or deviations.
How is it used?
- Discovering groups - or clusters - in data. Typical example: identifying customer segments from order data.
- Building visual representations of data in a way that exposes patterns or structures, like this example which visualizes semantic clusters identifying concepts based on images it looked at:
- Anomaly detection - learning how to recognise examples that differ significantly from the bulk of data. This is used in internet security for example, where machine learning approaches enable the detection of abnormal network traffic in huge piles of data that far surpass the capabilities of human analysts.
- Finding similar features in a dataset in order to be able to arrive at a smaller set of features carrying the same amount of information - basically simplifying or compressing data - which could then be used to run a different machine learning algorithm more efficiently.
This is a bit of a special beast. Reinforcement learning is a method for training agents (think of robots, or software bots) to perform a task by having it modify its actions based on a reward signal coming from its environment.
How does it learn?
Basically, an agent will have an internal model (called a policy) selecting actions to perform based on what it can perceive from its environment. It starts acting according to that model, and will be rewarded if it does well at the task at hand and possibly punished if fails.
That reward (or lack of it) will be used to change the policy for selecting actions in a way that will optimise the reward signal. The policy itself can be extremely complex and involve an arbitrary number of machine learning sub-problems on its own.
How is it used?
This is a very popular research topic, it's been used to have software master a whole host of Atari games, which it figured out how to play by being fed the game screen's pixels (environment) and a number indicating how much its score was (reward signal).
Currently, attempts are made to crack the problem of playing StarCraft II - the famous RTS game - on a competitive level (and you can help).
Reinforcement learning has found its way in practical applications as well, for example:
- Robotics - Reinforcement learning is a natural fit to teach robots all sorts of tasks. A very cool example is robots teaching themselves to walk, even though they have no knowledge of their own morphology.
- Financial market trading - Successful trading is still somewhat of a black art, reinforcement learning is applied in software agents to figure out profitable strategies.
Machine learning is in its infancy - about to get its first tooth or so - and papers detailing new advances are published all the time. Many of these focus on a specific subfield of machine learning called Deep Learning, which I believe deserves special mention in this overview article.
To give an adequate overview of what deep learning entails would require a lengthy separate post. Suffice to say it involves building what's called neural networks and is taking center stage in most of machine learning.
What is a neural network?
Actual neurons found, hopefully, in abundance in your own brain and nervous system are rather complicated cells. Your cat's cells are no more complicated than yours, most differences between your brain and your cat's are caused by the number of cells and the connections between those cells.
An artificial neural network uses a thoroughly stripped down and stylised version of its organic counterpart as a building block to create networks in which these units are connected to each other.
Such an artificial neuron looks something like the following:
When combining multiple of these neurons by connecting their inputs to the outputs of others, and vice versa, the result is called a neural network.
How does a neural net work?
A typical (smallish) neural network may be structured like this:
- Each purple and red disc is an artificial neuron as described in the previous section, and the lines between them link the output of one neuron to the inputs of other neurons.
- The input of the network is represented by the green discs.
- Each column of discs is called a layer, a layer is called 'hidden' if the inputs of the neurons in that layer can't be changed directly (they must receive their input from the outputs of neurons in the previous layer).
- The input discs can be set to a values representing a machine learning problem instance (so the problem's features serve as input to the network!).
- Giving the green input discs a value causes the network to generate an output:
- each neuron in the first layer will look at its inputs and its weights and generate an output
- that output is the input of neurons in the next layer
- the neurons in that layer do the exact same thing
- this is repeated in each layer until every neuron in the network calculated its own output.
- The output of the red neurons at the end constitute the output of the network as a whole. The network in this example is supposed to distinguish two classes of inputs, it's output is a set of two numbers estimating the probabilities that its input belongs to class 1 and class 2 respectively.
How to make a neural net work for you?
Now, usually it's the output of such a network which is of interest to you when you want it to solve problems. What makes a network yield outputs that happen to be solutions to a problem you're interested in?
There are two important ingredients here:
- Each neuron's input weights - These allow to change the influence of each input of that neuron on the value of its output.
- The topology of the network - how many neurons are there, how do they calculate their outputs exactly, which neurons are connected to each other, and how? There are many different architectures being explored today, with new ones popping up continuously. Each architecture has its own special tricks and capabilities.
Machine learning algorithms can tweak and tune these properties of a network in order to have them yield solutions to a particular problem by changing input weights of the neurons in the network and/or the network's topology.
We can give a supervised learning algorithm a neural network to play with and some examples of what we want it to achieve, and the algorithm will manipulate the network until it does what we wanted it to do.
And so, neural networks can play the role of machine learning models (mentioned earlier in this post). They've turned out to be stunning oscar-worthy actors indeed and I hope they'll all get a part in the next Game of Thrones season.
Neural networks have been used in all areas of machine learning - supervised, unsupervised and reinforcement learning - with spectacular results and there is absolutely no indication of our having exhausted its possibilities. This field has been especially successful in having computers perform tasks that previously only humans could bring to a satisfactory conclusion, like recognising objects (including cancer cells!) in images, turning speech into text, ...
What makes these networks so powerful is that we actually have very little theoretical knowledge of how and why they work, but we do have the practical knowledge to train them in order to perform tasks previously considered impossible to achieve by computers. This is pretty much what makes machine learning so interesting in the first place, as I've mention in the first part of this article: we can make computers perform tasks without having to tell them how exactly.
As impressive as results in deep learning have been so far, they are just a few pretty snowflakes residing on the tip of a humongous iceberg. A tiny selection of what this remarkable field has achieved in recent years follows.
Facebook face recognition
You have no doubt noticed Facebook correctly suggesting tags for people in your photos. The same face is recognised across photos using an unsupervised learning approach that detects a cluster of occurrences of the same face, and once a single instance of that face is tagged by a user (supervised learning) Facebook can tag that face everywhere it encounters it (the whole cluster of occurrences).
So basically, with just one human intervention their algorithms are capable of recognising you in any photo any facebook user may upload. This is sometimes called semi-supervised learning because the amount of human input needed is so tiny.
DeepMind is an artificial intelligence research lab owned by Google, which has made some truly fascinating advances in deep learning.
One of their accomplishments has recently made worldwide headlines. Researchers at DeepMind trained an algorithm that beat the world champion at the game of Go, a task very much more difficult than winning at chess for example. This achievement is made all the more impressive because even the best Go players can't really explain many of the moves they make because they rely heavily on 'intuition'. This suggests that at some level, the algorithm must have acquired something akin to what we call intuition (which might be just a word for rules too complex for human consciousness to make sense of) of its own.
This achievement even made the cover of the most prestigious science journals on the planet (Nature).
Differentiable Neural Computers
Recently they've also published a new approach to combining neural networks with a memory, allowing it to remember what it has learned and apply it to related problems. Called Differentiable Neural Computers, they have been trained to navigate a number of transit systems and showed it to be capable of applying that knowledge to get around London using the underground.
Generative Adversarial Networks (GANs)
This is an incredibly exciting subfield of deep learning in which two types of neural networks interact, for example:
- one tries to create images that look photorealistic (generative network),
- the other tries to distinguish real images from fake ones (discriminative network).
These are pitched against each other, both networks getting better at their job - locked into very much the same sort of evolutionary arms race that yielded insects that look exactly like poisonous plants, for example.
This approach has yielded astonishing results in which textual descriptions of an object are converted to a photorealistic (but entirely fake!) image of that object:
Birds generated from text descriptions by StackGAN
Another mind-blowing application is able to modify images into related images after having learned a 'make my image look like this' type of rule. What on earth am I talking about? How about this:
This model has learned what winter and summer look like and apply that to unseen photos in order to translate winter images into summer images and vice versa.
A popular app that has GANs at its core is Prisma, which takes any photograph you care to throw at it and turns it into a Van Gogh or a pencil sketch. Basically, the app employs a Deep Neural Network that is trained to recognise a Van Gogh and modifies your image so that network can't distinguish the modified image from a Van Gogh anymore.
Nobody had to sit down and figure out the rules for converting any image, pixel by pixel, into a Van Gogh.
Recently, researchers associated with Adobe (from Photoshop fame) have developed a similar trick to apply the visual style of one photograph to another, allowing you to take, for example, a boring day-time photo of one city and a beautiful night-time photo of a different city, and turn your day-time photo into a similar-looking night-time photo. You have to see it to understand how spectacular this is:
I'm going to leave a host of extra exclamation marks right here to point out how incredible all this is: !!!!!!
Learning to understand your cat
Hopefully by now you have some idea of how machine learning might prevail over a classical approach on some types of problems and why it might help you figure out whether or not to open the door for Ada next time she seems to require this of you but you're sure she will definitely refuse to go through.
Sadly, writing software that understands your cat is even harder than you might have guessed. How might each of the three approaches to machine learning listed above be of any help in your quest?
This approach would require a list of solved problem instances. The bad news is embodied by the word 'solved', obviously. In order to label any example correctly we would have to ask your cat what she actually meant in that particular case, the answer to which would fail to present itself in any way, shape or form.
We might conceivably be able to discover associations between some body language and certain aspects of Ada's purring, or find hidden patterns in her communication, providing at least some insight in how she operates, but to generate a sentence from this type of information that communicates her actual message seems at the very least A Challenge and possibly impossible.
This approach might help you to find the correct response to anything your cat does. Each time she makes a sound or just goes ahead and lies there on her back for a good while, you pick one of the answers that occur to you and act on it, and have your 'environment' either reward or punish you in order to be able to make a better guess next time - hoping that eventually you'll start guessing correctly every time.
You won't understand why, but at least she will think you do. Which at least solves the problem from her perspective, and yours by extension. Also, this is the approach most people owned by a cat live by.
These insights might render you despondent. On top of that, there's the very real possibility that you can't figure out what Ada wants because she is generating insufficient information to make a successful interpretation in the first place, in which case writing any sort of software to do the job in your stead would be an extremely foolish fool's errand.
All this just to say: Don't despair! You're on your own.
Machine Learning is a very powerful trick which humanity's only just started to poke at with a stick. The ideas at its heart are not very new but only recently hardware has become powerful enough to convert theoretical potential into practical applications that can be sold to people - which is, for the foreseeable future, still the main requirement to make large-scale advances in any industry.
Now, it's here, and not going anywhere. We've only just learned to crawl. In the near future we will be tripping over the carpet, hitting our knees against tables and stumbling into closets, but in time we will be running towards the horizon at top speed with the wind of revolution in our backs.
Science fiction is turning into reality at a staggering rate and the sky will not be a limit whatsoever.