Computer Vision Crash Course Computer Science #35

Computer Vision Crash Course Computer Science #35

Hi, Im Carrie Anne, and welcome to Crash
Course Computer Science! Today, lets start by thinking about how
important vision can be. Most people rely on it to prepare food, walk
around obstacles, read street signs, watch videos like this, and do hundreds of other
tasks. Vision is the highest bandwidth sense, and
it provides a firehose of information about the state of the world and how to act on it. For this reason, computer scientists have
been trying to give computers vision for half a century, birthing the sub-field of computer
vision.

Its goal is to give computers the ability
to extract high-level understanding from digital images and videos. As everyone with a digital camera or smartphone
knows, computers are already really good at capturing photos with incredible fidelity
and detail  much better than humans in fact. But as computer vision professor Fei-Fei Li
recently said, Just like to hear is the not the same as to listen. To take pictures is not the same as to see.

INTRO As a refresher, images on computers are most
often stored as big grids of pixels. Each pixel is defined by a color, stored as
a combination of three additive primary colors: red, green and blue. By combining different intensities of these
three colors, whats called a RGB value, we can represent any color. Perhaps the simplest computer vision algorithm
and a good place to start  is to track a colored object, like a bright pink ball.

The first thing we need to do is record the
balls color. For that, well take the RGB value of the
centermost pixel. With that value saved, we can give a computer
program an image, and ask it to find the pixel with the closest color match. An algorithm like this might start in the
upper right corner, and check each pixel, one at time, calculating the difference from
our target color.

Now, having looked at every pixel, the best
match is very likely a pixel from our ball. Were not limited to running this algorithm
on a single photo; we can do it for every frame in a video, allowing us to track the
ball over time. Of course, due to variations in lighting,
shadows, and other effects, the ball on the field is almost certainly not going to be
the exact same RGB value as our target color, but merely the closest match. In more extreme cases, like at a game at night,
the tracking might be poor.

And if one of the team's jerseys used the
same color as the ball, our algorithm would get totally confused. For these reasons, color marker tracking and
similar algorithms are rarely used, unless the environment can be tightly controlled. This color tracking example was able to search
pixel-by-pixel, because colors are stored inside of single pixels. But this approach doesnt work for features
larger than a single pixel, like edges of objects, which are inherently made up of many
pixels.

To identify these types of features in images,
computer vision algorithms have to consider small regions of pixels, called patches. As an example, lets talk about an algorithm
that finds vertical edges in a scene, lets say to help a drone navigate safely through
a field of obstacles. To keep things simple, were going to convert
our image into grayscale, although most algorithms can handle color. Now lets zoom into one of these poles to
see what an edge looks like up close.

We can easily see where the left edge of the
pole starts, because theres a change in color that persists across many pixels vertically. We can define this behavior more formally
by creating a rule that says the likelihood of a pixel being a vertical edge is the magnitude
of the difference in color between some pixels to its left and some pixels to its right. The bigger the color difference between these
two sets of pixels, the more likely the pixel is on an edge. If the color difference is small, its probably
not an edge at all.

The mathematical notation for this operation
looks like this  its called a kernel or filter. It contains the values for a pixel-wise multiplication, the sum of which is saved into the center pixel. Lets see how this works for our example
pixel. Ive gone ahead and labeled all of the pixels
with their grayscale values.

Now, we take our kernel, and center it over
our pixel of interest. This specifies what each pixel value underneath
should be multiplied by. Then, we just add up all those numbers. In this example, that gives us 147.

That becomes our new pixel value. This operation, of applying a kernel to a
patch of pixels, is call a convolution. Now lets apply our kernel to another pixel. In this case, the result is 1.

Just 1. In other words, its a very small color
difference, and not an edge. If we apply our kernel to every pixel in the
photo, the result looks like this, where the highest pixel values are where there are strong
vertical edges. Note that horizontal edges, like those platforms
in the background, are almost invisible.

If we wanted to highlight those features,
wed have to use a different kernel  one thats sensitive to horizontal edges. Both of these edge enhancing kernels are called
Prewitt Operators, named after their inventor. These are just two examples of a huge variety
of kernels, able to perform many different image transformations. For example, heres a kernel that sharpens
images.

And heres a kernel that blurs them. Kernels can also be used like little image
cookie cutters that match only certain shapes. So, our edge kernels looked for image patches
with strong differences from right to left or up and down. But we could also make kernels that are good
at finding lines, with edges on both sides.

And even islands of pixels surrounded by contrasting
colors. These types of kernels can begin to characterize
simple shapes. For example, on faces, the bridge of the nose
tends to be brighter than the sides of the nose, resulting in higher values for line-sensitive
kernels. Eyes are also distinctive  a dark circle
sounded by lighter pixels  a pattern other kernels are sensitive to.

When a computer scans through an image, most
often by sliding around a search window, it can look for combinations of features indicative
of a human face. Although each kernel is a weak face detector
by itself, combined, they can be quite accurate. Its unlikely that a bunch of face-like
features will cluster together if theyre not a face. This was the basis of an early and influential
algorithm called Viola-Jones Face Detection.

Today, the hot new algorithms on the block
are Convolutional Neural Networks. We talked about neural nets last episode,
if you need a primer. In short, an artificial neuron  which is
the building block of a neural network  takes a series of inputs, and multiplies each by
a specified weight, and then sums those values all together. This should sound vaguely familiar, because
its a lot like a convolution.

In fact, if we pass a neuron 2D pixel data,
rather than a one-dimensional list of inputs, its exactly like a convolution. The input weights are equivalent to kernel
values, but unlike a predefined kernel, neural networks can learn their own useful kernels
that are able to recognize interesting features in images. Convolutional Neural Networks use banks of
these neurons to process image data, each outputting a new image, essentially digested
by different learned kernels. These outputs are then processed by subsequent
layers of neurons, allowing for convolutions on convolutions on convolutions.

The very first convolutional layer might find
things like edges, as thats what a single convolution can recognize, as weve already
discussed. The next layer might have neurons that convolve
on those edge features to recognize simple shapes, comprised of edges, like corners. A layer beyond that might convolve on those
corner features, and contain neurons that can recognize simple objects, like mouths
and eyebrows. And this keeps going, building up in complexity,
until theres a layer that does a convolution that puts it together: eyes, ears, mouth,
nose, the whole nine yards, and says ah ha, its a face! Convolutional neural networks arent required
to be many layers deep, but they usually are, in order to recognize complex objects and
scenes.

Thats why the technique is considered deep
learning. Both Viola-Jones and Convolutional Neural
Networks can be applied to many image recognition problems, beyond faces, like recognizing handwritten
text, spotting tumors in CT scans and monitoring traffic flow on roads. But were going to stick with faces. Regardless of what algorithm was used, once
weve isolated a face in a photo, we can apply more specialized computer vision algorithms
to pinpoint facial landmarks, like the tip of the nose and corners of the mouth.

This data can be used for determining things
like if the eyes are open, which is pretty easy once you have the landmarks  its
just the distance between points. We can also track the position of the eyebrows;
their relative position to the eyes can be an indicator of surprise, or delight. Smiles are also pretty straightforward to
detect based on the shape of mouth landmarks. All of this information can be interpreted
by emotion recognition algorithms, giving computers the ability to infer when youre
happy, sad, frustrated, confused and so on.

In turn, that could allow computers to intelligently
adapt their behavior... Maybe offer tips when youre confused, and not ask to install
updates when youre frustrated. This is just one example of how vision can
give computers the ability to be context sensitive, that is, aware of their surroundings. And not just the physical surroundings  like
if you're at work or on a train  but also your social surroundings  like if youre
in a formal business meeting versus a friends birthday party.

You behave differently in those surroundings, and so should computing devices, if theyre smart. Facial landmarks also capture the geometry
of your face, like the distance between your eyes and the height of your forehead. This is one form of biometric data, and it
allows computers with cameras to recognize you. Whether its your smartphone automatically
unlocking itself when it sees you, or governments tracking people using CCTV cameras, the applications
of face recognition seem limitless.

There have also been recent breakthroughs
in landmark tracking for hands and whole bodies, giving computers the ability to interpret
a users body language, and what hand gestures theyre frantically waving at their internet
connected microwave. As weve talked about many times in this
series, abstraction is the key to building complex systems, and the same is true in computer
vision. At the hardware level, you have engineers
building better and better cameras, giving computers improved sight with each passing
year, which I cant say for myself. Using that camera data, you have computer
vision algorithms crunching pixels to find things like faces and hands.

And then, using output from those algorithms,
you have even more specialized algorithms for interpreting things like user facial expression
and hand gestures. On top of that, there are people building
novel interactive experiences, like smart TVs and intelligent tutoring systems, that
respond to hand gestures and emotion. Each of these levels are active areas of research,
with breakthroughs happening every year. And thats just the tip of the iceberg.

Today, computer vision is everywhere  whether
its barcodes being scanned at stores, self-driving cars waiting at red lights, or snapchat filters
superimposing mustaches. And, the most exciting thing is that computer
scientists are really just getting started, enabled by recent advances in computing, like
super fast GPUs. Computers with human-like ability to see is
going to totally change how we interact with them. Of course, itd also be nice if they could
hear and speak, which well discuss next week.

Ill see you then..

No comments:

Post a Comment