Abe Davis: New video technology that reveals an object’s hidden properties

Most of us think of motion
as a very visual thing. If I walk across this stage
or gesture with my hands while I speak, that motion is something that you can see. But there’s a world of important motion
that’s too subtle for the human eye, and over the past few years, we’ve started to find that cameras can often see this motion
even when humans can’t. So let me show you what I mean. On the left here, you see video
of a person’s wrist, and on the right, you see video
of a sleeping infant, but if I didn’t tell you
that these were videos, you might assume that you were looking
at two regular images, because in both cases, these videos appear to be
almost completely still. But there’s actually a lot
of subtle motion going on here, and if you were to touch
the wrist on the left, you would feel a pulse, and if you were to hold
the infant on the right, you would feel the rise
and fall of her chest as she took each breath. And these motions carry
a lot of significance, but they’re usually
too subtle for us to see, so instead, we have to observe them through direct contact, through touch. But a few years ago, my colleagues at MIT developed
what they call a motion microscope, which is software that finds
these subtle motions in video and amplifies them so that they
become large enough for us to see. And so, if we use their software
on the left video, it lets us see the pulse in this wrist, and if we were to count that pulse, we could even figure out
this person’s heart rate. And if we used the same software
on the right video, it lets us see each breath
that this infant takes, and we can use this as a contact-free way
to monitor her breathing. And so this technology is really powerful
because it takes these phenomena that we normally have
to experience through touch and it lets us capture them visually
and non-invasively. So a couple years ago, I started working
with the folks that created that software, and we decided to pursue a crazy idea. We thought, it’s cool
that we can use software to visualize tiny motions like this, and you can almost think of it
as a way to extend our sense of touch. But what if we could do the same thing
with our ability to hear? What if we could use video
to capture the vibrations of sound, which are just another kind of motion, and turn everything that we see
into a microphone? Now, this is a bit of a strange idea, so let me try to put it
in perspective for you. Traditional microphones
work by converting the motion of an internal diaphragm
into an electrical signal, and that diaphragm is designed
to move readily with sound so that its motion can be recorded
and interpreted as audio. But sound causes all objects to vibrate. Those vibrations are just usually
too subtle and too fast for us to see. So what if we record them
with a high-speed camera and then use software
to extract tiny motions from our high-speed video, and analyze those motions to figure out
what sounds created them? This would let us turn visible objects
into visual microphones from a distance. And so we tried this out, and here’s one of our experiments, where we took this potted plant
that you see on the right and we filmed it with a high-speed camera while a nearby loudspeaker
played this sound. (Music: “Mary Had a Little Lamb”) And so here’s the video that we recorded, and we recorded it at thousands
of frames per second, but even if you look very closely, all you’ll see are some leaves that are pretty much
just sitting there doing nothing, because our sound only moved those leaves
by about a micrometer. That’s one ten-thousandth of a centimeter, which spans somewhere between
a hundredth and a thousandth of a pixel in this image. So you can squint all you want, but motion that small is pretty much
perceptually invisible. But it turns out that something
can be perceptually invisible and still be numerically significant, because with the right algorithms, we can take this silent,
seemingly still video and we can recover this sound. (Music: “Mary Had a Little Lamb”) (Applause) So how is this possible? How can we get so much information
out of so little motion? Well, let’s say that those leaves
move by just a single micrometer, and let’s say that that shifts our image
by just a thousandth of a pixel. That may not seem like much, but a single frame of video may have hundreds of thousands
of pixels in it, and so if we combine all
of the tiny motions that we see from across that entire image, then suddenly a thousandth of a pixel can start to add up
to something pretty significant. On a personal note, we were pretty psyched
when we figured this out. (Laughter) But even with the right algorithm, we were still missing
a pretty important piece of the puzzle. You see, there are a lot of factors
that affect when and how well this technique will work. There’s the object and how far away it is; there’s the camera
and the lens that you use; how much light is shining on the object
and how loud your sound is. And even with the right algorithm, we had to be very careful
with our early experiments, because if we got
any of these factors wrong, there was no way to tell
what the problem was. We would just get noise back. And so a lot of our early
experiments looked like this. And so here I am, and on the bottom left, you can kind of
see our high-speed camera, which is pointed at a bag of chips, and the whole thing is lit
by these bright lamps. And like I said, we had to be
very careful in these early experiments, so this is how it went down. (Video) Abe Davis: Three, two, one, go. Mary had a little lamb!
Little lamb! Little lamb! (Laughter) AD: So this experiment
looks completely ridiculous. (Laughter) I mean, I’m screaming at a bag of chips — (Laughter) — and we’re blasting it with so much light, we literally melted the first bag
we tried this on. (Laughter) But ridiculous as this experiment looks, it was actually really important, because we were able
to recover this sound. (Audio) Mary had a little lamb!
Little lamb! Little lamb! (Applause) AD: And this was really significant, because it was the first time
we recovered intelligible human speech from silent video of an object. And so it gave us this point of reference, and gradually we could start
to modify the experiment, using different objects
or moving the object further away, using less light or quieter sounds. And we analyzed all of these experiments until we really understood
the limits of our technique, because once we understood those limits, we could figure out how to push them. And that led to experiments like this one, where again, I’m going to speak
to a bag of chips, but this time we’ve moved our camera
about 15 feet away, outside, behind a soundproof window, and the whole thing is lit
by only natural sunlight. And so here’s the video that we captured. And this is what things sounded like
from inside, next to the bag of chips. (Audio) Mary had a little lamb
whose fleece was white as snow, and everywhere that Mary went,
that lamb was sure to go. AD: And here’s what we were able
to recover from our silent video captured outside behind that window. (Audio) Mary had a little lamb
whose fleece was white as snow, and everywhere that Mary went,
that lamb was sure to go. (Applause) AD: And there are other ways
that we can push these limits as well. So here’s a quieter experiment where we filmed some earphones
plugged into a laptop computer, and in this case, our goal was to recover
the music that was playing on that laptop from just silent video of these two little plastic earphones, and we were able to do this so well that I could even Shazam our results. (Laughter) (Music: “Under Pressure” by Queen) (Applause) And we can also push things
by changing the hardware that we use. Because the experiments
I’ve shown you so far were done with a camera,
a high-speed camera, that can record video
about a 100 times faster than most cell phones, but we’ve also found a way
to use this technique with more regular cameras, and we do that by taking advantage
of what’s called a rolling shutter. You see, most cameras
record images one row at a time, and so if an object moves
during the recording of a single image, there’s a slight time delay
between each row, and this causes slight artifacts that get coded into each frame of a video. And so what we found
is that by analyzing these artifacts, we can actually recover sound
using a modified version of our algorithm. So here’s an experiment we did where we filmed a bag of candy while a nearby loudspeaker played the same “Mary Had a Little Lamb”
music from before, but this time, we used just a regular
store-bought camera, and so in a second, I’ll play for you
the sound that we recovered, and it’s going to sound
distorted this time, but listen and see if you can still
recognize the music. (Audio: “Mary Had a Little Lamb”) And so, again, that sounds distorted, but what’s really amazing here
is that we were able to do this with something
that you could literally run out and pick up at a Best Buy. So at this point, a lot of people see this work, and they immediately think
about surveillance. And to be fair, it’s not hard to imagine how you might use
this technology to spy on someone. But keep in mind that there’s already
a lot of very mature technology out there for surveillance. In fact, people have been using lasers to eavesdrop on objects
from a distance for decades. But what’s really new here, what’s really different, is that now we have a way
to picture the vibrations of an object, which gives us a new lens
through which to look at the world, and we can use that lens to learn not just about forces like sound
that cause an object to vibrate, but also about the object itself. And so I want to take a step back and think about how that might change
the ways that we use video, because we usually use video
to look at things, and I’ve just shown you how we can use it to listen to things. But there’s another important way
that we learn about the world: that’s by interacting with it. We push and pull and poke and prod things. We shake things and see what happens. And that’s something that video
still won’t let us do, at least not traditionally. So I want to show you some new work, and this is based on an idea I had
just a few months ago, so this is actually the first time
I’ve shown it to a public audience. And the basic idea is that we’re going
to use the vibrations in a video to capture objects in a way
that will let us interact with them and see how they react to us. So here’s an object, and in this case, it’s a wire figure
in the shape of a human, and we’re going to film that object
with just a regular camera. So there’s nothing special
about this camera. In fact, I’ve actually done this
with my cell phone before. But we do want to see the object vibrate, so to make that happen, we’re just going to bang a little bit
on the surface where it’s resting while we record this video. So that’s it: just five seconds
of regular video, while we bang on this surface, and we’re going to use
the vibrations in that video to learn about the structural
and material properties of our object, and we’re going to use that information
to create something new and interactive. And so here’s what we’ve created. And it looks like a regular image, but this isn’t an image,
and it’s not a video, because now I can take my mouse and I can start interacting
with the object. And so what you see here is a simulation of how this object would respond to new forces
that we’ve never seen before, and we created it from just
five seconds of regular video. (Applause) And so this is a really powerful
way to look at the world, because it lets us predict
how objects will respond to new situations, and you could imagine, for instance,
looking at an old bridge and wondering what would happen,
how would that bridge hold up if I were to drive my car across it. And that’s a question
that you probably want to answer before you start driving
across that bridge. And of course, there are going to be
limitations to this technique, just like there were
with the visual microphone, but we found that it works
in a lot of situations that you might not expect, especially if you give it longer videos. So for example,
here’s a video that I captured of a bush outside of my apartment, and I didn’t do anything to this bush, but by capturing a minute-long video, a gentle breeze caused enough vibrations that we could learn enough about this bush
to create this simulation. (Applause) And so you could imagine giving this
to a film director, and letting him control, say, the strength and direction of wind
in a shot after it’s been recorded. Or, in this case, we pointed our camera
at a hanging curtain, and you can’t even see
any motion in this video, but by recording a two-minute-long video, natural air currents in this room created enough subtle,
imperceptible motions and vibrations that we could learn enough
to create this simulation. And ironically, we’re kind of used to having
this kind of interactivity when it comes to virtual objects, when it comes to video games
and 3D models, but to be able to capture this information
from real objects in the real world using just simple, regular video, is something new that has
a lot of potential. So here are the amazing people
who worked with me on these projects. (Applause) And what I’ve shown you today
is only the beginning. We’ve just started to scratch the surface of what you can do
with this kind of imaging, because it gives us a new way to capture our surroundings
with common, accessible technology. And so looking to the future, it’s going to be
really exciting to explore what this can tell us about the world. Thank you. (Applause)

Danny Hutson

Leave a Reply

Your email address will not be published. Required fields are marked *