It’s been almost 6 years since the first consumer 3d camera Kinect was released as an accessory for the Xbox 360. Although this was a huge leap forward, and introduced the technology to millions of people it was still a very long ways from being a must have technology. Microsoft wasn’t even friendly to developers initially. It took million view YouTube demos running on reverse engineered drivers to convince Microsoft to even release their own SDK (Software Development Kit).

Things have changed significantly with cameras released since then. Most of the bulk is gone with devices shrinking by two orders of magnitude. No more need for a power brick either. Image quality, resolution, and latency have all improved dramatically. They all have developer support too. The technology certainly doesn’t need to be perfect to gain adoption, but even with all of the improvements there are still some major hurdles. Most of the remaining ones are software.

Common features could use a common interface

Every 3d camera I know of has several common features. They produce an image where pixels are distances instead of colors. There is a factory calibration to convert depth pixels to 3d vectors, and paint them with pixels from a separate color camera. And finally they need some way to match up images taken at the same time because the color and depth images usually arrive separately. Despite all this there isn’t a common API for any of these features. There is constant innovation, so this won’t be enough for every camera but as it is software written for the original Kinect cannot be used with any newer camera without being ported to the new interface.

Beyond the hardware interface there isn’t any well defined file format to record the data. The closest to what is needed is the “bag” format that Robot Operating System (ROS) uses. It can actually store everything, but not efficiently and portably. This is pretty challenging task. Any file format needs to be able to store time synchronized images in multiple formats, and framerates at once. Since the data is commonly wanted for 3d reconstruction it should store the calibration, and any sensor data that might help align data like accelerometers, and GPS. Probably the best place to start is the Matroska media format. Raw audio is a good template for sensor data, and subtitles a good template for the NMEA GPS data.

Computer vision

This gets you to the general purpose computer vision (CV) layer which has mature open source libraries available. OpenCV has advanced features like human face detection built in, and GPU versions of many functions. It also has tools for using a calibration to turn a depth image into a point cloud (giant array of vectors). Point Cloud Library (PCL) has tools for turning point clouds into 3d surfaces.


Initially 3d cameras were being marketed for the most amazing things they could possibly do, but it turns out there are much easier killer apps to build. Simply filtering an image by distance is pretty useful. A similar tool to a green screen can be made to work with any background so long as it is further away than the subject. Face detection can be used to automatically set the cutoff range. Another simple tool is the Windows 10 “hello” feature. It can identity a user with facial recognition, and verify it isn’t just a photo by checking the shape.

The launch feature of the original kinect was skeletal tracking. The leap motion, and RealSense cameras both feature similar technology in the form of gesture recognition. This is a proprietary technology implemented in closed source software. This is how you get your Minority Report UI. Every vendor seems to have a unique API which might not be stable. For example Intel’s SDK had an emotion detection feature they licensed from a third party until Apple bought that company, and it became unavailable. I think Intel has started aggressively buying companies to prevent that from happening again, but I’m not sure that is a guarantee. There is one open source project in this space called skeltrack. The last time I checked it’s original author was too busy with his day job discovering the secrets of the universe at CERN to put much time into it.

The most obvious use for a 3d camera is 3d scanning. This could mean scanning an object, or mapping a 3d space. The algorithms usually have SLAM (Simultaneous Location And Mapping) in their name. There are fortunately a bunch of open source implementations. They are usually tricky to get working. It is possible with just cameras, but works tremendously better with motion sensors (IMU). Many vendors have some kind of solution, but it isn’t a universal feature.

Hardware availability

When I started working on the software part of this problem a year and a half ago the new intel RealSense cameras weren’t widely available yet. This made it pretty difficult to find many collaborators. The development kits haven’t been consistently available. There have also been a few computers available with built in cameras, but buying a whole computer to get a camera is a pretty expensive option. I ended up buying a computer with a camera, but it was time for an upgrade already. Fortunately the shortage is finally over as the first stand alone consumer RealSense camera started shipping today. The Razer Stargazer. The specifications of it are pretty close to being the same as the SR300 development kit. It is lighter with a detachable cable.

About a year after my first blog post on 3d cameras intel released a cross platform library for their cameras. I’m not sure how much this blog influenced them, but it does implement most of the features I described in posts last year. Their Linux support includes a kernel patch with some of my code. Their OSX support uses libuvc. Robot Operating System (ROS) has it’s own driver. Unlike my own code it has a test framework, and bugs are getting fixed rapidly. A bunch of things I was working on late last year, and early this year never got released because intel surprised me with a better implementation. Since then I’ve been focusing on the Linux kernel which has been it’s own steep learning curve.

While there are still a few rough spots with 3d cameras I think the tipping point has already been reached where the technology starts to be adopted at an exponential rate. It is also the easiest it has ever been to start developing applications.

Part 1
Part 2
Part 3