08.24.08

The Future of Television

Posted in Media, Tech at 12 pm

A brief rambling of thoughts regarding television and video transmission as they will evolve in the coming decades:

1) The end goal? The Star Trek Holodeck: a 3-D representation of a scene that can be viewed from any angle. Putting aside the hokiness, this is what TV is heading towards: a reproduction of an environment in all physical dimensions.

2) In order for this to be feasible, flat 2-D capturing is useless. Video today is taking a series of bitmap images. The next gen of video will be just be stereo 2-D: 2 images of the same scene at the same time. Great, so we’ve replicated the depth of a scene, but we’re still stuck with the single perspective of the original pair of cameras.

3) If stereo images for ‘faux-3D’ isn’t enough, then what we need are more cameras, right? Well, then where does that end? Do you build a giant sphere of cameras, all pointed towards the center of the action? This might work okay for a movie like Cube but for, let’s say filming a climb of Mount Everest, this isn’t the way to go.

4) There are two basic ways of representing images in digital formats: Bitmaps or Vectors. Bitmaps are grids of pixels: perfect for paintings, documents and flat video. Bitmaps are great for when you have an image that you might want to make smaller, but they are useless for making bigger. If you take a 100 pixel by 100 pixel image and make it 1 mile by 1 mile, you’re going to get individual pixels that are 50 feet by 50 feet. However the same image made up of vectors could be made of very small 1 nanometer pixels and still be an accurate representation of the image.

5) If we want the ability to view a scene in all of it’s physical dimensions, we will need to capture the points in space (x,y,z coordinates/vectors) of as many elements as we need in order to re-create the scene. Take track events portrayed in a movie like Chariots of Fire. In order to truly capture the event, we’ll need to track the spacial locations of every significant element. I would guess these to be the track, the starting line, the finish line and the runners.

6) This should be subdivided down further however. Not just the runners, but the various body parts of the runners: legs, arms, heads. Maybe fingers? How about the starter’s gun? the trigger on the starter gun? the finish line tape?

7) We need to decide what’s truly important to capture: The runners, yes. The starting line and the finish line, yes. The crowd? Mmmm, maybe. Certainly films for decades have been using ‘standard crowd noise’ in place of recording actual crowds on the set of the film. Movies have been adding crowds to stadiums using mannequins, inflatables, or digital post-production. Maybe the specifics of the crowd are unnecessary for the scene.

8) We need to capture as much as possible, but we could extrapolate from a small set of points a number of the other points. Perhaps we know where the starter gun is, but instead of keeping track of the official that is pulling the trigger, we simple estimate the height of a person that would be holding a gun at a certain angle and height and make an approximation of the official. We know how the ribbon at the finish line would move and float given the motions of the players and the wind and the tautness of the tape. Do we need to know the exact location of a runner’s knee if we already know where their hips and toes are at? Maybe, but we probably don’t have to know where the ankle is at if we know where the heel and the knee are at.

9) Once we have those points in space, we can recreate the locations, but short of capturing the location of every thread of the clothing being worn or each lace of each shoe, we’re probably going to want to capture a ‘skin’ or a ‘texture map’ that would be used to wrap around the skeletons (vectors) of the runners. The skin could be captured ahead of time, or could be extrapolated from a video feed. We’ve already seen projects that take varied photographs and collects them into a multi-faceted view of a single object. In much the same way, a set of stills taken over time could create a texture map.

10) That same capture of the texture maps could be used to extrapolate the x/y/z of the original skeletons. Today’s motion capture techniques have relied on ping-pong balls taped to actors in green body suits and similar set ups. Those configurations are simply work-arounds that allow us to capture the models easily with today’s technology and are ultimately, unnecessary. Once we have the visual processing tools that are necessary, we can forgo the artificial set ups and special configurations and rely on the original video captures.

11) This sort of capturing and transmission becomes possible once we move from thinking of capturing a flat plane of pixels to capturing the coordinates and texture maps of a scene. The information that is captured can be still captured by a single video camera, given enough processing power. But when we add a second camera, we can collect better textures and more accurate coordinates. Add a third and the quality of the capture increases again. Add a dozen and you’re capturing every detail needed to analyze an event in everyday scenarios.

12) What does this all offer? Imagine watching Chariots of Fire from the actual point of view of one of the runners. Or from the officials. Or the finish line tape, or a shoe of the runners. Or directly overhead. The amount and number of perspectives is immense. Imagine changing the scene by adding a 100 mph wind to it. Or altering the track so it goes in a loop-de-loop.

13) And talk about scalability: If you want to transmit this scene to someone, you have the option of A) sending a fully-rendered image like you would to a current television, B) a pair of images to a stereoscopic video display (Yes, that’s by my employer), C) or a small set of the captured data to a cell phone/Personal media device for display of a low-res, animation style rendering, D) or a full feed of all the details to a computer-enabled display that could use a mouse or 3-d mouse that could be used to navigate around a scene.

14) Today we are capturing the equivalent of a single, low quality texture map. Soon we will be capturing higher quality single texture maps, but this is just a baby step forward. We need to build tools that will take those bitmaps and break them down into component parts: Vectors of skeletons, plus texture maps. We blend in approximations of the missing texture, enhance the scene with up-close photos, and extrapolate to fill in the additional x,y,z coordinate points we’re missing. None of these techniques are outside of our reach.

Comments are closed.

RSS feed for comments on this post · TrackBack URL