Image Occlusion Testing with Quartz/UIKit

Note 12/29/2017: This article was written several years ago and is in draft form; it was never published due to concerns over IP protection. Since those concerns are no longer relevant and the topic no longer fresh in my mind, I’m publishing it as-is for archival purposes.

Here at The Tap Lab we’re currently putting the finishing touches on our second game, Bigfoot Hunter. While you should definitely check it out, the basic mechanic in the game involves looking around an environment and snapping a photo of bigfoot when he runs out. Your score is based on how “good” the picture is, and one of the metrics by which you’re judged is whether the bigfoot in your photo is obstructed by trees, bushes, or other pieces of the environment. The environment consists of several “layers” at different depths, each composed of a variety of the aforementioned pieces. We currently use the standard iOS UI libraries (UIKit) to render the environment, so these layers are really just UIViews with UIImageViews as subviews. When a snapshot is taken, these layers are composited on top of each other to generate the image.

Determining how centered the bigfoot is and how fast he is moving (in order to score “blurriness”) is relatively straight-forward and simply involves positional state data about the bigfoot. The final metric by which the player is scored–how much of the bigfoot is occluded by the environment–is a bit more difficult. The images themselves are rectangular, so bounding boxes are an obvious, simple and performant solution. However the images, especially things like trees with long branches, have large areas of transparency that makes this approach far too inaccurate. The other solution that comes to mind immediately is looking at the image data itself and determining what percentage of bigfoot pixels are still visible after the composite image is rendered.

This approach is super-accurate but also resource intensive. Images take up a lot of memory on mobile devices and anything that involves looping through pixel data on the CPU is incredibly slow. We were hoping to allow the player to take many rapid-fire shots and score them instantly in the game; spending a full second calculating occlusion was not an acceptable user experience. The details of the pixel-based algorithm are also somewhat unclear at first glance. I would imagine, when presented with this problem, most engineers would believe it is likely doable and relatively straightforward but may not be able to articulate an implementation right away. Given the original bigfoot image and the composite image, you could compare each pixel in the bigfoot’s bounding box to see if it has been unchanged; but what if it was alpha blended? What if the environment is the same/similar color?

We had an existing pixel-based solution in the game, but it was resource-intensive and used a somewhat roundabout method to determine which pixels remained visible. My task was to come up with a more performant algorithm. I have done pixel manipulation in the past (on Windows) but was not very familiar with iOS/Objective-C and even less familiar with Quartz and UIKit. So my implementation below may not be the best, but it seems to work. If you are using an actual game engine for your game, which I recommend you do, it probably already contains a much faster implementation all ready to go. But if you’re stuck using UIKit for whatever reason, I hope this will be of help.

First off, a lightning-speed refresher on computer graphics if you are rusty or unfamiliar. A traditional image (JPEG, PNG, etc; collectively known as “bitmap images”) are made up of pixels, each with a particular color. There are a variety of ways of representing color but the most common is 32-bit RGBA; one byte for red, green, blue, and alpha respectively (each with a value between 0 and 255), totaling 4 bytes per pixel. A bitmap image then is represented in memory as a long array of bytes, with every 4 representing a pixel in the image starting from the top left and going left to right, top to bottom.

The alpha channel represents how transparent the pixel is, with 255 being opaque and 0 being fully transparent. When drawing images on top of each other, their colors are blended together based on the alpha value. This is how you get images in games (say, of trees) that are still rectangular but don’t have a white background; their opaque pixels are drawn over the background and obscure it, while their transparent areas just leave the existing background pixel as is. Parts of the image that are somewhere in the middle, like translucent glass, blend the background pixel with the image pixel into a new color representing both.

In order to do any drawing on most operating systems, you need a graphics context, which represents a drawing surface stored in a chunk of memory and the various state that accompanies it (dimensions, color format, etc). This context often but not always represents a part of the screen being displayed to the user; in our case, we will simply be performing drawing operations in memory and the user will never see it. In order to access the actual pixel data in that memory, we will be using a particular kind of context called a Bitmap Context.

The abstract version of the occlusion algorithm is as follows:

1. Create a context the size of the bigfoot image and draw the bigfoot image into it. Loop through each pixel and add up the alpha channel values. This represents the total area of the original bigfoot image that is visible when unobstructed. There are other ways to do this (for example, normalizing each pixel to a 1 or 0) but this is easiest for now.

2. Set the context’s blend mode to kCGBlendModeDestinationOut. This is one of many ways to blend an image on top of another based on their alpha values. This particular blend function has the effect of “image subtraction”; namely, anything rendered after that would normally occlude an existing pixel instead “erases” it by setting it to transparent. This has the effect of leaving only unoccluded pixels in the context remaining.

3. Render any foreground images into the context using the aforementioned blend mode, making sure to move them to mirror their position relative to the bigfoot image within the game. This will leave only visible bigfoot pixels left in the context.

4. Loop through the pixels in the context and add up the remaining visible alpha values as in Step 1. If the bigfoot image was only partially visible in the snapshot (and thus was “occluded” by the camera bounds), make sure to only count pixels in the visible region. We now have a visible vs total count which we can use to calculate the percentage of the image that is visible.

The above algorithm is accurate but slow, the main culprit being the looping through pixels in steps 1 and 4. Modern images are quite high-resolution, and even the tightest loop with that many iterations on the CPU is going to take some time (versus the massively parallel capabilities of the GPU). While there probably is a way to utilize the GPU for such an operation, it likely involves using OpenGL and possibly shaders, and if you’re using UIKit for your graphics you’re probably hoping to avoid refactoring to integrate OpenGL.

So how do we speed this up? Reducing the number of loop iterations, and thus the number of pixels, is a fairly obvious solution, and happens to be quite effective. If we simply scale the image down, we can reduce the number of pixel tests we have to do by orders of magnitude without significantly reducing the accuracy. In fact, since this particular graphics context will never be displayed to the user, we can actually scale the image down pretty dramatically without having to worry what it looks like.

Scaling could be accomplished by, say, only testing every 4th or 8th pixel, or even some sort of random sampling. But memory usage is also a concern, so instead of rendering the full-res images into the graphics context, we can kill two birds and draw only the scaled down version of the image, saving space. Now you might ask, why use any extra memory at all and just sample the source images in parallel and compare their alpha values? While there may be a way to accomplish this, from what I found it appears that the only way to access an image’s underlying pixel data is to draw them to a separate bitmap context.

Side note: My original implementation only copied the alpha channel into the context (grayscale), saving even more space since the color channels were irrelevant. However, this process of transforming the color space seems to be a rather intensive operation under the hood, and ended up degrading performance significantly. It also doesn’t lend itself well to outputting PNGs for debugging purposes, so I kept in the color channels.

So let’s get to the gritty details. Assume we have a function that takes as input two images, one representing the background (aka the bigfoot being occluded) and one representing the combined foreground (things occluding the bigfoot). In practice the foreground could actually be an array of images that could be drawn in the context as described in step 3; for the purposes of this article we’ll just use 1 (we’re using CGImageRefs because most of the low-level graphics functions we’ll be using are in Quartz. To get a CGImageRef from a UIImage, simply call uiimage.CGImage).

We also pass in CGRects representing the position and size of the two images so that we can determine where they are relative to each other. This may not be necessary for all purposes but if you are working on a game that has elements moving around the screen, images themselves do not encode any information about their location (they do, however, keep state about their height and width, which you can get from CGImageGetWidth/Height). We also take in the bounds of the screen itself in case the background image is partially off-screen. Finally, we pass in scaleFactor, which specifies how much we want to reduce the resolution of the images to increase performance, with 1 being no reduction. The higher this number, the faster but less accurate the algorithm.

(double)occlusionPercentageForBackground:(CGImageRef)background foreground:(CGImageRef)foreground snapshotFrame:(CGRect)snapshotFrame backgroundFrame:(CGRect)backgroundFrame foregroundFrame:(CGRect)foregroundFrame scaleFactor:(float)scaleFactor;

First, we apply a scaling transform to all of the geometry involved, essentially transforming our entire world-space into a lower resolution to increase performance. Unlike some graphics libraries, Core Graphics applies the transform to the origin as well as the size of the frame, so all of the geometry maintains their positions relative to each other; no manual adjustment is necessary.

// Scale down
CGAffineTransform bitmapScale = CGAffineTransformMakeScale (1.0 / scaleFactor, 1.0 / scaleFactor);

snapshotFrame = CGRectApplyAffineTransform(snapshotFrame, bitmapScale);
foregroundFrame = CGRectApplyAffineTransform(foregroundFrame, bitmapScale);
backgroundFrame = CGRectApplyAffineTransform(backgroundFrame, bitmapScale);

In order to determine what percentage of the bigfoot (background) is visible, we have to know how much space he occupies unoccluded. Therefore we must render the entire bigfoot, even if not all of him is in the snapshot frame. Additionally, nothing outside of the bigfoot’s image dimensions matters; a tree may cover the entire length of the game world, but we only care about the branch that passes through the bigfoot’s frame. This essentially puts the bigfoot image at the center of our working space, where “working space” refers to the chunk of memory (context) we will be performing our bitmap operations in. It also means that our context needs to be no larger than the size of the bigfoot image.

With that in mind, we position the origin of the bigfoot frame to be (0,0) so that when we draw it, it will fit perfectly into the context. We must also transform the foreground and snapshot frames though so that when we draw them into the context later, they will still be positioned properly relative to the bigfoot. Most of the foreground, as we mentioned, won’t even fit into the context as it doesn’t pass through the bigfoot’s frame; that is okay, as it will just be ignored and won’t take up space or CPU cycles.

// Make everything relative to bigfoot, set bigfoot to 0, 0
snapshotFrame = CGRectOffset(snapshotFrame, -backgroundFrame.origin.x, -backgroundFrame.origin.y);
foregroundFrame = CGRectOffset(foregroundFrame, -backgroundFrame.origin.x, -backgroundFrame.origin.y);
backgroundFrame = CGRectMake(0, 0, (int)backgroundFrame.size.width, (int)backgroundFrame.size.height);

We now calculate the size of the bitmap context we’ll be drawing into and create it. As we just discussed, the context needs to be big enough to fit the bigfoot’s image but no more than that. Since each pixel takes up 4 bytes (red, green, blue, alpha), the size in bytes of each row of the context will be the width of the bigfoot image times 4, and the total size will be the size of a row times the height of the image.

int bytesPerRow = backgroundFrame.size.width * 4;
int totalBytes = bytesPerRow * backgroundFrame.size.height; 

// Create context
UInt8 *bitmapData = calloc(totalBytes, 1);
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();

// For more information, see docs for CGBitmapContextCreate()
CGContextRef bitmapContext = CGBitmapContextCreate(bitmapData, backgroundFrame.size.width, backgroundFrame.size.height, 8, bytesPerRow, colorSpace, (CGBitmapInfo)kCGImageAlphaPremultipliedLast);

CGColorSpaceRelease(colorSpace);

Next we draw the bigfoot image into the context. With the origin at 0,0 and the context sized to be the same dimensions, the bigfoot should fit perfectly. Remember that we’ve scaled down the frame representing the bigfoot (backgroundFrame); it is now smaller than the actual image contained in the background variable. This will cause Core Graphics to downsample the image as it draws it into the context.

Once the bigfoot is in the context, we need to add up the total amount that is “visible” (non-transparent) before the bigfoot is obstructed with foreground elements. There are a number of ways one could do this; the simplest is to just add up the alpha value of each pixel (remember, 0 alpha is transparent and thus not visible, 255 is fully opaque/visible). Recall also that the alpha channel is the fourth by in each pixel, so when we are looping through the bytes in the bitmap context, we increment by 4 each time.

// Draw background and count total alpha
CGContextDrawImage(bitmapContext, backgroundFrame, background);

int totalAlpha = 0;
int visibleAlpha = 0;  

for(int i = 0; i < totalBytes; i+=4) {
    totalAlpha += bitmapData[i+3];
}

Now we need to draw the foreground elements into the context, on top of the bigfoot, positioned relative to the bigfoot as they are in the game. But first we need to set the blend mode so that an opaque pixel from a foreground image, rather than overwriting the underlying bigfoot pixel, simply “erases” it. This allows us to be sure that any remaining pixels represent visible portions of the bigfoot, and not some foreground element. The name for this blend mode is kCGBlendModeDestinationOut, which is really just a constant that represents two standard OpenGL blend functions: one that says to ignore the source (foreground) pixel, and multiply the destination (bigfoot) pixel by 1 minus the alpha of the source pixel, or D*(1-Sa) aka {GL_ZERO, GL_ONE_MINUS_SRC_ALPHA}.

// Draw foreground over background with special blend mode
CGContextSetBlendMode(bitmapContext, kCGBlendModeDestinationOut);
CGContextDrawImage(bitmapContext, foregroundFrame, foreground);

All that remains now is to count up how much of the bigfoot remains visible in the context, using the same method as before (add up the alpha values). The one caveat here is that, while we needed to render the entire bigfoot to get the total potentially visible area, it is possible the entire bigfoot was not in the viewport (snapshotFrame) when the picture was taken. Thus we do not want to count pixels that lie outside the snapshot, even if they are not obstructed by the environment. This may not be an issue for all games, in which case you could skip this part. Keep in mind though that this is also a performance improvement, as we only need to loop through a subset of pixels in the context.

There are many ways to do this. We could perform another graphics operation on the context where we shift the pixels relative to the snapshot, moving the non-visible pixels outside of memory. We could also render a mask or use a similar cropping operation. All of these solutions though require another graphics op and, unless you create a newer, smaller context (another op), leaves you still looping through extra, empty pixels that fall outside the snapshot frame.

The better solution, in my opinion, is simply to restrict your pixel testing to only the portion of the context that represents the overlap between the bigfoot frame and the snapshot frame, which can be calculated like so:

// Count up visible backgroud only in snapshot area
CGRect overlapFrame = CGRectIntersection(snapshotFrame, backgroundFrame);

// At this point, all of the CGRect's have origins in the bottom-left, but the image bytes start from the top-left. So
// we flip the Y-axis of the overlap frame (no need to do the background frame because it defines the origin)
overlapFrame.origin.y = backgroundFrame.size.height - overlapFrame.origin.y - overlapFrame.size.height;

As the comments describe, we also need to flip the Y-axis of our overlap frame due to the different origins of the context and CGRects.

Next, we loop through each pixel location in the overlap frame. The tricky part here is translating the (x,y) coords in the space of the overlap rect to an array index in the memory of the bitmap context. Normally, the equation is index = y * width + x, which makes sense if the memory represents the image from left to right, top to bottom. If you look at the offset calculation, you can see that we add the x- and y-offset of the overlap’s origin to the equation to account for the fact that the overlap frame does not necessarily align with the origin of the context’s frame. We then take this offset and multiply it by 4 (since each pixel is 4 bytes) and add 3 (since the alpha value is the fourth byte) to get the alpha value.

Note that because we scaled the frames earlier, we already take into account the fact that the context is a lower resolution than originally passed in. It also means that, depending on our scale factor, we are looping through considerably less pixels than we would be otherwise, without a significant loss of accuracy.

for(int y = 0; y < overlapFrame.size.height; y++) {
    for(int x = 0; x < overlapFrame.size.width; x++) {
        int offset = (y + overlapFrame.origin.y) * backgroundFrame.size.width + overlapFrame.origin.x + x;
        visibleAlpha += bitmapData[offset*4+3];
    }
}

At this point you might find it useful to save out the context to an image file for debugging purposes. The following snippet will save a copy to your photo album if you’re running on an iOS device. There are similar means of saving to a local file in OS X.

// DEBUG
BOOL TEST_IMAGE = YES;

if (TEST_IMAGE) {
    CGImageRef cgimage = CGBitmapContextCreateImage(bitmapContext);    UIImage *image = [UIImage imageWithCGImage:cgimage]; 
    UIImageWriteToSavedPhotosAlbum(image, nil, nil, nil);
    CGImageRelease(cgimage); 
}

Now that we have before and after alpha sums, we can return the percentage of the bigfoot that is visible, as well as clean up our allocated memory.

CGContextRelease(bitmapContext);
free(bitmapData);

return (float)visibleAlpha / (float)totalAlpha;
}

The result is a reasonably fast means of occlusion testing that can be optimized depending on the accuracy needed versus the speed.

Zachary R Broderick

Boston-area software engineer

Image Occlusion Testing with Quartz/UIKit