This is the first post of the year 2017.

## Recent work introduction

Recently, I have been working on a project that is to boost the speed of the SURF. The project mainly focuses on image stitching and requires demandingly real-time performance.

At the very beginning, with the help from the source code of clsurf, I merged it into my project and easily got 100 times faster than before by using the GPU accelerating.

And here's my initial scheme of image stitching: (**Since the project, to some extent confidential, is not released yet, I can't show the exact time of it**)

Procedure | Time percentages for each frame | Device or Host |
---|---|---|

Detection of keypoints | 3% | Device |

Generating descriptors | 3% | Device |

IO (Transfering keypoints and descriptors) | 10% | Device to Host |

Feature matching | 2% | Host |

Calculating the perspective transform matrix | 2% | Host |

## Problem

However, after some tests, I found that it still couldn't fit the requirement since the detection of keypoints and the generating of descriptors was nearly to its extreme while the transferring data from device to host, that is, IO time almosts filled about 10% of the time for each frame, which was the most dominating part of all.

## Solution

By several times of profiling and reviewing the source code, I noticed that it transferred the descriptors back to the host (here on my computer) and used them to find the matching pairs between descriptors of continuous frames. I was inspired and wondered that "why not just do it on the devices and transfer the matching pairs back, which contains fewer bytes to transfer". Actually, it seems plausible because each matching pair is independently calculated and somehow doesn't exist memory exclusively accessing, which is the key to implement on GPU device.

Before showing my kernel functions, let's see how to implement the matching process of feature descriptors in a regular way in pseudo code:

1 | # Step 1 Find the best and 2nd best(better) feature descriptors |

### Kernel functions of OpenCL

As for my initial try to enhance its performance, I divide the algorithm above into several kernels, ensuring it could be executed as tons of items for GPU. Here what these kernels functions have done exactly:

**Kernel of calculating the distance**: mainly calculating Euclidean distance between descriptors of continuous frames (Almost same as before;**Kernel of ranking**: Instead of getting the best distance or better distance directly, I choose to use the counters to save the number that how many distances are smaller than it. For example, if the distances are: d(0)=2, d(1)=4, d(2)=3, then the counters should be c(0)=0, c(1)=2, c(3)=1. It ranks or sorts the order of these distances indirectly. Since the GPU is not good at comparing or doing the logical comparison, I just make it suitable for GPU which is good at doing jobs on a large scale owing to its plenty of computing units. And as long as one item(an instance of the kernel) finds the counter equals to N-1 (N is the number of all distances), it must be the largest distance, the same as the better distance. Then this item could be saved for next kernel.**Kernel of calculating the ratio and rejecting bad features**: Easy one, same as before.

### Comparison

Comparison with the regular way, I mainly improve the implementation by two procedures: 1. Firstly, it is able to calculate the distances of descriptors independently and parallelly instead of serially. 2. Moreover, finding the best and better distance is the key procedure of the algorithm. Normally, that to find the k-largest number in a set of random numbers can have a O(logN), or nearly O(n) level solution. But in our case, implementing a quicksort, or like partition-exchange sort algorithm, may not be a good choice considering the weakness of GPU. Thus, dividing into the simple tasks, counting the ranking for each distance, would be a novel way to solve it.

### About reduction of data transferring

After executing these kernels, we transfer the matching pair from the device to host. Previously, we transfer about (m+n) x 64 x 4 bytes (m: number of current frame's descriptors, n: number of last frame's descriptors, 64 is the size of the descriptor, 4 is the size of float value) data. And now, we just need m x 2 x 2 bytes (m is the number of current frame's features, 2 x 2 stands for that each pair contains two int16 numbers since normally the number of features of each frame is in the range of int16.). In that case, the ratio is{ (256 x (m+n)) / (4 x m) = 64 x (m+n) / m}, where m is nearly the same as n for the continuous frames of stream video. So we could simplify it as (64 x 2 x m) / m = 128, which indicates we reduce to 1/128 of data approximately.

### Result

I have profiled the functions again. And it has almost ten times faster in the matching part, less than 1ms on the desktop. As well as we have a great decline in IO time, almost to 10 times (the relation between data scale and transferring time somehow is not linear)less than before.

## Conclusion and more info

The matching of features is very common in many applications that need to stitch or transform the images. Mostly, we find the matching pair on the host (CPU) and rarely make it on GPU. However, in this case, I happened to find a novel way to solve it perfectly if the requirement is demanding for time consumption, especially IO time. Due to the project is not released yet, I can't show my source code of its implementation. But I guess it could be updated on my GitHub in the future. Finally, as I am a rookie in OpenCL, I haven't comprehended much optimizing skill about the local group or synchronizing. If you have a better solution or haven't understood my solution yet, feel free to contact me.