recently learned CUDA, ready to find a project to practice hands. I first analyzed a code and found that the main performance bottleneck was in CV :: PyrmeanshiftFiltering. I originally planned to stab a GPU version. I did not expect to find that someone had written CUDA :: Meanshiftsegmentation. Then change the plan and read the source code.

### mean shift

Mean Shift’s simple understanding is to draw a circle, calculate Mean, and then go to this direction.

Picture FromOPENCV official documentation

Thinking is still very simple and intuitive. But when you apply it, you can use all kinds of brain holes to transform.

I used to use Meanshift to automatically draw files after gathering. For example, the registration time for the purchase of the same product to buy the same product can be divided by several people according to the new and old users. Because the distribution of each product is inconsistent, it is not possible to set a fixed value directly in the head, and how many clusters are not fixed. And Meanshift can easily solve how many gears of division and how to divide them.

The problem of returning to the image, the picture is ROWS*COLS pixel, which seems to be different from the picture of the example of OpenCV. How can the picture meanshift? The image has 5 dimensions of color space (R, G, B) and position space (x, y). I thought before that it was directly to the color space for meanshift, but it was not.

Then take a look at how CV :: PyrmeanshiftFiltering is dealt with.

### CV :: Pyrmeanshiftfiltering source code

Pyr refers to the pyramid Pyramids, how do you achieve Meanshift in the pyramid?

```
/****************************************************************************************\
* Meanshift *
\****************************************************************************************/
void cv::pyrMeanShiftFiltering( InputArray _src, OutputArray _dst,
double sp0, double sr, int max_level,
TermCriteria termcrit )
{
CV_INSTRUMENT_REGION()
Mat src0 = _src.getMat();
if( src0.empty() )
return;
_dst.create( src0.size(), src0.type() );
Mat dst0 = _dst.getMat();
// cn=channel num?
const int cn = 3;
const int MAX_LEVELS = 8;
if( (unsigned)max_level > (unsigned)MAX_LEVELS )
CV_Error( CV_StsOutOfRange, "The number of pyramid levels is too large or negative" );
std::vector<cv::Mat> src_pyramid(max_level+1);
std::vector<cv::Mat> dst_pyramid(max_level+1);
cv::Mat mask0;
int i, j, level;
//uchar* submask = 0;
// (C0, C1, C2) and (OFS0, OFS1, OFS2) color European -style distance. Do not open the root number with a square
#define cdiff(ofs0) (tab[c0-dptr[ofs0]+255] + \
tab[c1-dptr[(ofs0)+1]+255] + tab[c2-dptr[(ofs0)+2]+255] >= isr22)
double sr2 = sr * sr;
// I refers to the minimum of 16? Because SR needs at least 2
int isr2 = cvRound(sr2), isr22 = MAX(isr2,16);
int tab[768];
if( src0.type() != CV_8UC3 )
CV_Error( CV_StsUnsupportedFormat, "Only 8-bit, 3-channel images are supported" );
if( src0.type() != dst0.type() )
CV_Error( CV_StsUnmatchedFormats, "The input and output images must have the same type" );
if( src0.size() != dst0.size() )
CV_Error( CV_StsUnmatchedSizes, "The input and output images must have the same size" );
if( !(termcrit.type & CV_TERMCRIT_ITER) )
termcrit.maxCount = 5;
termcrit.maxCount = MAX(termcrit.maxCount,1);
termcrit.maxCount = MIN(termcrit.maxCount,100);
if( !(termcrit.type & CV_TERMCRIT_EPS) )
termcrit.epsilon = 1.f;
termcrit.epsilon = MAX(termcrit.epsilon, 0.f);
// Pre -processing table. tab [0] indicates (-255)^2, tab [255] indicates 0^2
// If the color is only 0-255, it seems that 512 is enough. Why do you want 768?
for( i = 0; i < 768; i++ )
tab[i] = (i - 255)*(i - 255);
// 1. construct pyramid
src_pyramid[0] = src0;
dst_pyramid[0] = dst0;
for( level = 1; level <= max_level; level++ )
{
src_pyramid[level].create( (src_pyramid[level-1].rows+1)/2,
(src_pyramid[level-1].cols+1)/2, src_pyramid[level-1].type() );
dst_pyramid[level].create( src_pyramid[level].rows,
src_pyramid[level].cols, src_pyramid[level].type() );
cv::pyrDown( src_pyramid[level-1], src_pyramid[level], src_pyramid[level].size() );
//CV_CALL( cvResize( src_pyramid[level-1], src_pyramid[level], CV_INTER_AREA ));
}
// Application space, you can reuse it
mask0.create(src0.rows, src0.cols, CV_8UC1);
//CV_CALL( submask = (uchar*)cvAlloc( (sp+2)*(sp+2) ));
// 2. apply meanshift, starting from the pyramid top (i.e. the smallest layer)
for( level = max_level; level >= 0; level-- )
{
cv::Mat src = src_pyramid[level];
cv::Size size = src.size();
const uchar* sptr = src.ptr();
int sstep = (int)src.step;
uchar* mask = 0;
int mstep = 0;
uchar* dptr;
int dstep;
float sp = (float)(sp0 / (1 << level));
sp = MAX( sp, 1 );
if( level < max_level )
{
cv::Size size1 = dst_pyramid[level+1].size();
cv::Mat m( size.height, size.width, CV_8UC1, mask0.ptr() );
// Step refers to how many bytes need to skip the next line of the picture
dstep = (int)dst_pyramid[level+1].step;
// Because the distance to 8 points around the surroundings is calculated below, a pixel on the edge must skip
dptr = dst_pyramid[level+1].ptr() + dstep + cn;
mstep = (int)m.step;
mask = m.ptr() + mstep;
//cvResize( dst_pyramid[level+1], dst_pyramid[level], CV_INTER_CUBIC );
cv::pyrUp( dst_pyramid[level+1], dst_pyramid[level], dst_pyramid[level].size() );
m.setTo(cv::Scalar::all(0));
// Mask's size is only 1/2 of the current pyramid DPTR, so Mask must be doubled in the X or Y direction
// that is, the direction of ROWS to add MSTEP*2, below of the COLUMNS direction [J*2 -1]
// Last Dilate to eliminate the impact of size.
// Is the difference between the increasing linear interpolation and the doubling of a small Mask directly?
// The edge of linear interpolation 1 may generate some gray values such as 0.8. And deity can be expanded by drawing
// and Mask here actually requires a dual value
for( i = 1; i < size1.height-1; i++, dptr += dstep - (size1.width-2)*3, mask += mstep*2 )
{
for( j = 1; j < size1.width-1; j++, dptr += cn )
{
int c0 = dptr[0], c1 = dptr[1], c2 = dptr[2];
// Calculate whether there is a color distance from 8 points around
// * * *
// * 0 *
// * * *
mask[j*2 - 1] = cdiff(-3) || cdiff(3) || cdiff(-dstep-3) || cdiff(-dstep) ||
cdiff(-dstep+3) || cdiff(dstep-3) || cdiff(dstep) || cdiff(dstep+3);
}
}
// kernel is empty MAT, that is the default 3*3
cv::dilate( m, m, cv::Mat() );
mask = m.ptr();
}
// The role of the pyramid is to generate a master
//
dptr = dst_pyramid[level].ptr();
dstep = (int)dst_pyramid[level].step;
for( i = 0; i < size.height; i++, sptr += sstep - size.width*3,
dptr += dstep - size.width*3,
mask += mstep )
{
for( j = 0; j < size.width; j++, sptr += 3, dptr += 3 )
{
int x0 = j, y0 = i, x1, y1, iter;
int c0, c1, c2;
// The first Mask was 0, so it will not cross the border
// Stop when you encounter the boundary, you can keep the edge of the image
if( mask && !mask[j] )
continue;
c0 = sptr[0], c1 = sptr[1], c2 = sptr[2];
// iterate meanshift procedure
for( iter = 0; iter < termcrit.maxCount; iter++ )
{
const uchar* ptr;
int x, y, count = 0;
int minx, miny, maxx, maxy;
int s0 = 0, s1 = 0, s2 = 0, sx = 0, sy = 0;
double icount;
int stop_flag;
// Note the details of the processing of Mean Shift here
// Plim a box (instead of circular) according to the space, and then only the pixels with similar colors in this frame (after filtering these pixels, it is a bit like the picture of the example). To the direction of shift. More concise. Or in order to achieve convenience.
// Anyway, Idea is still relatively direct, just track a point to its hometown.
//mean shift: process pixels in window (p-sigmaSp)x(p+sigmaSp)
minx = cvRound(x0 - sp); minx = MAX(minx, 0);
miny = cvRound(y0 - sp); miny = MAX(miny, 0);
maxx = cvRound(x0 + sp); maxx = MIN(maxx, size.width-1);
maxy = cvRound(y0 + sp); maxy = MIN(maxy, size.height-1);
ptr = sptr + (miny - i)*sstep + (minx - j)*3;
for( y = miny; y <= maxy; y++, ptr += sstep - (maxx-minx+1)*3 )
{
int row_count = 0;
x = minx;
#if CV_ENABLE_UNROLLED
for( ; x + 3 <= maxx; x += 4, ptr += 12 )
{
int t0 = ptr[0], t1 = ptr[1], t2 = ptr[2];
if( tab[t0-c0+255] + tab[t1-c1+255] + tab[t2-c2+255] <= isr2 )
{
s0 += t0; s1 += t1; s2 += t2;
sx += x; row_count++;
}
t0 = ptr[3], t1 = ptr[4], t2 = ptr[5];
if( tab[t0-c0+255] + tab[t1-c1+255] + tab[t2-c2+255] <= isr2 )
{
s0 += t0; s1 += t1; s2 += t2;
sx += x+1; row_count++;
}
t0 = ptr[6], t1 = ptr[7], t2 = ptr[8];
if( tab[t0-c0+255] + tab[t1-c1+255] + tab[t2-c2+255] <= isr2 )
{
s0 += t0; s1 += t1; s2 += t2;
sx += x+2; row_count++;
}
t0 = ptr[9], t1 = ptr[10], t2 = ptr[11];
if( tab[t0-c0+255] + tab[t1-c1+255] + tab[t2-c2+255] <= isr2 )
{
s0 += t0; s1 += t1; s2 += t2;
sx += x+3; row_count++;
}
}
// This macro is not else. Because it is a mobile PTR, slowly move it without speeding up
// unrolled is used for CPU acceleration, and continuous memory is read at one time
After reading the source code, the specific implementation of Meanshift is more clear. Back to the initial problem, what is the role of the pyramid? We can find that the role of the pyramid is only to generate a Mask. This MASK is calculated from small to large. Each calculation of Meanshift first look at the smaller Mask to extract those image edges and stop when it encounters these edges. After adding the pyramid, the edge of multiple dimensions can be retained, that is, the image blocks from Meanshift are relatively small, that is, the color level will be more "rich".
#endif
for( ; x <= maxx; x++, ptr += 3 )
{
int t0 = ptr[0], t1 = ptr[1], t2 = ptr[2];
if( tab[t0-c0+255] + tab[t1-c1+255] + tab[t2-c2+255] <= isr2 )
{
s0 += t0; s1 += t1; s2 += t2;
sx += x; row_count++;
}
}
count += row_count;
sy += y*row_count;
}
if( count == 0 )
break;
icount = 1./count;
x1 = cvRound(sx*icount);
y1 = cvRound(sy*icount);
s0 = cvRound(s0*icount);
s1 = cvRound(s1*icount);
s2 = cvRound(s2*icount);
stop_flag = (x0 == x1 && y0 == y1) || std::abs(x1-x0) + std::abs(y1-y0) +
tab[s0 - c0 + 255] + tab[s1 - c1 + 255] +
tab[s2 - c2 + 255] <= termcrit.epsilon;
x0 = x1; y0 = y1;
c0 = s0; c1 = s1; c2 = s2;
if( stop_flag )
break;
}
dptr[0] = (uchar)c0;
dptr[1] = (uchar)c1;
dptr[2] = (uchar)c2;
}
}
}
}
```

But if the gradient of the image is richer, it will actually look confused, as if being blurred by Gaussian.

Because the CUDA version does not have the function of this pyramid, but it is not particularly impact on the effect, and the problem is not large.

### cuda::meanShiftSegmentation

Meanshift is not dependent on each pixel point for each pixel, which is very suitable for GPU parallel.

```
void cv::cuda::meanShiftSegmentation(InputArray _src, OutputArray _dst, int sp, int sr, int minsize, TermCriteria criteria, Stream& stream)
{
GpuMat src = _src.getGpuMat();
CV_Assert( src.type() == CV_8UC4 );
const int nrows = src.rows;
const int ncols = src.cols;
const int hr = sr;
const int hsp = sp;
// Perform mean shift procedure and obtain region and spatial maps
// Different from the CPU version, in addition to saving the color (D_RMAP), it also saves the x, y position (d_spmap) of the last Meanshift
GpuMat d_rmap, d_spmap;
cuda::meanShiftProc(src, d_rmap, d_spmap, sp, sr, criteria, stream);
stream.waitForCompletion();
Mat rmap(d_rmap);
Mat spmap(d_spmap);
Graph<SegmLinkVal> g(nrows * ncols, 4 * (nrows - 1) * (ncols - 1)
+ (nrows - 1) + (ncols - 1));
// There is also a large code behind, which can be seen.
```

Standard Meanshift is relatively easy to understand, and there is a big thing for follow -up. For example, DjSet is DISJOINT SET, and you should be impressed with ACM. And also made a graph.

In fact, every pixel point generates the following four edges. Then the side of the right side is added, and the bottom one needs to be added to the right.

In fact, it is the FloodFill that adds space constraints. Finally, the small component merged. Therefore, the GPU version not only supports PYR, but also in turn merged with color blocks.

## 4 Comparison

cpu1540 msecs.

gpu 83 msecs