Wenshuai Ye, Yuhao Zhu
CS205: Final Project
Instructor: Thouis "Ray" Jones, TF: Thouis "Ray" Jones
Image processing is a mathematical operation on signals of images. It includes image sharpening, image smoothing, image segmentation, etc. Well designed algorithms have been studied for these methods. However, for high resolution images, it would take a long time to run. It is important that image is processed and analyzed using image processing algorithms at minimum cost. By parallelizing the algorithms, we can optimize the speed at which the image is processed. This project explores different parallel implementations of the anisotropic diffusion algorithm in image processing. These parallel algorithms are able to work with different number of threads and to take benefits of machines of various levels.
In image processing and computer vision, anisotropic diffusion, also called Perona–Malik diffusion, is a technique aiming at reducing image noise without removing significant parts of the image content via a partial differential equation (PDE), replaced the classical isotropic diffusion with.
This function g is to make sure the diffusion is "stopped" across edges. Namely, it is an edge-stopping functinon and in ourcase, a Gaussian function. Other candidates include exponential distribution, etc. Mathematically, the pixels can be updated iteratively by averaging the gradients of their four neighbors as the figure below.
Perona and Malik discretized their anisotropic diffusion equation as follows.[2]
where λ is a scalar that determines the rate of diffusion, |η| represents the spatial neighbor pixels and equals a constant 4 in our case, s indicates the pixel to be updated, and p refers to its four neighbors in four directions. The discretization of gradients can be expressed as follows.
In traditional image processings, image filtering are applied to exert various effects on photos. The center of a selected filter matrix has to be multiplied with the current pixel, the other elements of the filter matrix with corresponding neighbor pixels. In other words, only one pixel’s value is going to be updated at one time.
To improve the efficiency with parallel computing, take advantage of the parallel features among pixels in one iteration and partition pixels into different workgroups in openCL.
References
Michael J. Black, David H. Marimont, "Robust Anisotropic Diffusion", IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 1998
Our serial code is modified from the code published by the University of Oxford and takes advantage of vectorization in Numpy to speed up the process. We primarily adapt the code from the University of Oxford to the version proposed in the "Robust Anisotropic Diffusion paper" discussed in the Introduction section.
We partition the 2d image into multiple workgroups, each of size (x, 2) where x takes in the value in [8, 16, 32, 64, 128]. We choose a relatively small height for each work group because we read the value to the local buffer only when the one dimensional index is less than the width of the row and don't want that many threads to wait in the process.
This method sacrifices parallelization for less number of workgroups in total without changing the local size. Again, the local size is (x, 2) where x takes in the value in [8, 16, 32, 64, 128], and the height of the global size equals that of the local size. Since the global size does not cover the entire image, we use a loop within each workgroup to update the buffer and the output image pixels. We introduce an improved version in method 3, which utilizes the index trick to avoid reading the same values multiple times.
Because the size of our halo is greater than 0. The buffer we construct and update in the second method will have overlapped parts in the iteration. The index trick captures the feature and update the values only when necessary, as demonstrated in the image below (special thanks Ray for drawing the image).
The last method is to get rid of the local buffer and read the pixel values from the global memory directly instead. Theoretically, reading from the global memory would be slow. However, in my system, this method outperforms the other three.
It takes about 165 seconds to run in my machine using 40 iterations with the library image. We anticipate much higher performance using OpenCL.
The result from both images shows that the method without the buffer works best in my computer. In the Harvard library image case, this is followed by the method with the buffer index trick, which does a slightly better job than the rest. They are over 300 times faster than the serial Numpy version.
However, when we use a much smaller image, the block wise parallel method has the second better performance.
We thank Thouis "Ray" Jones and all TFs for providing guidance and support.
We found that the most tricky part of this project is to avoid reading same values to the buffer using the index trick. However, with sufficient guidance and patience, we sucessfully implemented in OpenCL. Meanwhile, I think it would be great if we have nvidia so that we can compare the performance with that in CUDA.
Wenshuai Ye: wenshuaiye@g.harvard.edu
Yuhao Zhu: yuhaozhu@g.harvard.edu