Comparing SSE2 and GPGPU C++ AMP
Last Updated on Tuesday, 21 May 2013 18:22
The temporal median algorithm compares the same pixel in consecutive images in a sequence, and returns the median value of the pixel, i.e. the pixel value that has the same number of lower and higher values (an easy way to visualize this is imaging all the pixel values in an array that gets sorted, then picking the value in the middle of the array). Depending on the time span of the sequence, and on the number of samples in the sequence, the temporal median filter can be useful for:
- filtering out noise, especially brief noise impulses that last for less than samples/2 time
- estimating a steady state of a scene, especially with a longer time span and an higher number of samples
When using the temporal median algorithm, every time a new image arrives, the oldest in the images' queue is discarded, the new one is added to the queue, then every single pixel is processed to obtain an output image that contains the median value of all images in the queue. The easy (and slow) method is copying all the values of the given pixel into a vector, sorting it then picking the middle value. This implementation will be named "reference" and it will be the yardstick for other implementations, as they must exactly match its results. Fortunately, for specific vector lengths (3, 5, 7 and 9 samples) there are pre-computed optimal sequences of min/max computation that give the median value with the absolute minimal number of computations. Even number of samples (4, 6 and 8 samples) can be transformed into the optimal 5, 7 and 9 samples cases by adding an empty 0 data element and adjusting the index of the median element.
Before analyzing how the basic and optimized algorithms are implemented, let's have a look at the contenders.
- SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD (Single Instruction, Multiple Data) processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully replace MMX. In this example, we will use the SSE2 C++ intrinsics, and let the Visual C++ compiler take care of SSE2 register allocation (due to the low registry pressure, the generated assembly code is optimal)
- C++ Accelerated Massive Parallelism (C++ AMP) is a library implemented on DirectX 11 and an open specification from Microsoft for implementing data parallelism directly in C++. It is intended to make programming GPUs easy for the developer. Code that cannot be run on GPUs will fall back onto one or more CPUs instead and use SSE instructions. The Microsoft implementation is included in Visual C++ 2012, including debugger and profiler support.
Spidering Facebook public profiles with C++ and Boost
Last Updated on Friday, 17 May 2013 22:59
Back in 2010, security researcher Ron Bowes wrote a Ruby script that downloads information from Facebook's user directory, a searchable index of public profile pages. The directory did not expose a user's entire profile and only exposed information that the user has allowed Facebook to make public. Bowes got the idea of spidering the data so that he could collect statistics about the most common names.
Now, how hard can it be to write such a software using C++ instead of Ruby? As Jeremy Clarkson was not interested in answering such a question, I decided to write a quick and simple spidering software in C++, and found out that while it is easy to build, these days it is not useful at all.
The main part of this project is parsing the HTML pages containing the Facebook directory. In a directory page, we can find both links to other directory pages and links to public users' profiles. Luckily, they are easy to distinguish using regular expressions, and the Boost C++ library supports Perl regular expression, so given the HTML source of a page, the following functions use regular expressions to extract useful links:
GPGPU performance on switchable graphics notebooks
Many notebooks on the market feature switchable graphics, that is, a notebook with an Intel CPU with built-in HD Graphics technology, as well as an additional AMD HD Radeon GPU or nVidia GeForce GPU. During normal usage, just the Intel HD Graphics is enabled, as it consumes less power, and the high-performance AMD or nVidia GPU is enabled only when 3D intensive applications are started.
When developing a GPGPU application, in this example using C++ AMP with Microsoft Visual C++ 2012, you must check that the high-performance GPU is running your GPGPU kernels, or the resulting performance will be so poor, that you will wonder what the hype about GPGPU is all about. To ensure that the high-performance AMD GPU is running your code, right click on the desktop and click on Setup switchable graphics
Unit testing with Visual C++ 2012
Last Updated on Thursday, 25 April 2013 13:59
The article related to multi-threading and SSE2 optimizations (you can find it here) uses a quick-and-dirty way to check if the optimized code is correct, i.e. it runs an iteration on a given input image with a reference serial code, stores the resulting output image, runs an iteration of the optimized code on the same input image, checks if the output image matches that obtained with the reference code. This is a valid approach, but the location is clearly wrong: even if we are writing a demo application, the testing code should be separate in unit tests that can be automated and repeated before check-ins and builds. It's time to modify that code to use the awesome support for unit testing contained in Microsoft Visual C++ 2012.
Different types of parallel loops with Intel TBB, SSE2, SSSE3 and Visual C++ 2012
Last Updated on Wednesday, 01 May 2013 14:06
This is not the first article on this site that discusses how to use the Intel Thread Building Blocks library to spread the computation of an image-processing kernel over multiple threads: the article named "Multi thread loops with Intel TBB" showed how to do it with Intel TBB 2.x. However, due to the enhancements of Intel TBB 4.x, and the availability of C++ 11 compliant compilers such as Microsoft Visual C++ 2012, it is now possible to write more compact and easy to understand code than before.
Further multi-thread processing with Delphi
Last Updated on Monday, 22 April 2013 15:45
In a previous article named "Easy multi-thread programming Delphi", the AsyncCalls library was used to process multiple images at the same time. However, the processing of every single image was still strictly serial, even if image processing kernels are quite easy to accelerate spreading the load over multiple threads.
In this article we will see how the OmniThreadLibrary can be used to split a simple image processing kernel across multiple threads.
procedure TProcessedImage.EffectBlackWhite(Bitmap : TBitmap32);
var x, y : integer;
Color : TColor32;
Red, Green, Blue : Cardinal;
Gray : Cardinal;
for y := 0 to Pred(Bitmap.Height) do
for x := 0 to Pred(Bitmap.Width) do
Color := Bitmap.Pixel[x, y];
Red := (Color and $00FF0000) shr 16;
Green := (Color and $0000FF00) shr 8;
Blue := Color and $000000FF;
Gray := (Red * 299 + Green * 587 + Blue * 114) div 1000;
Bitmap.Pixel[x, y] := Color32(Gray, Gray, Gray);
Page 1 of 4