Parallel C++ and TBB

Environment

Make sure you are using the correct environment!

Hello World

#include <thread>
#include <iostream>
int main()
{
   auto f = [](int i){
   std::cout << "hello world from thread " << i << std::endl;
  };
//Construct a thread which runs the function f
  std::thread t0(f,0);

//and then destroy it by joining it
  t0.join();
}

Compile with:

g++ std_threads.cpp -lpthread -o std_threads

Measuring time intervals

#include <chrono>
...
auto start = std::chrono::steady_clock::now();
  foo();
auto stop = std::chrono::steady_clock::now();
std::chrono::duration<double> dur= stop - start;
std::cout << dur.count() << " seconds" << std::endl;

Exercise 1. Reduction

#include <iostream>
#include <random>
#include <utility>
#include <vector>
#include <chrono>

int main(){

  const unsigned int numElements= 100000000;   

  std::vector<int> input;
  input.reserve(numElements);

  std::mt19937 engine;
  std::uniform_int_distribution<> uniformDist(-5,5);
  for ( unsigned int i=0 ; i< numElements ; ++i) input.emplace_back(uniformDist(engine));

  long long int sum= 0;

  auto f= [&](unsigned long long firstIndex, unsigned long long lastIndex){
    for (auto it= firstIndex; it < lastIndex; ++it){
        sum+= input[it];
    }
  };

  auto start = std::chrono::system_clock::now();
  f(0,numElements);
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time spent in reduction: " << dur.count() << " seconds" << std::endl;
  std::cout << "Sum result: " << sum << std::endl;
  return 0;
}

Quickly create threads

unsigned int n = std::thread::hardware_concurrency();
std::vector<std::thread> v;
for (int i = 0; i < n; ++i) {
     v.emplace_back(f,i);
}
for (auto& t : v) {
    t.join();
}

Exercise 2. Numerical Integration

#include <iostream>
#include <iomanip>
#include <chrono>

int main()
{
  double sum = 0.;
  constexpr unsigned int num_steps = 1 << 22;
  double pi = 0.0;
  constexpr double step = 1.0/(double) num_steps;
  auto start = std::chrono::system_clock::now();
  for (int i=0; i< num_steps; i++){
    auto  x = (i+0.5)/num_steps;
    sum = sum + 4.0/(1.0+x*x);
  }
  auto stop = std::chrono::system_clock::now();
  std::chrono::duration<double> dur= stop - start;
  std::cout << dur.count() << " seconds" << std::endl;
  pi = step * sum;

  std::cout << "result: " <<  std::setprecision (15) << pi << std::endl;
}

Exercise 3. pi with Montecarlo

.

The area of the circle is pi and the area of the square is 4.

Generate N random floats x and y between -1 and 1 https://en.cppreference.com/w/cpp/numeric/random/uniform_real_distribution.

Calculate the distance r of your point from the origin.

If r < 1: the point is inside the circle and increase Nin.

The ratio between Nin and N converges to the ratio between the areas.

Setting the environment for Intel oneTBB

Check your environment!

echo $TBBROOT

To compile and link:

g++ -O2 algo_par.cpp  -ltbb

Let's check that you can compile a simple tbb program:

#include <cstdint>
#include <oneapi/tbb.h>
#include <oneapi/tbb/info.h>
#include <oneapi/tbb/parallel_for.h>
#include <oneapi/tbb/task_arena.h>
#include <cassert>

int main() {
  // Get the default number of threads
  int num_threads = oneapi::tbb::info::default_concurrency();

  // Run the default parallelism
  oneapi::tbb::parallel_for(
      oneapi::tbb::blocked_range<size_t>(0, 20),
      [=](const oneapi::tbb::blocked_range<size_t> &r) {
        // Assert the maximum number of threads
        assert(num_threads == oneapi::tbb::this_task_arena::max_concurrency());
      });

  // Create the default task_arena
  oneapi::tbb::task_arena arena;
  arena.execute([=] {
    oneapi::tbb::parallel_for(
        oneapi::tbb::blocked_range<size_t>(0, 20),
        [=](const oneapi::tbb::blocked_range<size_t> &r) {
          // Assert the maximum number of threads
          assert(num_threads ==
                 oneapi::tbb::this_task_arena::max_concurrency());
        });
  });

  return 0;
}

Compile with:

g++ your_first_tbb_program.cpp -ltbb

Your TBB Thread pool

// analogous to hardware_concurrency, number of hw threads:
int num_threads = oneapi::tbb::info::default_concurrency();

// or if you wish to force a number of threads:
auto t = 10; //running with 10 threads
oneapi::tbb::task_arena arena(t);

// And query an arena for the number of threads used:
auto max = oneapi::tbb::this_task_arena::max_concurrency();
// Limit the number of threads to two for all oneTBB parallel interfaces
oneapi::tbb::global_control global_limit(oneapi::tbb::global_control::max_allowed_parallelism, 2);

Task parallelism

A task is submitted to a task_group as in the following. The run method is asynchronous. In order to be sure that the task has completed, the wait method has to be launched. Alternatively, the run_and_wait method can be used.

#include <iostream>
#include <oneapi/tbb.h>
#include <oneapi/tbb/task_group.h>

using namespace oneapi::tbb;

int Fib(int n) {
  if (n < 2) {
    return n;
  } else {
    int x, y;
    task_group g;
    g.run([&] { x = Fib(n - 1); }); // spawn a task
    g.run([&] { y = Fib(n - 2); }); // spawn another task
    g.wait();                       // wait for both tasks to complete
    return x + y;
  }
}

int main() {
  std::cout << Fib(32) << std::endl;
  return 0;
}

Bonus: Graph Traversal

Generate a direct acyclic graph represented as a std::vector<Vertex> graph of 20 vertices:

struct Vertex {
  int N;
  std::vector<int> Neighbors;
}

If there is a connection from A to B, the index of the element B in graph needs to be pushed into A.Neighbors. Make sure that from the first element of graph you can visit the entire graph.

Once generated, when you visit a vertex X of the graph, you compute Fib(X.N). Generate Vertex.N uniformly between 30 and 40.

Remember to keep track of which vertex has already been visited.