Multicore Computing

created : 2021-06-02T08:11:08+00:00
modified : 2021-06-03T00:21:24+00:00

Introduction to Multicore Computing

Multicore Processor

Manycore processor (GPU)

What is Prallel Computing?

Parallelism vs Concurrency

Parallel Programming Techniques

Parallel Processing Systems

Parallel Computing vs. Distirbuted Computing

Cluster Computing vs. Grid Computing

Cloud Computing

Good Parallel Program

Moore’s Law

Computer Hardware Trend

Examples of Paralle lComputer

Generic SMP


Principles of Parallel Computing

Overhead of Parallelism

Locality and Parallelism

Load Imbalance

Performance of Parallel Programs

Flynn’s Taxonomy on Parallel Computer

SISD (Single Instruction, Single Data)

SIMD (Single Instruction, Multiple Data)

MISD (Multiple Instruction, Single Data)

MIMD (Multiple Instruction, Multiple Data)

Creating a Parallel Program

  1. Decomposition
  2. Assignment
  3. Orchestration/Mapping


Domain Decomposition

Functional Decomposition




Performance of Parallel Programs

Coverage (Amdahl’s Law)

Performance Scalability


Fine vs Coarse Granularity

Load Balancing

General Load Balancing Problem

Load Balancing Problem

Static load balancing

Dynamic Load Balancing

Granularity and Performance Tradeoffs


Factors to consider for communication

MPI : Message Passing Library


Synchronous vs Asynchronous Messages

Blocking vs .Non-Blocking Messages





Memory Access Latency in Shared Memory Architectures

Cache Coherence

Shared Memory Architecture

Distributed Memory Architecture

Hybrid Architecture:

JAVA Thread Programming


Unix process


Multi-process vs Multi-thread

Programming JAVA Threads

Java Threading Models

Creating THreads: method1

class Mythreads extends Thread {
  public void run() {
    // work to do

 MyThread t = new THread();

Thread Names

Creating Threads: method 2

Thread Life-Cycle

Alive States

Thread Priority


Thread identity

Thread sleep, suspend, resume

Thread Waiting & Status check

THread syncrhonization

Synchronized JAVA methods

Synchronized Lock Object

Condition Variables

wait() and notify()

Producer-Consumer Problem

Potential Concurrency Problejms

Important Concepts in Concurrent Programming

Devide-and-Conquer way for parallelization

Pthread Programming

Thread Properties


Pthreads API

Thread Management

Thread Creation

Thread Termination

Thread Cancellation



Mutex Routins

Locking/Unlocking Mutexes

User’s Responsibility for Using Mutex

Condition Variables

Condition Variables Routines


Shared Memory Model

Example -Matrix times vector

#pragma omp parallel for default(none) \
            private(i, j, sum) shared(m, n, a, b, c)
for (i = 0; i < m; i ++)
  sum = 0.0;
  for (j = 0; j < n; j ++)
    sum += b[i][j] * c[j];
  a[i] = sum;

When to consider using OpenMP

About OpenMP


Components of OpenMP

About OpenMP clauses

The if/private/shared clauses

About storage association

The first/last private cluases

The default clause

The reduction clause

The nowait clause

The parallel region

Work-sharing constructs

The omp for/do directive

#pragma omp for [cluase[[,] clause] ...]
  <origianl for-loop>

Load balancing

The schedule clause

The SECTIONS directive

#pragma omp sections [cluases(s)]
#pragma omp section
  <code block1>
#pragma omp section
  <code block2>
#pragma omp section


Synchornization Controls


Critical region

#pragma omp critical [(name)]
{ <code-block> }
#pragma omp atomic

Single processor region

SING and MASTER construct

#pragma omp single [clause[[,] clause] ...]
#pragma omp master
{ <code-block> }

More synchronization directives

OpenMP Environment Variableso

OpenMP and Global data

The threadprivate construct

The copyin caluse

OpenMP Runtime Functions

OpenMP runtime library

OpenMP locking routines

Nested locking

Manycore GPU Programming with CUDA

The Need of Multicore Architecture

Many-core GPUs

Processor:Multicore vs Many-core




GPU Architecture

GPU chip design

Popularity of GPUs

Why more parallelism?

CUDA(Computer Unified Device ARchitecture)

Compute Capability

CUDA - Main Features

CUDA device and threads

CUDA Hello World

#include <stdio.h>
__glogal__void hello_world(void) {
  pritnf("Hello World\n");

int main (void) {
  hello_world<<<1, 5>>>();
  return 0;

C Language Extension

Simple Processing Flow

  1. Copy input data from CPU memory to GPU memory
  2. Load GPU program and execute, caching data on chip for performance
  3. Copy results from GPU memory to CPU memory

Hello World! with Device Code

Memory Mangement

__global__ void add(int *a, int *b, int *c) {
  *c = *a + *b;
int main (void) {
  int a, b, c;
  int *d_a, *d_b, *d_c;
  int size = sizeof(int);
  cudaMalloc((void **) &d_a, size);
  cudaMalloc((void **) &d_b, size);
  cudaMalloc((void **) &d_c, size);
  a = 2;
  b = 7;
  cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
  add<<<1,1>>>(d_a, d_b, d_c);
  cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
  return 0;

Running in Parallel

Moving to Parallel

Vector Addition on the Device

__glogal__ void add(int *a, int *b, int *c) {
  c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
// add<N, 1>> (...);

CUDA Threads

Combining Blocks and Threads

1D Stencil

Implementing Within a block

Sharing Data Between Threads

__global__ void stencil_ld(int *int, int *out) {
  __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
  int gindex = threadIdx.x + blockIdx.x * blockDim.x;
  int lindex = threadIdx.x + radius;

  temp[lindex] = in[gindex];
  if (threadIdx.x < RADIUS) {
    temp[lindex - RADIUS] = in[gindex - RADIUS];
    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

  int result = 0;
  for (int offset = -RADIUS; offset <= RADIUS ; offset ++)
    result += temp[lindex + offset];
  out[gindex] = result;

Coordinating Host & Device

Reporting Erros

Device Managment