The Need for High Performance (Part 1) : What I have Learnt About Data Parallelism and Thread Parallelism
Introduction
What is Parallelism in Computing ? Parallelism is simply the simultaneous execution of independent tasks.
When can a program be parallelized ? A program can be parallelized if the final output
Levels of Parallelism
parallelism, a parent process is subdivided into daughter - processes - independent tasks are assigned to processes which execute simultaneously. There are various levels of parallelism and they can be classified as either Implicit or Explicit based on the way they are expressed -
a. Implicit Parallelism - Parallelism is expressed automatically. This type of parallelism is exposed automatically by the hardware architecture (CPU, GPU, e.t.c) and compiler techniques such as automatic vectorization, automatic parallelization, pipelining, e.t.c.
b. Explicit Parallelism- Parallelism is expressed by the programmer. It is not automatic, the programmer exploits opportunities for parallel execution of tasks and exposes parallelism stating specific library constructs in the program.
Levels of Parallelism :
- 
Data Parallelism - This is form of explicit parallelism. 
- 
Thread-Level Parallelism - Multiple threads in a processor for the execution of a task at the same time. 
- 
Instruction-Level Parallelism (ILP) - Occurs when there is a parallel execution of a series of instructions in a computer program (this means that there are no dependencies in the series of instructions); ILP is extracted by the hardware architecture and compiler, therefore it is an implicit type of parallelism. ILP can be exposed using the following techniques - pipelining, loop-level parallelism (loop unrolling - statically by the compiler or dynamically by the hardware), vectorization (which also exploits loop-level parallelism), branch prediction, scheduling, compiler dependence analysis, e.t.c. 
R-A-I-S-E : The five, 5, Concerns of Data parallelism
Data Parallelism (DP) can be explained as "the mapping of data to available resources effectively for high performance".
Summarizing the concerns of data parallelism into five, 5:
- Data Reuse: "Adapting data - 1. for different use cases; 2. to specific algorithms and methods; ". The process of reusing often require :
a. Data Distribution Techniques - for maximizing memory on computational resources. An example is splitting data into chunk sizes
to map to available resource(s); 
b. *Data Reprocessing Techniques *- This includes actions such as inserting, deleting, updating, other actions to process data again with the aim of reusing it.
- Data Accessibility / Availability : "This is the placement of data so that it is at the right place at the right time". Some considerations :
a. *Data access pattern *- It is very important for data to be accessed with respect to the storage / placement configuration of a programming language. For example, in storing elements of an array : C, adapts a row-wise storage whereas Fortran adapts a column-wise storage.
b. Cache locality -  The cache belongs to the set of smaller and fastest memory level in the memory hierarchy of a computer architecture (the other members in this set are the Registers). The data communication line from this set of memory levels to the processor is of a shorter range, unlike that of the other memory levels (- main memory, local secondary storage and
external storage) to the processor. In simpler words, they are the closest to the processor, so data stored in them can be retrieved and processed faster. Faster code performance is guaranteed if data is in the cache while it is been accessed. 
Locality simply means placement. In, cache locality, we want to ensure that data is used as often as possible while it is still in the cache. Hence, the placement of data is critical to ordering data accesses. Here, we then see that it is almost impossible to talk about locality without first talking about data access pattern.
- 
Data Integrity: "Data that tells the story of the method used or explained" . For example, the end result of plotting a direction 
 field (mathematics concept) is a graph that tells us how solutions behave with respect to an equilibrium solution. Another example is getting the running time of an algorithm to examine performance.
 Data integrity also means "Correct data" , "Uncorrupted data", "Clean data"; it is somewhat similar to Data Accuracy.
- 
Data Size: This involves using tools, application software, algorithms, compute power sufficient to process intended size of data. For example: High Performance Computing (HPC) is field intended for simple and complex computations requiring longer computation time and needing larger compute power. In HPC, there are peculiar algorithms, tools, applications that have been developed to suit the needs of heavy computations. 
- 
Data Expression: Involves the ability to determine efficient memory and structure management techniques for allocating and processing data. One way of determining this is to juxtapose different memory management technique for small or 
 huge amounts of data. Also, in data expression we consider the ability to express data in such a way that a considerable amount of parallelism is exposed.
In DPC (Data Parallel C) for example, there are three styles of expressing data parallelism - Basic Data Parallel Kernel, ND-range (N-Dimensional range) Data Parallel Kernel, Hierarchical Data Parallel Kernel; and three styles of expressing memory and structure - Unified Shared Memory, Buffers and Images.
M-C-S-M : The four, 4, Concerns of Thread parallelism
First a short story to illustrate thread parallelism
Will too many cooks spoil the broth ? Do we really need all hands on deck ?
Why do we need many cooks ? Why do we need all hands ? - To speed up the work.
But we first have to properly assign tasks and ensure that each task is done at the right time and by the right cook / hand.
Using only the case of the broth, If it takes 5 minutes to  get ready the ingredients of a broth and another 5 minutes to mix and heat
up the ingredients of a broth, the only avenue for speeding up the broth preparation is in the 5 minutes to get the ingredients ready and this will depend on the number of cooks. So, we will have:
a. Time taken to prepare broth by one cook: (5 + 5 ) minutes = 10 minutes.
b. Time taken to prepare broth by, n, number of cooks: (((1/5) * 5)+ 5 ) minutes = 6 minutes.
From the above illustration, we can see that the speed of preparing the broth, is affected by the time taking to heat the broth and
the number of available cooks. 
Thread parallelism addresses this kinds of concepts - Process execution (for example, an addition computation), distribution of process of into task / work, distribution of task / work to n, number, of threads. Now, the main concerns of thread parallelism:
- Thread Management: Threads are like the cooks in the above illustration. They are:
a. Workers that are assigned to distributed tasks to help speed up computation time.
b. Streams of instructions scheduled  to be executed at a particular time whether in-order or out-of-order.
c. Sub-processes that are independent of the parent process.
There are some important concepts to look at when discussing thread management:
a. Task / work distribution : This is technically referred to as load balancing. It involves sharing computation across specified
number of threads. Task / work distribution is aided by thread scheduling techniques whether explicit of implicit, made available by respective parallel programming models. OpenMP for example offers static, dynamic, guided, auto, runtime-controlled (auto is an implicit scheduling type, while the rest are explicit scheduling types) scheduling for distributing work across threads.
b. System Oversubscription  : System oversubscription occurs when the workload on a system exceeds the number of available resources - processes. Taking inspiration from reference [3] (pages 143 - 144), we will now consider two system types to further explain system oversubscription. The names used are self-given - *Application Dedicated System and Integrated System:
*
- 
Application Dedicated System (ADS) - In this type of system, the application run does not share processes with other system processes (such as operating system daemons and services) as the system is dedicated to a single application; also, threads allocated to the application are for the running of the application alone. Other system processes run independently (on threads) of the threads allocated to the application. 
- 
Integrated System (IS)- In this type of system, the application run shares resources with other system processes. It must be ensured that the application is allocated a number of threads that would not interfere with other system processes. 
In both systems (even though the ADS is application specific), it is recommended to use a number of processes (threads in this case) less than the total number of threads possible on a system. This is done to prevent process context switches - switching a process with another process. After a process context switch, there are two possible scenarios, which degrade performance -
a. Cache Flush - this is cache invalidation of a previously running thread by new program scheduled to run.
b. Cache Reload - reload of data on previously running thread rescheduled onto the same process.
b. Thread Rescheduling - Thread rescheduling unto a different process.
- 
Thread Cooperation : This involves techniques to bridge the communication gaps between working threads - avoiding thread waits (idleness) as much as possible as this could affect application performance. Usually, this is done by using the right thread synchronization techniques as well as proper workload distribution. 
- 
Threading Style: By thread style I mean threading model adopted by a parallel programming model (- OpenMP, Intel TBB, Pthreads). Often, style adopted by different parallel programming models have a way of affecting application performance, therefore, a developer must consider application type and choose the best parallel programming model to use by a comparison 
 / squeeze out as much as possible performance using a particular model . Now, how many parallel programming model should one developer know ? I do not know, but best approach might be to have a team of individuals competent with any one of the models working on a project.
- 
Thread-Style Memory Management : Like in data parallelism, memory management is also an important concept of thread parallelism. Using the right data access pattern as well as defining a method to effectively manage memory and data are very important in obtaining good performance. 
References:
- 
J. Reinders, B. Ashbaugh, J. Brodman, M. Kinsner, J. Pennycook, and X. Tian, Data Parallel C - Mastering DPC for Programming of Heterogeneous Systems using C++ and SYCL. Apress, Berkeley, CA, 2021. [Online]. Available: here. 
- 
J.M.P. Cardoso, J.G.F. Coutinho, P.C. Diniz, Embedded Computing for High Performance. Morgan Kaufmann Publishers, Elsevier Inc, Cambridge, MA, 2017. 
- 
B. Chapman, G. Jost, And R. van der Pas and D.J. Kuck, Using OpenMP : portable shared memory parallel programming. The MIT Press, Cambridge, MA, 2008. 
Thank you for reading!
Oluwatosin Odubanjo.