Skip to main content

Welcome to a magical world...

A magical world where clever algorithms meet elegant multitasking models!

Suppose you were given the following open-ended problem:
Given an array of n bits , perform a negation operation on each of the bits.

The most obvious brute force approach would be to go from index 0 to index n-1 and invert the ith bit on the way. Well, this works in a clean way and is the simplest formulation which delivers correct results.
Nice! Now suppose you have an array of about a million digits; our algorithm directs us to go through each of the elements serially, one by one, giving each bit a feeling of self-importance! (The CPU dedicates few clock cycles exclusively on accessing, processing and finally writing out the result).

Can we make this faster? Can we use the fact that an operation on the ith bit is independent of  the operation on the (i+1)th bit (or any other bit in general)? Of course we can!

Imagine a switchboard with a set of conventional switches which need to be toggled as quickly as possible. We don't have to toggle the switches one at a time. As a matter of fact, multiple volunteers, hands and fingers can be utilized...

Suppose the switchboard is an array upon which operation(s) have to be performed. The analogies we have are:
  1. Person is analogous to a processor
  2. Person's hand is analogous to a processor core
  3. Fingers of a hand are analogous to threads
  • Single finger, Single hand : Single processor, Single core, Single software thread
  • Multiple fingers, Single hand : Single processor, Single core, Multiple software threads
  • Multiple fingers, Both hands : Single processor, Dual core, Multiple software threads
  • Call in your friends on a switch switching party! Multiple Processors, Multiple cores, Multiple threads.
Some forward references for the curious readers:
  • Hyper-threading can be simulated by imagining an artificial partition within a set of five fingers of a single hand, calling each partition as a virtual core...
  • Suppose you invented a new gadget which can be attached to each finger which can toggle n (contiguous) switches at a time, that would be analogous to SIMD style operation(Single/Symmetric Instruction Multiple Data). Note that you need specialized hardware for such operations...
  • There might be conflict(s) on who toggles certain switches in the last case, this is what is referred to as a race condition/data race. Race conditions in writing values generally leads to incorrect results.(except for write operations where all the instructions/threads compete to write the same result, as in this switchboard analogy)
While we're working with this analogy, let me bring in some interesting concepts and gotchas in parallel computing at a very abstract level.

Suppose there is a predefined set of rules with which the switches have to be toggled, then parallelism will be constrained. For the time being, let's switch from the switchboard to a piano. Each keystroke of a given note can be viewed as a memory access of a particular address.

The concept of caching can be seen as:
  1. The repetition of the same note in a given time period : Temporal locality
  2. The access of notes in the local neighbourhood of a note : Spatial Locality
Pipelining is like the positioning of the fingers among the keys along with an action, where the action can be seeking, settling, striking or retreating the keys.(many stages of a pipeline, where the actual implementation will differ)

Branch prediction allows for a predictive seeking to a particular key, hoping that it is to be struck according to the score sheet/song being played. If not, then the effort of moving to the keys is wasted, so a repositioning is required. (analogous to a pipeline flush)
All these possibilities employ "parallelism" in slightly different degree of multiplicity and complexity. The suitable choice depends on the context of usage and availability of corresponding hardware resources. I hope this gave you a fundamental set of simple analogies in parallelism.


Bit (BInary digiT; An oxymoron, isn't it?)

Comments

Popular posts from this blog

Parallel computing jargon

Parallel is always better than Serial... right? NO Well, in the most general case, parallel computing is better than serial computing in terms of speed and throughput. Sometimes, we have to make other considerations too. As a comparison , consider computer networking where serial transmissions are straight-forward and faster than their parallel SCSI  counterparts! Some processes are inherently not parallelizable, due to presence of data dependency. (Two bank account withdrawals from different locations which may lead to negative account balance if done simultaneously! Anyway, such a pair of withdrawals with critical section management using semaphores/mutexes conceptually and momentarily reduces to serial execution...) On a lighter note, the process of Sneezing and keeping your eyes open is not parallelizable for example! Before jumping into the concepts and principles of parallelizing a given task, let us go through some interesting set of ( controversial ...

A simple python program using multiprocessing... or is it?

I would like to show you a very simple, yet subtle example on how programs can seem to produce unreasonable outputs. Recently, I was glancing through certain programs in Python, searching for places to optimize code and induce parallelism. I started thinking of threads immediately, and how independent contexts of computation can speed up code. Although I program frequently with Python, I hadn't been using any kind of explicit parallelism in my code. So, using my C knowledge, I went towards the Threading library of Python. Long story short, that was a mistake! Turns out that the Python implementation which is distributed by default (CPython) and Pypy, both have a process-wide mutex called the Global Interpreter Lock. This lock is necessary mainly because CPython's memory management is not thread-safe. The GIL locks up any kind of concurrent access to any objects in the Python run-time to prevent any race conditions or corruption of states. This is effectively a synchroniz...

EnchantingProgram: Spoiler alert

This is part-2 of the "EnchantingProgram" post, read this post first: http://magical-parallel-computing.blogspot.in/2017/04/a-simple-python-program-using.html So, let's see the actual reason of the speedup in the C and C++ programs. Lo and Behold, it is the effect of Branch Prediction ! Surprised? Well, at least my comments in the programs should have given you some direction!!! The if condition leads to a branch in the control flow. We know that branch predictions lead to pipeline flushes and create a delay in the piped execution scheme. Modern microprocessors utilize complex run-time systems for throughput, execution speed and memory efficiency. One example of such a technique is dynamic branch prediction. A long time ago, microprocessors used only a basic technique called static branch prediction, with two general rules: A forward branch is presumed to be not taken A backward branch is presumed to be taken Now, static branch p...