A simple python program using multiprocessing... or is it?

I would like to show you a very simple, yet subtle example on how programs can seem to produce unreasonable outputs.

Recently, I was glancing through certain programs in Python, searching for places to optimize code and induce parallelism. I started thinking of threads immediately, and how independent contexts of computation can speed up code.

Although I program frequently with Python, I hadn't been using any kind of explicit parallelism in my code. So, using my C knowledge, I went towards the Threading library of Python.

Long story short, that was a mistake! Turns out that the Python implementation which is distributed by default (CPython) and Pypy, both have a process-wide mutex called the Global Interpreter Lock. This lock is necessary mainly because CPython's memory management is not thread-safe. The GIL locks up any kind of concurrent access to any objects in the Python run-time to prevent any race conditions or corruption of states. This is effectively a synchronization primitive, restricting all parallelism across threads on the Python byte-code.

There is another module called Threading2 that sets process affinity by studying the system affinity. Although this is a good reduction in processing time in general, I prefer to stay away from the GIL workarounds...

Next, I explored the multiprocessing library. Here is where things took an interesting turn! The multiprocessing library uses OS fork() routines, so it can escape the clutches of the GIL. There are a few catches with this approach, such as dependency on OS forking semantics, address space replication across parent and child, etc

So, let us start from a C/C++ implementation of the thresholdSum program.

Here are the set of files which conceals some strange phenomenon. I've named the directory EnchantingProgram. Run the C and C++ programs first.(with and without the sort routine which is indicated)

https://github.com/varun-manjunath/ParallelComputing/tree/master/EnchantingProgram

The code basically runs through each row of the matrix and computes a conditional sum. If a given number passes the condition, it is added. I do understand there is a better way to achieve the same result, using simple arithmetic rules and subtraction. However, I have employed this technique to expose a special kind of "gotcha" in computing!

Run the code and get ready to be surprised. The sorting operation on the matrix rows somehow makes the code faster. Can you reason out why that is so?

If you profile the code, you will certainly notice a spike in the for loop! (well, that is obvious... Thanks profiler!) A visual inspection also can't uncover the reason...(unless you are an expert in Computer Architecture and pipelining)

So, did you identify the reason for the speedup? I took quite some time to understand why this is so! It took me a pipe-lining revision and a thorough research over many computer science blogs and forums to get it.

Do read through my program thoroughly, even the comments! There are hidden hints in there! :)

I've split up the post into two, so that you get some time to think about it, not just scroll down to get the answer!

So, read this post for the answer:
http://magical-parallel-computing.blogspot.in/2017/04/enchantingprogram-spoiler-alert.html

Magical World of Parallel Computing

Search This Blog

A simple python program using multiprocessing... or is it?

Labels

Comments

Post a Comment

Popular posts from this blog

Parallel computing jargon

EnchantingProgram: Spoiler alert