HPC & Cloud Features


Beyond the core IDE features, we hope to add several features to Python Tools for Visual Studio that address the growing need for large compute & large data. This release adds support for cluster computing and during the next few months we'll support Cloud and Data-Intesive (Dryad) scenarios.

Python @ Scale


In the current version of PTVS, you can scale your computations in two general ways:

  • Batch mode: Via MPI, using the MPI4PY wrappers on a cluster
  • Interactive mode: Via the integrated IPython Shell on a cluster (or by using IPython by itself)

Writing MPI programs in Python on a Windows HPC cluster


MPI, which stands for Message Passing Interface, is an industry standard specification for writing SPMD (Single Program Multiple Data) style programs on clusters. What you need is basically an MPI library that manages communication, a programming language that can call into the library (directly or via wrappers), a Cluster Manager of sorts and ideally an actual cluster (but it's not necessarily to try things out).

There are various MPI libraries available on Windows and Linux systems. MPI programs are generally very portable across OS's. Two popular libraries on Windows are MSMPI and Intel's MPI library. MSMPI is based on the open source MPICH library with various optimizations and security enhancements for Windows clusters. You can interface into MSMPI via most any language, including CPython and IronPython.



Do I really need to setup a dedicated cluster?

Note that you do not need a dedicated cluster to try things out. You can try writing/running MPI programs in several ways:

  • locally on your uni or multi-core dev box (MPI knows how to use multiple cores appropriately)
  • By roping together PC's you have access to and quickly setting up a "Cluster of Workstations" (eg using spare cycles at night with auto release if user touches mouse/keyboard, etc)



MPI Quick Start Guide

The following steps will enable you to install the necessary bits and writing MPI programs.

Note: make sure you have a consistent set of 32 or 64-bit of Python.exe/OS/MPI library collection installed.

  • Download and install the Windows HPC SDK -- This basically provides you with the MSMPI library and necessary headers.
  • Download and install the Windows HPC Pack -- this gives you the Cluster Manager bits for setting up a dedicated cluster, a job scheduler, etc.
  • As mentioned before, you can experiment with MPI on your local machine or setup a cluster. A cluster can be a dedicated one or a just a collection of PC's that are sitting around in your office (eg not used at night). Here's the Cluster setup guide.


Once installed, you can fire up VS and type in your MPI "Hello World!" program to verify things locally without a cluster first. Start a New Project, choose MPI Project, and paste this code that calculates Pi:


from mpi4py import MPI
comm = MPI.COMM_WORLD

import random, array
rand = random.Random()
size = comm.Get_size()

def computePi(numSamples):
rank = comm.Get_rank()

oldPi = 0.0
pi = 0.0
mypi = 0.0

sndBuf = array.array('d', [0.0])
recvBuf = array.array('d', [0.0])
while True:
inside = 0
for i in xrange(numSamples):
x = rand.random()
y = rand.random()
if (x*x) + (y*y) < 1:
inside += 1

oldPi = pi
mypi = (inside * 1.0) / numSamples

sndBuf[0] = mypi
comm.Allreduce((sndBuf, MPI.DOUBLE), (recvBuf, MPI.DOUBLE), op=MPI.SUM)

pi = (4.0 / size) * recvBuf[0]


delta = abs(pi - oldPi)
if rank == 0:
print("pi: %f - delta: %f\r\n" % (pi, delta))

if delta < 0.00001:
break
return pi


if __name__ == '__main__':
pi = computePi(10000)
if comm.Get_rank() == 0:
print("Computed value of pi on %d processors is %f" % (size, pi))

MPI.Finalize()






 

i.debug.localhost.png Make sure you've installed your MPI library wrapper (MPI4PY or MPI.Net) and import it. Then right click on your Project Properties (Alt+Enter), and on the Debug tab, select the MPI Debugger. Click on the Run Environment and set the Head Node of the "Cluster" to Localhost. In this example the number of Processors has been set to 4.











 

i.debug.run1.png Now hit F5 to run your code. You should see 4 cmd windows pop up (for output, which are also captured in VS). Ctrl-Shift-Esc to bring up the Task Manager to make sure all cores are being utilized.

















Now that we know the setup is correct, let's run the sample on an actual cluster. From the command line enter something like:

>clusrun /scheduler:head-node-name /all hostname.exe

 

i.clusrun.pngThis will run the hostname command on all the nodes on the cluster fronted by the specified headnode, in this case, layton00:













Next we need to make sure that the Python runtime environment is available on all the compute nodes on the cluster - or at least the nodes you'll be using. You can install Python with a command like:

>clusrun msiexec.exe /quiet /i “\\some\path\Python-Distro.msi”

You can even create a "node group" called Python-nodes and ask the job scheduler to always route your HPC jobs to that group of nodes which you've specially prepared... this is useful when say, you have some nodes with GPGPU's and want to use PyCUDA and have the scheduler automatically route to those compute nodes.

To run a quick test on a cluster, run the Pi program from the command line (or through WinHPC UI):

i.pi.out.png

where:

job submit: submits a job request to the HPC scheduler
/scheduler: connect to this head-node/cluster
/numprocessors: request this many cores. If enough cores aren't available, the job will be queued.
/stdout: write program output here
mpiexec: start the Python program under the control of the mpiexec launcher


To initiate a debug session from inside Visual Studio, selected the MPI Debugger and click the Run Environment field. This will bring up a form that enables you to select various attributes about the job. To start with, if your cluster in on Active Directory, its headnode will show up on list of clusters automatically. It will also show which nodes are available, how many procs they have, how loaded they are, etc. From there you can specify the number of ranks, whether to schedule to nodes, sockets, or cores, etc. In general all you really need to set is the number of ranks (cores):

i.pick.nodes.png


Where:

Headnode: the name of your cluster head-node (or localhost)
Number of procs: the number of procs you require for this job (which will be queued if necessary until they become available)
Schedule per: Core/Socket/Node - you can leave this as Core generally unless you require one rank per machine for example
Pick nodes from: There are different families of nodes on WinHPC - generally you want to leave this as ComputeNodes
Manually select nodes: If you occassionally have a need to run your code on specific nodes, you can specify them here.

OK... with your environment configured, you're ready to start debugging on a cluster! This time when you hit F5, PTVS will submit a job to the cluster on your behalf based on the parameters set above and monitor its progress. You can see the job submit string in the VS output/general window:


 

If you switch the "Show output from:" field to "Debug", you can see the program's print statement outputs.



After hitting F5 you can switch to the HPC job manager and see that you have an "Active" job on the cluster:

i.jobs.active.png


Now let's set a breakpoint at the line "delta = abs(...)" :

This time when you hit F5, your code should break at the specified line, and in the VS Process Window you should see 24 processes (ranks 0..23). In the Locals window notice your current process's "rank" value (14). This is how you can tell whether your'e in the special rank zero or one of the slave ranks.



Clicking on different ranks in the Processes window changes focus to that particular rank. You can also "pin" the rank variable so it's always visible as shown here (rank=14). When you hit F5 again, whenever another rank reaches that breakpoint, the program will pause.

If you run on multi-core nodes, you can request scheduling each MPI rank to a Node, then within each rank, do further parallel processing at the node level - for example by calling into a C++/OpenMP routine, or from IronPython calling into a TPL library! There's really no limit here to impressing your friends or increasing your job security...


 

i.debug.tooltip.pngHovering over objects provides a quick glance at their current values as shown here for com.Allreduce:

In the MPI Debugger launcher you can set various other advanced options, but for the majority of the cases you just point VS at the cluster's Headnode, set the number of ranks and hit F5 - that's it...

















MPI programming and debugging with IronPython is pretty much the same - install MPI.Net, install the runtime module on each of the cluster nodes and run. Note that IronPython uses the .Net debugger. It uses the C# expression evaluator - but it enables debugging through C#/F#/etc. as well, which the CPython debugger doesn't (yet):

i.mpi.mpi.net.png

In general the MPI.Net calls require less parameters since data types, sizes, etc. are known. MPI.Net in turn constructs the C API calls and P/Invokes into MSMPI.dll. When you use the standard MPI types (int, real, ...) or user defined types, the performance hit of using MPI.Net compared to C will be in the noise range (ie, no serialization penalty). MPI.Net is usable from C# and F# as well.

Ok, that was the overview of parallel programming using Python and MPI on a cluster. But wait.. we're not done yet: IPython!


IPython Quick Start Guide

Beyond being an awesome Python development environment, IPython has built-in capability for interactive || computing, including support for Windows HPC. Running Python code this way is particularly amenable to prototyping and interactively building up your parallel program - something that's somewhat unnatural with MPI. You can use IPython from VS directly (Tools/Options/Python/Interactive mode: IPython), or you can use it out of VS in standalone mode. IPython runs identically across Linux, MacOS and Windows, so it's a great choice if that's an important factor as well.

Assuming you've already installed IPython and the Windows HPC bits, let's write our first program... first let's make sure IPython works on the local machine. Enter:

ipcluster -n 2

This will spit out a whole bunch of informational text and set up two "engines" ready to receive requests. Much like running MPI on your local box, IPython knows how to take advantage of multiple cores on your machine:

i.ipcluster.start.png

To let IPython schedule jobs through the WinHPC scheduler, the general steps are:

  • Create a cluster profile using: ipcluster create -p mycluster
  • Edit configuration files in the directory .ipython\cluster_mycluster


Steps 1 and 2 are covered in the docs at http://ipython.scipy.org in detail (it's not complicated). Once done, you can start up the number of desired engines like this:

ipcluser start -p mycluster -n 32

IPython will handshake with the WinHPC scheduler and submit a job to spin up the engines as seen here:

i.ipython.hpc.png


 

i.ipython.pi.pngNow you can start IPython in a command shell, create a MultiEngineClient instance for your profile ("mycluster") and use the resulting instance to do a simple interactive parallel computation.


In this example, we take a simple Python function and apply it to each element of an array of integers in parallel using the MultiEngineClient.map() method:





























Profiling on a cluster

We haven't implemented VS remote profiling on a cluster in an integrated fashion, but you can log into any node on the cluster and run the PTVS Profiler directly. Of course you can do this locally as well and since these are SPMD programs where most of the ranks do similar things, you can profile one rank, optimize your code and redeploy.

Another option for profiling on a cluster is using ETW (Event Tracing for Windows). This is sort of like Solaris's excellent DTrace system. ETW performs low overhead system-wide profiling, including all the action from the MPI library. For information on ETW and its integration with Windows HPC/MPI, please see Tracing MPI Applications. Once profiling data is collected it is clock sync'd across the cluster (since different nodes run at slightly different frequencies), and the merged results can be viewed using JumpShot or Vampir for Windows:

i.vampir.timeline.png

Scaling Python code to the Azure cloud

We haven't done anything for this yet - hopefully by this summer!

Scaling Python code with Dryad (a Data-Intensive, || computing Engine)

We haven't done anything for this either - hopefully by this summer!

Last edited Apr 17, 2012 at 5:25 PM by dfugate, version 60

Comments

No comments yet.