Lab 11 Material - November 6th, 2008

Files for this lab are in svn, at:

http://class.ged.idyll.org/svn/files/lab-11/

or you can copy them from

~ctb/lab-11/

on the CSE cluster.

---

Accurately and reproducibly measuring programs is a tough issue in its own right. Here are some simple techniques that you can use to get a rough handle on program performance.

Measuring program runtime on Linux/UNIX

The 'time' utility will give you numbers for runtime, user-space time (think "computation"), and kernel time (disk and network I/O). Try it out on arctic:

/bin/time python simple-loop.py 500000

(This program just adds together all the numbers between 1 and 500000.)

NOTE: 'time' alone will use the built-in shell utility; on arctic and other Linux boxes, be sure to use '/bin/time', otherwise you'll get weird-looking output.

The output should be something like this:

real    0m0.240s
user    0m0.211s
sys     0m0.019s

The 'real' number is the wall-clock seconds (how long YOU thought it took to run). The 'user' number is the number of seconds spent in user-space, and the 'sys' number is the number of seconds spent in kernel-space, accessing the disk etc.

(I don't know how to get numbers like this in Windows; any thoughts?)

If you run 10x more loops, you should see a roughly 10x increase in the time:

/bin/time python simple-loop.py 5000000

real    0m2.372s
user    0m2.107s
sys     0m0.113s

Note how the numbers add up -- is that always the case?

Sleeping

No! Try this:

/bin/time python simple-wait.py 2

real    0m2.031s
user    0m0.015s
sys     0m0.015s

This program just executes a simple 'sleep', telling the OS to wake it up after 2 seconds. So, the "wall clock" time is 2 seconds, but the OS recorded essentially no actual time spent doing anything -- because all it had to do was wake it up after 2 seconds!

In particular, this is what you'll see if a program is I/O bound -- if it spends a lot of time waiting for data. That's because on something like a blocking 'socket.recv' call, or on a 'read' of a file from a slow disk, the OS just puts the program to sleep until data arrives.

Q: What does this measurement look like for your blocking echo server, vs your non-blocking echo server, vs your threaded blocking echo server, from earlier HWs & labs?

Q: What if you switch to using the 'multitask' module to do cooperative multitasking -- how does that compare to non-blocking, blocking, threaded-blocking, etc. in terms of overall speed?

Q: What happens if you vary the number of threads in your threaded echo server -- what's the "most efficient" number of threads, and why?

Q: What happens if you vary the amount of data being sent by your echo server from 1 byte to 100k to 5mb? (Vary across blocking, non-blocking, threaded, etc.)

Q: What happens if you run your echo server code on 127.0.0.1, and only connect from the local machine, vs connecting to the external interface (the public IP address, which you find with ipconfig (Windows) or '/sbin/ifconfig -a' (Linux)), vs connecting across machines?

Overachieving

Here's a weirder case for you to consider. Try:

python thread-compress.py ecoli.fa 5

real    0m0.629s
user    0m1.079s
sys     0m0.031s

Hmmmmm... the real time elapsed was .6 seconds, but the user time is more than that -- 1.08 seconds. What's going on here?

It's threading! The code is using threads to do the heavy lifting (each thread is compressing the first 100kb of the E. coli genome...) and so it takes less wallclock time to do the work than the work it actually does.

Q: How do these numbers change when you increase the size of the file being read, the amount of sequence being compressed, and the number of threads doing the compression?

(Note, you can generate different size files by concatenating the ecoli genome multiple times into a single, bigger file.)

Q: How would these numbers change when you run it on a single-CPU machine? Can you think of a way to emulate that?

(Hint, look up the Semaphore object in the threading module.)

OK, so that's understandable: things are running in parallel. Neat! So what happens if you run this?

/bin/time python thread-loop.py 500000 5

real    0m1.056s
user    0m0.826s
sys     0m0.312s

This uses threads to run the 'simple-loop' code above -- why is 'real' more or less equal to the sum of 'user' and 'sys here?

Q: What if you use the multiprocess module available in Python 2.6 to run the compress and loop code in different processes rather than in different threads? Does this change your numbers (yes) -- and why? (Hint, look up the GIL...)

Q: how does the compression code speed scale with number of threads?

Q: What happens if you run all of this on Windows, in comparison to Linux?

Measuring program runtime from Python

While you can use the 'timeit' module,

http://www.python.org/doc/2.5.2/lib/module-timeit.html

it's not meant for anything other than CPU-intensive stuff, from what I can tell; if you try to use it to measure network or disk I/O, it may not help.

The simplest way to measure actual program performance on "slow" tasks like I/O (as opposed to algorithm performance) is to use 'time.time':

start_time = time.time()
... # do stuff
end_time = time.time()
print '%f seconds', end_time - start_time

Warning, 'time.time' is not particularly accurate for sub-second information!

Locking of centralized resources

You can use 'twill-fork' or urllib to run a bunch of simultaneous queries against your Web server; if you modify your sqlite3 interface to work in threads (by using locks and opening/closing the database connection inside the locked area) you can look at the impact of locking on throughput and CPU usage, too.

More essay notes

This is meant to be a hypothesis paper -- do a set of measurements, and try to explain them to yourself (and me) in a paper. I expect to see a graph or three, along with a legitimate and thoughtful attempt to explain the trends that you see. (Three points don't make a trend unless you're an astrophysicist, BTW.) You DON'T NEED TO BE CORRECT, but it helps, of course.

If your paper is too short (< 3-5 pages double-spaced Times-Roman 12pt with 1" margins and no more than a 3 inch masthead), do more measurements. Explanatory diagrams can substitute for some text but graphs do not add to the page count.

Do cite Web resources you use, if only informally; e.g.

This neat discussion (http://foo.com/some-threading-stuff) explains why I get the results I do. Basically, fib does foo and bar, so...

It is entirely appropriate to speculate wildly on how you might tune or otherwise adjust your Web server to deal with tens or hundreds or even thousands of simultaneous users.

PLEASE don't use arctic or the other centralized machines to run CPU, disk, or network-intensive stuff more than a few times, please; i.e. do all your code development and testing somewhere else and then use adriatic or arctic for benchmarking measurements only.

In terms of an essay proposal, I would like to see a sentence or two like this:

I propose to measure performance across non-blocking, threading, and multiprocess echo servers. I will also vary the total amount of data sent to see how network performance degrades with data size.

Please check the essay proposal in by Tuesday evening on November 11th into 'essay/proposal.txt'. The final essay is due December 2nd.

English (grammar) and spelling and logic all matter! Feel free to check in your source code under 'essay' and refer to the files by name in your paper if you want sympathy points for your partial work.

Writing C code that Python can call

Suppose you want to call C code from Python -- how do you do that? You can go through a full tutorial here,

http://docs.python.org/extending/index.html

but here's a quick demo of extending Python with a simple addition function in C.

Under 'lab-11/c-embed', there are two important files: 'module.c', which is our extension module, and 'setup.py', which tells Python how to compile 'module'. First, let's take a look at module.c.

module.c has three components: the function ('add') that we want to make available to Python; the methods table, that tells Python what the name of the 'add' function is; and the module init function, which initializes the module on import. (In this case, it just tells Python to load the methods in the methods table.)

The 'add' function does three things:

  • it unpacks arguments from a Python object into two integers;
  • it sums the integers;
  • and it packs the result back into a Python object, which it then returns.

Now, try

python setup.py build_ext -i

This tells Python to build the module, and put the result in the current directory; otherwise it hides it away under the build/ directory.

Now you can run python and try

>>> import module
>>> module.add(5, 6)
11

I've put this code in the 'test-module.py' file so you can try running that if you like.

Q: What happens if you put the looping code in this C function and then run it in multiple threads -- will it run in parallel? (A: No. Talk to me if you want to get it to run in parallel.)

Using pyrex to wrap C code

There are a bunch of options for wrapping C and C++ in Python; go to this URL for more info:

http://ivory.idyll.org/articles/advanced-swc/#wrapping-c-c-for-python

I've started to use 'pyrex' quite a bit. Pyrex is an intermediate language that can speak to both C and Python code; this intermediate language gets translated into C code with a Python interface automatically.

If you look under 'lab-11/pyrex', you'll see a few files:

module.c
module.h
module_wrap.pyx
setup.py

The module.c and module.h files just contain the C implementation of the 'add' function, and are straight C code. setup.py is just another module telling Python how to compile the code. The 'module_wrap' function contains a bit more interesting stuff, though:

cdef extern from "module.h":
     long add(long, long)

# this function can call arbitrary C code, and where possible will be
# compiled into efficient C code itself.

def wrapped_add(a, b):
    return add(a, b)

What this does is tell pyrex where to get the function 'add' -- from C code -- and then defines the 'wrapped_add' function to call 'add' and return the result.

To build this, type

python setup.py build_ext -i

and watch the monkeys go... then run

>>> import module_wrap
>>> module_wrap.wrapped_add(5,6)

The important thing here is that 'wrapped_add' (and other Pyrex functions) can call any C code that they want, so this is a good simple way to wrap a small set of C functions.

If you want to wrap lots of C code, or wrap C++ code, there are other tricks and options.

Calling external programs and keeping the output

Try:

import subprocess
cmd = 'ls'

p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,
                     stderr=subprocess.PIPE)
(output, error_output) = p.communicate()

print output

(This code is in 'lab-11/subprocess-run.py')

This will let you run any UNIX or Windows command and get the output and error output into a Python string.