Tim Evans Python Tips Page

Tim Evans Computing Tips Main Page Tim Evans Informal Home Page | Netplexity | Tim Evans Imperial College page |

The Python web site is the best place to start. Python is easy to use and fast for developing projects. I know people have used this network package and it came highly recommended in a blog about Python packages

Some places to look

Python documentation has most links.
For Imperial College users, try the physics first year lab pages (but access is ridiculously limited, you may have to ask for access).
Tutorialspoint has some nice summaries e.g. on string formatting.

Tips

The preferred way seems now to be to use the .format method of strings.

print ' Integer {0:d} Integer {0:5d} Float {1:.3g} Float {1:9.3g} String {2:s}'.format(239,12.356789,'abc')

    numberlines=0
    try:
        print "--- opening edgelist file  ",fullfilename
        f=open(fullfilename,"w")
        for edge in edgelist:
            f.write('{0}\t{1}\n'.format(edge[0],edge[1]))
            numberlines=numberlines+1
        print "--- finished writing edgelist file  ",fullfilename,numberlines," lines written"
    except:
        print "*** Failed to finish edgelist file  ",fullfilename,numberlines," lines written"
    f.close()

Writing a line of numbers and strings to a file use the C-like format format

print('%i \t %i \t %i \t %f \t %f \t %f \t %f \n'%(index,x,y,xcircle,ycircle,xrand,yrand))
with open(filename,"w") as f:
    f.write('%i \t %i \t %i \t %f \t %f \t %f \t %f \n'%(index,x,y,xcircle,ycircle,xrand,yrand))

A cheap way to deal with unicode and other non-ascii strings but to remain in an ascii environment is to use
```
ustring=u"unicode mess"
ustring.encode('ascii','replace')
```
Note the u in front of the quote to indicate a unicode string.

For regular expressions use the raw string option of python (r in front of double quotes) and the re package

import re
wikilinkregex=r"\[\[.*?[\]\|]"
text=" abc [[link1|text]] xyz [[link2]]"
re.findall(wikilinkregex,text)

To deal with file names try the os.path

To find the version of a module at runtime try this

import pkg_resources
pkg_resources.get_distribution("moduleofinterest").version

When a python file is called (even if imported?), all the code will be executed, that is methods defined and code outside a method will be executed in order. To run a main method use the following
```
if __name__ == "__main__":
    main()
```
There is a standard way to document python code. Look up docstring and in particular try the Docstring conventions are in PEP-257.
When you show a plot in matplotlib python will normally stop executing until you kill the display. To change this behaviour you need to be in interactive mode so you can say
```
import matplotlib.pyplot as plt
plt.ion() # this turns on interactive mode
```
then when you do plt.show() it will not block the rest of the execution.

Useful ways to change setting for file locations etc

import getpass
username = getpass.getuser()
import socket
hostname = socket.gethostname()

Running Python

There are several approaches and useful tools when running python. One way is just to run scripts from a command line in which case you need an editor. There are other (usually) built in graphical editors or a full scale IDE approach.

Editor2

Google's Python Class has some useful tips on how to set up editors. They suggest that you want the tab key to insert two spaces rather than a tab character. They also suggest that files are saved with the Unix end-of-line convention (otherwise an "Unknown option: -" may be produced), the file may have the wrong line-ending. For Notepad++ they suggest the following change

Tabs: Settings > Preferences > Edit Components > Tab settings, and Settings > Preferences > MISC for auto-indent. Line endings: Format > Convert, set to Unix

GUI

I use IDLE which is part of Python. There is a page on "IDLE by Anne Dawson which helped me to get going. Note that the command history is obtained using alt+P. Its useful when developing code as it is interactive. I often try things out interactively (simple examples to check command syntax or to see if a command behaves as I think it should) then write then into my full programme in eclipse.

To change directory use the following

>>> import os
>>> os.getcwd()
'/home/user'
>>> os.chdir("/tmp/")
>>> os.getcwd()
'/tmp'

Eclipse and pydev

For most work I actually prefer to use a proper IDE (integrated development environment). For python I use the Eclipse IDE (also useful for java) with the pydev package added through the Eclipse system of updates, try this help on eclipse and pydev. I have seen many recommendations for this.

Like all IDE there is a learning curve and a large amount of non-python overhead to learn, always similar but different from IDE to IDE. It is not worth it for the odd project, and not necessary for larger projects. However I do think for any long term work it will repay the investment handsomely. I found the tutorial by Lars Vogel on using eclipse with python a good place to start.

(31/07/13) In fact I tried adding a new project, picked python and eclipse prompted me to go to another window to install pydev.

(27/07/2013) Changed access to python directory to see if it helped adding libraries. Perhaps best to run as administrator when running executable installation routines files.

(25/6/14) I can not get Eclipse to run with the Enthought Canopy python distribution. I can set the path to the correct python location but then Ecilpse can not find the libraries. I tried to set up a parallel cPython installation (64bit Windows version) to use with Eclipse but then some of the Windows install packages for Numpy and so forth only find the Canopy distribution and won't let you change this. This seems to be because the packages are only compiled for 32 bit Windows due to compiler licence restrictions. Currently stuck on this. Its Canopy or Eclipse and I want to stick to an IDE I can use for other things too. Now trying the WinPython distribution with Eclipse.

WinPython and Eclipse. I found some pretty good instructions on how to link Eclipse and pydev to WinPython. I placed my WinPython in C:\WinPython so I needed to point pydev to C:\WinPython\python-2.7.6.amd64\python.exe. I'm pretty sure the autoconfig will work and that the key here is to restart eclipse after making these changes. Perhaps that is true for other reconfigurations. I didn't mess around with grammars and setting explicit version of Python though that could be useful if I need a standard 32bit cPython installation for something later.

Packages and Libraries

Windows 64bit nightmare

It appears that most of my problems come from trying to use a 64 bit version of the standard Python (cpython) as the standard scientific libraries do not come in a Windows 64 bit version due to compiler licence issues. Solution may be to switch to the completely portable WinPython distribution.

Installation of packages

I have been running into trouble with different versions of packages being accessed and it is hard to see from python what version of a package your system has found. It is much easier to switch to the Enthought Python package which installs the scientific packages by default and maintains them to the latest versions.

The best way is to use the easy_install command mentioned on many web pages. What they fail to tell you is that this is part of another package that you have to install first. So first try to install setuptools from the Python package index. This worked easily enough. Note that you download a script for a python programme so you need python installed to run it (either double click on the file ez_setup.py or run it via a command window python ez_setup.py). Before you use easy_install you may have to import setuptools inside the python.

(27/07/2013) I could not get easy_install to work inside python and could not find any Windows executables to download. However running the ez_setup.py python file creates an exe file in the Scripts subdirectory of the main python directory. Then I ran this from a Windows command line to install packages. For instance I used

easy_install networkx

followed by

easy_install --upgrade networkx

from the command line to install networkx (for some reason it first installed an old version).

Alternatively, most major packages come with a Windows installer - just make sure you pick the one for the correct version of Python (its obvious from the file names). The main Python package worked fine and I have a version 2.7 with the IDLE GUI interface. Just make sure that c:\Python27 and its Scripts subdirectory is on the PATH environment variable.

Finally some suggest using pip. However again this is not a python package or command but something set up outside python. You get errors containing lines such as NameError: name 'pip' is not defined if you try this. I accessed this from the command window in Windows and this was in C:\python27\Scripts. So I changed to this directory and then typed pip install networkx. In case you have already been messing around then you might need to upgrade so use pip install networkx --upgrade.

Scientific packages

There is a useful blog about Python packages. Scientific libraries of interest include (watch the order you add them, one may depend on the other so best use easy_install).

numpy numerical arrays and often used in other packages. I found that there was an easy Windows executable but I would probably now use easy_install. Numpy documentation is alright but I have found some things hard to follow. My Scipy tips are below.
scipy general package for maths, science and engineering. I found that there was an easy Windows executable but I would probably now use easy_install. Numpy and Scipy documentation is OK but I have found some things hard to follow so see my Scipy tips below.
Matplotlib is a 2D plotting library producing publication quality figures in many formats.
NetworkX
Describes itself as "High productivity software for complex networks". Its a free package for Python. Lots of information on the web. If you have a problem google and you will probably find something relevant.
I installed by downloading the source code, following instructions on the web site. The easy_install suggested on the web site would be easier and better.
The problem I found is that there is no immediate visualisation here. You seem to have to install other packages to link through to Graphviz drawing programmes or matlab graph drawing.
powerlaw has a python package which needs several of the packages above plus mpmath
pydot.
Pyclutser
RPy
scikit-learn an easy to use machine learning library recommended in Kaggle.

Numpy tips

Reading in data

The loadtxt command is an easy way to read in text data

      import numpy as np
      tweets, authors = np.loadtxt('c:/data/textdata.txt', float, skiprows=1, usecols=(0,1), unpack=True)

This will give two arrays, tweets and authors, with tweets being the first column (numbered 0 by loadtxt) and authors being the second column (numbered 1 internally). Both will be of the string data type. The skiprows option defaults to 0 but if there is a row of column headings then you need so skiprows=1. If you have missing entries then the more general routine is needed

      import numpy as np
      missingCode = 'missing'
      jid1, rating1 = np.genfromtxt('c:/data/textdata.txt', np.str_, skiprows=1, usecols=(0,1), unpack=True, missing=missingCode, invalid_raise=False)

Scipy tips

Statistical distributions

Statistical distributions are a bit odd at first so read the introduction on Scipy statistical distributions. All statistical distributions have a name (such as uniform, norm and lognorm)and various functions giving the pdf, cdf or a sequence of random numbers drawn from the distribution, e.g. in Scipy these would be called name.pdf, name.cdf and name.rvs. You need to import them to use them

>>>from scipy import stats
>>>from scipy.stats import name

All statistical functions take two special arguments, the shifting (loc) and scaling (scale) parameter. Suppose our random variable is x and name.pdf gives the function p(x), that is with probability p(x)dx we will get a value between x and x+dx. Then the standard python command to get p(x) value is name.pdf(x) which is short hand for name.pdf(x,loc=0,scale=1). Now suppose p(x) has a fixed mean of mu and a standard deviation of sigma and the python name routines have no additional parameters to change this. What you need to do is use the shift and scale to get the same shape distribution but different width. That is we need to use is something like

name.pdf(X,loc=L,scale=S)

Warning this is not the same as name.pdf((X-L)/S,loc=0,scale=1). When working with the pdf of a continuous distribution there is a subtle difference as the two represent distributions defined in terms of different variables. In the second case we have the original distribution so it gives us a density p(x) evaluated at x=(X-L)/S. Crucially this means we have a probability finding a value between X and X+dx of p(x)dx but note that dx is exactly the same as in the standard loc=0, scale=1 function. In the first case name.pdf(X,loc=L,scale=S) is a distribution q(y) defined in terms of a new variable y=(x-L)/S. The important difference is that to define a distribution defined in terms of a new variable yet to maintain the proabailities are the same in the same interval we must demand that p(x)dx=q(y)dy. That means the pdf given by the form name.pdf(X,loc=L,scale=S) is the function q(X) which automatically includes a factor of dy/dx to convert from the form p(x) in one variable to a new form in a new variable. This form name.pdf(X,loc=L,scale=S) is probably the version you want. The first form, name.pdf((X-L)/S,loc=0,scale=1) gives the original standard pdf function p evaluated at a different position. Note that for other distribution functions, such as the cdf and rvs, there can be simple equality of the two forms but best to stick to the name.pdf(X,loc=L,scale=S) form in all cases.

One way to see this is to realise that the mean and standard deviation of the second form, q(y) = name.pdf(y,loc=L,scale=S) are (mu-L) and (sigma/S). However the mean and standard deviation of the first form, p((X-L)/S) = name.pdf((X-L)/S,loc=0,scale=1) are still mu and sigma.

Another way to see this is to look at the uniform distribution. The standard Scipy form is p(x)=1 if 0 < x < 1 and is zero otherwise. On the other hand if we define y=(x-L)/S, then we find that the associated pdf in the y variable is q(x)=1/S=dy/dx if L < x < (1+L) and is zero otherwise. Indeed we find that

>>>from scipy import stats
>>>from scipy.stats import uniform
>>>uniform.pdf(0.5)
1.0
>>>uniform.pdf(0.5,loc=0.25,scale=2)
0.5
>>>uniform.pdf(0.125)
1.0

The last two are evaluated at the same value (0.5-0.25)/s=0.125 but as we have said represent distributions in different spaces.

In general the distributions in python represent a family of distributions and accessing different members of the family is done by what are called shape arguments. This can be confusing as you might think that loc and scale change the shape too, certainly plots of name.pdf(x,loc=0,scale=1) and name.pdf(x,loc=L,scale=S) will not be the same. However Scipy wants to regard these as straightforward changes of variable not as fundamentally different shapes so you have to get used to this division of parameters. Take the gamma distribution which on Wikipedia is defined in different ways terms of different sets of parameters. One parameter is always one of Scipy's shape parameters. The other parameter discussed in the definitions in the literature is achieved by Scipy's scale parameter (and is called scale in many cases for the gamma distribution). However in all cases, this gamma distribution is defined as zero for x<0. So should one need a gamma distribution shape but starting at x=Z you would need to call gamma.pdf(x-Z,n)=gamma.pdf(x,n,loc=1,scale=1) where n is the sole shape parameter needed by the Scipy gamma distribution routine. What is this shape parameter doing? You have to look at the documentation where it is written out. In general if there is more than one shape parameter as there can be for more complicated distributions, then this second parameter needs to be an array like object carrying the different values needed to specify the shape.

Example: lognormal The manual gives this as a function of one variable x and one shape value, s, where

lognorm.pdf(x, s) = 1 / (s*x*sqrt(2*pi)) * exp(-1/2*(log(x)/s)**2)

so in general we have that

lognorm.pdf(X, sigma, loc=L, scale=S) = 1 / (sigma*(X-L)*sqrt(2*pi)) * exp(-1/2*(log((X-L)/S)/sigma)**2)

Note the following

Second form is in terms of the variable y=(X-L)/S) which explains why there is no factor of S with the (X-L) outside the exponential, it is cancelled by the dy/dx factor.
Looking at the first form, the standard Scipy lognormal lognorm.pdf(x, s), it looks like a normal distribution in terms of a variable ln(x) with mean of m=zero and standard deviation of s. In fact lognormals are usually defined with a mu like parameter m as
```
1 / (s*x*sqrt(2*pi)) * exp(-1/2*( (log(x)-m)/s)**2)
```
That is true for some variable z=ln(x). However the mean and standard deviation for the x variable are not these parameters. For instance the mean of x is exp(m+(s*s/2)) not m (zero for the Scipy form) and the variance is [exp(s*s)-1]exp(2m+(s*s)) not s*s.
From this we see that the general form of the lognormal in Scipy is obtained by setting the loc parameter to zero and the Scale parametr equal to log(mu) where again mu is not the mean of the lognormal (its the mean of log(x)) but it is the standard parameter used when describing lognormals
```
1 / (s*x*sqrt(2*pi)) * exp(-1/2*( (log(x)-m)/s)**2)  = lognorm.pdf(x, s, loc=0, scale=exp(m) ) = 1 / (s*x*sqrt(2*pi)) * exp(-1/2*(log(X/scale)/s)**2)
```

These can be illustrated as follows

>>> from scipy.stats import lognorm
>>> lognorm.mean(1,loc=0,scale=1)
1.6487212707001282
>>> lognorm.std(1,loc=0,scale=1)
2.1611974158950877
>>> from numpy import exp
>>> exp(0.5)
1.6487212707001282
>>> from numpy import sqrt
>>> sqrt((exp(1)-1)*exp(1))
2.1611974158950877

Plottingtips

Saving Plots as Files

I like to save a plot in several file formats so I don't need to rerun the code: pdf is my current vector format of choice for LaTeX documents, svg so I can edit the file in a vector package like Inkscape, jpg form for presentations and quick discussions, and so forth. Each is specified in matplotlib by the standard extension used for that format. So this routine you pass a list of the strings containing the desired file types, something I usually set as a global at the start of my programming. The routine I use is below but you may want to adapt it so you can pass or set other parameters e.g. dpi for bitmap formats.

def saveFigure(plt,filenameroot,extlist=['pdf'],messageString='Plot'):
    """Save figures as files in different formats
    Inputs
    plt - a plot
    filenameroot - the full name of the file to be used but without the extension
    extlist=['pdf'] - a list of strings, each string is the extension of an allowed graphics type
    messageString='Plot' - string to print before printing name of file being created. Note empty string will produce no message at all.

    Output
    For every string, ext, in the extlist, it will produce the plot from plt in the format
    specified by the extension, ext, in the file in filenameroot.ext
    """
    for ext in extlist:
        if filenameroot.endswith('.'):
            plotfilename=filenameroot+ext
        else:
            plotfilename=filenameroot+'.'+ext
        if len(messageString)>0:
            print messageString+' file '+plotfilename
        plt.savefig(plotfilename)

The following is an outline of how I use this.

import matplotlib.pyplot  as plt
extlist=['pdf','jpg']
screenOn=False
if screenOn:
    print '--- plots shown on screen'
else:
    print '--- plots not shown on screen'

# http://matplotlib.org/examples/color/named_colors.html
colourlist = ['red', 'green', 'blue',  'cyan', 'magenta', 'brown', 'black', 'pink', 'purple', 'yellow']

(stuff)

    fig, ax = plt.subplots(figsize=(15, 15)) # set size
    ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    ax.plot(X, Y, linestyle='', mec='none',c=colourList, ms=sizeList)
    ax.set_aspect('auto')
    if len(extlist)>0:
        saveFigure(plt,fullfilenameroot,extlist)
    if screenPlotOn:
        plt.show()