Tim Evans R Tips Page

Tim Evans Computing Tips Main Page Tim Evans Informal Home Page | Tim Evans Imperial College page |

R

The R statistics package is free, widely used, and is extremely powerful. It is a statistics package first so its analysis tools, e.g. non-linear curve fitting, is very good. It has extensive plotting abilities and can produce pdf and postscript output suitable for publications. It can also display networks, e.g. using iGraph. I have found it painful to learn as it does not work in the same way as normal programming languages (works more like Matlab) and the manuals are dense if comprehensive. The best way in is to take the time to read the introductory manual and then to find a good book of which there are several (see the R Wiki book list). On the other hand being open source there is masses of material on the web. A search starting with the terms R statistics followed by other relevant keywords should throw up what you need. There are GUI front ends for Windows etc. but most commands need to be entered on the command line. I tend to test the commands on the command line, copying what works into simple scripts so I can repeat this quickly in the future. Scripts are easy to run (or "source" them in R language).

Some places to look

Tips

Non linear Fitting example

This can be very frustrating, but this is often due to the inherent nature of non-linear fitting and not the fault of R. I find that playing around with the starting values and bounds for the parameters being fitted can often turn rubbish into beautiful results. Persevere is the watch word here.

Create a data frame containing the data, here with two columns labelled k and logp

df <- data.frame(k = dataf$k[2:kmax], logp = log(dataf$pk[2:kmax]));
The parameters in this example are called Afit, gammafit and k0fit and one needs to specify lower, upper and starting values for these. Of course the starting values must be larger (smaller) than the corresponding value specified in the lower (upper) list. For instance I might use
 startlist <- list(Afit=Aest,   gammafit=gammaest,   k0fit=k0est);
 lowerlist <- list(Afit=Alower, gammafit=gammalower, k0fit=k0lower)
 upperlist <- list(Afit=Aupper, gammafit=gammaupper, k0fit=k0upper )
where Aest etc. all have numerical values. To fit see if the logp data fits form with a linear and log k terms, I use one of the following
  fitres <- nls( logp ~ log(Afit)-gammafit*log(k) - k/k0fit  , df, startlist, lower=lowerlist, upper=upperlist,  algorithm="port")
  fitres <- nls( logp ~ log(Afit)-gammafit*log(k) - k/k0fit  , df, startlist, lower=lowerlist, upper=upperlist)
Note the first part is how you specify a formula. The names in the formula are either data or parameters. The data variable are those present in the data frame df, e.g. logp refers to the values in df$logp. Any names appearing in the in the upper, lower and starting value lists are treated as the parameters whose values are being sought, e.g. Afit in the formula appears in startlist so its value is being found. Anything else better have a fixed numerical value.

A summary is given on screen and the results are in fitres which is a "a type of fitted model object". You need to look this up to find out how to get the information out of it. Look up the nls function in the stats package, for instance as discussed in the reference manual. For instance coef(fitres) gives the coefficients of the fit, summary(fitres) gives a summary.

Don't forget that non-linear fitting is not guaranteed to converge and you may fail to get an answer at first. You may need to think about what sort of values are likely, play around with start/lower/upper values or whatever to get it to work. If all else fails try to get it working with a trivial example.

Plotting example

See R Basic Graphs page for basic ideas and the R Graph Gallery for some fancy ones.

One idea is to use the package gplots which you get get installed from the GUI package menu.

One use for a list of objects is when you have run the same routine on different data sets and you want to plot the results on the same plot. Here ResultsList is a list of lists where the sublist ResultsList[[iii]]] has one set of data and a fit (for instance the output from nls,m see below) to that data as its entries. The trick is to open a new window then to plot an example with null data where you specify the features of the plot (log axes, titles etc). In particular here you must set the range of the axes by hand, e.g. by calculating them for yourself as illustrated by the first three lines below. You then add more data points or lines as required. Also note that this code generates different point symbols, line types and colours, the latter are a subset of those known to the system which can be found by typing colors()

  #First calculate the minimum value of y axis.
  yMin<-ResultsList[[1]]$yMin;
  for (iii in 2:length(ResultsList)) yMin<-min(yMin, ResultsList[[iii]]$yMin)
  # Short list of allowed colours suitable for display or printing
  colourlist=c("black", "red", "blue", "green", "magenta", "brown");
  # Now plot on screen
  windows()
  plot(x=NULL,  y=NULL, type="n", log="xy", xlim=c(xMin,xMax) , ylim=c(yMin,yMax) , main="Main Title", sub="(subtitle)", xlab="x axis name", ylab="y axis name")
  for (iii in 1:length(ResultsList)) {
  points(ResultsList[[iii]]$y,   ResultsList[[iii]]$x, pch=iii, col=colourlist[iii])
  lines(ResultsList[[iii]]$y, fitted(ResultsList[[iii]]$fit), col=colourlist[iii], lty=iii)
  }
  # Finally add legend
  legend (x="topright",y=NULL, nameArray[1:length(ResultsList)], col=colourlist[1:length(ResultsList)],lty=1:length(ResultsList),pch=1:length(ResultsList));
Note that on a MAC you need quartz() to get displays in a window, the window() is for MS Windows platforms so I often define a boolean OSWindows and use if (OSWindows) windows() else quartz(). One could probably access some machine environment variable to do this too.

To get output to a file or a new window on a screen you use device drivers. These may sound scary but there is just one for every different type of file or output device and by default the appropriate screen is set up so most of the time you don"worry about these. For instance postscript() or pdf() with extra arguments should be used to produce files of each of these types. Note that for the former you must close the device (using dev.off(which = dev.cur()) or similar) or you get empty output. Thats a good idea anyway for all `devices' except for screen output though many types of device (file/screens) don't seem to care. Also for eps (best for inclusion in documents) not ps output you must include the onefile=FALSE option.

  rootName="IC090521"
  hepsFileName<-paste(rootName,"yearhist.ps",sep="_")
  print(paste("ps plotting",hepsFileName), quote=FALSE)
  postscript(hepsFileName, horizontal=FALSE, onefile=FALSE, height=6, width=6, pointsize=10)
  hist(yearList)
  dev.off(which = dev.cur())

Here is a simple example of how to alter the size and colours of the points or lines. It also shows how to set up a simple routine to do plots. A simple copying of a single line calling this routine can be used to reproduce the same plot for different outputs such as screen, eps file, pdf file. Also shown are labels for the curves.

#Next three lines are altered as needed
  screenOn=TRUE
  epsOn=FALSE
  pdfOn=FALSE
  pngOn=FALSE
  OSWindows=TRUE
  
fileNameFullRoot=paste("figure",sep="")
colourlist=c("black", "red", "blue", "green", "magenta", "brown");

plotFig<-function(){
    # next two control size of text and lines
	cexvalue=2
	lwdvalue=2
	xmin=0
	xmax=2.0
	# plots cross at this point and at (0,1)
	xcross=0.5
	ycross=0.5
	nvalues=100 	# number of xvalues to use
	xvec = seq(from=xmin, to=xmax, by=(xmax-xmin)/nvalues)
	# plot exponential e^(-gamma*x)
	gammavalue = -(1.0/xcross)*log(ycross)
	lll=1 ## dummy variable, helps if you later insert or remove some of the lines on the plot
    yvec=exp(-gammavalue*xvec)
	plot(xvec, yvec,  ylim=c(-0.01,1.01), type="l", col=colourlist[lll], lty=lll, cex=cexvalue, lwd=lwdvalue)
	xlabelpos=1.0
	text(xlabelpos,yvec[xvec==xlabelpos],labels=c("exponential"), col=colourlist[lll], offset = 2, pos=1, cex=cexvalue)
	# pos argument is 1, 2, 3 and 4, for below, to the left of, above and to the right of the specified coordinates.
	# plot power law 1/(x+x0)^alpha
	alphavalue = 2
	x0value = xcross/( (ycross^(-1.0/alphavalue))-1)
	lll=lll+1
	yvec=(x0value/(xvec+x0value))^alphavalue
    lines(xvec, yvec,  type="l", col=colourlist[lll], lty=lll, cex=cexvalue, lwd=lwdvalue)
	text(xlabelpos,yvec[xvec==xlabelpos],labels=c("power law"), col=colourlist[lll], offset = 2, pos=3, cex=cexvalue)
	# plot theta function
	lll=lll+1
    yvec=rep(1,times=length(xvec))
	yvec[xvec>xcross]=0
	lines(xvec, yvec,  type="l", col=colourlist[lll], lty=lll, cex=cexvalue, lwd=lwdvalue)
	text(xcross,0.8,labels=c("sharp cut off"), col=colourlist[lll], offset = 0.5, pos=4, cex=cexvalue)
}

 if (screenOn){
         if (OSWindows) windows() else quartz()
         plotFig()
  }
  # EPS plot
  if (epsOn){
         epsFileName<- paste(fileNameFullRoot,".eps",sep="")
         print(paste("eps plotting",epsFileName), quote=FALSE)
         postscript(epsFileName, horizontal=FALSE, onefile=FALSE, height=10, width=10, pointsize=20)
         plotFig()
         dev.off(which = dev.cur())
  }
  # PDF plot
  if (pdfOn){
         pdfFileName<- paste(fileNameFullRoot,".pdf",sep="")
         print(paste("pdf plotting",pdfFileName), quote=FALSE)
         pdf(pdfFileName, onefile=FALSE, height=10, width=10, pointsize=20, units="cm")
         plotFig()
         dev.off(which = dev.cur())
  }
# PNG plot
  if (pngOn){
         dpivalue=1200
         pngFileName<- paste(fileNameFullRoot,dpi,".png",sep="")
         print(paste("png plotting",pngFileName), quote=FALSE)
         png(pngFileName, height=10, width=10, pointsize=20, units="cm", res=dpivalue)
         plotFig()
         dev.off(which = dev.cur())
  }
 plotDataTableLines<-function(dataTable1,dataTable2, legendString1,legendString2){
    ltyList=c(2,3);
    plot(dataTable1$gamma, dataTable1$Communities,  ylim=c(0,12), xlim=c(0,2.5), main=NULL, xlab=expression(gamma), ylab="Number Communities", type="l", col=colourlist[2], lty=ltyList[1], lwd=2)
    lines(dataTable2$gamma,dataTable2$Communities, col=colourlist[3], lty=ltyList[2], lwd=2)
    legend(x="bottomright" ,y=NULL, c(legendString1,legendString2), lty=ltyList, lwd=c(4,4), col=colourlist[2:3]);
}
So then I would call this as
  #Next three lines are altered as needed
  screenOn=TRUE
  epsOn=TRUE
  pdfOn=TRUE
  pngOn=TRUE
  OSWindows=TRUE
  if (screenOn){
         if (OSWindows) windows() else quartz()
         plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4)
  }
  # EPS plot
  if (epsOn){
         epsFileName<- paste(fileNameFullRoot,".eps",sep="")
         print(paste("eps plotting",epsFileName), quote=FALSE)
         postscript(epsFileName, horizontal=FALSE, onefile=FALSE, height=6, width=6, pointsize=10)
         plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4)
         dev.off(which = dev.cur())
  }
  # PDF plot
  if (pdfOn){
         pdfFileName<- paste(fileNameFullRoot,".pdf",sep="")
         print(paste("pdf plotting",pdfFileName), quote=FALSE)
         pdf(pdfFileName, onefile=FALSE, height=6, width=6, pointsize=10)
         plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4)
         dev.off(which = dev.cur())
  } 
  # PNG plot
  if (pngOn){
         pngFileName<- paste(fileNameFullRoot,".png",sep="")
         print(paste("png plotting",pngFileName), quote=FALSE)
         png(pngFileName, onefile=FALSE, height=6, width=6, pointsize=20)
         plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4)
         dev.off(which = dev.cur())
  }

A useful function to increase the scale of the plot e.g. to leave room for text labels

plotrange<-function(coordinates,factor=0.05){
 cmin=min(coordinates)
 cmax=max(coordinates)
 crange=cmax-cmin
 cextra=crange*factor
 c(cmin-cextra,cmax+cextra)
}
I call this as
plot(xvalues,yvalues,xlim=plotrange(xvalues,0.05),ylim=plotrange(yvalues,0.05))
which gives an five percent whitespace border.

This is the old way I used to combine several data sets and lines on one plot. The options in the plot command illustrate several of the useful options. Here s010 and the other parts of SimonDataList are data frames formed using a read.table. The double square bracket is confusing and comes from the default name given to entries in list. The FitList is a list of fits coming out of some call to nls or similar fitting command. ifirst and ilast are there just for flexibility.

  colourlist=c("black", "red", "blue", "green", "magenta", "brown");
  SimonDataList <- list(s010 ,s250, s750, s950, s990 , s999);
  ifirst=1
  ilast=length(SimonDataList)
  plot(SimonDataList[[ifirst]],   type="p", log="xy" ,  col=colourlist[ifirst], pch=ifirst, xlim=c(kmin,kmax), ylim=c(1e-9,1), xlab="degree k", ylab="p(k)", main="Laird and Jensen EPL 2006 Fig 1 (ll, ac fit)")
  for ( i in (ifirst+1):ilast) points(SimonDataList[[i]],  col=colourlist[i], pch = i);
  for ( i in ifirst:ilast) lines(SimonDataList[[i]]$k, exp(fitted(FitList[[i]])), col=colourlist[i]);

Offset of axes

The axes are usually drawn offset from the data. To change this you need the xaxs or yaxs options. This example is taken from the blog by Kelly O’Day which contains many other useful tips for R.
# Here are two simple x and y vectors to be plotted
    x <- c(0,1,2,3,4,5,6,7,8)
   y <- c(0,1,2,3,4,5,6,7,8)
# Set layout() for side by side plots
    layout(matrix(c(1,2), byrow=TRUE, ncol=2))
#  xaxs = “r” adds 4% over specified limits
    plot (y~x, xlim=c(0,10), ylim=c(0,10),
         xaxs=”r”, yaxs = “r”, las = 1,
         main = “Plot with ‘r’ axes placement”)
# The 4% expansion can be eliminated by specifying xaxs = “i” and yaxs = “i”
   plot (y~x, xlim=c(0,10), ylim = c(0,10),
      xaxs = “i”, yaxs = “i” ,  las = 1,
      main = “Plot with ‘i’ axes placement”)

Greek Letters, Equations and Numerical Values in Plots and strings in general

To get R to display mathematical expressions in plots (or in strings in general) use the expression(), bquote() or substitute() commands, which seem to be part of plotmath. There are some webpages with examples, and more symbols so search for these terms.

A simple example I had working is as follows

  subtitle=paste(substitute(alpha),"=",alpha,", ",substitute(gamma),"=",gamma)
  plot(x,y,main=substitute(y==Phi*alpha-sum(x^gamma)),type="l",subtitle=subtitle)
  text(20,30,substitute(Delta[z]==0.1))

It took me time to piece a complicated mixture of angle brackets, subscripts, mathematical expressions and numerical values. You can't always paste bits together as you can with normal strings. For instance the following work as labels in a plot

  graphKfitScale=1.0
  graphKfitShiftAbs=3.0
  xLabel = bquote(k/(.(graphKfitScale)*symbol("\341") * k[s] * symbol("\361")  - .(graphKfitShiftAbs) ) )
  yLabel = bquote(log(n(k))*p*( .(graphKfitScale)*symbol("\341") * k[s] * symbol("\361")  - .(graphKfitShiftAbs) ) )
This will substitute the values of the graphKfit parameters while the other symbols such as k are left unchanged. The whole is treated as an expression so the * symbols do not appear yet the angle brackets (the symbol commands) are left alone.

When values of variables appear in labels, leading and trailing zeros are usually suppressed. Sometimes this is not wanted so to control the conversion of numerical values to strings look at the ,tt>format function.

Dendrogram Plot Example

Here is an example of how to read in a table of distances and a list of labels for the sites (with a header) and to produce a dendrogram.
  rootName="aegean34";
  typeName="S1L3";
  namesFileName<-paste(rootName,"names.dat",sep="");
  distanceFileName<-paste(rootName,typeName,"dist.dat",sep="")
  nameList <- scan(namesFileName, what="", skip=1);
  distanceVector <- scan(distanceFileName,0);
  distanceMatrix <- matrix(distanceVector,nrow=34);
  distanceDist <- as.dist(distanceMatrix)
  hclustMethod <- c("single", "complete", "ward", "average", "mcquitty", "median", "centroid");
  methodName=hclustMethod[2]
  distanceHclust <- hclust(distanceDist, method=methodName)
  plclust(distanceHclust, labels=nameList, frame.plot=FALSE, hang=-10, xlab="", ylab="Distance", sub="Sites", main=paste(rootName,typeName," ",methodName,sep="") )

Histogram Barplot Example

Here we are plotting columns of data in file SiteHist.dat as horizontal bars. Note that though I'd call this a histogram one actually wants to use the barplot command. The las=1 is vital here to turn the labels around to be horizontal so each bar is clearly labelled.
  df <- read.table("SiteHist.dat", header=TRUE, sep="\t", fill=TRUE);
  wmax <- max(df$Weight)
  rmax <- max(df$Ranking)
  imax <- max(df$Influence)
  dataArray <- rbind(df$Weight,df$Ranking,df$Influence)
  barplot(ddd, beside=TRUE, names.arg=df$ShortName, horiz=TRUE, las=1)

Error bars

Look in the gplots library for confidence limits. plotci and barplot2 for instance have this.

Networks and Graphs

I have used the iGraph package. After install.packages("igraph") (I had to fix the location for libraries on my college PC, see R tips above) you load the library with library(igraph). Typical commands I use to get a figure of a graph are
  # convert flowMatrix N-by-N array to adjacency matrix
  minw=1/N # minw is a minimum threshold for weights
  adjMatrix <- ifelse(flowMatrix>=minw,1,0)
  flowgGraph <- graph.adjacency(adjMatrix, mode="upper", diag=FALSE)
  # make an N-by-2 matrix for the locations of vertices, here taken from a data frame of site input parameters - sitedf -
  # containing latitude and longitude of sites
  xvec <- sitedf$Long
  yvec <- sitedf$Lat
  vc <- array(c(xvec,yvec),dim=c(numberRows,2))
  # now make a vector from some data to produce vertices of different sizes, here from results vector
  maxInFlowVector <- max(siteresultsdf$inFlowVector)
  vertexSize <- siteresultsdf$inFlowVector*10/maxInFlowVector+1
  # How to colour the vertices using some membership vector - here generated by strong component membership
  flowgClusters <- clusters(flowg,mode="strong")
  colourListLong=c("black", "red", "blue", "green", "magenta", "brown","cyan","gold","darkred","darkblue", "darkgreen","yellow","hotpink","lightblue", "olivedrab");
  ncolours <-length(colourList)
  vertexColour <- colourList[ (flowgClusters$membership +1)%%ncolours +1]
  # now produce graphic with edges of different widths and named vertices of different sizes
  plot(flowg, layout=vc, vertex.label=sitedf$ShortName, vertex.size=vertexSize, vertex.color=vertexColour)
  # oops the edgewidth only works if we used the edge list method below
  #plot(flowg, edge.width=edgedf$weight*10, layout=vc, vertex.label=sitedf$ShortName, vertex.size=vertexSize)
Note how simple it is to specify the locations of the vertices, a two column matrix (three for 3D). I found this quite hard to find in the documentation. flowg <- graph.adjacency(adjMatrix, mode="upper", diag=FALSE) Also note that I had tried to create the graph from an edge list, which I generated from the adjacency matrix. This is how I did that but I think there can be issues with multiple edges.
  # convert flowMatrix N-by-N array to three column edge data frame edgedf
  # First create a N*N vector of weights, then do same for sources and targets.
  weight<- flowMatrix
  dim(weight)<-N*N
  #Internally the vertices are numbered from 0 but can use any unique numbering for each row and column
  source<-array(1:N,dim=dim(weight))
  target <- t(array(1:N,dim=c(N,N)))
  dim(target)=dim(source)
  # minw is a minimum threshold for weights
  minw=1/N
  edgedf <-data.frame(source=source[weight>minw], target=target[weight>minw], weight=weight[weight>minw])
  # create a graph structure from a data frame whose first two columns are the source then the target vertices.
  flowg <- graph.data.frame(edgedf, directed=T)
  # Trouble is multiple edges are allowed as this command will show
  count.multiple(flowg)

Finally note that there was a problem with fonts and poscript/pdf output. The solution is to load the fonts needed in the postscript (or pdf) command. The serif is the default needed to get this to work.

       postscript(epsFileName, horizontal=FALSE, onefile=FALSE, height=6, width=6, pointsize=10, fonts=c("serif", "Palatino"))
       plot(fixedRGraph,  layout=vc, vertex.label=dataList$sitedf$ShortName, vertex.color=vertexColour, vertex.size=vertexSize, vertex.label.dist=1)
For more details on plot poscript/pdf output, see examples above.

Ordering vectors

Use order as in the following example output
>xxx <- c(1:4)
>yyy <- c("one", "two", "three", "four")
>aorder <- order(yyy,xxx)
>aorder
[1] 4 1 3 2
> yyy[aorder]
[1] "four"  "one"   "three" "two"
> xxx[aorder]
[1] 4 1 3 2

Libraries

To install try install.packages("libraryname") where libraryname is the name of the library. Note that I had to fix the location for libraries on my college PC, see R tips above as it tried to put things on a networkk drive by default. You load the library with library(libraryname).