The R statistics package is free, widely used, and is extremely powerful. It is a statistics package first so its analysis tools, e.g. non-linear curve fitting, is very good. It has extensive plotting abilities and can produce pdf and postscript output suitable for publications. It can also display networks, e.g. using iGraph. I have found it painful to learn as it does not work in the same way as normal programming languages (works more like Matlab) and the manuals are dense if comprehensive. The best way in is to take the time to read the introductory manual and then to find a good book of which there are several (see the R Wiki book list). On the other hand being open source there is masses of material on the web. A search starting with the terms R statistics followed by other relevant keywords should throw up what you need. There are GUI front ends for Windows etc. but most commands need to be entered on the command line. I tend to test the commands on the command line, copying what works into simple scripts so I can repeat this quickly in the future. Scripts are easy to run (or "source" them in R language).
Some places to look
HOME=c:/PRG/R/ R_LIBS=c:/PRG/R/win64-library/
s250 <- read.table("tabseparateddata.dat", header=TRUE, sep="\t", fill=TRUE);This reads in a table. The fill option means that it works even if there are more header items than columns, it just fills the extra lines with rubbish.
A useful check for problems with numbers of entries on lines is to use
count.fields("tabseparateddata.dat", header=TRUE, sep="\t");
The typical commands for output to text files are write.table for data frames and write for matrices. For instance
tfdf <- data.frame(source=sourceVector, target=targetvector) write.table(tfdf, file = "outputFile.dat", append = FALSE, sep = "\t", quote=FALSE, row.names = FALSE, col.names = TRUE)
cat("This is information\n") cat(paste("variable =", variable, "\n",sep=""))
cat can direct to the screen ( file = "") chooses the standard output usually the screen) or to a specified file
cat("This is information\n",file="output.txt") cat(paste("variable =", variable, "\n",sep=""),file="output.txt")However I also use a way of redirecting screen output, sink, as follows:-
sink(distanceTextFileName) # all screen sent to file, use split=TRUE to copy to both screen and file cat(paste("This is information", "\n",sep="")) cat(paste("variable =", variable, "\n",sep="")) sink()
Note if you use the print command (with a quote=FALSE option to ensure no quotes around strings in the output)
print("This is information", quote=FALSE)this produces a nasty number in square brackets at the start of the line that I could not get rid of.
paste("string 1"," and ","string 2",sep="")Note that the default for the separation string used in between each string is a single space so for true concatenation you must assign this to be the empty string as shown above if you don't want spaces.
Note paste will do conversions to. convert lists of numbers into strings for use in plot legends etc
EEE <- 1000; KKK <- 4.0; pdetitle <- paste("pde E=",EEE," K=",KKK,sep=" "); prlist = c(0.9999,0.1,1e-2,1e-3,1e-4); prliststring <- paste(prlist); legend (xxx,yyy, prliststring, col=colourlist[1:length(prlist)],lty=1:length(prlist),pch=1:length(prlist)
result=list() # create empty list for ( i in 1:6) { name <- list(filename=paste("filename",i,sep=""), dirname=paste("dir",i,sep="") ) # create list of names output <- list(value=i,sq=i*i,cube=i*i*i) # create list of results singlelist <- list(c(name,output)) # combine these into a single list #result <- c(result,singlelist) # make a new list of lists result[[i]] <- singlelist # create new entry in lest }This means that singlelist has four elements such as singlelist$sq the numerical value of the square of the value. result is now a list of length 6, each of whose entries is one of the singlelist, so each entry in the list result is a list of four objects. Thus result[[2]]$sq will be the square of result[[2]]$value .
This can be very frustrating, but this is often due to the inherent nature of non-linear fitting and not the fault of R. I find that playing around with the starting values and bounds for the parameters being fitted can often turn rubbish into beautiful results. Persevere is the watch word here.
Create a data frame containing the data, here with two columns labelled k and logp
df <- data.frame(k = dataf$k[2:kmax], logp = log(dataf$pk[2:kmax]));The parameters in this example are called Afit, gammafit and k0fit and one needs to specify lower, upper and starting values for these. Of course the starting values must be larger (smaller) than the corresponding value specified in the lower (upper) list. For instance I might use
startlist <- list(Afit=Aest, gammafit=gammaest, k0fit=k0est); lowerlist <- list(Afit=Alower, gammafit=gammalower, k0fit=k0lower) upperlist <- list(Afit=Aupper, gammafit=gammaupper, k0fit=k0upper )where Aest etc. all have numerical values. To fit see if the logp data fits form with a linear and log k terms, I use one of the following
fitres <- nls( logp ~ log(Afit)-gammafit*log(k) - k/k0fit , df, startlist, lower=lowerlist, upper=upperlist, algorithm="port") fitres <- nls( logp ~ log(Afit)-gammafit*log(k) - k/k0fit , df, startlist, lower=lowerlist, upper=upperlist)Note the first part is how you specify a formula. The names in the formula are either data or parameters. The data variable are those present in the data frame df, e.g. logp refers to the values in df$logp. Any names appearing in the in the upper, lower and starting value lists are treated as the parameters whose values are being sought, e.g. Afit in the formula appears in startlist so its value is being found. Anything else better have a fixed numerical value.
A summary is given on screen and the results are in fitres which is a "a type of fitted model object". You need to look this up to find out how to get the information out of it. Look up the nls function in the stats package, for instance as discussed in the reference manual. For instance coef(fitres) gives the coefficients of the fit, summary(fitres) gives a summary.
Don't forget that non-linear fitting is not guaranteed to converge and you may fail to get an answer at first. You may need to think about what sort of values are likely, play around with start/lower/upper values or whatever to get it to work. If all else fails try to get it working with a trivial example.
One idea is to use the package gplots which you get get installed from the GUI package menu.
One use for a list of objects is when you have run the same routine on different data sets and you want to plot the results on the same plot. Here ResultsList is a list of lists where the sublist ResultsList[[iii]]] has one set of data and a fit (for instance the output from nls,m see below) to that data as its entries. The trick is to open a new window then to plot an example with null data where you specify the features of the plot (log axes, titles etc). In particular here you must set the range of the axes by hand, e.g. by calculating them for yourself as illustrated by the first three lines below. You then add more data points or lines as required. Also note that this code generates different point symbols, line types and colours, the latter are a subset of those known to the system which can be found by typing colors()
#First calculate the minimum value of y axis. yMin<-ResultsList[[1]]$yMin; for (iii in 2:length(ResultsList)) yMin<-min(yMin, ResultsList[[iii]]$yMin) # Short list of allowed colours suitable for display or printing colourlist=c("black", "red", "blue", "green", "magenta", "brown"); # Now plot on screen windows() plot(x=NULL, y=NULL, type="n", log="xy", xlim=c(xMin,xMax) , ylim=c(yMin,yMax) , main="Main Title", sub="(subtitle)", xlab="x axis name", ylab="y axis name") for (iii in 1:length(ResultsList)) { points(ResultsList[[iii]]$y, ResultsList[[iii]]$x, pch=iii, col=colourlist[iii]) lines(ResultsList[[iii]]$y, fitted(ResultsList[[iii]]$fit), col=colourlist[iii], lty=iii) } # Finally add legend legend (x="topright",y=NULL, nameArray[1:length(ResultsList)], col=colourlist[1:length(ResultsList)],lty=1:length(ResultsList),pch=1:length(ResultsList));Note that on a MAC you need quartz() to get displays in a window, the window() is for MS Windows platforms so I often define a boolean OSWindows and use if (OSWindows) windows() else quartz(). One could probably access some machine environment variable to do this too.
To get output to a file or a new window on a screen you use device drivers. These may sound scary but there is just one for every different type of file or output device and by default the appropriate screen is set up so most of the time you don"worry about these. For instance postscript() or pdf() with extra arguments should be used to produce files of each of these types. Note that for the former you must close the device (using dev.off(which = dev.cur()) or similar) or you get empty output. Thats a good idea anyway for all `devices' except for screen output though many types of device (file/screens) don't seem to care. Also for eps (best for inclusion in documents) not ps output you must include the onefile=FALSE option.
rootName="IC090521" hepsFileName<-paste(rootName,"yearhist.ps",sep="_") print(paste("ps plotting",hepsFileName), quote=FALSE) postscript(hepsFileName, horizontal=FALSE, onefile=FALSE, height=6, width=6, pointsize=10) hist(yearList) dev.off(which = dev.cur())
Here is a simple example of how to alter the size and colours of the points or lines. It also shows how to set up a simple routine to do plots. A simple copying of a single line calling this routine can be used to reproduce the same plot for different outputs such as screen, eps file, pdf file. Also shown are labels for the curves.
#Next three lines are altered as needed screenOn=TRUE epsOn=FALSE pdfOn=FALSE pngOn=FALSE OSWindows=TRUE fileNameFullRoot=paste("figure",sep="") colourlist=c("black", "red", "blue", "green", "magenta", "brown"); plotFig<-function(){ # next two control size of text and lines cexvalue=2 lwdvalue=2 xmin=0 xmax=2.0 # plots cross at this point and at (0,1) xcross=0.5 ycross=0.5 nvalues=100 # number of xvalues to use xvec = seq(from=xmin, to=xmax, by=(xmax-xmin)/nvalues) # plot exponential e^(-gamma*x) gammavalue = -(1.0/xcross)*log(ycross) lll=1 ## dummy variable, helps if you later insert or remove some of the lines on the plot yvec=exp(-gammavalue*xvec) plot(xvec, yvec, ylim=c(-0.01,1.01), type="l", col=colourlist[lll], lty=lll, cex=cexvalue, lwd=lwdvalue) xlabelpos=1.0 text(xlabelpos,yvec[xvec==xlabelpos],labels=c("exponential"), col=colourlist[lll], offset = 2, pos=1, cex=cexvalue) # pos argument is 1, 2, 3 and 4, for below, to the left of, above and to the right of the specified coordinates. # plot power law 1/(x+x0)^alpha alphavalue = 2 x0value = xcross/( (ycross^(-1.0/alphavalue))-1) lll=lll+1 yvec=(x0value/(xvec+x0value))^alphavalue lines(xvec, yvec, type="l", col=colourlist[lll], lty=lll, cex=cexvalue, lwd=lwdvalue) text(xlabelpos,yvec[xvec==xlabelpos],labels=c("power law"), col=colourlist[lll], offset = 2, pos=3, cex=cexvalue) # plot theta function lll=lll+1 yvec=rep(1,times=length(xvec)) yvec[xvec>xcross]=0 lines(xvec, yvec, type="l", col=colourlist[lll], lty=lll, cex=cexvalue, lwd=lwdvalue) text(xcross,0.8,labels=c("sharp cut off"), col=colourlist[lll], offset = 0.5, pos=4, cex=cexvalue) } if (screenOn){ if (OSWindows) windows() else quartz() plotFig() } # EPS plot if (epsOn){ epsFileName<- paste(fileNameFullRoot,".eps",sep="") print(paste("eps plotting",epsFileName), quote=FALSE) postscript(epsFileName, horizontal=FALSE, onefile=FALSE, height=10, width=10, pointsize=20) plotFig() dev.off(which = dev.cur()) } # PDF plot if (pdfOn){ pdfFileName<- paste(fileNameFullRoot,".pdf",sep="") print(paste("pdf plotting",pdfFileName), quote=FALSE) pdf(pdfFileName, onefile=FALSE, height=10, width=10, pointsize=20, units="cm") plotFig() dev.off(which = dev.cur()) } # PNG plot if (pngOn){ dpivalue=1200 pngFileName<- paste(fileNameFullRoot,dpi,".png",sep="") print(paste("png plotting",pngFileName), quote=FALSE) png(pngFileName, height=10, width=10, pointsize=20, units="cm", res=dpivalue) plotFig() dev.off(which = dev.cur()) }
plotDataTableLines<-function(dataTable1,dataTable2, legendString1,legendString2){ ltyList=c(2,3); plot(dataTable1$gamma, dataTable1$Communities, ylim=c(0,12), xlim=c(0,2.5), main=NULL, xlab=expression(gamma), ylab="Number Communities", type="l", col=colourlist[2], lty=ltyList[1], lwd=2) lines(dataTable2$gamma,dataTable2$Communities, col=colourlist[3], lty=ltyList[2], lwd=2) legend(x="bottomright" ,y=NULL, c(legendString1,legendString2), lty=ltyList, lwd=c(4,4), col=colourlist[2:3]); }So then I would call this as
#Next three lines are altered as needed screenOn=TRUE epsOn=TRUE pdfOn=TRUE pngOn=TRUE OSWindows=TRUE if (screenOn){ if (OSWindows) windows() else quartz() plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4) } # EPS plot if (epsOn){ epsFileName<- paste(fileNameFullRoot,".eps",sep="") print(paste("eps plotting",epsFileName), quote=FALSE) postscript(epsFileName, horizontal=FALSE, onefile=FALSE, height=6, width=6, pointsize=10) plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4) dev.off(which = dev.cur()) } # PDF plot if (pdfOn){ pdfFileName<- paste(fileNameFullRoot,".pdf",sep="") print(paste("pdf plotting",pdfFileName), quote=FALSE) pdf(pdfFileName, onefile=FALSE, height=6, width=6, pointsize=10) plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4) dev.off(which = dev.cur()) } # PNG plot if (pngOn){ pngFileName<- paste(fileNameFullRoot,".png",sep="") print(paste("png plotting",pngFileName), quote=FALSE) png(pngFileName, onefile=FALSE, height=6, width=6, pointsize=20) plotDataTable(gnct2,gnct4,typeStringt2,typeStringt4) dev.off(which = dev.cur()) }
A useful function to increase the scale of the plot e.g. to leave room for text labels
plotrange<-function(coordinates,factor=0.05){ cmin=min(coordinates) cmax=max(coordinates) crange=cmax-cmin cextra=crange*factor c(cmin-cextra,cmax+cextra) }I call this as
plot(xvalues,yvalues,xlim=plotrange(xvalues,0.05),ylim=plotrange(yvalues,0.05))which gives an five percent whitespace border.
This is the old way I used to combine several data sets and lines on one plot. The options in the plot command illustrate several of the useful options. Here s010 and the other parts of SimonDataList are data frames formed using a read.table. The double square bracket is confusing and comes from the default name given to entries in list. The FitList is a list of fits coming out of some call to nls or similar fitting command. ifirst and ilast are there just for flexibility.
colourlist=c("black", "red", "blue", "green", "magenta", "brown"); SimonDataList <- list(s010 ,s250, s750, s950, s990 , s999); ifirst=1 ilast=length(SimonDataList) plot(SimonDataList[[ifirst]], type="p", log="xy" , col=colourlist[ifirst], pch=ifirst, xlim=c(kmin,kmax), ylim=c(1e-9,1), xlab="degree k", ylab="p(k)", main="Laird and Jensen EPL 2006 Fig 1 (ll, ac fit)") for ( i in (ifirst+1):ilast) points(SimonDataList[[i]], col=colourlist[i], pch = i); for ( i in ifirst:ilast) lines(SimonDataList[[i]]$k, exp(fitted(FitList[[i]])), col=colourlist[i]);
# Here are two simple x and y vectors to be plotted x <- c(0,1,2,3,4,5,6,7,8) y <- c(0,1,2,3,4,5,6,7,8) # Set layout() for side by side plots layout(matrix(c(1,2), byrow=TRUE, ncol=2)) # xaxs = “r” adds 4% over specified limits plot (y~x, xlim=c(0,10), ylim=c(0,10), xaxs=”r”, yaxs = “r”, las = 1, main = “Plot with ‘r’ axes placement”) # The 4% expansion can be eliminated by specifying xaxs = “i” and yaxs = “i” plot (y~x, xlim=c(0,10), ylim = c(0,10), xaxs = “i”, yaxs = “i” , las = 1, main = “Plot with ‘i’ axes placement”)
To get R to display mathematical expressions in plots (or in strings in general) use the expression(), bquote() or substitute() commands, which seem to be part of plotmath. There are some webpages with examples, and more symbols so search for these terms.
A simple example I had working is as follows
subtitle=paste(substitute(alpha),"=",alpha,", ",substitute(gamma),"=",gamma) plot(x,y,main=substitute(y==Phi*alpha-sum(x^gamma)),type="l",subtitle=subtitle) text(20,30,substitute(Delta[z]==0.1))
It took me time to piece a complicated mixture of angle brackets, subscripts, mathematical expressions and numerical values. You can't always paste bits together as you can with normal strings. For instance the following work as labels in a plot
graphKfitScale=1.0 graphKfitShiftAbs=3.0 xLabel = bquote(k/(.(graphKfitScale)*symbol("\341") * k[s] * symbol("\361") - .(graphKfitShiftAbs) ) ) yLabel = bquote(log(n(k))*p*( .(graphKfitScale)*symbol("\341") * k[s] * symbol("\361") - .(graphKfitShiftAbs) ) )This will substitute the values of the graphKfit parameters while the other symbols such as k are left unchanged. The whole is treated as an expression so the * symbols do not appear yet the angle brackets (the symbol commands) are left alone.
When values of variables appear in labels, leading and trailing zeros are usually suppressed. Sometimes this is not wanted so to control the conversion of numerical values to strings look at the ,tt>format function.
rootName="aegean34"; typeName="S1L3"; namesFileName<-paste(rootName,"names.dat",sep=""); distanceFileName<-paste(rootName,typeName,"dist.dat",sep="") nameList <- scan(namesFileName, what="", skip=1); distanceVector <- scan(distanceFileName,0); distanceMatrix <- matrix(distanceVector,nrow=34); distanceDist <- as.dist(distanceMatrix) hclustMethod <- c("single", "complete", "ward", "average", "mcquitty", "median", "centroid"); methodName=hclustMethod[2] distanceHclust <- hclust(distanceDist, method=methodName) plclust(distanceHclust, labels=nameList, frame.plot=FALSE, hang=-10, xlab="", ylab="Distance", sub="Sites", main=paste(rootName,typeName," ",methodName,sep="") )
df <- read.table("SiteHist.dat", header=TRUE, sep="\t", fill=TRUE); wmax <- max(df$Weight) rmax <- max(df$Ranking) imax <- max(df$Influence) dataArray <- rbind(df$Weight,df$Ranking,df$Influence) barplot(ddd, beside=TRUE, names.arg=df$ShortName, horiz=TRUE, las=1)
# convert flowMatrix N-by-N array to adjacency matrix minw=1/N # minw is a minimum threshold for weights adjMatrix <- ifelse(flowMatrix>=minw,1,0) flowgGraph <- graph.adjacency(adjMatrix, mode="upper", diag=FALSE) # make an N-by-2 matrix for the locations of vertices, here taken from a data frame of site input parameters - sitedf - # containing latitude and longitude of sites xvec <- sitedf$Long yvec <- sitedf$Lat vc <- array(c(xvec,yvec),dim=c(numberRows,2)) # now make a vector from some data to produce vertices of different sizes, here from results vector maxInFlowVector <- max(siteresultsdf$inFlowVector) vertexSize <- siteresultsdf$inFlowVector*10/maxInFlowVector+1 # How to colour the vertices using some membership vector - here generated by strong component membership flowgClusters <- clusters(flowg,mode="strong") colourListLong=c("black", "red", "blue", "green", "magenta", "brown","cyan","gold","darkred","darkblue", "darkgreen","yellow","hotpink","lightblue", "olivedrab"); ncolours <-length(colourList) vertexColour <- colourList[ (flowgClusters$membership +1)%%ncolours +1] # now produce graphic with edges of different widths and named vertices of different sizes plot(flowg, layout=vc, vertex.label=sitedf$ShortName, vertex.size=vertexSize, vertex.color=vertexColour) # oops the edgewidth only works if we used the edge list method below #plot(flowg, edge.width=edgedf$weight*10, layout=vc, vertex.label=sitedf$ShortName, vertex.size=vertexSize)Note how simple it is to specify the locations of the vertices, a two column matrix (three for 3D). I found this quite hard to find in the documentation. flowg <- graph.adjacency(adjMatrix, mode="upper", diag=FALSE) Also note that I had tried to create the graph from an edge list, which I generated from the adjacency matrix. This is how I did that but I think there can be issues with multiple edges.
# convert flowMatrix N-by-N array to three column edge data frame edgedf # First create a N*N vector of weights, then do same for sources and targets. weight<- flowMatrix dim(weight)<-N*N #Internally the vertices are numbered from 0 but can use any unique numbering for each row and column source<-array(1:N,dim=dim(weight)) target <- t(array(1:N,dim=c(N,N))) dim(target)=dim(source) # minw is a minimum threshold for weights minw=1/N edgedf <-data.frame(source=source[weight>minw], target=target[weight>minw], weight=weight[weight>minw]) # create a graph structure from a data frame whose first two columns are the source then the target vertices. flowg <- graph.data.frame(edgedf, directed=T) # Trouble is multiple edges are allowed as this command will show count.multiple(flowg)
Finally note that there was a problem with fonts and poscript/pdf output. The solution is to load the fonts needed in the postscript (or pdf) command. The serif is the default needed to get this to work.
postscript(epsFileName, horizontal=FALSE, onefile=FALSE, height=6, width=6, pointsize=10, fonts=c("serif", "Palatino")) plot(fixedRGraph, layout=vc, vertex.label=dataList$sitedf$ShortName, vertex.color=vertexColour, vertex.size=vertexSize, vertex.label.dist=1)For more details on plot poscript/pdf output, see examples above.
>xxx <- c(1:4) >yyy <- c("one", "two", "three", "four") >aorder <- order(yyy,xxx) >aorder [1] 4 1 3 2 > yyy[aorder] [1] "four" "one" "three" "two" > xxx[aorder] [1] 4 1 3 2