R is an open-source programming language principally used for statistical computations and data science. It has been created in the 1990's by two statisticians, Ross Ihaka and Robert Gentleman, from whose initials is derived the name of the language. It can be installed from https://www.r-project.org/ and is usually used by means of the integrated development environment (IDE) RStudio (https://rstudio.com/).

The R Studio environment

When you open the RStudio software, you will see 4 windows. The two on the left are the most important ones and represent where you write a script (top) and execute commands and scripts (bottom). On the top right are listed the current variables in the environment. On the bottom right is a window with several tabs, the one most often used being the "Plots" tab.

When executing commands requiring access to files on your computer (e.g. executing a script, reading or exporting a data file), this will be done relatively to the "path" where you are located (working directory). You can see where you are by typing:

In [ ]:
getwd() # get working directory

Create a new folder for the purpose of the session (either outside of RStudio or using the command dir.create("dir_name")) and navigate until the folder using setwd("path/to/file"). When using setwd(), use the Tab key to autocomplete the names of the folders and see list of possibilities. This feature of autocompletion is used constantly when coding.

First commands

Expressions and assignations

There are two types of commands (to be executed in the command window): expressions, such as execution of a function (or a script, see later)

In [ ]:
3+2
In [ ]:
3*2
In [ ]:
3^2
In [ ]:
log(1)
In [ ]:
exp(2)

and assignations

In [ ]:
x <- 2

or

In [ ]:
x = 2

In the latter we have assigned the value 2 to the variable x. We can then manipulate x in various expressions:

In [ ]:
2*x
In [ ]:
y <- x^2
In [ ]:
z <- x + y

This is used constantly in programming as we don't want to have to remember all the time the exact value of the variable to be manipulated. In addition, if the variable is for instance the value of some parameter of a model, we only have to change this value once to affect the entire code in a script.

When executing commands in the command window, a useful tip is to recall the previous command by typing ↑ ('up' arrow)

You can get help from the documentation about a given function by typing ?

In [ ]:
?log

Scripts

For reproductibility, saving and ease of manipulation, we usually do not code directly in the command window. Instead, we group the commands to be executed for a given purpose in a script. This is nothing else than lines of text written in the R language aimed at being called in the command window. A script can be written using any text editor but RStudio provides an integrated editor that is convenient to use.

Create your own first script with commands you choose (you can pick from above). Save it in the current working directory.

To execute your script you have three options:

  • run it from the command window by using the source function
In [ ]:
source('test.R')
  • run it from the editor window by clicking 'Source' in the top right
  • run it from the editor window using the keyboard shortcut + shift + S (Ctrl + shift + S on Windows)

During the writing of your script you can also execute the current line by typing +

Types of variables

In R, a variable can be of four basic types, which you can obtain using the typeof() function

numbers (called 'double' relative to the fact that numbers are stored in double precision)

In [ ]:
x <- 2
typeof(x)

character (also called a string)

In [ ]:
x <- "abc"
typeof(x)

logical (or boolean)

In [ ]:
x <- TRUE
typeof(x)

To test if two objects are equal, we use ==

In [ ]:
x <- 2 
x == 1

To test if they are different, !=

In [ ]:
x != 2
In [ ]:
!(x==2)

The operator ! is the NOT logical operation, it transforms FALSE to TRUE and conversely. Other important logical operations are AND (operator &) and OR (operator |)

In [ ]:
(x > 1) & (x <= 5)
In [ ]:
(x < 1) | (x <= 5)

list : lists are collection of objects, that don't have necessarily the same types. An important subtype of lists are dataframes (see below)

In [ ]:
x <- list(x=1, y="a", z=FALSE)
x

Vectors and matrices

Vectors

Several 'atomic' variables (i.e. containing only one element) of the same type can be combined (or concatenated) into a vector

In [ ]:
x <- c(2,3)
x

Long vectors (such as for use in plotting lines) can also be defined using the function seq()

Exercise: look at the help of the seq() function and build the vector $x=(0, 2, 4, 6, 8, 10)$

In [ ]:
x <- seq(from=0, to=10, by=2)

A basic property of a vector is its length

In [ ]:
length(x)

Elements of a vector can be accessed by their position

In [ ]:
x[3]

One can also extract several values by their positions

In [ ]:
x[c(1, 4)]

or in a given range

In [ ]:
x[2:4]

Values of a vector can be excluded using

In [ ]:
x[-4]
In [ ]:
x[-(2:4)]

When extracting data from a vector, a particularly useful feature is to use boolean extraction

In [ ]:
x[x>2]

Note that

In [ ]:
x>2
In [ ]:
x[(x>2) & (x<=8)]

Matrices

The basic function to define matrices organizes data (usually given as a vector) into a matrix

In [ ]:
M <- matrix(c(1,2,3,
              4,5,6
             ), nrow=2, ncol=3, byrow=TRUE)
M

However usually matrices are defined from pre-existing vectors using the functions cbind() (for concatenating by column) or rbind() (row concatenation)

In [ ]:
x <- c(1, 2, 3)
y <- c(4, 5, 6)
M_col <- cbind(x, y )
M_col
In [ ]:
M_row <- rbind(x, y)
M_row

Elements of matrices can be extracted by their indices

In [ ]:
M[2, 3]

To extract an entire row

In [ ]:
M[2,]

or entire column

In [ ]:
M[,2]

The extensions of length for a matrix are its dimensions (dim()), number of rows (nrow()) and number of columns (ncol())

In [ ]:
dim(M)
In [ ]:
nrow(M)
In [ ]:
ncol(M)

Operations on matrices and vectors

Addition

In [ ]:
x + y

Vector multiplication is defined term by term (although this leads to zero divisors)

In [ ]:
c(1, 0)*c(0, 1)

To distinguish term-by-term multiplication from matrix product, the latter is defined by %*%

In [ ]:
P <- cbind(c(-1, 0, 2), c(1, 1, 1))
P
In [ ]:
M%*%P

Exercise: Define the diagonal matrix $D=\begin{pmatrix} -2 & 0 \\ 0 & 3 \end{pmatrix}$. Can you multiply $M\cdot D$? Or $P \cdot D$? And $D\cdot M$? What is the effect of left or right multiplication by $D$? Try also with $D\cdot v$, $v=\begin{pmatrix}1 \\ 2\end{pmatrix}$

In [ ]:
D <- cbind(c(-2, 0), c(0, 3))
M%*%D
In [ ]:
P%*%D
In [ ]:
D%*%M
In [ ]:
v<-c(1,2)
D%*%v

Transposition of a matrix

In [ ]:
t(M)

The determinant of a square matrix

In [ ]:
Q <- cbind(c(-1, 2), c(3, 4))
det(Q)

Exercise: Without computing it first, what is the determinant of $D$?

Linear least-squares regression

In the following we will analyze pharmacokinetics data of Sunitinib in rats after short 1-min intra-venous infusion at dose 4 mg/kg. We will determine the elimination rate constant $k$ and volume of distribution $V$ under the assumption of one-compartmental first-order linear elimination kinetics:

$$ C(t) = \frac{D}{V}e^{-kt}. $$

This exponential model transforms to a linear model when taking the logarithm:

$$ \ln\left(C(t)\right) = \ln(D) - \ln(V) - kt $$

Exercise: Denoting $y_j = \ln\left(C_j \right)$ the observed concentrations and assuming a linear regression of $y$ on $t$:

$$ y_j = \theta_0 + \theta_1 t +\varepsilon, $$

what is the correspondence between $\theta = \left(\theta_0, \theta_1\right)$ and $\left(D, V, k\right)$?

Load data

The basic structure to store data in R is called a data frame. It is a two-dimensional array and is similar to a matrix, except that the columns can be of different types. They usually have names and correspond to the variables under study. The rows are for the observations (e.g. time index or subject number). Within a given column, all elements must have the same type.

In [ ]:
df <- data.frame(x=c(1,2), y=c("a", "b"))
df

Data can be read from a file into a data frame. The usual file format to store data is .csv (comma separated value) where columns are separated by commas or another separator (e.g ;, space or Tab, which can be specified in the sep parameter). The file can be located locally on your machine or distantly at an url. You have to specify whether the first line contains the name of the columns ("header=TRUE") or not ("header=FALSE"). Packages exist to read from excel files (e.g. readxl).

In [ ]:
url_data <- 'http://benzekry.perso.math.cnrs.fr/DONNEES/TP_R/linear_algebra/sunitinib_rat_iv_data_haznedar.csv'
df <- read.csv(url_data, header = TRUE)

To see the first lines of the dataframe

In [ ]:
head(df)

The columns can be accessed either by their index

In [ ]:
df[1]

their name

In [ ]:
df['conc']

or using the $ sign

In [ ]:
df$conc

The important difference between df['conc'] and df$conc is the type of what is returned. df['conc'] returns a dataframe containing only the column conc (the type of a dataframe is list)

In [ ]:
typeof(df['conc'])

whereas df$conc returns the vector of the actual values in the column conc

In [ ]:
typeof(df$conc)

This is important because it is not the same type of functions that can be applied to a vector or a dataframe.

Plot the data

Exercise Define variables t and conc containing the respective column values of time and concentrations. Define also a variable y containing the log (in base $e$) of the concentrations. Search on the Internet how to plot one variable against another and use it to plot conc as a function of t, both in linear and logarithmic scales. Add labels with units (hours for time, ng/mL for concentration)

In any programming language, it is impossible to entirely draw a continuous curve, since it contains an infinite number of points. Instead, we just draw a discretized version of the curve, i.e a large number of points on the curve linked together by straight lines. This is also the reason why you always have a finite number of pixels in an image, defining the resolution of the image.

Example: if we want to plot the function $x \mapsto x^2$ between $-1$ and $1$, we define an x_plot vector that discretizes the segment $[-1, 1]$ into a large number of points, say 1000 (this parameter is given in the length.out parameter of the seq() function):

In [ ]:
x_plot <- seq(from = -1, to=1, length.out=1000)

We define then the corresponding $y$ values, plot $y$ against $x$ and link the points by lines (type='l' in plot)

In [ ]:
y_plot <- x_plot^2
plot(x_plot, y_plot, type = 'l')

Exercise: explore the effect of discretizing with less values, say, 100, 10 and 5

Line going through two points

Exercise: Extract the first two (log) data points into two vectors t_2 and y_2.

Exercise: On paper and using $\theta_0$ and $\theta_1$ as unknowns write the system of two equations for the line $y =\theta_0 + \theta_1 t$ going through the first two points $(t_1, y_1)$ and $(t_2, y_2)$. Write it in matrix form.

Exercise: Define in R the matrix $M$ above and solve the system using the function solve()

Exercise: Plot on the same graph the two first (log) data points and the line in red. Tip: to add a line to an existing plot, use lines() after the first plot. The red color can be defined with the parameter col='red'

Plot the same line with all the data points

*Polynomial interpolation

Exercise:

  • Write the system of equations for the coefficients of the 3$^{\text{rd}}$ order polynomial passing through the first 4 data points, in matrix form: $$ y = \theta_0 + \theta_1 t + \theta_2 t^2 + \theta_3 t^3 $$
  • Solve the system, plot the four points together with the polynome.
  • Plot all points together with the polynome

Linear least-squares regression

Exercise:

  • Define the matrix $M$ corresponding to: $$ y = \theta_0 + \theta_1 t \Leftrightarrow y = M\cdot \theta $$ tip: use the rep() function to build a vector from repeating multiple elements.
  • Define the least squares matrix $A = M^{T}\cdot M$
  • Solve the least-squares problem (caution, the vector to use is not $y$)
  • Plot the least-squares regression line together with the data
  • Compute the distribution volume (in L/kg), elimination rate (in h$^{-1}$), clearance (in L/(kg.h)) and half-life (in hours) of the drug

Full data set

In reality, a late measurement has been omitted at $t=24$ hours in the previous data set. The full dataset is at the url:

In [ ]:
url_data_full <- 'http://benzekry.perso.math.cnrs.fr/DONNEES/TP_R/linear_algebra/sunitinib_rat_iv_data_haznedar_full.csv'
df_full <- read.csv(url_data_full, header = TRUE)

Re-perform the linear least-squares regression on this entire data set. What do you conclude about the validity of the model? What alternative model would you suggest?