`R`

is an open-source programming language principally used for statistical computations and data science. It has been created in the 1990's by two statisticians, Ross Ihaka and Robert Gentleman, from whose initials is derived the name of the language. It can be installed from https://www.r-project.org/ and is usually used by means of the integrated development environment (IDE) `RStudio`

(https://rstudio.com/).

When you open the `RStudio`

software, you will see 4 windows. The two on the left are the most important ones and represent where you write a script (top) and execute commands and scripts (bottom). On the top right are listed the current variables in the environment. On the bottom right is a window with several tabs, the one most often used being the "Plots" tab.

When executing commands requiring access to files on your computer (e.g. executing a script, reading or exporting a data file), this will be done relatively to the "path" where you are located (working directory). You can see where you are by typing:

In [ ]:

```
getwd() # get working directory
```

Create a new folder for the purpose of the session (either outside of `RStudio`

or using the command `dir.create("dir_name")`

) and navigate until the folder using setwd("path/to/file"). When using setwd(), use the `Tab`

key to autocomplete the names of the folders and see list of possibilities. This feature of *autocompletion* is used constantly when coding.

There are two types of commands (to be executed in the command window):
expressions, such as execution of a `function`

(or a `script`

, see later)

In [ ]:

```
3+2
```

In [ ]:

```
3*2
```

In [ ]:

```
3^2
```

In [ ]:

```
log(1)
```

In [ ]:

```
exp(2)
```

and assignations

In [ ]:

```
x <- 2
```

or

In [ ]:

```
x = 2
```

In the latter we have assigned the *value* `2`

to the *variable* `x`

. We can then manipulate `x`

in various expressions:

In [ ]:

```
2*x
```

In [ ]:

```
y <- x^2
```

In [ ]:

```
z <- x + y
```

This is used constantly in programming as we don't want to have to remember all the time the exact value of the variable to be manipulated. In addition, if the variable is for instance the value of some parameter of a model, we only have to change this value once to affect the entire code in a script.

When executing commands in the command window, a useful tip is to recall the previous command by typing ↑ ('up' arrow)

You can get help from the documentation about a given function by typing `?`

In [ ]:

```
?log
```

For reproductibility, saving and ease of manipulation, we usually do not code directly in the command window. Instead, we group the commands to be executed for a given purpose in a `script`

. This is nothing else than lines of text written in the `R`

language aimed at being called in the command window. A script can be written using any text editor but `RStudio`

provides an integrated editor that is convenient to use.

Create your own first script with commands you choose (you can pick from above). Save it in the current working directory.

To execute your script you have three options:

- run it from the command window by using the
`source`

function

In [ ]:

```
source('test.R')
```

- run it from the editor window by clicking 'Source' in the top right
- run it from the editor window using the keyboard shortcut
`⌘`+`shift`+ S (`Ctrl`+`shift`+ S on Windows)

During the writing of your script you can also execute the current line by typing `⌘` + `⏎`

In `R`

, a variable can be of four basic types, which you can obtain using the `typeof()`

function

**numbers** (called 'double' relative to the fact that numbers are stored in double precision)

In [ ]:

```
x <- 2
typeof(x)
```

**character** (also called a string)

In [ ]:

```
x <- "abc"
typeof(x)
```

**logical** (or boolean)

In [ ]:

```
x <- TRUE
typeof(x)
```

To test if two objects are equal, we use `==`

In [ ]:

```
x <- 2
x == 1
```

To test if they are different, `!=`

In [ ]:

```
x != 2
```

In [ ]:

```
!(x==2)
```

The operator `!`

is the `NOT`

logical operation, it transforms `FALSE`

to `TRUE`

and conversely. Other important logical operations are `AND`

(operator `&`

) and `OR`

(operator `|`

)

In [ ]:

```
(x > 1) & (x <= 5)
```

In [ ]:

```
(x < 1) | (x <= 5)
```

**list** : lists are collection of objects, that don't have necessarily the same types. An important subtype of lists are dataframes (see below)

In [ ]:

```
x <- list(x=1, y="a", z=FALSE)
x
```

Several 'atomic' variables (i.e. containing only one element) of the *same type* can be combined (or concatenated) into a **vector**

In [ ]:

```
x <- c(2,3)
x
```

Long vectors (such as for use in plotting lines) can also be defined using the function `seq()`

*Exercise*: look at the help of the `seq()`

function and build the vector $x=(0, 2, 4, 6, 8, 10)$

In [ ]:

```
x <- seq(from=0, to=10, by=2)
```

A basic property of a vector is its length

In [ ]:

```
length(x)
```

Elements of a vector can be accessed by their position

In [ ]:

```
x[3]
```

One can also extract several values by their positions

In [ ]:

```
x[c(1, 4)]
```

or in a given range

In [ ]:

```
x[2:4]
```

Values of a vector can be excluded using

In [ ]:

```
x[-4]
```

In [ ]:

```
x[-(2:4)]
```

When extracting data from a vector, a particularly useful feature is to use boolean extraction

In [ ]:

```
x[x>2]
```

Note that

In [ ]:

```
x>2
```

In [ ]:

```
x[(x>2) & (x<=8)]
```

The basic function to define matrices organizes data (usually given as a vector) into a matrix

In [ ]:

```
M <- matrix(c(1,2,3,
4,5,6
), nrow=2, ncol=3, byrow=TRUE)
M
```

However usually matrices are defined from pre-existing vectors using the functions `cbind()`

(for concatenating by column) or `rbind()`

(row concatenation)

In [ ]:

```
x <- c(1, 2, 3)
y <- c(4, 5, 6)
M_col <- cbind(x, y )
M_col
```

In [ ]:

```
M_row <- rbind(x, y)
M_row
```

Elements of matrices can be extracted by their indices

In [ ]:

```
M[2, 3]
```

To extract an entire row

In [ ]:

```
M[2,]
```

or entire column

In [ ]:

```
M[,2]
```

The extensions of length for a matrix are its dimensions (`dim()`

), number of rows (`nrow()`

) and number of columns (`ncol()`

)

In [ ]:

```
dim(M)
```

In [ ]:

```
nrow(M)
```

In [ ]:

```
ncol(M)
```

Addition

In [ ]:

```
x + y
```

Vector multiplication is defined term by term (although this leads to zero divisors)

In [ ]:

```
c(1, 0)*c(0, 1)
```

To distinguish term-by-term multiplication from matrix product, the latter is defined by `%*%`

In [ ]:

```
P <- cbind(c(-1, 0, 2), c(1, 1, 1))
P
```

In [ ]:

```
M%*%P
```

*Exercise*: Define the diagonal matrix $D=\begin{pmatrix} -2 & 0 \\ 0 & 3 \end{pmatrix}$. Can you multiply $M\cdot D$? Or $P \cdot D$? And $D\cdot M$? What is the effect of left or right multiplication by $D$? Try also with $D\cdot v$, $v=\begin{pmatrix}1 \\ 2\end{pmatrix}$

In [ ]:

```
D <- cbind(c(-2, 0), c(0, 3))
M%*%D
```

In [ ]:

```
P%*%D
```

In [ ]:

```
D%*%M
```

In [ ]:

```
v<-c(1,2)
D%*%v
```

Transposition of a matrix

In [ ]:

```
t(M)
```

The determinant of a *square* matrix

In [ ]:

```
Q <- cbind(c(-1, 2), c(3, 4))
det(Q)
```

*Exercise*: Without computing it first, what is the determinant of $D$?

In the following we will analyze pharmacokinetics data of Sunitinib in rats after short 1-min intra-venous infusion at dose 4 mg/kg. We will determine the elimination rate constant $k$ and volume of distribution $V$ under the assumption of one-compartmental first-order linear elimination kinetics:

$$ C(t) = \frac{D}{V}e^{-kt}. $$This exponential model transforms to a linear model when taking the logarithm:

$$ \ln\left(C(t)\right) = \ln(D) - \ln(V) - kt $$*Exercise*: Denoting $y_j = \ln\left(C_j \right)$ the observed concentrations and assuming a linear regression of $y$ on $t$:

what is the correspondence between $\theta = \left(\theta_0, \theta_1\right)$ and $\left(D, V, k\right)$?

The basic structure to store data in `R`

is called a `data frame`

. It is a two-dimensional array and is similar to a matrix, except that the columns can be of different types. They usually have names and correspond to the variables under study. The rows are for the observations (e.g. time index or subject number). Within a given column, all elements must have the same type.

In [ ]:

```
df <- data.frame(x=c(1,2), y=c("a", "b"))
df
```

Data can be read from a file into a data frame. The usual file format to store data is `.csv`

(comma separated value) where columns are separated by commas or another separator (e.g `;`

, space or Tab, which can be specified in the `sep`

parameter). The file can be located locally on your machine or distantly at an url. You have to specify whether the first line contains the name of the columns (`"header=TRUE"`

) or not (`"header=FALSE"`

). Packages exist to read from excel files (e.g. `readxl`

).

In [ ]:

```
url_data <- 'http://benzekry.perso.math.cnrs.fr/DONNEES/TP_R/linear_algebra/sunitinib_rat_iv_data_haznedar.csv'
df <- read.csv(url_data, header = TRUE)
```

To see the first lines of the dataframe

In [ ]:

```
head(df)
```

The columns can be accessed either by their index

In [ ]:

```
df[1]
```

their name

In [ ]:

```
df['conc']
```

or using the `$`

sign

In [ ]:

```
df$conc
```

The important difference between `df['conc']`

and `df$conc`

is the *type* of what is returned. `df['conc']`

returns a *dataframe* containing only the column `conc`

(the type of a dataframe is `list`

)

In [ ]:

```
typeof(df['conc'])
```

whereas `df$conc`

returns the vector of the actual values in the column `conc`

In [ ]:

```
typeof(df$conc)
```

This is important because it is not the same type of functions that can be applied to a vector or a dataframe.

*Exercise* Define variables `t`

and `conc`

containing the respective column values of time and concentrations. Define also a variable `y`

containing the log (in base $e$) of the concentrations. Search on the Internet how to plot one variable against another and use it to plot `conc`

as a function of `t`

, both in linear and logarithmic scales. Add labels with units (hours for time, ng/mL for concentration)

In any programming language, it is impossible to entirely draw a *continuous* curve, since it contains an infinite number of points. Instead, we just draw a *discretized* version of the curve, i.e a large number of points on the curve linked together by straight lines. This is also the reason why you always have a finite number of pixels in an image, defining the resolution of the image.

Example: if we want to plot the function $x \mapsto x^2$ between $-1$ and $1$, we define an `x_plot`

vector that *discretizes* the segment $[-1, 1]$ into a large number of points, say 1000 (this parameter is given in the `length.out`

parameter of the `seq()`

function):

In [ ]:

```
x_plot <- seq(from = -1, to=1, length.out=1000)
```

We define then the corresponding $y$ values, plot $y$ against $x$ and link the points by lines (`type='l'`

in `plot`

)

In [ ]:

```
y_plot <- x_plot^2
plot(x_plot, y_plot, type = 'l')
```

*Exercise*: explore the effect of discretizing with less values, say, 100, 10 and 5

*Exercise*: Extract the first two (log) data points into two vectors `t_2`

and `y_2`

.

*Exercise*: On paper and using $\theta_0$ and $\theta_1$ as unknowns write the system of two equations for the line $y =\theta_0 + \theta_1 t$ going through the first two points $(t_1, y_1)$ and $(t_2, y_2)$. Write it in matrix form.

*Exercise*: Define in `R`

the matrix $M$ above and solve the system using the function `solve()`

*Exercise*: Plot on the same graph the two first (log) data points and the line in red. Tip: to add a line to an existing plot, use `lines()`

after the first plot. The red color can be defined with the parameter `col='red'`

Plot the same line with all the data points

*Exercise*:

- Write the system of equations for the coefficients of the 3$^{\text{rd}}$ order polynomial passing through the first 4 data points, in matrix form: $$ y = \theta_0 + \theta_1 t + \theta_2 t^2 + \theta_3 t^3 $$
- Solve the system, plot the four points together with the polynome.
- Plot all points together with the polynome

*Exercise*:

- Define the matrix $M$ corresponding to:
$$
y = \theta_0 + \theta_1 t \Leftrightarrow y = M\cdot \theta
$$
tip: use the
`rep()`

function to build a vector from repeating multiple elements. - Define the least squares matrix $A = M^{T}\cdot M$
- Solve the least-squares problem (caution, the vector to use is not $y$)
- Plot the least-squares regression line together with the data
- Compute the distribution volume (in L/kg), elimination rate (in h$^{-1}$), clearance (in L/(kg.h)) and half-life (in hours) of the drug

In reality, a late measurement has been omitted at $t=24$ hours in the previous data set. The full dataset is at the url:

In [ ]:

```
url_data_full <- 'http://benzekry.perso.math.cnrs.fr/DONNEES/TP_R/linear_algebra/sunitinib_rat_iv_data_haznedar_full.csv'
df_full <- read.csv(url_data_full, header = TRUE)
```

Re-perform the linear least-squares regression on this entire data set. What do you conclude about the validity of the model? What alternative model would you suggest?