The essentials
Numerical approximations
Approximating \(\pi\)
What do you mean?
Pseudorandom numbers, logistic map, and chaos!?
Simulating ecology
Ecological communities
Species interactions & coexistence
Note: This workshop is about simulations and not programming. Don’t worry about the details but rather, focus on building intuition and a mental model
There are only a couple things you will need to run a simulation: 1) generate a sequence of numbers, 2) have some way to store data, 3) be able to manipulate that data, and 4) have imagination 😄
Before any simulation can happen, we need to a generate some numbers or a range of numbers. Think about it as the “spacetime” or canvas in which the reality of your simulation unfolds. Often times this is a sequence of ordered integers or a random sequence of numbers drawn from some probability distribution.
These are the commonly used functions that you can to generate such numbers in R:
Ordered sequence: seq()
Random sequence
Uniform distribution: runif()
Normal distribution: rnorm()
Examples
seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9
runif(n = 5, min = 0, max = 1)
## [1] 0.5581363 0.4001734 0.3617694 0.5873508 0.2044456
rnorm(n = 5, mean = 5, sd = 1)
## [1] 5.681982 4.245183 3.589591 4.480622 4.185396
Once we have generated some numbers and also when we run the actual simulation, we need to have some way to store the data.
In R, the commonly used data structures are:
Vectors - a sequence of items of the same type (integers, numerics, characters)
vec <- c(1, 2, 3)
vec
## [1] 1 2 3Matrices or Arrays - generalization of vectors in multiple dimensions
mat <- matrix(data = 1:9, nrow = 3, ncol = 3)
mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9Data frames - a collection of variables organized in a table. Each column is a variable and each row is an observation/datum
df <- data.frame(varA = c(1, 5, 10), varB = c('blue', 'red', 'green'))
df
## varA varB
## 1 1 blue
## 2 5 red
## 3 10 greenLists - the “meta” data structure, can store objects of any type. Can be a list of lists, a list vectors, list of dataframes, or a list of different combinations of data structures!
list(vec, mat, df)
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## [[3]]
## varA varB
## 1 1 blue
## 2 5 red
## 3 10 greenIn general, loops are used to do repetitive tasks. Although R has some ways of doing repetitive tasks without loops (via vectorized operations), loops are particularly useful for iterating functions that are time-dependent. If your simulation has some time-dependent component, use loops! The two types of loops you can do are for loops and while loops. Sometimes either approach can work, but the main difference between them is that with for loops you usually know a priori the number of iterations your program will run. However, sometimes you don’t know how many iterations your program will run for. Instead, you might have some condition or criteria that you will use to determine when to stop a program (e.g. accuracy of approximation).
For loops - iterates (i ) through some sequence and applies some function for each iteration
for (i in 1:length(vec)){
print(vec[i])
}While loops - while conditions is true, keep iterating
val <- 1
while (val < 10){
#some function
do_this
#update test variable
val <- val + 1
}Some inspiration from the great explainer
“Our imagination is stretched to the utmost, not, as in fiction, to imagine things which are not really there, but just to comprehend those things which are there.” -Richard Feynman
Sometimes an analytical solution is hard to derive or is simply not available. One way to get around this is to approximate a solution by doing what we already know.
Believe it or not, mathematics used to be an experimental science. Before the the concept of infinity or calculus was a thing, people had no way of dealing with curves. But curves are everywhere! To estimate the circumference of a circle, the Greek polymath Archimedes started by fitting a hexagon (a shape with straight lines!) inside a circle. Since he knew the distance between the corners of the polygon to the center of the circle is equal to the radius (r), he can apply Pythagorean’s theorem to get the distance between those corners. Summing up those distances would then give an approximation of the circle’s circumference. But the approximation of a hexagon is crude. Instead, Archimedes continued adding more edges to the polygon, and with each added edge, the approximation gets closer and closer to the actual value.
Instead of drawing polygons and solving the Pythagorean’s equation by hand(!), we can now have computers do it for us!
Suppose we have a unit circle enclosed within a unit square, like so:
It is said that the value of \(\pi\) is equal to 4 times the ratio between the circle’s area and the square’s area:
\[ \pi = 4 * \frac{Area_{circle}}{Area_{square}} \]
Obviously, anyone who’s taken elementary geometry would tell you “but we need \(\pi\) to get the area of the circle!” Instead of using equations, we can fill up the space (that make up the shapes) with quantities that we already know and use that to approximate area. Like estimating the volume of a vase by adding water from a graduated cylinder. Instead of water, we can fill up that space with points of data on a computer. For example, we can use the seq() function to fill up the space between 0 and 1, and the amount of points within that range gives us some quantitative information about the space. Since the circle is smaller than the square, some points will lie within the circle while some are outside. The number of points that lie within and outside of the circle therefore gives us an approximation of the ratio between the area of the circle and the square. The more points we use, the closer we are to the actual value. Just as we can fill our vase with discrete and coarse objects like marbles, pebbles, or coins, instead of water. Coarser objects simply reduces the resolution of our approximation.
res <- 100 #resolution; the higher the better the approximation
points <- seq(from = 0, to = 1, length.out = res)
xy_points <- expand.grid(x = points, y = points) #gives a pairwise combination of each value
#Plot with points
circle_square_p +
geom_point(data = xy_points, aes(x = x, y = y), size = 0.5, alpha = 0.3)+
lims(x = c(0, 1), y = c(0,1))
To check whether a point lies within the circle, we can use equation of the circle:
\[ (x - x_o)^2 + (y - y_o)^2 = r^2 \]
\[ (x - 0.5)^2 + (y - 0.5)^2 = 0.5^2 \]
\[ \sqrt((x−0.5)^2+(y−0.5)^2) = 0.5 \]
Any point that gives a value greater than \(r\) (or 0.5) would lie outside of the circle. To implement this, we simply substitute the x and y with our “data” (apply the LHS of the equation), and check whether each result is < than 0.5.
approx_pi_fun <- function(n = 100){
#Number of points used to approximate
points <- seq(from = 0, to = 1, length.out = n)
#Expand it to span 2 dimenisons (x and y)
xy_points <- expand.grid(x = points, y = points)
#Check whether points are within circle
area_with_points <- xy_points%>%
mutate(distance_to_center = sqrt((x-0.5)^2 + (y-0.5)^2))%>% #circle equation
mutate(in_circle = ifelse(distance_to_center < 0.5, yes = 1, no = 0))
#check condition
#Number of points inside square
area_square <- n*n #all points should be within square
area_circle <- sum(area_with_points$in_circle) #sum up all the 1s
#Approximate pi
approx_pi <- 4*(area_circle/area_square)
return(approx_pi)
}
approx_pi_fun(n = 100)
## [1] 3.0672
Task 1a: Use the approx_pi_fun() function and apply it to varying levels of sample sizes (n) and see how the approximation of \(\pi\) changes
Task 1b: Instead of using seq() can you think of a stochastic way of approximating pi? Imagine throwing darts and adding up the darts that lie within and outside of the circle.
Hint: draw from a probability distribution but think about which one specifically
# #I. take random shots with "equal probablity" within the unit square
#
# # random_darts <- data.frame(x = some distribution b/w 0 and 1,
# # y = some distribution b/w 0 and 1)
#
# #Check where darts lie
# pi_exp2 <- random_darts%>%
# mutate(distance_to_center = sqrt((x-0.5)^2 + (y-0.5)^2))%>% #circle equation
# mutate(in_circle = ifelse(distance_to_center < 0.5, yes = 1, no = 0))
#
# #Approximate value of pi
# # No. of points inside circle
# # No. of points inside square
# # ratio * 4
I think it’s safe to say that we all know what the mean (or average) is. It’s a summary statistic (a single value) that best represents a set of values From Wikipedia: “[The mean] is a central value of a finite set of numbers”. And you probably also know the formula for it:
\[ \overline{x} = \frac{\sum_{i = 1}^{N}x_i}{N} \]
But WHY!?
We can think of the mean in more physical terms: the “center of mass.” In seesaw language, that means the left and the right side of the fulcrum have the same mass; they are balanced:
One way we can quantify the centerness of a particular point is by calculating the distances between that point to the N points in a set of values.
\[ f(\overline{x}) = \sum_{i = 1}^{N}(x_i - \overline{x})^2 \]
If a point is centered, then the distances to the left should be balanced by the distances to the right. Squaring ensures the distances are all positive (as you may have been told before) but it also does something very special.
Task 2. Using the sum of squares (SS) equation, calculate the SS for a range of possible means of a random variable
#Sample the values of a random variable
set.seed(10)
rand <- runif(30)
#Sum of square function
ss_fun <- function(xi, xhat){
#Desired output: Sum of squares as a function of xhat
#output should be a vector of length = length(xhat)
sum_of_squares <- vector(mode = 'numeric', length = length(xhat))
#Loop through each xhat and compute it Sum of squares
#SS for each xhat
for(i in 1:length(xhat)){
squares <- (xi - xhat[i])^2 #compute the squares
sum_of_squares[i] <- sum(squares) #sum the squares and save the value
}
return(sum_of_squares)
}
#Generate a sequence of "possible" means
possible_means <- seq(0, 1, length.out = 10)
#Apply SS function to get SS for each xhat
ss <- ss_fun(xi = rand, xhat = possible_means)
#Plot
data.frame(SumOfSquares = ss, possible_mean = possible_means)%>%
mutate(Minimum = ifelse(min(SumOfSquares) == SumOfSquares, TRUE, FALSE))%>%
ggplot(aes(x = possible_mean, y = SumOfSquares))+
geom_line()+
geom_point(aes(col = Minimum))+
geom_vline(xintercept = mean(rand))+
labs(y = "Sum of Squares", x = "Candidate mean")+
theme_bw()
#Experiment with the number of possible means and see what happens to the plot
Apply the chain rule to find the derivative of the SS function. Derivative of sum function is 2 and derivative of \(\overline{x}\) inside the sum function is -1: \[SS = f(\overline{x}) ={\sum_{i = 1}^{N}(x_i-\overline{x})^2}\] \[\frac{\partial f}{\partial \overline{x}} = \sum_{i = 1}^{N} 2(x_{i} - \overline{x} )(-1)\] \[0 = -\sum_{i = 1}^{N}(2x_{i} - 2\overline{x})\] \[= 2\sum_{i = 1}^{N}(x_{i}) - 2\overline{x}\sum_{i = 1}^{N}1\] \[= 2\sum_{i = 1}^{N}(x_{i}) - 2\overline{x}\] \[2\overline{x}N= 2\sum_{i = 1}^{N}(x_{i})\] \[\overline{x}= \frac{\sum_{i = 1}^{N}(x_{i})}{N}\]
Notice that the resulting function is a curve and that the minimum of that curve lies the actual mean. Knowing this, and a bit of calculus, is there anything “special” about the minimum point that you can say? What can we do with that information?
Despite claims of a deterministic universe (Sakarchi 2021, personal communication), random numbers are central to sooo many things (e.g. statistics, Monte Carlo methods, cryptography, ecology & evolution(!)). In computers, we use it all the time without even thinking about it. How is it that every time we call rnorm(1) , the computer gives us a number that we have no way of knowing a priori? How is it possible for a deterministic system (i.e. a computer) to generate random numbers? In fact, it actually is impossible for a computer to generate truly random numbers. The solution is to generate something that are random-like, hence pseudo. Enter chaos, a deterministic phenomenon with irregular (random-like) behavior.
Strictly speaking, chaos is a type of attractor (or equilibrium dynamic) of a dynamical system. A kind of equilibrium just like the carrying capacities of single species pop’n growth and limit cycles in predator-prey systems. It is similar to carrying capacities and limit cycles in the sense that it is bounded and the trajectories of the state variable/s will always be within a range of values; if you perturb it, it will go back to the bounded trajectory. The difference however, and this is why they are useful for generating random numbers, is because it has these two characteristics:
Sensitivity to initial conditions - given two identical chaotic system (i.e. same governing equation), even the slightest difference in their initial conditions will cause them to diverge. BUT if they are exactly the same, the trajectory will be exactly the same
Aperiodic - although the trajectories are bounded, it never repeats itself
Many dynamical systems can give rise to chaotic dynamics, think double pendulums, the weather, and of course, ecology. Although chaos is more likely to arise in complicated systems (with many interacting components), they can nonetheless arise in the simplest non-linear dynamical system: the logistic map, or population growth with non-overlapping generations popularized by Robert May.
This is the equation:
\[ N_{t+1} = rN_{t}(1-N_{t}) \]
It is shown that different values of \(r\) (i.e. intrinsic growth rate), can give rise to fixed-point equilibrium, limit cycles, and chaos:
Task 3a: Convert the logistic map equation into a computer function/program
Hint: What arguments would that function have? What is the population size at time t+1 determined by?
Lets take a look at the dynamics of the system with different values of \(r\):
#Values of r
rs <- c(1.5, 2.5, 3.3, 4)
#Apply each to logistic map function
out <- lapply(rs, function(x)logistic_map(r = x, N0 = 0.3, timesteps = 30))%>%
setNames(nm = rs)%>%
bind_rows(.id = 'r')%>%
tibble::rowid_to_column(var = 'time')%>%
pivot_longer(cols = -1, names_to = 'r', values_to = 'N')
#Plot
out%>%
ggplot(aes(x = time, y = N))+
geom_line()+
facet_wrap(~r)+
theme_bw()
What is similar and what is different about the resulting dynamics?
The simulation above only shows how the periodicity among different values of \(r\). Let’s now see what it means to exhibit sensitivity to initial conditions:
Here is an an example with varying N0 for the system with r = 1.5:
#Set of initial conditions
N0 <- seq(from = 0.3, to = 0.4, length.out = 5)
#Apply each N0 to logistic_map(r = 1.5)
lmap_stable <- lapply(N0, function(x)logistic_map(r = 1.5, N0 = x, timesteps = 30))%>%
setNames(nm = N0)%>%
bind_rows(.id = 'N0')%>%
tibble::rowid_to_column(var = 'time')%>%
pivot_longer(cols = -1, names_to = 'N0', values_to = "N")%>%
mutate(N0 = as.numeric(N0))
#Plot
lmap_stable%>%
ggplot(aes(x = time, y = N, group = N0, col = N0))+
scale_color_continuous(type = "gradient")+
geom_line()+
theme_bw()
Task 3b: Modify the code above and see what happens when we change \(r\) to 4.
As you can see, the trajectory of a chaotic system is highly dependent on initial conditions. If we know the initial conditions, then we know exactly how the system will progress. In other words, every time we simulate the chaotic system with N0 = 0.3, it will always be the exact same solution. In that sense, there is nothing random about it! So going back to our original question, how is it that rnorm(1) gives us a different value every single time? If the chaotic system is “seeded” with a random initial condition, can we really say that the sequence of numbers are random. So how do we seed it with something “random?”
Coupling randomness w/ deterministic chaos
#Generate a random seed
random_seed <- format(Sys.time(), "%OS5")%>%
substr(start = 3, stop = 7)%>%
as.numeric
#Use the seed as initial condition
logistic_map(r = 4, N0 = random_seed, timesteps = 30)%>%
tail(20)%>% #remove transient
order() #ignore the values and just look at their order
## [1] 19 20 7 4 14 10 8 12 2 5 17 15 16 1 11 9 13 3 6 18
Let’s apply the CLT to see if we can convert a chaotic system into a Gaussian random variable.
The Central Limit Theorem states that when independent random variables are summed up, the distribution of those summed statistics will approximate a normal distribution.
We can “simulate” independent variables by dividing up a single chaotic series into chunks, we assume that each chunk represents an independent variable.
Task 3c. Using chaos coupled with a random seed and the CLT, create a random variable that follows a normal distribution.