Simulation is one of the most common applications of programming. In statistics, simulation is almost necessary if the results depend on a model that may not be totally correct or an asymptotic conclusion when small sample properties are unknown. In 2006 September issue of JSAS, there are 25 theory and methods articles, 16 of them used simulation.
In R Functions & datasets, there are many functions to generate random deviates. Most of them start with r. So it is easy to find them. Here are some of them
One very important step in simulation is set the initial seed for your simulation. It is the function rngseed(x), where x should be a large integer. The set.seed() function works the same way.
Here are the two most important random deviates.
Let us starts with 4 different distributions
postscript("Four_distr.ps")
par(mfrow=c(2,2))
set.seed(12345)
x<-runif(1000)
hist(x)
y<-rnorm(1000)
hist(y)
z<-rgamma(1000,shape=4,scale=2.5) # f(x) = C x^(4-1)exp(-x/2.5)
hist(z)
w<-rbeta(1000,5,2) # f(x) = C x^(5-1)*(1-x)^(2-1)
hist(w)
dev.off()
<\PRE>
Four_distr.ps
Here are first order autoregressive time series with different correlation
coefficient. (The model is given in the comments of the program.)
# Different shapes of a autoregressive process
# Saved as autoreg3.s
# Various forms of (1-0.8B)(1+0.3B)Y(t)=Z(t) #1-#3.
# 1 Y(t)=0.5Y(t-1)+Z(t)
# 2 Y(t)=0.5Y(t-1)+0.24Y(t-1)+Z(t)
# 3 Y(t)=0.8Y(t-1)+Z(t)
# 4 Y(t)=-0.8Y(t-1)+Z(t)
postscript("timeseries.ps")
par(mfrow=c(2,2)) # Four figures in the output
t<-c(1:200); # Generate t from 1 to 200
set.seed(12345); # Set seed from random number generatots
par(mfrow=c(2,2))
try.ar1<-function(n,a1,a2)
{
y<-c(1:n); y[1]<-0; y[2]<-0
for(j in c(3:n)) y[j]<-a1*y[j-1]+a2*y[j-2]+rnorm(1)
y}
n<-200
a1<- 0.5
a2<- 0.0
y1<-try.ar1(n,a1,a2)
ts.plot(y1,main="AR(1), 0.5") # Plot the time series y1
a1<- 0.5
a2<- 0.24
y2<-try.ar1(n,a1,a2)
ts.plot(y2,main="AR(2), 0.5, 0.24")
a1<- 0.9
a2<- 0.0
y3<-try.ar1(n,a1,a2)
ts.plot(y3,main="AR(1), 0.9")
a1<- -0.8
a2<- 0.0
y4<-try.ar1(n,a1,a2)
ts.plot(y4,main="AR(1), -0.8")
dev.off()
Suppose we are interested in how robust is the least squares estimate
b=(X'X)^(-1)X'Y when the error terms are not normal. The two other error
terms are uniform (short tail distribution) and exponential (long tail
distribution). Both of them are adjusted so that the errors have mean 0
and variance 1.
postscript("robustness.ps")
par(mfrow=c(2,2))
set.seed(123456)
n=20
x<-runif(n)*5
y1<-c(1:n)
e1<-rnorm(n) # normal residuals
for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i]
result1<-lm(y1~x)
summary(result1)
plot(x,y1)
abline(lsfit(x,y1),lty=2)
c=sqrt(12.0)
e2<-(runif(n)-0.5)*c # uniform residuals
y2<-c(1:n)
for(i in c(1:n)) y2[i]=1+0.5*x[i]+e2[i]
result2<-lm(y2~x)
summary(result2)
plot(x,y2)
abline(lsfit(x,y2),lty=2)
e3<-rexp(n)-1 # exponential residuals
y3<-c(1:n)
for(i in c(1:n)) y3[i]=1+0.5*x[i]+e3[i]
result3<-lm(y3~x)
summary(result3)
plot(x,y3)
abline(lsfit(x,y3),lty=2)
dev.off()
The outputs from the summary all show significance of the slope and
intercept. The p-values for the slopes are 0.015, 0.0002, and 0.015 for
normal, uniform and exponential errors respectively.
From the range of the residuals, we can see that y3 has much longer tail
than the other two. This also shows the merit of mixture of graphics and
data analysis. The when the seed is changed to set.seed(1234567), we have
the following graphs. The p-values are 0.00004, 0.0001 and 0.0005 for
normal, uniform and exponential errors respectively. It is surprising it
have such a large variation.
A typical output from summary() is
>summary(result1)
Residuals:
Min 1Q Median 3Q Max
-1.18925 -0.55789 0.07722 0.54001 1.80446
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7557 0.3935 4.461 0.000302 ***
x 0.3194 0.1184 2.699 0.014688 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.7897 on 18 degrees of freedom
Multiple R-Squared: 0.2881, Adjusted R-squared: 0.2485
F-statistic: 7.284 on 1 and 18 DF, p-value: 0.01469
Repeated simulations
The previous example tells us that we need to do more than one experiment
to check the accuracy of an estimate.
In other words, we need to check the distribution of the estimates.
To do this, we need to know how to get the R lm() output and summary them.
The manual usually provided the information. For the lm example,
see the statement slope[j]=fit$coef[2].
robust<-function(ns,n,error,a,c) {
slope<-c(1:ns)
x<-runif(n)*10
y1<-c(1:n)
for (j in c(1:ns)) {
e1<-(error(n)-a)*c # normalizing residuals
for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i]
fit<-lm(y1~x)
slope[j]=fit$coef[2] }
mean<-mean(slope)
std<-var(slope)^0.5
list(mean,std)
}
set.seed(12345)
test1<-robust(1000,10,rnorm,a=0,c=1) # Note the error
# function in the function statement
test2<-robust(1000,10,runif,a=0.5,c=3.464) # 3.464 = sqrt(12)
test3<-robust(1000,10,rexp,a=1,c=1)
The results shows
mean std
normal 0.515 0.125
uniform 0.490 0.158
exp 0.502 0.110
<\PRE>
Apparently, the regression estimate is not sensitive to the error
distribution. It would be better if we can check the relation between
the estimated slope and its standard error. Hopefully, the worse the estimate,
the larger the standard error. In other words, the confidence interval coverage is
robust. Unfortunately, there is no
way to recover the standrad error from the R output (see summary(result1).