** STATA Sample Simulation program by Colin Cameron
** Program stsim.do September 2001
** This runs under version 7 but not version 8
* For more explanation on basic Stata see stdemo.do
* Simulation can be used a number of ways.
* (A) Simulate Once:
* Generate data using random number generators and estimate a model once.
* This is usual regression except that data is generated rather than real.
* (B) Simulate a number of times
* Do the same a number of times and look at sample distribution of estimators.
* This is a simulation and uses the Stata command simul
* (C) Simulate using bootstrap resample
* Resample from a data set to get more refined small sample distribution.
* This is a bootstrap and uses the Stata command bstrap
* (D) Simulate to Estimate
* Use simulation methods to permit estimation of regression parameters.
* This is maximum simulated likelihood or simulated method of moments.
* This program considers (A), (B) and (C).
* It compares usual OLS and robust White standard errors
* for regression with heteroskedastic errors.
********** SETUP
clear
capture log close
log using stsim.txt, text replace
di "stsim.txt by Colin Cameron: Stata simulation example"
version 7
********** (A) SIMULATE ONCE ******************************
* Heteroscedastic error example
* y = a + bx + u where a=1 b=1 u ~ N(0,1*x^2)
* Sample size n is defined in the set obs command
* (1) Generate data
* Set seed so get same random data in future runs of same program
set seed 10101
drop _all /* Drop previous data */
* Change this for different sample size
set obs 20
* Generate N(0,1) using inverse-transformation method
* If x ~ N(0,1) then PHI(x) = u so x = PHI-inverse(u) where PHI is N(0,1) cdf.
gen x = invnorm(uniform())
gen u = x*invnorm(uniform())
gen y = 1 + 1*x + u
summarize
* (2) Estimate model by OLS with usual and robust standard errors
regress y x
regress y x, robust
********** (B) SIMULATE A NUMBER OF TIMES ********************
* To perform a simulation
* (1) Setup: generate and/or read in initial data
* (2) Define the program that will be run a number of times
* In Stata this is command program and has an arcane format
* (3) Run the simulations
* In Stata use command simul
* (4) Summarize the results using descriptive statistics and/or graph
* Step (2) in Stata uses an arcane format
* To define the progam called progname, say, use the following outline:
* program define progname
* if "`1'"=="?" {
* global S_1 "variable names"
* exit
* }
* perform single simulation
* post `1' results
* end
* There must be the same number of results following the post command as variable
* names following the global S_1 command. See example below.
* Then Step (3) with 100 reps, say, uses the command
* simul progname reps (100)
* Heteroscedastic error example
* y = a + bx + u where a=1 b=1 u ~ N(0,1*x^2)
* Post coefficient estimates and standard errors: a se(a) b se(b) c se(c)
* A big difference is between
* (B1) Completely random sampling, with both new draws of x and y.
* (B2) x fixed in repeated samples, with just new draws of y.
**** (B1) Completely random sampling
* For large n,S use n=100, S=10000. For small n,S use n=20, S=100.
* As written there n = ?? and S = ?? simulations
* Change n in set obs ?? and change S in reps(??)
* (1) Setup:
* Set seed so get same random data in future runs of same program
set seed 10101
* (2) Define the simulation program.
* Compute both usual and White standard errors of slope coefficient.
program define simrandm
version 6.0
if "`1'" == "?" {
global S_1 "b_const se_const b_slope se_usual se_white"
exit
}
drop _all
set obs 20 /* Change sample size by change set obs */
gen x = invnorm(uniform())
gen y = 1 + 1*x + x*invnorm(uniform())
regress y x, robust
scalar sewhite = _se[x]
regress y x
post `1' (_b[_cons]) (_se[_cons]) (_b[x]) (_se[x]) (sewhite)
end
* (3) Do the simulations
* Change number of simulations by changing reps
simul simrandm, reps(1000)
* (4) Post results
* Theory says that for the dgp y = 1+1*x+x*e where x~N(0,1) and e~N(0,1)
* b_slope equals 1 (from the dgp)
* se_usual equals square root (1/n) where n is sample size
* se_white equals square root (3/n) where n is sample size
describe
summarize
kdensity b_slope, normal s(.) xlab ylab /*
*/ title("Density of slope coeff in heteroskedasticity example")
**** (B2) x fixed in repeated sampling
* For large n,S use n=100, S=10000. For small n,S use n=20, S=100.
* As written there n = ?? and S = ?? simulations
* Change n in set obs ?? and change S in reps(??)
* (1) Setup: First generate x to be used in repeated samples
set seed 10101 /* So get same random data in future runs */
drop _all /* Drop previous data */
set obs 20 /* Change sample size by change set obs */
gen x = invnorm(uniform()) /* Using inverse-transformation method */
gen mean_y = 1+1*x /* Saves recalculating in each simulation */
save fixedx, replace /* Data to be re-used in each simulation */
sum
* (2) Define the simulation program
program define simfixed
version 6.0
if "`1'" == "?" {
global S_1 "b_const se_const b_slope se_usual se_white"
exit
}
use fixedx, clear
gen y = mean_y + x*invnorm(uniform())
regress y x, robust
scalar sewhite = _se[x]
regress y x
post `1' (_b[_cons]) (_se[_cons]) (_b[x]) (_se[x]) (sewhite)
end
* (3) Do the simulations
* Change number of simulations by changing reps
simul simfixed, reps(1000)
* (4) Post results
* Theory gives results for random x. See above. For fixed ??
describe
summarize
kdensity b_slope, normal s(.) xlab ylab /*
*/ title("Density of slope coeff in heteroskedasticity example")
********** (C) BOOTSTRAP ********************
* This does some bootstrap.
* But for more details go to separate program stboot.do
* The program produces three bootstrap 100*(1-alpha) confidence intervals
* (1) Regular asymptotic normal: bhat +/- z_alpha/2*se(bhat)
* where se(bhat) is the standard deviation of bhat from the bootstrap reps
* (2) Percentile method: This orders the bhat's from simulations and
* goes from alpha/2 lowest bhat's to the alpha/2 highest bhat's
* (3) Bootstrap-corrected. This works with the pivotal Wald statistic.
* See manual or a textbook. e.g. Efron/Tibsharani (1993, pp.184-188) with a=0
* This orders the bhat's from simulations and
* goes from p1 to the p2 highest
* where p1 and p2 are bias-correction adjustments to alpha/2 and 1-alpha/2
* Let p1 = Phi(2z0 - z_alpha/2) p2 = Phi(2z0 + z_alpha/2)
* z0 measures the median bias in bhat with
* z0 = Phi-inv(fraction of the bhat(s) < bhat)
* And if z0=0 then p1 = alpha/2 and no correction
* (1) and (2) are not asymptotic refinements.
* (3) is an asymptotic refinement and is better in small samples.
* (4) Stata does not give the bootstrap t-interval. See progran stboot.do for this.
* Generate the data set to be used
set seed 10101 /* So get same random data in future runs */
drop _all /* Drop previous data */
set obs 20 /* Change sample size by change set obs */
gen x = invnorm(uniform()) /* Using inverse-transformation method */
gen u = x*invnorm(uniform())
gen y = 1 + 1*x + u
sum
* Do the bootstrap
regress y x
bs "regress y x" "_b[x]", reps(999) level(95)
********** CLOSE OUTPUT
log close