** STATA Sample Simulation program by Colin Cameron
** Program stsim.do September 2001
** This runs under version 7 but not version 8

* For more explanation on basic Stata see stdemo.do

* Simulation can be used a number of ways.

* (A) Simulate Once: 
*     Generate data using random number generators and estimate a model once.
*     This is usual regression except that data is generated rather than real.

* (B) Simulate a number of times
*     Do the same a number of times and look at sample distribution of estimators.
*     This is a simulation and uses the Stata command simul

* (C) Simulate using bootstrap resample
*     Resample from a data set to get more refined small sample distribution.
*     This is a bootstrap and uses the Stata command bstrap

* (D) Simulate to Estimate
*     Use simulation methods to permit estimation of regression parameters.
*     This is maximum simulated likelihood or simulated method of moments.

* This program considers (A), (B) and (C).
* It compares usual OLS and robust White standard errors 
* for regression with heteroskedastic errors.


********** SETUP 
clear
capture log close   
log using stsim.txt, text replace
di "stsim.txt by Colin Cameron: Stata simulation example"
version 7

**********  (A) SIMULATE ONCE  ******************************

* Heteroscedastic error example
* y = a + bx + u  where a=1 b=1 u ~ N(0,1*x^2)
* Sample size n is defined in the set obs command

* (1) Generate data
* Set seed so get same random data in future runs of same program
set seed 10101      
drop _all       /* Drop previous data */
* Change this for different sample size
set obs 20
* Generate N(0,1) using inverse-transformation method
* If x ~ N(0,1) then PHI(x) = u so x = PHI-inverse(u) where PHI is N(0,1) cdf.
gen x = invnorm(uniform())
gen u = x*invnorm(uniform())
gen y = 1 + 1*x + u
summarize

* (2) Estimate model by OLS with usual and robust standard errors
regress y x
regress y x, robust


**********  (B) SIMULATE A NUMBER OF TIMES  ******************** 

* To perform a simulation
* (1) Setup: generate and/or read in initial data
* (2) Define the program that will be run a number of times
*     In Stata this is command program and has an arcane format
* (3) Run the simulations
*     In Stata use command simul
* (4) Summarize the results using descriptive statistics and/or graph

* Step (2) in Stata uses an arcane format
* To define the progam called progname, say, use the following outline:
*        program define progname
*                if "`1'"=="?" {
*                        global S_1 "variable names"
*                        exit
*                }
*                perform single simulation
*                post `1' results
*        end
* There must be the same number of results following the post command as variable
* names following the global S_1 command. See example below.

* Then Step (3) with 100 reps, say, uses the command
* simul progname reps (100) 

* Heteroscedastic error example
* y = a + bx + u  where a=1 b=1 u ~ N(0,1*x^2)
* Post coefficient estimates and standard errors: a se(a) b se(b) c se(c)

* A big difference is between 
* (B1) Completely random sampling, with both new draws of x and y.
* (B2) x fixed in repeated samples, with just new draws of y.


**** (B1) Completely random sampling

* For large n,S use n=100, S=10000. For small n,S use n=20, S=100.
* As written there n = ?? and S = ?? simulations
* Change n in set obs ?? and change S in reps(??)

* (1) Setup: 
* Set seed so get same random data in future runs of same program
set seed 10101      

* (2) Define the simulation program. 
*     Compute both usual and White standard errors of slope coefficient.       
program define simrandm
version 6.0 
if "`1'" == "?" {
   global S_1 "b_const se_const b_slope se_usual se_white"
   exit  
   }
   drop _all
   set obs 20          /* Change sample size by change set obs */
   gen x = invnorm(uniform())
   gen y = 1 + 1*x + x*invnorm(uniform())
   regress y x, robust
   scalar sewhite = _se[x]
   regress y x
   post `1' (_b[_cons]) (_se[_cons]) (_b[x]) (_se[x]) (sewhite)
end

* (3) Do the simulations
* Change number of simulations by changing reps 
simul simrandm, reps(1000)

* (4) Post results
* Theory says that for the dgp y = 1+1*x+x*e where x~N(0,1) and e~N(0,1)
*   b_slope equals 1  (from the dgp)
*   se_usual equals square root (1/n) where n is sample size
*   se_white equals square root (3/n) where n is sample size
describe
summarize

kdensity b_slope, normal s(.) xlab ylab /*
  */ title("Density of slope coeff in heteroskedasticity example")


**** (B2) x fixed in repeated sampling

* For large n,S use n=100, S=10000. For small n,S use n=20, S=100.
* As written there n = ?? and S = ?? simulations
* Change n in set obs ?? and change S in reps(??)

* (1) Setup: First generate x to be used in repeated samples 
set seed 10101              /* So get same random data in future runs */
drop _all                   /* Drop previous data */
set obs 20                  /* Change sample size by change set obs   */
gen x = invnorm(uniform())  /* Using inverse-transformation method    */
gen mean_y = 1+1*x          /* Saves recalculating in each simulation */  
save fixedx, replace        /* Data to be re-used in each simulation  */
sum

* (2) Define the simulation program
program define simfixed
version 6.0 
if "`1'" == "?" {
   global S_1 "b_const se_const b_slope se_usual se_white"
   exit  
   }
   use fixedx, clear
   gen y = mean_y + x*invnorm(uniform())
   regress y x, robust
   scalar sewhite = _se[x]
   regress y x
   post `1' (_b[_cons]) (_se[_cons]) (_b[x]) (_se[x]) (sewhite)
end

* (3) Do the simulations
* Change number of simulations by changing reps 
simul simfixed, reps(1000)

* (4) Post results
* Theory gives results for random x. See above. For fixed ??
describe
summarize

kdensity b_slope, normal s(.) xlab ylab /*
  */ title("Density of slope coeff in heteroskedasticity example")


**********  (C) BOOTSTRAP  ******************** 

* This does some bootstrap.
* But for more details go to separate program stboot.do

* The program produces three bootstrap 100*(1-alpha) confidence intervals
* (1) Regular asymptotic normal: bhat +/- z_alpha/2*se(bhat)
*     where se(bhat) is the standard deviation of bhat from the bootstrap reps
* (2) Percentile method: This orders the bhat's from simulations and
*     goes from alpha/2 lowest bhat's to the alpha/2 highest bhat's 
* (3) Bootstrap-corrected. This works with the pivotal Wald statistic.
*     See manual or a textbook. e.g. Efron/Tibsharani (1993, pp.184-188) with a=0
*     This orders the bhat's from simulations and
*     goes from p1 to the p2 highest
*     where p1 and p2 are bias-correction adjustments to alpha/2 and 1-alpha/2
*     Let p1 = Phi(2z0 - z_alpha/2)  p2 = Phi(2z0 + z_alpha/2)
*         z0 measures the median bias in bhat with
*         z0 = Phi-inv(fraction of the bhat(s) < bhat)
*     And if z0=0 then p1 = alpha/2 and no correction 

* (1) and (2) are not asymptotic refinements.
* (3) is an asymptotic refinement and is better in small samples.
* (4) Stata does not give the bootstrap t-interval. See progran stboot.do for this.

* Generate the data set to be used
set seed 10101              /* So get same random data in future runs */
drop _all                   /* Drop previous data */
set obs 20                  /* Change sample size by change set obs   */
gen x = invnorm(uniform())  /* Using inverse-transformation method    */
gen u = x*invnorm(uniform())
gen y = 1 + 1*x + u
sum

* Do the bootstrap
regress y x
bs "regress y x"   "_b[x]", reps(999) level(95)


********** CLOSE OUTPUT
log close