** STATA Sample Simulation program by Colin Cameron ** Program stsim.do September 2001 ** This runs under version 7 but not version 8 * For more explanation on basic Stata see stdemo.do * Simulation can be used a number of ways. * (A) Simulate Once: * Generate data using random number generators and estimate a model once. * This is usual regression except that data is generated rather than real. * (B) Simulate a number of times * Do the same a number of times and look at sample distribution of estimators. * This is a simulation and uses the Stata command simul * (C) Simulate using bootstrap resample * Resample from a data set to get more refined small sample distribution. * This is a bootstrap and uses the Stata command bstrap * (D) Simulate to Estimate * Use simulation methods to permit estimation of regression parameters. * This is maximum simulated likelihood or simulated method of moments. * This program considers (A), (B) and (C). * It compares usual OLS and robust White standard errors * for regression with heteroskedastic errors. ********** SETUP clear capture log close log using stsim.txt, text replace di "stsim.txt by Colin Cameron: Stata simulation example" version 7 ********** (A) SIMULATE ONCE ****************************** * Heteroscedastic error example * y = a + bx + u where a=1 b=1 u ~ N(0,1*x^2) * Sample size n is defined in the set obs command * (1) Generate data * Set seed so get same random data in future runs of same program set seed 10101 drop _all /* Drop previous data */ * Change this for different sample size set obs 20 * Generate N(0,1) using inverse-transformation method * If x ~ N(0,1) then PHI(x) = u so x = PHI-inverse(u) where PHI is N(0,1) cdf. gen x = invnorm(uniform()) gen u = x*invnorm(uniform()) gen y = 1 + 1*x + u summarize * (2) Estimate model by OLS with usual and robust standard errors regress y x regress y x, robust ********** (B) SIMULATE A NUMBER OF TIMES ******************** * To perform a simulation * (1) Setup: generate and/or read in initial data * (2) Define the program that will be run a number of times * In Stata this is command program and has an arcane format * (3) Run the simulations * In Stata use command simul * (4) Summarize the results using descriptive statistics and/or graph * Step (2) in Stata uses an arcane format * To define the progam called progname, say, use the following outline: * program define progname * if "`1'"=="?" { * global S_1 "variable names" * exit * } * perform single simulation * post `1' results * end * There must be the same number of results following the post command as variable * names following the global S_1 command. See example below. * Then Step (3) with 100 reps, say, uses the command * simul progname reps (100) * Heteroscedastic error example * y = a + bx + u where a=1 b=1 u ~ N(0,1*x^2) * Post coefficient estimates and standard errors: a se(a) b se(b) c se(c) * A big difference is between * (B1) Completely random sampling, with both new draws of x and y. * (B2) x fixed in repeated samples, with just new draws of y. **** (B1) Completely random sampling * For large n,S use n=100, S=10000. For small n,S use n=20, S=100. * As written there n = ?? and S = ?? simulations * Change n in set obs ?? and change S in reps(??) * (1) Setup: * Set seed so get same random data in future runs of same program set seed 10101 * (2) Define the simulation program. * Compute both usual and White standard errors of slope coefficient. program define simrandm version 6.0 if "`1'" == "?" { global S_1 "b_const se_const b_slope se_usual se_white" exit } drop _all set obs 20 /* Change sample size by change set obs */ gen x = invnorm(uniform()) gen y = 1 + 1*x + x*invnorm(uniform()) regress y x, robust scalar sewhite = _se[x] regress y x post `1' (_b[_cons]) (_se[_cons]) (_b[x]) (_se[x]) (sewhite) end * (3) Do the simulations * Change number of simulations by changing reps simul simrandm, reps(1000) * (4) Post results * Theory says that for the dgp y = 1+1*x+x*e where x~N(0,1) and e~N(0,1) * b_slope equals 1 (from the dgp) * se_usual equals square root (1/n) where n is sample size * se_white equals square root (3/n) where n is sample size describe summarize kdensity b_slope, normal s(.) xlab ylab /* */ title("Density of slope coeff in heteroskedasticity example") **** (B2) x fixed in repeated sampling * For large n,S use n=100, S=10000. For small n,S use n=20, S=100. * As written there n = ?? and S = ?? simulations * Change n in set obs ?? and change S in reps(??) * (1) Setup: First generate x to be used in repeated samples set seed 10101 /* So get same random data in future runs */ drop _all /* Drop previous data */ set obs 20 /* Change sample size by change set obs */ gen x = invnorm(uniform()) /* Using inverse-transformation method */ gen mean_y = 1+1*x /* Saves recalculating in each simulation */ save fixedx, replace /* Data to be re-used in each simulation */ sum * (2) Define the simulation program program define simfixed version 6.0 if "`1'" == "?" { global S_1 "b_const se_const b_slope se_usual se_white" exit } use fixedx, clear gen y = mean_y + x*invnorm(uniform()) regress y x, robust scalar sewhite = _se[x] regress y x post `1' (_b[_cons]) (_se[_cons]) (_b[x]) (_se[x]) (sewhite) end * (3) Do the simulations * Change number of simulations by changing reps simul simfixed, reps(1000) * (4) Post results * Theory gives results for random x. See above. For fixed ?? describe summarize kdensity b_slope, normal s(.) xlab ylab /* */ title("Density of slope coeff in heteroskedasticity example") ********** (C) BOOTSTRAP ******************** * This does some bootstrap. * But for more details go to separate program stboot.do * The program produces three bootstrap 100*(1-alpha) confidence intervals * (1) Regular asymptotic normal: bhat +/- z_alpha/2*se(bhat) * where se(bhat) is the standard deviation of bhat from the bootstrap reps * (2) Percentile method: This orders the bhat's from simulations and * goes from alpha/2 lowest bhat's to the alpha/2 highest bhat's * (3) Bootstrap-corrected. This works with the pivotal Wald statistic. * See manual or a textbook. e.g. Efron/Tibsharani (1993, pp.184-188) with a=0 * This orders the bhat's from simulations and * goes from p1 to the p2 highest * where p1 and p2 are bias-correction adjustments to alpha/2 and 1-alpha/2 * Let p1 = Phi(2z0 - z_alpha/2) p2 = Phi(2z0 + z_alpha/2) * z0 measures the median bias in bhat with * z0 = Phi-inv(fraction of the bhat(s) < bhat) * And if z0=0 then p1 = alpha/2 and no correction * (1) and (2) are not asymptotic refinements. * (3) is an asymptotic refinement and is better in small samples. * (4) Stata does not give the bootstrap t-interval. See progran stboot.do for this. * Generate the data set to be used set seed 10101 /* So get same random data in future runs */ drop _all /* Drop previous data */ set obs 20 /* Change sample size by change set obs */ gen x = invnorm(uniform()) /* Using inverse-transformation method */ gen u = x*invnorm(uniform()) gen y = 1 + 1*x + u sum * Do the bootstrap regress y x bs "regress y x" "_b[x]", reps(999) level(95) ********** CLOSE OUTPUT log close