** STATA Sample Program by Colin Cameron ** Program stpanel.do May 2001 (began October 1999) * To run you need file * patr7079.asc * in your directory * This program demonstrates panel data models in Stata * read in data in wide form: one observation has data for all years * reshape data to long form: separate observations for each year * linear panel data models * nonlinear panel data models ********** STATA SETUP clear * clears previously opened files capture log close * capture in front means program continues even if no log file open log using stpanel.log, replace * creates output file * replace here means existing file of same name will be overwritten * and if this file is already open then give command log close di "stpanel.do by Colin Cameron: Stata panel regression example" set maxvar 100 width 1000 * If need more memory then in Stata give command help memory ********** OVERVIEW OF STATA PANEL COMMANDS * * Stata has many panel commands. * For linear models the main commands are * XTREG Classic fixed and random effects * XTGLS GLS of constant coeff model for panels independent over i * This is typical short panel such as NLSY, PSID, ... * XTGEE GLS of constant coeff model for panels correlated over i * This is typical macro such as cross-country and region * And other linear model commands include * XTDATA exploratory analysis with the XTREG models * XTHAUS test of fixed versus random effects * XTREGAR fixed and random effects with AR(1) error * XTIVREG IV estimation of fixed and random effects models * XTABOND IV estimation of fixed effects model with lagged dep regressors * XTPCSE variant of XTGLS with simpler model estimated with correct s.e.'s * XTRCHH random coefficients model * XTREG * for model y_it = x_it'b + a_i + e_it where a_i is unbobserved indiv effect * Options * fe Fixed effects or within or LSDV * be Between * re Random effects with GLS estimates of error component variances * mle Random effects with ML estimates of error component variances * pa Population-averaged - see xtgee * XTGEE * Also for some nonlinear models - see below. Default is linear. * Linear model y_it = x_it'b + u_it for u_it independent over i and small T * Estimation is by WLS or GLS with different possible model for cor(u_it) * Options * robust gives robust standard errors that permit general cor(u_it,u_is) * b = (X'RX)-1 * X'Ry where R is working matrix defined by corr( ) * V[b] = (X'RX)-1 without the robust option * V[b] = (X'RX)-1 * X'RVR'X * (X'RX)-1 if the robust option is used * corr(independent) cor(u_it,u_is) = 0 t^=s i.e. OLS * corr(exchangable) cor(u_it,u_is) = rho t^=s i.e. equicorrelation = RE * This is the default * corr(ar g) cor(u_it,u_is) = defined by an AR(g) model * corr(stationary g) cor(u_it,u_is) = defined by an MA(g) model * corr(nonstationary) cor(u_it,u_is) = rho_ts has an unrestricted stuctured * user-specified R is also possible * XTGLS * Linear model y_it = x_it'b + u_it for u_it correlated over i and small n * If n is large this command will need a lot of memory and take a long time * and stacking y_i = x_i'b + u_i * Estimation is by GLS and GLS standard errors are given * Options * panels defines the correlation over i * panels(iid) cov(u_i,u_j) = 0 and var(u_i) = s^2 * I * so u_it uncorrelated over i with same variance over i * This is the default * panels(hetero) cov(u_i,u_j) = 0 and var(u_i) = (s_i)^2 * I * so u_it uncorrelated over i with diff variance over i * panels(correlated) cov(u_i,u_j) = s_ij * I * so u_it correlated over i with different variance * and covariance over i * this is basic seemingly unrelated regressions * corr defines the correlation over t * corr(independent) cor(u_it,u_is) = 0 * so u_it uncorrelated over t * This is the default * corr(ar1) cor(u_it,u_is) = defined by AR(1) model * with same rho for each i * corr(psar1) cor(u_it,u_is) = defined by AR(1) model * with different rho for each i * XTPCSE * Same model as XTGLS * Instead of GLS estimation * use OLS or AR1 error estimator with no correlation over i but then * get correct standard errors allowing for possible correlation over i * Stata calls this "panel correct standard errors" * where correction is for correlation across i. * corr defines the correlation over t as in XTGLS. * This is imposed in estimation * corr(independent) cor(u_it,u_is) = 0 this is the default * corr(ar1) cor(u_it,u_is) = defined by AR(1) model * with same rho for each i * corr(psar1) cor(u_it,u_is) = defined by AR(1) model * with different rho for each i * and then to define the correlation over i as * blank this is the default and has correlation over i * and variance varying with i * hetonly no correlation over i and variance varying with i * independent no correlation over i and same variance over i * For nonlinear models the data are always independent over i * i.e. typical micro cross-section set up * XTGEE Does a range of models * Other special commands exist for binbary, tobit and count models * XTGEE for nonlinear models * Same as PA population averaged * Can be applied to * Normal various links or conditional mean functions * Binomial various links including logit, probit, cloglog * Count links or conditional mean functions include poisson and neg binomial * Gamma * For the following special commands * - sometimes there is fixed effects and sometimes there are not * - sometimes random effects integrate out * and sometimes instead gaussian quadrature is used * Binary Models * XTLOGIT Logit FE, RE and PA * XTPROBIT Probit RE and PA (no FE) * XTCLOG Complementary log-log RE and PA (no FE) * Tobit Models * XTTOBIT Tobit RE * XTINTEREG Interval regression RE * Count Models * XTPOISSON Poisson FE, RE and PA * XTNBREG Negative binomial FE, RE and PA ********** DATA DESCRIPTION * * The original data is from * Bronwyn Hall, Zvi Griliches, and Jerry Hausman (1986), * "Patents and R&D: Is There a Lag?", * International Economic Review, 27, 265-283. * File patr7079.dat has data on 346 firms * There are 4 lines per firm, with 25 variables * Time-invariant: CUSIP,ARDSSIC,SCISECT,LOGK,SUMPAT, * Time-varying X: LOGR70,LOGR71,LOGR72, ....., LOGR77,LOGR78,LOGR79 * Time-varying Y: PAT70,PAT71,PAT72, ....., PAT77,PAT78,PAT79 * in the format: * I7,I3,I2,5F12.6/6F12.6/6F12.6/5F12.6/ * where * CUSIP Compustat's identifying number for the firm (Committee on * Uniform Security Identification Procedures number). * ARDSIC A two-digit code for the applied R&D industrial classification * (roughly that in Bound, Cummins, Griliches, Hall, and Jaffe, in * the Griliches R&D, Patents, and Productivity volume). * SCISECT Dummy equal to one for firms in the scientific sector. * LOGK The logarithm of the book value of capital in 1972. * SUMPAT The sum of patents applied for between 1972-1979. * LOGR70- The logarithm of R&D spending during the year (in 1972 dollars). * LOGR79 * PAT70- The number of patents applied for during the year that were * PAT79 eventually granted. ********** READ DATA * * The data are in ascii file patr7079.dat * There are 346 observations on 25 variables with four lines per obs * The data are fixed format with * line 1 variables 1-8 I7,I3,I2,5F12.6 * line 2 variables 9-14 6F12.6 * line 3 variables 15-20 6F12.6 * line 4 variables 20-25 6F12.6 * Read in using Infile: FREE FORMAT WITHOUT DICTIONARY * As there is space between each observation data is also space-delimited * free format and then there is no need for a dictionary file * The following command spans more that one line so use /* and */ infile CUSIP ARDSSIC SCISECT LOGK SUMPAT LOGR70 LOGR71 LOGR72 LOGR73 /* */ LOGR74 LOGR75 LOGR76 LOGR77 LOGR78 LOGR79 PAT70 PAT71 PAT72 /* */ PAT73 PAT74 PAT75 PAT76 PAT77 PAT78 PAT79 using patr7079.asc * To drop off extra blanks (if any) at end of file jaggia.asc drop if _n>347 ********** DATA TRANSFORMATIONS AND CHECK * Use observation number as an identifier, not just CUSIP gen id = _n label variable id "id" * The following lists the variables in data set and summarizes data describe summarize ******** CHANGE ORGANIZATION OF DATA USING RESHAPE AND MORE TRANSFORMATIONS * reshape long PAT LOGR, i(id) j(year) * describe summarize * Create new variable log(patents) with adjustment for patents = 0 gen NEWPAT = PAT replace NEWPAT = 0.5 if NEWPAT==0. gen LPAT = ln(NEWPAT) label variable LOGR "Ln(R&D)" label variable LPAT "Ln(Patents)" label variable PAT "Patents" * Create OLS residuals from regress LPAT on LOGR regress LPAT LOGR predict uols, residuals * Check data and Save data as Stata data set describe summarize drop NEWPAT save patr7079, replace summarize xtsum, i(id) ******** LOOK AT DATA AGAIN IN ORIGINAL FORM USING RESHAPE reshape wide PAT LOGR LPAT uols, i(id) j(year) summarize corr LOGR70 LOGR71 LOGR72 LOGR73 LOGR74 LOGR75 LOGR76 LOGR77 LOGR78 LOGR79 corr LPAT70 LPAT71 LPAT72 LPAT73 LPAT74 LPAT75 LPAT76 LPAT77 LPAT78 LPAT79 corr uols70 uols71 uols72 uols73 uols74 uols75 uols76 uols77 uols78 uols79 corr LOGR70 LOGR71 LOGR72 LOGR73 LOGR74 LOGR75 LOGR76 LOGR77 LOGR78 LOGR79, cov corr LPAT70 LPAT71 LPAT72 LPAT73 LPAT74 LPAT75 LPAT76 LPAT77 LPAT78 LPAT79, cov corr uols70 uols71 uols72 uols73 uols74 uols75 uols76 uols77 uols78 uols79, cov ********* XTDATA: LINEAR PANEL - SPECIFICATION SEARCH * XTDATA permits plots of between data, within data and overall data * Useful for looginf at the data. See Stata manual under xtdata for example. * iis is an xt command that defines the variable for the ith individual * tis is an xt command that defines the variable for the tth year * Here only individual specific effects are considered. So do not use tis * For plotting we can use * ksm kernel smoothing using lowess local regression line * graph with c(s) option median bands using smoothing spline * The latter is quicker so I use that * Overall plot of data use patr7079, clear graph LPAT LOGR, xlab ylab s(p) c(s) bands(20) saving (stpantot, replace) /* */ title("Overall regression: Ln(Patents) on LOG(R&D)") * OLS regression gives wrong s.e.'s as no attempt to control for clustering regress LPAT LOGR * ksm LPAT LOGR, lowess xlab ylab s(.) c(s) saving (stpantot, replace) * gphprint * Within plot of data use patr7079, clear iis id xtdata, fe graph LPAT LOGR, xlab ylab s(p) c(s) bands(20) saving (stpanre, replace) /* */ title("Within (fixed effects) regression: Ln(Patents) on LOG(R&D)") regress LPAT LOGR * ksm LPAT LOGR, lowess xlab ylab s(.) c(s) saving (stpanre, replace) * gphprint * Betweeen plot of data with lowess local regression line use patr7079, clear iis id xtdata, be graph LPAT LOGR, xlab ylab s(p) c(s) bands(20) saving (stpanbe, replace) /* */ title("Between: Ln(PATENTS) on LOG(R&D)") regress LPAT LOGR * ksm LPAT LOGR, lowess xlab ylab s(.) c(s) saving (stpanbe, replace) * gphprint * plot a graph again * graph using myexampl\hrvslnw * To print the graph: * gphdot myexampl\hrvslnw for default medium resolution * gphdot myexampl\hrvslnw /dhpl for low resolution * gphdot myexampl\hrvslnw /dhplphr for high resolution ********** XTREG: LINEAR PANEL - CLASSIC RANDOM AND FIXED EFFECTS * * Note that in the first xt command need to give , i(id) * to indicate that the ith observation is for the ith id * Time invariant regressors LOGK SCISECT are not included use patr7079, clear * * Fixed effects xtreg LPAT LOGR, fe i(id) predict yfe, xbu estimates list gen ufe = LPAT - yfe * Random effects xtreg LPAT LOGR, re i(id) estimates list * Hausman test of fixed versus random effects xthaus * Between xtreg LPAT LOGR, be i(id) estimates list * Random effects MLE will be slightly different from re xtreg LPAT LOGR, mle i(id) * Population averaged is similar to re (gives similar to mle version of re) * Exactly sanme as xtgee, i(id) xtreg LPAT LOGR, pa i(id) * Check the fixed effects residuals keep ufe id year reshape wide ufe, i(id) j(year) correlate ufe70 ufe71 ufe72 ufe73 ufe74 ufe75 ufe76 ufe77 ufe78 ufe79 corr ufe70 ufe71 ufe72 ufe73 ufe74 ufe75 ufe76 ufe77 ufe78 ufe79, cov ****** XTGEE: LINEAR PANEL - CONSTANT COEFFICIENTS INDEPENDENCE OVER I use patr7079, clear * First OLS with no attempt to control for correlation over i for given t * Same wrong estimates but divide by n not n-k in xtgee regress LPAT LOGR xtgee LPAT LOGR, corr(independent) i(id) * Second OLS with attempt to control for correlation over i for given t * These are equicorrelation / exchangeable same as random intercept regress LPAT LOGR, cluster(id) xtgee LPAT LOGR, corr(independent) i(id) robust * Third GLS with attempts to control for correlation over i for given t * These are equicorrelation / exchangeable same as random intercept xtgee LPAT LOGR, i(id) xtgee LPAT LOGR, i(id) robust * Fourth GLS with attempts to control for correlation over i for given t * These are AR(1) error xtgee LPAT LOGR, corr(ar 1) i(id) t(year) xtgee LPAT LOGR, corr(ar 1) i(id) t(year) robust * Fifth GLS with attempts to control for correlation over i for given t * These are unstructured correlations similar to MA(T-1) xtgee LPAT LOGR, corr(unstructured) i(id) t(year) xtgee LPAT LOGR, corr(unstructured) i(id) t(year) robust ****** XTGLS: LINEAR PANEL - CONSTANT COEFFICIENTS CORRELATION OVER I * These are not suitable for the data here but do for illustration * These require MATSIZE of at least 346 * Intercooled Stata default is 40 set matsize 400 * and also need more memory clear set memory 40m use patr7079, clear * This gives same as OLS and same wrong standard errors as OLS xtgls LPAT LOGR, i(id) t(year) * The next two are not suitable for the data here as they assume indep over i * Tey still understate true standard errors as ignore clustering over i * panels(hetero) does GLS with different variance over i * panels(correlated) does GLS with different variance over i and corr over i xtgls LPAT LOGR, i(id) t(year) panels(hetero) xtgls LPAT LOGR, i(id) t(year) panels(correlated) * The next two are possible alternatives to random effects / equicorrleation * They permit AR(1) correlation over t * corr(ar1) has same rho for different i * corr(psar1) has different rho for different i xtgls LPAT LOGR, i(id) t(year) corr(ar1) xtgls LPAT LOGR, i(id) t(year) corr(psar1) * To use xtpcse need to use tsset * These take forever and are commented out * Here I use independent option so no correlation over i and same variance * This should be comparable to XTGEE independent and XTGLS tsset id year, yearly xtpcse LPAT LOGR, corr(independent) independent xtpcse LPAT LOGR, corr(ar1) independent xtpcse LPAT LOGR, corr(psar1) independent ****** XTGLS: LINEAR PANEL - RANDOM COEFFICIENTS MODEL * This is the only more complicated random effects than random intercept * Stata does not have an equivalent of SAS proc mixed xtrchh LPAT LOGR, i(id) t(year) ********** NONLINEAR PANEL REGRESSION * Note that in the first xt command need to give , i(id) * to indicate that the ith observation is for the ith id * Time invariant regressors LOGK SCISECT are not included use patr7079, clear ****** XTPOIS: POISSON RANDOM AND FIXED EFFECTS * * Poisson Cross-section with Poisson standard errors poisson PAT LOGR * Poisson Cross-section with robust standard errors poisson PAT LOGR, robust * Poisson fixed effects xtpois PAT LOGR, fe i(id) * Poisson random effects xtpois PAT LOGR, re i(id) ********** CLOSE OUTPUT log close