** STATA Sample Program by Colin Cameron
** Program stpanel.do May 2001 (began October 1999)

* To run you need file
*   patr7079.asc    
* in your directory

* This program demonstrates panel data models in Stata 
*   read in data in wide form: one observation has data for all years
*   reshape data to long form: separate observations for each year   
*   linear panel data models
*   nonlinear panel data models


********** STATA SETUP
clear
*  clears previously opened files
capture log close   
* capture in front means program continues even if no log file open 
log using stpanel.log, replace
* creates output file
* replace here means existing file of same name will be overwritten
* and if this file is already open then give command log close 
di "stpanel.do by Colin Cameron: Stata panel regression example"
set maxvar 100 width 1000
* If need more memory then in Stata give command help memory


********** OVERVIEW OF STATA PANEL COMMANDS
*
* Stata has many panel commands.  

* For linear models the main commands are 
*    XTREG  Classic fixed and random effects
*    XTGLS  GLS of constant coeff model for panels independent over i
*           This is typical short panel such as NLSY, PSID, ...
*    XTGEE  GLS of constant coeff model for panels correlated over i
*           This is typical macro such as cross-country and region

* And other linear model commands include
*    XTDATA  exploratory analysis with the XTREG models
*    XTHAUS  test of fixed versus random effects
*    XTREGAR fixed and random effects with AR(1) error
*    XTIVREG IV estimation of fixed and random effects models
*    XTABOND IV estimation of fixed effects model with lagged dep regressors 
*    XTPCSE  variant of XTGLS with simpler model estimated with correct s.e.'s
*    XTRCHH  random coefficients model 

* XTREG 
* for model  y_it = x_it'b + a_i + e_it  where a_i is unbobserved indiv effect
* Options
*    fe     Fixed effects or within or LSDV
*    be     Between 
*    re     Random effects with GLS estimates of error component variances
*    mle    Random effects with ML estimates of error component variances
*    pa     Population-averaged - see xtgee

* XTGEE
* Also for some nonlinear models - see below. Default is linear.
* Linear model  y_it = x_it'b + u_it  for u_it independent over i and small T
* Estimation is by WLS or GLS with different possible model for cor(u_it)
* Options
*    robust  gives robust standard errors that permit general cor(u_it,u_is)
*            b = (X'RX)-1 * X'Ry  where R is working matrix defined by corr( )
*            V[b] = (X'RX)-1   without the robust option
*            V[b] = (X'RX)-1 * X'RVR'X * (X'RX)-1  if the robust option is used
*    corr(independent)  cor(u_it,u_is) = 0   t^=s   i.e. OLS
*    corr(exchangable)  cor(u_it,u_is) = rho t^=s   i.e. equicorrelation = RE
*                       This is the default 
*    corr(ar g)         cor(u_it,u_is) = defined by an AR(g) model
*    corr(stationary g) cor(u_it,u_is) = defined by an MA(g) model
*    corr(nonstationary) cor(u_it,u_is) = rho_ts has an unrestricted stuctured
*    user-specified R is also possible

* XTGLS
* Linear model  y_it = x_it'b + u_it  for u_it correlated over i and small n
* If n is large this command will need a lot of memory and take a long time
* and stacking   y_i = x_i'b + u_i
* Estimation is by GLS and GLS standard errors are given
* Options
* panels defines the correlation over i
*    panels(iid)         cov(u_i,u_j) = 0 and var(u_i) = s^2 * I                  
*                        so u_it uncorrelated over i with same variance over i
*                        This is the default 
*    panels(hetero)      cov(u_i,u_j) = 0 and var(u_i) = (s_i)^2 * I   
*                        so u_it uncorrelated over i with diff variance over i
*    panels(correlated)  cov(u_i,u_j) = s_ij * I   
*                        so u_it correlated over i with different variance 
*                                                  and covariance over i
*                        this is basic seemingly unrelated regressions
* corr defines the correlation over t
*    corr(independent)   cor(u_it,u_is) = 0
*                        so u_it uncorrelated over t
*                        This is the default 
*    corr(ar1)           cor(u_it,u_is) = defined by AR(1) model 
*                                         with same rho for each i
*    corr(psar1)         cor(u_it,u_is) = defined by AR(1) model 
*                                         with different rho for each i

* XTPCSE
* Same model as XTGLS
* Instead of GLS estimation 
* use OLS or AR1 error estimator with no correlation over i but then
* get correct standard errors allowing for possible correlation over i
* Stata calls this "panel correct standard errors" 
* where correction is for correlation across i.
* corr defines the correlation over t as in XTGLS. 
* This is imposed in estimation
*    corr(independent)   cor(u_it,u_is) = 0  this is the default
*    corr(ar1)           cor(u_it,u_is) = defined by AR(1) model 
*                                         with same rho for each i
*    corr(psar1)         cor(u_it,u_is) = defined by AR(1) model 
*                                         with different rho for each i
* and then to define the correlation over i as
*    blank               this is the default and has correlation over i
*                                           and variance varying with i
*    hetonly             no correlation over i and variance varying with i
*    independent         no correlation over i and same variance over i

* For nonlinear models the data are always independent over i 
* i.e. typical micro cross-section set up

* XTGEE  Does a range of models 
* Other special commands exist for binbary, tobit and count models
   
* XTGEE for nonlinear models
* Same as PA population averaged
* Can be applied to
*  Normal various links or conditional mean functions
*  Binomial various links including logit, probit, cloglog
*  Count links or conditional mean functions include poisson and neg binomial
*  Gamma

* For the following special commands 
*  - sometimes there is fixed effects and sometimes there are not
*  - sometimes random effects integrate out
*    and sometimes instead gaussian quadrature is used

* Binary Models
*   XTLOGIT   Logit  FE, RE and PA
*   XTPROBIT  Probit RE and PA  (no FE)
*   XTCLOG    Complementary log-log RE and PA  (no FE)

* Tobit Models
*   XTTOBIT   Tobit  RE
*   XTINTEREG Interval regression  RE

* Count Models
*   XTPOISSON Poisson           FE, RE and PA
*   XTNBREG   Negative binomial FE, RE and PA


********** DATA DESCRIPTION
*
*  The original data is from 
*  Bronwyn Hall, Zvi Griliches, and Jerry Hausman (1986), 
* "Patents and R&D: Is There a Lag?", 
*  International Economic Review, 27, 265-283.

* File patr7079.dat has data on 346 firms
* There are 4 lines per firm, with 25 variables
*   Time-invariant:  CUSIP,ARDSSIC,SCISECT,LOGK,SUMPAT,
*   Time-varying X:  LOGR70,LOGR71,LOGR72, ....., LOGR77,LOGR78,LOGR79
*   Time-varying Y:  PAT70,PAT71,PAT72, ....., PAT77,PAT78,PAT79 
* in the format:
*   I7,I3,I2,5F12.6/6F12.6/6F12.6/5F12.6/ 
* where
* CUSIP    Compustat's identifying number for the firm (Committee on
*          Uniform Security Identification Procedures number).
* ARDSIC   A two-digit code for the applied R&D industrial classification
*          (roughly that in Bound, Cummins, Griliches, Hall, and Jaffe, in
*          the Griliches R&D, Patents, and Productivity volume).
* SCISECT  Dummy equal to one for firms in the scientific sector.
* LOGK     The logarithm of the book value of capital in 1972.
* SUMPAT   The sum of patents applied for between 1972-1979.
* LOGR70-  The logarithm of R&D spending during the year (in 1972 dollars).
*  LOGR79
* PAT70-   The number of patents applied for during the year that were
*  PAT79  eventually granted.


********** READ DATA
*
* The data are in ascii file patr7079.dat
* There are 346 observations on 25 variables with four lines per obs
* The data are fixed format with 
*   line 1  variables  1-8   I7,I3,I2,5F12.6 
*   line 2  variables  9-14  6F12.6
*   line 3  variables 15-20  6F12.6
*   line 4  variables 20-25  6F12.6

* Read in using Infile: FREE FORMAT WITHOUT DICTIONARY
* As there is space between each observation data is also space-delimited 
* free format and then there is no need for a dictionary file
* The following command spans more that one line so use /* and */
infile CUSIP ARDSSIC SCISECT LOGK SUMPAT LOGR70 LOGR71 LOGR72 LOGR73   /*
   */  LOGR74 LOGR75 LOGR76 LOGR77 LOGR78 LOGR79 PAT70 PAT71 PAT72     /*
   */  PAT73 PAT74 PAT75 PAT76 PAT77 PAT78 PAT79 using patr7079.asc
* To drop off extra blanks (if any) at end of file jaggia.asc
drop if _n>347


********** DATA TRANSFORMATIONS AND CHECK
* Use observation number as an identifier, not just CUSIP
gen id = _n
label variable id "id"
* The following lists the variables in data set and summarizes data
describe
summarize


******** CHANGE ORGANIZATION OF DATA USING RESHAPE AND MORE TRANSFORMATIONS
*
reshape long PAT LOGR, i(id) j(year)
*
describe
summarize

* Create new variable log(patents) with adjustment for patents = 0
gen NEWPAT = PAT
replace NEWPAT = 0.5 if NEWPAT==0.
gen LPAT = ln(NEWPAT)
label variable LOGR "Ln(R&D)"
label variable LPAT "Ln(Patents)"
label variable PAT "Patents"

* Create OLS residuals from regress LPAT on LOGR
regress LPAT LOGR
predict uols, residuals

* Check data and Save data as Stata data set
describe
summarize
drop NEWPAT
save patr7079, replace
summarize
xtsum, i(id)


******** LOOK AT DATA AGAIN IN ORIGINAL FORM USING RESHAPE

reshape wide PAT LOGR LPAT uols, i(id) j(year)

summarize
corr LOGR70 LOGR71 LOGR72 LOGR73 LOGR74 LOGR75 LOGR76 LOGR77 LOGR78 LOGR79
corr LPAT70 LPAT71 LPAT72 LPAT73 LPAT74 LPAT75 LPAT76 LPAT77 LPAT78 LPAT79
corr uols70 uols71 uols72 uols73 uols74 uols75 uols76 uols77 uols78 uols79
corr LOGR70 LOGR71 LOGR72 LOGR73 LOGR74 LOGR75 LOGR76 LOGR77 LOGR78 LOGR79, cov
corr LPAT70 LPAT71 LPAT72 LPAT73 LPAT74 LPAT75 LPAT76 LPAT77 LPAT78 LPAT79, cov
corr uols70 uols71 uols72 uols73 uols74 uols75 uols76 uols77 uols78 uols79, cov


********* XTDATA: LINEAR PANEL - SPECIFICATION SEARCH

* XTDATA permits plots of between data, within data and overall data
* Useful for looginf at the data. See Stata manual under xtdata for example.
 
* iis     is an xt command that defines the variable for the ith individual
* tis     is an xt command that defines the variable for the tth year

* Here only individual specific effects are considered. So do not use tis
 
* For plotting we can use 
*   ksm                kernel smoothing using lowess local regression line
*   graph with c(s) option  median bands using smoothing spline
* The latter is quicker so I use that

* Overall plot of data 
use patr7079, clear
graph LPAT LOGR, xlab ylab s(p) c(s) bands(20) saving (stpantot, replace) /*
   */ title("Overall regression: Ln(Patents) on LOG(R&D)")
* OLS regression gives wrong s.e.'s as no attempt to control for clustering
regress LPAT LOGR
* ksm LPAT LOGR, lowess xlab ylab s(.) c(s) saving (stpantot, replace)
* gphprint

* Within plot of data
use patr7079, clear
iis id
xtdata, fe
graph LPAT LOGR, xlab ylab s(p) c(s) bands(20) saving (stpanre, replace) /*
   */ title("Within (fixed effects) regression: Ln(Patents) on LOG(R&D)")
regress LPAT LOGR
* ksm LPAT LOGR, lowess xlab ylab s(.) c(s) saving (stpanre, replace)
* gphprint

* Betweeen plot of data with lowess local regression line
use patr7079, clear
iis id
xtdata, be
graph LPAT LOGR, xlab ylab s(p) c(s) bands(20) saving (stpanbe, replace) /*
   */ title("Between: Ln(PATENTS) on LOG(R&D)")
regress LPAT LOGR
* ksm LPAT LOGR, lowess xlab ylab s(.) c(s) saving (stpanbe, replace)
* gphprint

* plot a graph again
* graph using myexampl\hrvslnw
* To print the graph:
*   gphdot myexampl\hrvslnw          for default medium resolution
*   gphdot myexampl\hrvslnw /dhpl    for low resolution
*   gphdot myexampl\hrvslnw /dhplphr for high resolution


********** XTREG: LINEAR PANEL - CLASSIC RANDOM AND FIXED EFFECTS
*
* Note that in the first xt command need to give  , i(id)
* to indicate that the ith observation is for the ith id
* Time invariant regressors LOGK SCISECT are not included
use patr7079, clear
*
* Fixed effects
xtreg LPAT LOGR, fe i(id)
predict yfe, xbu
estimates list
gen ufe = LPAT - yfe

* Random effects
xtreg LPAT LOGR, re i(id)
estimates list

* Hausman test of fixed versus random effects
xthaus

* Between
xtreg LPAT LOGR, be i(id)
estimates list

* Random effects MLE will be slightly different from re 
xtreg LPAT LOGR, mle i(id)

* Population averaged is similar to re  (gives similar to mle version of re)
* Exactly sanme as xtgee, i(id)
xtreg LPAT LOGR, pa i(id)

* Check the fixed effects residuals
keep ufe id year
reshape wide ufe, i(id) j(year)
correlate ufe70 ufe71 ufe72 ufe73 ufe74 ufe75 ufe76 ufe77 ufe78 ufe79
corr ufe70 ufe71 ufe72 ufe73 ufe74 ufe75 ufe76 ufe77 ufe78 ufe79, cov


****** XTGEE: LINEAR PANEL - CONSTANT COEFFICIENTS INDEPENDENCE OVER I

use patr7079, clear

* First OLS with no attempt to control for correlation over i for given t
* Same wrong estimates but divide by n not n-k in xtgee
regress LPAT LOGR
xtgee LPAT LOGR, corr(independent) i(id)

* Second OLS with attempt to control for correlation over i for given t
* These are equicorrelation / exchangeable same as random intercept
regress LPAT LOGR, cluster(id)
xtgee LPAT LOGR, corr(independent) i(id) robust

* Third GLS with attempts to control for correlation over i for given t
* These are equicorrelation / exchangeable same as random intercept
xtgee LPAT LOGR, i(id)
xtgee LPAT LOGR, i(id) robust

* Fourth GLS with attempts to control for correlation over i for given t
* These are AR(1) error
xtgee LPAT LOGR, corr(ar 1) i(id) t(year) 
xtgee LPAT LOGR, corr(ar 1) i(id) t(year) robust

* Fifth GLS with attempts to control for correlation over i for given t
* These are unstructured correlations similar to MA(T-1)
xtgee LPAT LOGR, corr(unstructured) i(id) t(year)
xtgee LPAT LOGR, corr(unstructured) i(id) t(year) robust


****** XTGLS: LINEAR PANEL - CONSTANT COEFFICIENTS CORRELATION OVER I

* These are not suitable for the data here but do for illustration
* These require MATSIZE of at least 346
* Intercooled Stata default is 40
set matsize 400
* and also need more memory
clear
set memory 40m
use patr7079, clear

* This gives same as OLS and same wrong standard errors as OLS 
xtgls LPAT LOGR, i(id) t(year)

* The next two are not suitable for the data here as they assume indep over i
* Tey still understate true standard errors as ignore clustering over i
* panels(hetero) does GLS with different variance over i
* panels(correlated) does GLS with different variance over i and corr over i
xtgls LPAT LOGR, i(id) t(year) panels(hetero)
xtgls LPAT LOGR, i(id) t(year) panels(correlated)

* The next two are possible alternatives to random effects / equicorrleation
* They permit AR(1) correlation over t
* corr(ar1) has same rho for different i
* corr(psar1) has different rho for different i
xtgls LPAT LOGR, i(id) t(year) corr(ar1)
xtgls LPAT LOGR, i(id) t(year) corr(psar1)


* To use xtpcse need to use tsset
* These take forever and are commented out
* Here I use independent option so no correlation over i and same variance
* This should be comparable to XTGEE independent and XTGLS
tsset id year, yearly
xtpcse LPAT LOGR, corr(independent) independent
xtpcse LPAT LOGR, corr(ar1) independent
xtpcse LPAT LOGR, corr(psar1) independent


****** XTGLS: LINEAR PANEL - RANDOM COEFFICIENTS MODEL

* This is the only more complicated random effects than random intercept
* Stata does not have an equivalent of SAS proc mixed
xtrchh LPAT LOGR, i(id) t(year)


********** NONLINEAR PANEL REGRESSION

* Note that in the first xt command need to give  , i(id)
* to indicate that the ith observation is for the ith id
* Time invariant regressors LOGK SCISECT are not included

use patr7079, clear


****** XTPOIS: POISSON RANDOM AND FIXED EFFECTS
*
* Poisson Cross-section with Poisson standard errors
poisson PAT LOGR

* Poisson Cross-section with robust standard errors
poisson PAT LOGR, robust

* Poisson fixed effects
xtpois PAT LOGR, fe i(id)  

* Poisson random effects
xtpois PAT LOGR, re i(id)  


********** CLOSE OUTPUT
log close