Grid Enabling Empirical Economics: A Microdata Application

July 5, 2017 | Autor: Anja Le Blanc | Categoría: Computational Economics, Econometrics, Ethnic minorities, United Kingdom, Data Fusion, Survey data, Ethnic Minorities, Economic Modelling, Social Science, Survey data, Ethnic Minorities, Economic Modelling, Social Science

Share Embed

Laporkan tautan ini

Descripción

Comput Econ (2007) 30:349–370 DOI 10.1007/s10614-007-9093-3

Grid Enabling Empirical Economics: A Microdata Application Simon Peters · Ken Clark · Pascal Ekin · Anja Le Blanc · Stephen Pickles

Received: 23 August 2006 / Accepted: 15 May 2007 / Published online: 7 July 2007 © Springer Science+Business Media B.V. 2007

Abstract This article discusses the use of Grid technology to integrate the data, computation and presentation elements of an empirical economic modelling process. We achieve this by using a form of statistical data fusion developed in the poverty mapping literature to address a substantive issue: determining United Kingdom ethnic minority welfare. Elements of this methodology appear to be well suited to such grid-enablement, and we present and illustrate our implementation using the context of this microdata application. Keywords e-Social Science · Grid technology · Census data · Survey data · Statistical data fusion · Visualization · Welfare measures JEL Classification C81 · C88 · I32 · J15 1 Introduction Empirical economic modelling using secondary information can be considered as consisting of three steps: data handling, econometric computation and results presentation or visualization. One might loosely term these the workflow of an applied economist, though steps one and three are often overlooked as a necessary part of the full process. Consequently, research in computational economics has focused mainly on step

S. Peters (B) · K. Clark School of Social Sciences (Economics), University of Manchester, Manchester, M13 9PL, UK e-mail: [email protected] P. Ekin · A. Le Blanc · S. Pickles Manchester Computing, University of Manchester, Manchester, M13 9PL, UK

350

S. Peters et al.

two, dealing with topics ranging from software choice (Kendrick and Amann 1999) to newer computing paradigms, such as parallel programming (Creel 2005), although (Phillips 2003) does propose a basic web service which would encompass elements of steps two and three. The purpose of this article is to note that it is now feasible to consider a computational environment that enables the full process by using the technology available from e-Science or e-Research (also known as cyberinfrastructure in the United States). In particular we are using Grid technology, which can be defined briefly for our purposes as the software and systems that deliver a service for sharing computer power, data storage capacity, and other computing resources over the Internet.1 We develop this by discussing the motivation and experiences of implementing an e-Social Science pilot demonstrator project entitled: Grid Enabled Micro-econometric Data Analysis (GEMEDA). The application of Grid technology to the underlying substantive problem allows one to integrate all three elements of the empirical modelling process within a Web browser based interface suitable for use by a casual (i.e. non-expert) user. Ultimately a grid-enabled analysis will allow users to produce their own bespoke output, addressing a problem of interest to themselves. The perspective taken is from the economics discipline where, echoing a comment of Creel (2005), dedicated e-Science resources such as equipment and specialised staff are not readily available, and progress at the time of writing may require collaboration with computer scientists or service providers. One consequence of this is that early adopters of e-Research have tended to come from outside of traditional Social Science disciplines with research being based in informatics and related areas, resulting in a bias towards a technical rather than a substantive agenda. Notwithstanding this point, important progress has been made using Grid technology to address a variety of problems with an empirical economic content. Examples of such research are: mixed quantitative/sentiment analysis of financial data (Ahmad et al. 2004), modelling interest rate asymmetry (Russell et al. 2003), and demographic forecasting (Birkin et al. 2005). The GEMEDA demonstrator is, in a sense, building on this earlier body of work, with the starting point being Russell et al.’s (2003) Seamless Access to Multiple Datasets (SAMD) project. This was a collaboration between computer scientists (database, middleware and HPC specialists) and a macroeconomist, that introduced the benefits that could be obtained by having a Grid based system handle the data and computational elements of an empirical analysis. The project demonstrated to researchers in economics, and in other social science disciplines, that Grid-based systems could manage an empirical analysis and perform it much faster than if it was performed locally. As a consequence, local systems would be free for other tasks, and the ongoing requirement to continually upgrade desktop resources would be alleviated. Russell et al.’s (2003) system, however, used bespoke modifications of a web-based data service (Cole and Pickles 2003) and was, therefore, difficult to replicate. The GEMEDA demonstrator, therefore, not only shows empirical economic research can be implemented on the Grid, but that it can be done in a manner that builds on available

1 This is based upon the short definition at: http://gridcafe.web.cern.ch/gridcafe/whatisgrid/whatis.html

Grid Enabling Empirical Economics

351

infrastructure investments. The infrastructure used here is that of the United Kingdom’s National Grid Service (henceforth NGS), which was used to investigate a policy relevant Social Science issue: the welfare of ethnic minority groups in the United Kingdom. The problem requires an econometric analysis that uses quantitative data from more than one source, and the collaborative effort is spread between an econometrician, microeconomist, database, middleware and visualization specialists. The remainder of this article is organised as follows. Section 2 discusses the details of the underlying micro-economic problem that we wish to address, and notes how using Grid technology may address issues associated with the data handling and econometric computation parts of the analysis process. Section 3 summarises the specific Grid technology adopted and why it was used for each part of the process. It is at this stage that we also introduce elements and examples of step three of our process, namely results visualization. Section 4 illustrates the methodology by applying it to the calculation and presentation of poverty measures for U.K. ethnic minorities by geographic area. It should be noted at this juncture that one is dealing with a demonstrator service rather than a production one, and this informs the discussion and concluding comments of Sect. 5. 2 The Background for the Empirical Application Investigation of the experiences and prospects of the U.K.’s non-white ethnic groups paints a picture of multiple deprivation and disadvantage in areas such as employment and earnings. One consequence of this is that the welfare of such groups is of major policy concern. However, in order to evaluate minority welfare, a requirement for successful policy intervention, one needs to be able to produce appropriate estimates of welfare using, for example, poverty or inequality measures. Due to data constraints the full modelling process required for this involves combining data from more than one source. 2.1 Data Handling The economic welfare of ethnic minority groups in the United Kingdom raises data issues that require a multiple data set approach. The basic problem is that non-whites account for a small proportion of the population and sample surveys typically yield minority samples that are too small for meaningful results to be obtained. To some extent the availability (since 1991) of Census microdata data has improved matters. However, while these provide relatively large samples of minority individuals and households, they do not contain any direct measure of income. Other surveys do contain such information but have limited sample sizes when minorities are analysed, with the problem being especially acute when reporting is required for small area geographies. A major consequence of such data problems is that important questions about the welfare of minority groups have not been answered. For example, small sample sizes preclude useful measures of household welfare such as poverty rates or inequality measures at anything other than high levels of aggregation. Yet research suggests

352

S. Peters et al.

that disaggregation along two dimensions is crucially important when discussing the welfare of Britain’s ethnic minority groups. First, it is clear that treating non-white, minority groups as a homogenous entity is not valid. There is considerable diversity between groups such as Caribbeans, Indians, Pakistanis, Bangladeshis and the Chinese (Leslie 1998; Modood et al. 1997). This diversity is often quantitatively greater than the differences between non-whites, taken as a whole, and the majority white community. The second dimension where aggregation is important is geographical. The United Kingdom’s ethnic minorities tend to live in co-ethnic clusters, or enclaves, and this clustering has important consequences for economic activity and unemployment (Clark and Drinkwater 2002). We use two data sources to address these problems, the British Household Panel Survey (the BHPS2 ) and the U.K. Census Samples of Anonymised Records (the SARs3 ). The basic idea is to combine data from the smaller-scale, detailed, BHPS source with the larger sample sizes and geographical coverage of the Census SARs. This provides welfare measures for ethnic minorities which are both broken down by particular ethnic group and geographically disaggregated. The levels of geography that are available are: the U.K., the U.K. regions, and the 1991 SARs areas.4 These Social Science micro-data sets are not large from an e-Science perspective, and are also in the main static. Preparation and cleaning of the data prior to analysis is, however, time consuming as the data are rarely delivered in a form which allows the researcher to immediately begin statistical modelling. This problem is compounded for repeated samples (variable definitions change), longitudinal samples (records require information from previous waves), and for the data combination approach considered in this article. Grid technology has the potential to integrate the tasks (e.g. data extraction, file transfer) associated with processing quantitative data of this type, by hosting the information in an appropriate manner on a data Grid. The BHPS data source is a longitudinal one, with waves occurring every year since 1991. The SARs are a sample of individuals and households taken from the decennial U.K. Census which are released several years after the Census has been collected. These were a 2% and 1% sample respectively for the 1991 U.K. Census which rose to 3% for 2001. Unfortunately, in the time between the 1991 SARs release and its 2001 equivalent, government statistical bodies have become preoccupied with confidentiality and data disclosure issues resulting in us only using the 1991 data in our grid-enabled research. Data needs to be readily available to accredited researchers to allow deployment on a data Grid, and it was felt at the time of project conception that the confidentiality restrictions envisaged for the Licensed 2001 SARs (the public domain version of the 2001 SARs) would not contain variables suitable for the analysis required. The original release of the Licensed data was not scheduled to contain detailed information on ethnic minorities or on geographical details below regional 2 University of Essex, Institute for Social and Economic Research, British Household Panel Survey, Waves

1-11, 1991-2002 [computer file], Colchester, Essex: U.K. Data Archive [distributor], May 2003, SN: 4561. 3 The 1991 SARs are discussed in Marsh (1993), with more information available at http://www.ccsr.

ac.uk/sars/ 4 A unitary local authority (or amalgamated adjacent authorities) with a population over 120,000. There

are 287 SARs areas.

Grid Enabling Empirical Economics

353

level. There were also further restrictions (such as the grouping of age) on a variable’s response categories. These restrictions are lifted for the Controlled Access (CAMS) version of the 2001 SARs. However, as the 2001 CAMS are only available within a data enclave, a secure location where accredited researchers can analyse restricted data, they are presently unusable from a data Grid perspective. 2.2 Econometric Computation The methodology used in our research is taken from the poverty mapping literature and follows the approach of Elbers et al. (2002, 2003). This empirical analysis, a microeconometric one that relies upon the combination of two or more data sources, belongs to a broader group of modelling techniques associated with data linkage. Following the terminology of Chesher and Nesheim (2006), who provide a review of this area, as the linking is performed with no or unidentifiable common records between the data sets, it falls into the class known as statistical data fusion. The essence of the approach is to estimate a statistical model on one data set, the donor sample (a relatively small scale but detailed survey), and then apply elements of the fitted model (predicted responses for data imputation, residuals for re-sampling) to another data set, the recipient sample (possibly larger-scale but less detailed), taking due account of the statistical issues surrounding both model assumptions and data matching as required, such as the potentially heterogeneous nature of the survey data. Further, the underlying assumptions associated with standard statistical inference may well be violated in a combined analysis. As a consequence, it may be preferable to calculate poverty measures and their standard errors using simulation methods, rather than relying on expected closed form solutions and the delta method. As noted by Doornik et al. (2004), simulation-based analyses have a component that allows for embarrassingly parallelisable computations, and as such are well suited to implementation on High Performance Computing (HPC) resources. However, such resources are, at present, not easily accessible to researchers in economics. This difficulty can be overcome by using Grid technology to enable access to the computational Grid. The remainder of this sub-section presents our version of Elbers et al. (2002), dealing first with the BHPS survey data, and then the SARs Census data. A basic estimation approach has been used at this stage for both the income and idiosyncratic heteroscedasticity equations, namely ordinary least square (OLS), as this is sufficient for the objectives of the demonstrator. A similar rationale has driven our choice of poverty measures, with simulated head count (SHC) and simulated poverty gap measures (SPG) being our initial choices, although a parametric (expected) head count measure (PHC) is also available. 2.2.1 Calculations on the Donor Sample The survey data is used to estimate a model of income yic . The index i refers to an individual in a sample cluster, which is indexed by c. Each cluster contains nc observations C nc observations over the C clusters. The income equation is and there are N = c=1

354

S. Peters et al.

specified as log(yic ) = β xic + uic where xic is a vector of suitably defined explanatory variables, individual idiosyncratic error terms: uic = ηc + εic where ηc and εic are uncorrelated with xic , independent of each other and IID(0,ση2 ) and ID(0,σic2 ) respectively. ˆ The second First step estimation uses OLS to obtain the coefficient estimates β. step uses the fitted residuals. uˆ ic = log(yic ) − βˆ xic , to estimate a model of the variance components. Set uˆ ic = uˆ .c + eˆ ic where eˆ ic = uˆ ic − uˆ .c , here the “.” subscript means that an average has been taken over that index, and proceed to model the idiosyncratic heteroscedastic component σic2 using a logistic style transformed equation 2 eˆ ic 2) (A−ˆeic

= α zic + ric where zic is a vector of appropriate explanatory variables, A is

2 ) and r is a suitable error term. Estimation of the α coefficients is set to 1.05 max(ˆeic ic again done using OLS. Once αˆ has been obtained the appropriate prediction for σic2 can be calculated. The remaining variance component, ση2 can be calculated as σˆ η2 = max(Vˆ (ηc ), 0) where

ˆ c) = V(η

nc 1 1 1 2 (ˆu.c − uˆ .. )2 − V(ˆe.c ) and V(ˆe.c ) = eˆ .c . C−1 c C c (nc − 1)nc i=1

2.2.2 Calculations on the Recipient Sample Once the above estimates have been obtained one can impute or predict an individual’s income for any given set of comparable explanatory variables xkc and zkc . The index k indicates an individual in the Census data source. The Census based predictions can σˆ 2 +σˆ 2 then be calculated, under an assumption of Normality, as:ˆykc = exp(βˆ xkc + η 2 kc ). The variance prediction σˆ 2 is calculated using αˆ zkc . kc

One can also calculate a wide variety of poverty measures. The PHC measure requires an assumption of Normality and is calculated as PHC(p) =

n 1 2) ((log(p) − βˆ xkc )/ σˆ η2 + σˆ kc n k=1

where (.) is the standard Normal distribution function and p is a poverty line. Summation is taken over the sub-sample of interest, namely ethnic group within geographic area. If the parametric assumption of Normality was incorrect, this would cause misspecification problems that might affect the estimated measures and their associated standard errors. As noted, simulation can be used to counter this possibility and is used for SHC and SPG. The SHC measure is obtained as the average of B simulated B 1 SHC(p)b where head count measures: SHC(p) = B b=1 SHC(p)b =

n 1 I((βˆ b xkc + u˜ .c,b + e˜ kc,b σˆ kc,b ) < log(p)). n k=1

Grid Enabling Empirical Economics

355

Note that I(.) is an indicator function taking the value of 1 if the condition inside the parentheses is satisfied and zero otherwise. The standard error predictor, σˆ kc,b , is calculated using αˆ b zkc . The coefficients, βˆ b and αˆ b , are obtained from the bth casewise resample of the survey data, error terms, u˜ .c,b & e˜ kc,b , are drawn with replacement from the requisite error vectors u˜ b & e˜ b , noting that e˜ b needs to be appropriately standardised. 1 B The SPG measure is obtained in a similar manner: SPG(p) = B b=1 SPG(p)b where n yˆ kc,b 1 . I(ˆykc,b < p)∗ 1 − SPG(p)b = n p k=1

Here yˆ kc,b = exp(β b xkc + u˜ .c,b + e˜ kc,b σˆ kc,b ). Standard errors are calculated in the usual fashion using the B simulated values of the chosen poverty measure: SPG(p)b or SHC(p)b . A simulated version of the PHC(p), PHC(p)b is used to calculate its standard error. This is based upon the PHC(p) equation above, but with the σˆ η2 , σˆ jc2 , βˆ and αˆ replaced by the values obtained from the bth simulation. Elbers et al. (2003) suggest B can be set to 300. 3 The Grid Implementation The present implementation builds on an existing infrastructure known as the NGS, whose core systems are described in the Appendix. This NGS core constitutes a Grid as it crosses administrative domains, namely those of the Universities of Oxford, Manchester, Leeds and Rutherford Appleton Laboratory. The decision to use the NGS also impacted upon the possibilities available for parallelization of the analysis code, and for the hosting of the data. These are discussed below. The schematic presented in Fig. 1, an architecture diagram of the type commonly found in the informatics literature, shows how the GEMEDA service is arranged. Each of the three steps of the analysis process are related to Sect. 3.2, 3.3 and 3.4 below, along with the managing service annotated with Sect. 3.1. The parts of the diagram annotated with MS indicate components that are associated with the middleware and security. This is the Grid and related service software (Globus and others listed in the Appendix) that join all the steps together and communicate with the user. The diagram’s arrows are labelled to indicate how the service is linked together with J_D and J_A referring to the data handling and econometric analysis parts of a job submitted to the service, and S and V referring to security and viewing. The labels can also be used as a reference to user initiation of the relevant steps as they also feature in Fig. 2 below. Although further documentation and a demonstration of the service can be found at the links given in section (iii) of the Appendix, it is also useful to consider Fig. 2’s walkthrough of the stages a user and the GEMEDA service might pass through, as an aid to the material presented in Sect. 3.1 – 3.4. The user accesses the

356

S. Peters et al.

(3.3) NGS HPC computing nodes Manchester, Oxford, Leeds, Rutherford

(3.2) NGS Oracle hosted datasets

BHPS

SARs Household

SARs Individual

GEMEDA parallelized analysis code grid service

J_D OGSA-DAI

S NGS MyProxy server (MS)

(MS)

Grid Security Infrastructure

DAI Service Group Registry (MS)

Globus Toolkit Core Axis Apache Tomcat

Athens Server (MS)

J_A J_D

J_D

S

S

(3.1) GEMEDA

FTP server (MS)

Apache Tomcat Spring Framework

V JSP Web interface + (3.4) Results visualization applet

Fig. 1 The GEMEDA architecture

service via a Web portal, a Web site that offers a range of resources and services. This avoids difficulties in deploying complex middleware stacks on end-users’ computers, a problem indicated by earlier projects such as Russell et al. (2003). 3.1 The Web Service Client The demonstrator is designed for a researcher who wishes to investigate the welfare of ethnic minority groups in the United Kingdom. Specifically it allows the researcher to choose different ethnic groups, to specify a level of geography, and to pick from a limited set of poverty measures. The aim is to provide an easy-to-use (Web-based) interface which allows the user to make choices about the type of analysis to be performed and which then returns the results of that analysis to the user. The details of the actual analysis and associated data management are invisible to the user. A user needs login permissions for the GEMEDA service, an e-Science certificate to access the NGS, and an Athens username to allow use of the SARs and BHPS data. The user is asked for his/her pass-phrase which is sent using HTTPS to the service. This automatically initiates the creation of a proxy certificate by calling a designated MyProxy server containing the user’s certificate. Note that the user’s certificate needs

Grid Enabling Empirical Economics

357 Development and Deployment

Set-up Stage. [Exploit available e-Science infrastructure and investments]. Has the GEMEDA service been set up? [Yes] Proceed to Security Stage. [No] The service team needs to: Populate databases. Host databases on the Manchester NGS node. Create analysis code. Store analysis code executable on the GEMEDA service server. Routine Usage Security Stage. [User logs in to the GEMEDA service]. Is full security set? [Yes] Proceed to Job Submission Stage. [No] The user needs to: S.1) send an e-certificate to the NGS MyProxy server. S.2) set the Athens credentials. S.3) set the MyProxy credentials. Job Submission Stage. [Creating and submitting a job]. Have you already submitted a job? [Yes] Proceed to Job Status Stage. [No] The user needs to: J_D.1) select the economic variables for the analysis. J_D.2) select the geographic areas. J_D.3) select the ethnic minority groups. J_A.1) select the welfare measure to estimate. J_A.2) select the HPC node and the number of processors to use. Submit the job for processing through the GEMEDA service. GEMEDA Stages..[Data extraction, file creation and transfer, start the HPC job] GridFTP executable to the chosen HPC node Query and filter the SARs and BHPS databases using the information obtained at stages J_D.1, J_D.2 and J_D.3. Convert the resulting XML datastream to data matrix files using XLST. GridFTP the data matrix files to the chosen HPC node. Create a text file of analysis control commands using the information obtained at stages J_D.1, J_D.2, J_D.3 and J_A.1. GridFTP the command file to the chosen HPC node. Create a job instantiating script using the information from stage J_A.2 GridFTP the job script to the chosen HPC node and insert it in the job queue. Job Status Stage. [Waiting in a pub with a wireless connection]. Has the job completed? [Yes] Proceed to Viewing Stage. The GEMEDA service does the following once a job has completed. GEMEDA Stages. [Results retrieval, housekeeping]. GridFTP results back to the GEMEDA server. Delete all job related files from the chosen HPC node filestore. Convert the results text files to dBase files for the visualisation tool. [No] The user can interrogate the status of the GEMEDA service. Viewing Stage. [Examine the results]. The user can view and work with the results of the analysis. V.1) View the results as text files. V.2) Launch the visualization tool. The results and their job description remain available until deleted by the user.

Fig. 2 The GEMEDA service walkthrough

to have been uploaded to the MyProxy server using a utility such as the Java Web Start application known as the MyProxy Upload Tool (CCRLC, 2007). The proxy credential is stored and used by the GEMEDA service throughout the lifetime of the session. A single sign-on mechanism allows the web service to query the data and to communicate with the HPC by means of the Grid Security Infrastructure (GSI). This provides message level encryption as well as authenticating and authorising the owner of the proxy credentials, as proxy certificates and certificate delegation are an integral part of GSI. The Athens username/password combination is verified by

358

S. Peters et al.

calling the XML-RPC Athens security interface developed for Cole et al. (2006). Information concerning Athens and proxy credentials are stored on the GEMEDA server and accessible through an HTTPS connection. Only one set of login details, those of the GEMEDA service, are required once the MyProxy and Athens usernames and passwords have been set. This alleviates the need to constantly re-enter all security details whenever the service is used. The service generates separate data queries targeted at the SARs and BHPS data sets as a result of user input, and uploads necessary files to the user’s selected HPC computation node. The service is effectively managing the workflow of the chosen analysis, a user’s submitted job, and, as it progresses through its constituent tasks, event notification is carried by the encapsulating Grid service. This periodically interrogates the status of the Grid service returning the job status (e.g. extracting data, running, pending, halted, etc) if required by the user. Once a job is completed, the results are written to file and downloaded through GridFTP by the web service before being processed for viewing as described in Sect. 3.4 below.

3.2 Grid-enabled Datasets A key middleware requirement for our grid-enabled analysis is one that permits access to and extraction of variables from the desired datasets. It is also at this stage that any inconsistencies in the definition of variables between the data sets are resolved. Fortunately, another of the infrastructure investments we could exploit was the advent and use by the NGS of data management middleware known as Open Grid Services Architecture – Data Access and Integration (henceforth OGSA-DAI). This middleware supports access to relational databases and XML repositories. This software would, in principle, allow us access to data hosted in a variety of database formats (e.g. DB2, MySQL, Oracle, SQL Server) though in practice only Oracle is supported on the NGS data nodes. The SARs and BHPS data sets are, consequently, stored in separate Oracle databases which occupy slightly over 1 GB of storage space, and hosted on a single data node of the NGS. All the available waves of the BHPS were grid-enabled, along with both the individual and household SARs for 1991, however, only the 1991 BHPS wave and individual SARs file are used in the present version of the demonstrator. Python was used to create and populate the Oracle databases from the original data tables (text files). The SARs individual file contains 1116181 observations (records) on 67 variables, while the household file contains 541,922 observations on 123 variables. The BHPS data set is longitudinal and contained, when the project started, 134 different files of varying numbers of observations and variables. The main file for the 1991 wave contains 38177 observations on 1101 variables. In this context our “grid-enabling” now allows easier access to these datasets, avoiding the costs associated with maintaining local copies of the data, and automating the required data extraction for an appropriately accredited researcher. Once such a user has initiated a job the data processing part of the service generates separate OGSA-DAI SQL queries targeted at the SARs and BHPS databases. These return query results as XML asynchronous data streams, a task performed by the driver that OGSA-DAI

Grid Enabling Empirical Economics

359

uses to connect to the Oracle databases. The data streams are uploaded via ftp before being converted to a data format recognized by the econometric computation code using XML stylesheets (XLST). These are two matrix files which contain the coherent and consistent variable sets required for the analysis. The donor and recipient data set files contain (y|XBHPS |ZBHPS |cBHPS ) and (XSARs |ZSARs |cSARs ) respectively. The variables are used in the manner described in Sect. 2.2 above, with cBHPS and cSARs being vectors of cluster indicators for each data set. This part of the service, from job submission by the user to the files arrival on the HPC node, takes about 14 minutes in real time. This timing is for a single job that requires the full data set used for the analysis reported in Sect. 4 below.

3.3 Running the Analysis Code This second step of the service, the econometric computation, was performed using a HPC node. The NGS investment in this area meant working with a distributed memory parallel architecture using Message Passing Interface (MPI). MPI allows parallel programs to be run on a wider configuration of computer architectures than the shared memory alternative, and has been used by other researchers such as Creel (2005) and Doornik et al. (2003). Such use of MPI appears to suggest it is presently a preferred paradigm for performing embarrassingly parallel computations in econometrics. All the housekeeping of the parallel operations take place within the analysis code itself, and leave options open for extending a problem into a scenario where it becomes less embarrassingly parallel in the sense that more communication may be required between elements of the econometric analysis. For our present application, all that is needed is to add the appropriate MPI code to two places in our Fortran 95 program in order to scatter the data between the processors being used, and then to gather up the results after the processors have completed their analysis. The econometric analysis computation consists of four Fortran 95 and two MPI parts: command/data input, scatter the data between processors, estimation and prediction, welfare measure calculation, gather the results from the processors, and output of the results. These produce: estimation results for the Survey data (the donor sample), imputed income quantiles, poverty measures and associated standard error estimates for the Census data (the recipient sample). In our application, the timings for the raw processing are reduced from two hours on a single Pentium 4 (3.0GHz) desktop to under five minutes when using an NGS node with 32 processors to run a full analysis of the type that produces the results presented in the figures below. However, this choice of node and the number of processors needs to be decided by the user, and can have an impact upon the amount of real time taken to run the job. The compute resource brokerage is, therefore, manual, and empirical observation of the real time turn-round times during development of the GEMEDA system suggest a full analysis can be performed on 8 nodes in just over thirty minutes. The NGS can run MPI jobs on all of its four nodes, and observations during development suggested that the Oxford node appeared to be the most reliable.

360

S. Peters et al.

3.4 Visualization A micro-economic empirical analysis of the type dealt with in this article has the potential to generate a large quantity of output that requires presentation and examination in a suitable and accessible form. Standard practice in the economics discipline would be to produce large numbers of tables and plots for perusal and possible presentation. In the context of the present example, this would result in supplying information for 300 geographic areas, each with 10 ethnic minority categories, for all the desired statistics of interest. A more efficient approach would be to allow users access to a visualization tool that permitted preliminary assessment of the results produced. The application itself suggests a solution, namely to exploit the geography involved and use a simplified Geographic Information System (GIS) to filter results before selected subsets are presented in the usual manner. An added bonus is that this form of presentation may also provide a useful summary of the results itself, as shown in Figs. 3, 4 and 5 below. Once the results files have been returned from the analysis to the GEMEDA service, they may be viewed in the raw or by the visualization applet. A C utility converts the results text file data into a dBase file format (dbf) appropriate for the applet. This is done for both the regional and SARs area geographies. The applet, which runs under Java 1.4 and above and uses the GeoTools toolkit (Codehaus 2006), needs this information along with special mapping data files (shp files5 ) to produce choropleth maps at regional and SARs area geographies for the selected ethnic group and gender category. The shp files are combined to produce the map, along with the legend, linked plot, and buttons for gender/ethnic group/geography selection. The data and maps remain on the service’s server, with permission to use the mapping information ensured by the fact that a user has Athens authentication. The demonstrator service presently produces poverty measures using individual level data. Demonstrator options are the standard headcount measure, and the poverty gap measure. These can be calculated for the two possible poverty lines used by U.K. government agencies, either 60% of the U.K. median income or 50% of the U.K. mean income. The poverty measures are then displayed on a GIS style choropleth map display for U.K. regional and SARs area geographies. The display presents the poverty measure for the chosen ethnic minority group. The use of individual income data also allows calculation of poverty measures by gender, and, if this option has been chosen for the analysis, the resulting poverty measures can be displayed by ethnic group within gender. The applet also produces a box-whisker style plot of the predicted income quantiles (minimum, 10% quantile, 25% quantile, median, 75% quantile, 90% quantile, maximum) for all the ethnic groups associated with a region. This is available for the whole of the U.K. and at the level of geography displayed by the map (U.K. region or SARs area). Other information, such as estimation results, are available from the supporting files returned by the demonstrator. It is also possible to request classification based upon the pseudo-geography (a set of area classifications)

5 The shp files are obtained from http://edina.ac.uk/ukborders/

Grid Enabling Empirical Economics

361

Fig. 3 (a) Screenshots of the visualization applet: regional geography. (b) Screenshots of the visualization applet: SARs area geography

362

S. Peters et al.

Fig. 4 United Kingdom regional poverty maps. Male ethnic minorities. Notes: Legend as Fig. 3a and 3b, and Clear (0), Dark Blue (0–10), Light Blue (10–20), Dark Green (20–25), Light Green (25–33), Yellow (33–50), Orange (50–66.6), Red (> 66.6).

available for the 1991 SARs, although these results are not accessible via the visualization tool. The poverty measure displayed using the map’s colour scheme is the one chosen by the user at job initiation. The applet allows the user to switch between different linked box-whisker style plots of predicted income quantiles by using the map cursor to point to the geographic area of interest. The map has a zoom facility which is useful when the finer SARs area geographies are displayed. Figure 3a and b present screenshots of the U.K. region and SARs area map for White Male SHC poverty measures using the half mean income poverty line. The linked plot displays predicted income quantiles for all the groups available from the geography last pointed to, which was the North West in Fig. 3a6 and Manchester in Fig. 3b. The visualization applet does not display information at a geographic level if the sample size for an ethnic minority group is deemed small (20 or less). This is visible in the linked plot in Fig. 3b and in Figs. 4 and 5 below, and has also been implemented 6 This is annotated by an arrow for those unfamiliar with U.K. geography. Manchester is in the middle of

this region.

Grid Enabling Empirical Economics

363

Fig. 5 United Kingdom SARs area poverty maps for two ethnic minorities. Notes: As Fig. 4

for the Tables. The purpose of this is to mimic the type of confidentially restrictions applied to the equivalent controlled access 2001 data. 4 Results In this section we provide an overview of some estimates of ethnic minority poverty rates obtained using the demonstrator. While these results are of interest in their own right, in the current paper we present them to illustrate the type of analysis which is possible. Using data from 1991, models of individual income were estimated using the BHPS data separately for males and females. The specification of the regression equation included all of the variables available via the demonstrator: constant, gender, age, age squared, children present, marital status, labour force position, housing tenure, high qualifications, immigrant, and region. The results supported splitting the sample by gender as the signs of the parameters on some of the regressors were different for males and females (e.g. the variables indicating marital status and the presence of children in the household). Full regression results are not presented here, instead we note that the explanatory power of both prediction equations appeared reasonable, 53% and 40% for males and females respectively. Heteroscedasticity tests strongly rejected the null of homoscedasticity. This heterogeneity in the variance was modelled using the methodology of Elbers et al. (2002) described in Sect. 2.2.1 above. Table 1 shows the SHC measures (with standard errors) for all of the U.K. and the U.K. regions for males. Figure 4 presents the region map version of these measures which can be taken in tandem with the White male regional map shown in Fig. 3a. The breakdown of poverty measures by region and ethnic group for males

364

S. Peters et al.

Table 1 Male SHC poverty measures for U.K. regions. 1991 Ethnic Group Region

White

Black Caribbean

Black African

Indian

Pakistani

Bangladeshi

Chinese

Male All U.K.

19 (0.4)

26 (1.0)

37 (1.9)

22 (0.9)

33 (1.2)

33 (1.8)

29 (1.3)

North

24 (1.3)

.

.

20 (3.4)

42 (4.3)

48 (6.9)

25 (4.8)

Yorkshire & Humberside

20 (0.9)

30 (2.6)

44 (5.6)

28 (1.9)

36 (2.1)

40 (6.0)

36 (3.4)

East Midland

20 (1.0)

27 (2.4)

20 (4.7)

24 (1.7)

46 (3.8)

29 (7.0)

28 (3.4)

East Anglia

19 (1.7)

21 (4.7)

.

19 (5.1)

36 (7.0)

.

17 (4.6)

Inner London

19 (1.4)

27 (1.9)

39 (2.6)

22 (2.3)

34 (3.5)

37 (3.0)

38 (3.1)

Outer London

15 (1.0)

20 (1.4)

32 (2.5)

18 (1.2)

28 (2.3)

28 (3.9)

22 (1.9)

Rest of S.E.

15 (0.6)

20 (1.7)

27 (3.6)

17 (1.2)

24 (1.6)

24 (2.9)

23 (2.2)

South West

20 (1.0)

30 (3.8)

35 (6.1)

23 (3.2)

29 (6.4)

.

19 (3.7)

West Midlands

20 (0.9)

33 (1.7)

44 (5.0)

26 (1.2)

37 (1.8)

38 (3.9)

27 (4.0) 37 (2.7)

North West

20 (0.8)

31 (3.1)

42 (4.5)

30 (2.0)

33 (2.0)

35 (5.0)

Wales

24 (1.6)

.

61 (8.3)

22 (4.7)

33 (5.3)

17 (6.1)

40 (4.6)

Scotland

21 (1.0)

.

39 (7.3)

28 (3.6)

20 (5.3)

.

34 (4.3)

Notes: Black-other, Other-asian and Other-other omitted. Results report to whole percentage, and 1 decimal place for the standard error. Standard errors are in parentheses. "." cell indicates a SARs sample size of 20 or less. An italic entry indicates a sample size of 21–50

shows considerable diversity across each of these dimensions. In general non-whites have higher poverty measures than Whites and this conforms to what we know about the higher unemployment rates and lower earnings of ethnic groups in the U.K. Some groups, particularly Black Africans, Pakistanis and Bangladeshis, do particularly badly while the Indians and, to a lesser extent, the Chinese have poverty rates closer to those of Whites. This broad ranking is similar to that in Berthoud (1998), ‘Southern’ areas of the country generally have lower poverty headcounts than other regions although it should be noted that we do not correct for regional price differentials here. Some regions have relatively high poverty rates for particular groups, for example Bangladeshis and Pakistanis in the North, Pakistanis in the East Midlands and Black Africans in Yorkshire and Humberside, the West Midlands, the North West and especially Wales. Table 2 presents the regional poverty measures and standard errors for females. The broad ranking of the groups and regions is similar for this gender but poverty rates are much higher. This is because these measures are based on individual income and females have lower participation rates in the labour market. Clearly this measure does not take account of intra-household income transfers. Pakistani and Bangladeshi females stand out as having extremely high poverty rates while, again, Indian and Chinese women are more comparable with their White counterparts. An advantage of using SARs data is the ability to examine sub-regional geography. However at this level, small samples become a problem. While poverty rates can

Grid Enabling Empirical Economics

365

Table 2 Female SHC poverty measures for U.K. regions. 1991 Ethnic Group Region

White

Black Caribbean

Black African

Indian

Pakistani

Bangladeshi

Chinese

Female All U.K.

50 (0.7)

40 (1.4)

48 (2.1)

53 (1.5)

71 (1.7)

71 (2.5)

49 (1.8)

North

52 (1.6)

.

.

56 (6.3)

73 (5.0)

.

65 (7.1)

Yorkshire & Humberside

55 (1.7)

52 (3.5)

56 (7.5)

64 (3.0)

77 (2.4)

78 (5.0)

57 (5.6)

East Midland

52 (1.8)

45 (3.6)

47 (7.9)

58 (2.3)

71 (4.4)

.

50 (6.4)

East Anglia

58 (2.4)

43 (7.7)

.

68 (7.6)

80 (7.0)

.

60 (7.1)

Inner London

34 (2.0)

36 (2.3)

45 (2.9)

39 (2.7)

59 (4.2)

67 (3.8)

44 (3.3)

Outer London

43 (2.2)

37 (2.5)

50 (2.8)

48 (2.3)

62 (3.3)

64 (4.9)

45 (3.1)

Rest of S.E.

48 (1.3)

39 (2.5)

45 (4.5)

47 (2.5)

68 (3.0)

70 (4.6)

46 (3.3)

South West

55 (1.9)

45 (5.4)

.

53 (4.8)

72 (10.3)

.

54 (6.0)

West Midlands

53 (1.7)

45 (2.5)

41 (7.8)

60 (2.1)

76 (2.6)

77 (4.5)

53 (5.2) 54 (3.7)

North West

50 (1.3)

46 (3.1)

56 (6.1)

59 (2.5)

69 (2.5)

77 (5.4)

Wales

50 (2.1)

58 (9.1)

.

59 (6.1)

67 (6.3)

83 (8.2)

46 (7.0)

Scotland

48 (1.6)

.

.

58 (4.8)

70 (4.7)

.

54 (5.3)

Notes: See Table 1

be estimated for Whites in all the areas (Fig. 3b), meaningful comparisons between different ethnic groups are only possible for urban areas in which ethnic minorities tend to cluster. This is quite noticeable in Fig. 5, where the display has been moved down to the SARs area level for the Black Carribean and Pakistani ethnic minority groups.

Table 3 1991 SHC poverty measures for profiled areas Ethnic Group Profile

White

Black Caribbean

Black African

Indian

Pakistani

Bangladeshi

Chinese

Enclave

23 (1.0)

27 (1.5)

39 (2.2)

24 (1.3)

36 (1.6)

37 (2.9)

39 (2.7)

Poor

25 (0.6)

33 (1.6)

39 (2.5)

25 (1.4)

31 (1.7)

38 (3.7)

30 (2.2)

The Rest

17 (0.4)

22 (1.1)

31 (2.0)

20 (0.9)

29 (1.4)

25 (2.1)

25 (1.3)

Enclave

42 (1.4)

39 (1.8)

47 (2.3)

57 (1.6)

73 (1.8)

71 (2.7)

48 (2.7)

Poor

54 (0.7)

46 (2.0)

51 (3.4)

56 (2.3)

72 (2.3)

72 (3.9)

56 (3.3)

The Rest

49 (0.6)

39 (1.7)

48 (2.5)

51 (1.4)

68 (1.9)

73 (3.1)

48 (2.0)

U.K. Male

U.K. Female

Notes: See Table 1. Enclave refers to GBprofile codes 5,13,18, 22, 29 and 33. Poor refers to GBprofile codes 1,7,17,21,23,35,36,27,43,44,45,49. The Rest refers to the remaining GBprofile codes

366

S. Peters et al.

There are a number of solutions to the problem of small samples. First, more detail is available using the 2001 Individual SARs which feature considerably more ethnic minority respondents than the 1991 data set. Alternatively, it is possible to capture something of the results for different ‘types’ of area. Table 3 does this using the ‘GB Profiles’ area classifications of Dale and Oppenshaw (1996), which are attached to the 1991 SARs. All GB profiles that indicate categorization based on one or more ethnic minorities are grouped together as Enclaves. Clark and Drinkwater (2002) suggest that enclaves are associated with worse outcomes for ethnic minorities. The remainder are split into Poor (based on housing tenure categorization), and The Rest. The results do not indicate that there is a strong difference in ethnic minority wealth when comparing the SHCs in the Enclave profile grouping with the Poor profile grouping, however, both do worse than The Rest. 5 Concluding Comments In this paper we have described how the GEMEDA service uses the NGS to aid the investigation of the welfare of ethnic minorities in the U.K. by grid-enabling the required empirical analysis. The user initiates the analysis (job) and views the substantive results through a standard web browser. The GEMEDA service, as well as managing the various data security and access protocols, arranges for the appropriate data sub-sets to be extracted by running queries against OGSA-DAI enabled Oracle databases hosted on the NGS. It then transfers these sub-sets, along with the MPI parallelized code for the econometric analysis, and its associated command file, to a compute node on the NGS. The results of the said analysis are then returned to the service, and processed for presentation to the user via a GIS style visualization tool. To the best of our knowledge this is the first e-Social Science problem implementation on a true Grid as previous substantive projects and research have only emulated Grid technology on a local network. It should be noted that the methodology is general and its future development offers opportunities for economics researchers to address a wide variety of questions using a number of different, complementary data sets. Indeed, the use of OGSA-DAI offers the potential to include data sets that are hosted in other database formats such as Microsoft SQL Server. In the context of the present analysis, this could be further concurrent data sets, or later versions (2001 for instance) of the BHPS and Census SARs. Extensibility is not restricted to classical quantitative applications, however, and there exists the possibility of integrating appropriate qualitative information using the techniques of Ahmad et al. (2005), although this is outside the domain of the authors. The applied work itself, will not be discussed in detail here. Issues such as improving and evaluating the economic specification and econometrics, using more datasets, or extending the range of welfare measures and visualization options are work in progress. Some of these will fit neatly into our present framework. For example, inequality measures such as the Gini coefficient do not have exact closed form equivalents for our problem, and, although approximations may be available (Elbers et al. 2000) calculation by simulation is an easier and more flexible option. Others will not fit so easily into our approach, the prime example being the use of anonymised records within local

Grid Enabling Empirical Economics

367

authority areas from the 2001 U.K. Census. As mentioned above, this data is only available within a secure physical data enclave and as such cannot be grid-enabled. However, this latter issue will be resolved in the near future as it is part of a continuing body of research that relates directly to the usability and benefits of systems such as GEMEDA. Work on issues related to secure access to confidential data is ongoing as evidenced by Elliot et al. (2006) and Bradburn et al. (2006) and will lead to the creation of virtual data enclaves. The move from physical to virtual secure data hosting will increase the amount of quantitative information available to a researcher immensely, as disclosure and proprietary issue will be resolved for both governmental and commercial data owners respectively. Access to all these new sources may only be feasible via Grid technology as this offers secure access to remotely hosted information. This leads to a general point that needs emphasising whenever the idea of “gridenabling one’s data” is presented to an applied worker. Messy data remains messy wherever it is hosted, and whether it is public domain or on a virtual data enclave. What is actually required from an applied worker’s prespective is grid-enabled data cleaning and grid-enabled data manipulation. In the context of the present problem, adding more variables, or another data set, to the analysis would also require expert intervention to link up the OGSA-DAI enabled data sources with the calling service in order to obtain usable working data files. This is a topic that is also in the process of being addressed, through projects such as GEODE (Lambert et al. 2006), though it may well need the application of smart meta-data and associated semantics. Grid technology has allowed us to speed up the computations associated with the statistical analysis required for the problem, however, the choice of HPC node and the number of processors is a manual one for the GEMEDA service. What is required is a compute resource broker, a system which will find optimal available capacity on the Grid independently of user intervention. The system would then deploy the software and associated files to the discovered computational resource. This type of approach also raises the problem of how to deploy software on any unknown HPC on the Grid, though fortunately this does not impact on our application as the core NGS nodes have the same computing architecture and software base. According to Richards and Pickles (2007), a “limited but significant” compute resource broker is scheduled to come into service on the NGS by the middle of 2007. Our speeding up of the computations involved bespoke programming in MPI and Fortran 95, and while this expertise is more readily available in the economics discipline, approaches which make the parallelisation transparent for standard problems may be more accessible, especially if they are implemented in an HLMP language (e.g. Creel 2005). Work on this from a Grid perspective has involved the R open source statistical language, notably Grose et al. (2006) and Hayes et al. (2005). The ability to adopt grid technology to address the full empirical modelling process requires a wide range of expertise in order to address all three of its constituent steps: data handling, econometric computation and visualization, as well as the middleware needed to link everything together. From the software choice perspective this extends the picture provided by Kendrick and Amann (1999) considerably. For our application, a sole researcher would need database languages (Oracle SQL), computation languages (Fortran 95), interface languages (Java), and glue languages (Python) as

368

S. Peters et al.

well as middleware and mapping software experience. Tools will be, and have been, developed to address parts of the modelling process (e.g. Hayes et al. 2005), however, the watchword for full process grid enablement is collaboration. From a wider perspective this may well go beyond just collaboration between economists and computer scientists, but between economists and other scientists as well, as some of the articles above indicate work relevant to economic problems is being done not only in other areas of social science (for example Lambert et al. 2006), but in other sciences as well (the disclosure problem in Elliot et al. (2006) comes from biomedicine). This leads to a final comment—a call for engagement, at any level from awareness to adoption, by empirical and computational economists. The development and related infrastructure of many aspects of the Grid are proceeding apace, and the lack of involvement from the economics discipline with the e-Research agenda will see tools and services developed from a physical science perspective, or development driven by technical, rather than substantive problems. Acknowledgements Research supported by ESRC grant number RES-149-25-0009, “Grid Enabled Microeconometric Data Analysis”. Details of the ESRC’s pilot demonstrator projects and present ongoing research can be found at the National Centre for e-Social Science (NCeSS) hub: http://www.ncess.ac.uk/. We benefited from discussion with members of the following projects: SAMD (Celia Russell, Mike Jones), ConvertGrid (Keith Cole), Hydra I Grid (Mark Birkin, Andrew Turner), NCeSS (Alex Voss) and with locally based NGS staff (Matt Ford). Additional support was provided in the form of an allocation of resources on the NGS itself. We would also like to thank an anonymous referee whose comments were instrumental in improving an earlier draft. BHPS data were collected by the Institute of Social and Economic Research at the University of Essex and are made available through the U.K. Data Archive. The SARs are made available through the Centre for Census and Survey Research at the University of Manchester and are Crown Copyright. Mapping data was provided through EDINA UKBORDERS with the support of the ESRC and JISC and uses boundary material which is copyright of the Crown and the ED-LINE Consortium. The original depositors bear no responsibility for the use or interpretation of the data made here.

Appendix: Software and Systems Used The United Kingdom’s National Grid Service The core service has four nodes, two configured for compute intensive operations and two for data intensive operations. These are ClusterVision Beowulf type clusters with Myrinet high speed message passing interconnects, each running RedHat ES 3.0 Linux operating systems. The compute nodes are comprised of 64 dual Intel Xeon 3.06 GHz CPUs with 2GB of memory. The data nodes only have 20 dual CPUs, though with 4GB of memory. Further details can be found at: http://www.ngs.ac.uk/. GEMEDA (i) Middleware and Services Linux Mandkrake 10.1, Java 1.4.2_06, Tomcat 5.0.28, Globus Toolkit 3.2, OGSADAI 4, Java Cog Kit 4, Spring J2EE 1.2.3, DWR Ajax library, ProFTP, Oracle 9i RAC edition, Python 2.3 and Eclipse IDE.

Grid Enabling Empirical Economics

369

(ii) Other Software Fortran 95 and MPI (analysis code). The NGS nodes use Intel compilers and mpich 1.2.5. C and GeoTools 2.0 Java mapping library (visualization applet).

(iii) Miscellaneous The funding for the GEMEDA project has now ceased, and the service is being continued on an unofficial basis while resources are available. The service web portal is based at: http://pascal.mvc.mcc.ac.uk:9080/gemeda and contains links to the service itself and documentation. Access requires U.K. Athens and e-certificate security, as well as a username/pass-phrase for the service itself. There is also a video with linked slides available from the ReDReSS presentations at: http://redress.lancs.ac.uk/resources/imp_template.php?creator=Peters%20 Simon&title=Gemeda which discusses some of the issues associated with its implementation on the NGS. ReDReSS, http://redress.lancs.ac.uk/, is one of the U.K.’s e-Social Science resource portals.

References Ahmad, K., Gillam, L., & Cheng, D. (2005). Society grids, Proceedings of the UK e-Science all hands meeting 2005, EPSRC Sept. 2005. Ahmad, K., Taskaya-Temizel, T., Cheng, D., Gillam, L., Ahmad, S., Traboulsi, H., & Nankervis, J. (2004). Financial information grid – An ESRC e-Social Science pilot, Proceedings of the UK e-Science all hands meeting 2004, EPSRC Sept. 2004. Berthoud, R. (1998). The incomes of ethnic minorities, ISER Report 98-1, Colchester: University of Essex, Institute for Social and Economic Research. Birkin, M., Dew, P., McFarland, O., & Hodrien, J. (2005). HYDRA: A prototype grid-enabled decision support system, Proceedings of the first international conference on e-Social Science, Mimeo. Bradburn, N., Horton, R., Lane, J., & Tilkin, M. (2006). Developing a data enclave for sensitive microdata, Proceedings of the second international conference on e-Social Science, Mimeo. CCLRC (2007). MyProxy Upload Tool, http://tiber.dl.ac.uk:8080/myproxy/ Chesher, A., & Nesheim, L. (2006). Review of the literature on the statistical properties of linked datasets, DTI Economics Papers, Occasional Paper No. 3. Clark, K., & Drinkwater, S. (2002). Enclaves, neighbourhood effects and economic outcomes: Ethnic minorities in England and Wales, Journal of Population Economics, 15, 5–29. Codehaus (2006). Geotools the open source Java GIS toolkit, http://geotools.codehaus.org/, Codehaus Foundation. Cole, K. T. H., & Pickles, S. (2003). Seamless Access to Multiple Datasets (SAMD) – An ESRC e-Science demonstrator, Final Report, Mimeo. Cole, K. L. M., Ekin, P., & Maclaren, J. (2006). Convertgrid, Proceedings of the second international conference on e-Social Science. Mimeo. Creel, M. (2005), User-friendly parallel computations with econometric examples, Computational Economics, 26, 107–128. Dale, A., & Oppenshaw, S. (1996). Adding area-based classification to the samples of anonymised records (SAR) of the 1991 Census, Mimeo. http://www.ccsr.ac.uk/sars/publications/Areaclassifpap.pdf Doornik, J., Shepherd, N., & Hendry, D. F. (2004). Parallel computation in econometrics: A simplified approach, Nuffield Economics Working Paper. Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2000). Welfare in villages and towns, Tinbergen Institute Discussion Paper TI 2000-029/2.

370

S. Peters et al.

Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2002). Micro-level estimation of welfare, Policy Research department Working Paper No. WPS2911, The World Bank. Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2003). Micro-level estimation of poverty and inequality, Econometrica, 71, 355–364. Elliot, M., Purdam, K., & Smith, D. (2006). Patient record data: Statistical disclosure control for grid based data access, Proceedings of the second international conference on e-Social Science, Mimeo. Grose, D., Crouchley, R., Van Ark, T., Allan, R., Kewley, J., Braimah, A., & Hayes, M. (2006). sabreR: Grid-enabling the anlaysis of multi-process random effect response data in R, Proceedings of the second international conference on e-Social Science, Mimeo. Hayes, M., Morris, L., Crouchley, R., Grose, D., VanArk, T., Allan, R., & Kewley, J., (2005). GROWL: A lightweight grid services toolkit and applications, Proceedings of the UK e-Science all hands meeting 2005, EPSRC Sept. 2005. Kendrick, D., & Amann, H. (1999). Programming languages in economics, Computational Economics, 14, 151–181. Lambert, P., Tan, K., Turner, K., Gayle, V., Prandy, K., & Sinnott, R. (2006). Development of a grid enabled occupational data environment, Proceedings of the second international conference on e-Social Science, Mimeo. Leslie, D. (1998). An investigation of racial disadvantage. Manchester. Manchester University Press. Marsh, C. (1993). The sample of anonymised records. In A. Dale & C. Marsh (Eds.), The 1991 Census User’s Guide, London: HMSO. Modood, T., Berthoud, R., Lakey, J., Nazroo, J., Smith, P., Virdee, S., & Beishon, S. (1997). Ethnic Minorities in Britain: Diversity and Disadvantage, London: Policy Studies Institute. Phillips (2003), Laws & limits of econometrics. Economic Journal, 113, C26–C52. Richards, A., & Pickles, S. (2007). National grid service roadmap, Issue 1, Mimeo. Russell, C., Cole, K., Jones, M. A. S., Pickles, S. M., Riding, M., Roy, K., & Sensier, M. (2003). Grid technology for social science: The SAMD Project. IASSIST Quarterly, 27 # 4.

Lihat lebih banyak...

Grid Enabling Empirical Economics: A Microdata Application

Descripción

Comentarios