Billion Prices Project - India

July 12, 2017 | Autor: Mandeepak Singh | Categoría: Machine Learning
Share Embed


Descripción

Billion Prices Project - India Mandeepak Singh, Nitin Kohli, Satish Terala December 19, 2014

1

Contents 1

Introduction

3

2

Architecture and Design

3

3

Implementation Tools

4

4

Implementation hurdles

5

5

Deployment

6

6

Results 6.1 Analysis using EMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Statistical Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 8

7

Lessons Learnt and Post Implementation Notes

13

8

Limitations of our Approach

13

9

Appendix 9.1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 13 14 14

2

1 Introduction Over the last few years, retail in India has made a huge leap from being a primarily brick and mortar business to an online one. With entrants of international players like ebay and Amazon and the emergence of formidable Indian ones like FlipKart, Pepper Fry, Snapdeal and others the market has doubled from $6.3 billion in 2011 to nearly $14 billion in 2012. As online retailing becomes a significant player in India’s retail space it only makes sense to use it to compute statistics based on retail prices such as consumer price inflation. CPI or the consumer price index is a measure that examines the weighted average of prices of a basket of consumer goods and services such as transportation, food , electronics and white goods and others. The CPI is calculated by taking price changes for each item in the predetermined basket of goods and averaging them; the goods were weighted according to their importance. In this project we demonstrate that it is possible to calculate the consumer price index by scraping online retailer prices of goods rather than taking a survey approach. The theoretical underpinnings of this exercise are based on a similar project at MIT called Billion Prices Project developed by Profs. Alberto Cavallo and Roberto Rigobon1 . They also show that online prices are valid substitutes to calculate inflationary indexes .

2 Architecture and Design The following model shows the architecture we chose to build this application. One of our major challenges in this exercise was the data collection i.e. since we had to collect daily data on a scheduled basis from a diverse stream of data sources and since this data collection was a lengthy and bandwidth intensive task, choosing a purely cloud based implementation would have been highly expensive proposition. Instead we chose to run the spiders outside the cloud and only host the database on an EC2 instance in the cloud. This gave us the flexibility to run spiders from multiple desktops in different geographies and at different times while the Mongo DB on EC2 served as a single repository of data. 1

Cavallo, Albero. ”Scraped data and sticky prices: frequency, hazards.” 2009.

3

Scrapy , the python scraping framework comes with a JSON Pipeline that allowed to export the scraped content directly as JSON objects into Mongo. A Mongo connector for the framework abstracted the actual persistence details from the implementation. Why Mongo and JSON At the outset of the project we were not sure of what the data structure to use for an individual item level details since every item tends to be different and might have different characteristics. So instead of spending time to design an optimal data model we chose to use JSON as a holder of content so that we could use mapreduce functions later to create more amenable data outputs for various purposes. MRJob on Amazon EMR provided us Hadoop to run mapreduce jobs for multiple analysis datasets that we had to generate from the JSON dumps we got from Mongo Database. Jobs to aggregate by category, product name, vendor etc were written in Python using MR Job as the underlying framework and deployed on EMR. Since most of our statistical analysis was to be done in R, we wrote mapreduce tasks to generate aggregated product data into csv since that is a easily digestible format for R. One of us did spend time to investigate integration of Mongodb from R but were not successful at which point we chose to go with csv based approach. A similar approach as above was used to generate visualizations in Tableau, Aggregated data as JSON was either generated from MRJob or exported from Mongodb into Tableau.

3 Implementation Tools • Scrapy2 - We have used Scrapy for scrapping multiple sites. Scrapy provides an extensible framework where in someone can build his/her own pipeline for publishing data extracted while scraping to a relevant storage. Besides this, Scrapy provides a middleware mechanism that could be used as a proxy server. The Model can also be directly specified within Scrapy. 2

http://scrapy.org/

4

• Scrapyd3 - For Scheduling Scrapy Jobs we used ScrapyD. ScrapyD provide a Web UI to control the execution of any Job. We also wrote few cron jobs to schedule the crawlers. • Two personal laptops (in Toronto and Hyderabad) would run these spiders on a scheduled basis, typically late in the night ( 3am) or early hours in India time. This ensured that we were not scraping at retailers busiest hours reducing our chances of getting blocked by them. All in all we were scraping data for about 10000 products on a daily basis from about 5 sites. While some of them were retailers themselves, we also used price aggregators as well since they provided a one stop shop for prices for a given product across many retailers. • MongoDB - All the data scraped was submitted to Mongo DB. We have a MongoDB pipeline available within Scrapy. We had couple of shell script files to export data from MongoDB and upload it to S3 for analysis using a Map Reduce Job. • MrJob - Mrjob is a Python fraemwork for Map Reduce. We used the MrJob configuration file to run Map Reduce Jobs on Amazon EMR. The configuration that we used was – ec2 instance type = c1.medium – num ec2 instances = 5 • Tableau - Tableau was used for few Visualization and Simba driver was used to fetch data from MongoDB into Tableau. – MongoDB Connector for Tableau4 – MrJob • R - R was used for the computing the daily CPI and inflation values, testing the unimodality of price differences to establish whether prices were sticky or not, and to construct ARIMA(p,d,q) models Sites Crawled • BigBasket - http://bigbasket.com/ • PriceDekho - http://www.pricedekho.com/ • Pepperfry - http://www.pepperfry.com/ • LocalBanya - http://www.localbanya.com/ 3 4

https://scrapyd.readthedocs.org/en/latest/ http://www.simba.com/resources/webinars/connect-tableau-big-data-source

5

• Used API’s available - http://www.pricecheckindia.com/api • Retail Price for Essential Commodity - http://dataweave.in/apis • India Weather Data - http://dataweave.in/apis Frequency of crawl • Daily for all the sites above expect data for Retail Price for Essential Commodity & India Weather Data, which was crawled once for a given date range.

4 Implementation hurdles Crawling a website anonymously has its challenges as generally the websites were treating us a bad bot and were blocking us. To overcome the problem, we used the scrapy middleware5 as a proxy server and created a list of proxy IP’s through which we were making requests. Crawling sites had other challenges like • Retries in case the initial request fails • Handling AJAX Requests • Ensuring that we are getting the required fields in every case i.e. the code shouldn’t break and post invalid data to the backend. All the data cleaning was planned before posting the data to MongoDB to avoid the complexity during MapReduce or writing aggregate functions in MongoDB. • Ensuring that we are not getting frequent timeouts w.r.t. requests i.e. At any time we restricted on the number of calls we were making to different websites and introduced some delay among the calls. Below were the parameters that were set in the Scrapy configuration – DOWNLOAD DELAY = 5 – DOWNLOAD TIMEOUT = 180 – CONCURRENT REQUESTS = 8 • When trying to connect R to the MongoDB, we faced a problem with the package ”rmongodb”. The package was a bit fragile, namely with respect to writing queries with R’s array data type. Hence, we opted to write the relevant information to a csv to conduct the analysis locally with R. For the given scope of the analysis this was fine - however, had the database been larger we would have had to opt to use Python and connect to Mongo from there, and rewrite the scripts from R to Python. 5

https://github.com/geekan/scrapy-examples/blob/master/misc/middleware.py

6

5 Deployment MongoDB was setup on a EC2 Cluster and crawlers were scheduled to run from multiple geos. All the data from the crawler was synced to a single MongoDB Instance. Since the data set is not too large, we didn’t do any sharding. We had multiple collections in Mongo for different data types. The deployment details are below -

7

6 Results 6.1

Analysis using EMR

Product Prices for various online Retail Stores ( Amazon, Flipkart,Snapdeal etc. ) In India Amazon,Flipkart & Snapdeal are major players in the online retail market. We analyzed the individual prices of different items across various stores and also analyzed the average price for a given store. • Compare the average price of same set products across Flipkart & Amazon - For around 11126 products ( not unique as the products may be repeated again daily) in a span of 20 days, Flipkart is cheaper than Amazon.

• Compare the average price of same set products across Flipkart & Snapdeal - For around 38964 in a span of 20 days, Snapdeal was cheaper than Flipkart • Compare the average price of the same set of products across Flipkart,Amazon & Snapdeal - For 1730 products available across all the stores, Snapdeal is the cheapest followed by Flipkart. • We also analyzed which stores provide the item at the minimum price & ranked the stores based on price for each item from Low to High Essential Commodity Retail Price Analysis for various Items across different cities We analyzed the data of essential commodities ( e.g. Sugar/Gasoline/Rice/Wheat etc. ) across different regions in India to verify if the prices were the same across different regions or there is lot of variability across regions/tier-1 metros. The data is provided by Ministry of Consumer Affairs, Food and Public Distribution 8

6.2

Statistical Analysis using R

Once the data was compiled in Mongo, we wrote a csv to conduct the statistical analysis in R. The motivation for doing the analysis in R over Python was the wealth of statistical packages that R already has built in. CPI and Inflation Computation: To determine if prices are sticky for daily time intervals, as in the MIT Billion Prices Project, we examine only items who have dates recorded consecutively. Thus, we limit our analysis to all data scraped from November 23rd till December 16th. Traditionally CPI on date b with base date of a is computed as P

CP Iab

x∈Basket =P

day b price(x)

x∈Basket day a price(x)

× 100%,

where the Basket is a set of items that does not change over the course of time from a to b. However, when computing the CPI and Inflation rate, the Indian Ministry of Finance groups merchandise into one of five buckets, with the following weights for computation: http://www.knowledge.fintotal.com/What-is-Consumer-Price-Index-CPI-and-How-it-Matters/ 6458#sthash.RpYhoAcm.AwGvtHrX.dpbs • Food, Tobacco, Beverages (49.71%) • Fuel and Light (9.49%) • Housing (9.77%) • Clothing and Footwear (4.73%) • Misc. (26.31%) Note that Misc. is technically defined to be anything that does not fit in (1)-(4) However, we were only able to find data on products excluding housing prices and oil prices during this time period. So, we reallocate the weights of housing by the following method described in Appendix A. • Food, Tobacco, Beverages (61.57%) • Clothing and Footwear (5.86%) • Misc. (36.07%) November 23rd till December 16th. So, our basket technically changes frequently, as the items that appear on these websites can change in real-time. Hence, we adopt a modified version of CPI for this study. In our data we also collected information that classified each object as a specific category, such as cologne, video games, or granite. For each category we had pricing data on, we attached a label 9

to the product that classifies it either as “Food, Tobacco, Beverages”, “Clothing and Footwear”, and “Misc.” This was done by creating a csv for the 353 various categories and creating a separate column for each that mapped each category to a CPI bucket. After this mapping table was created, we assigned every item collected a “bucket” value determined solely off of the product tag from Mongo. Next, we define the weight of a category z to be weight of the bucket it belongs to, denoted as w(z), where these weights are defined above. Furthermore, denote the average price of all the items classified under the same category code y as A(y). Then, P

CP Iab

x∈product =P

in Basket day b w(x)A(x)

x∈product in Basket day a w(x)A(x)

× 100%.

Despite this new definition, inflation is still computed in the same manner. So, we have CP Iad − CP Iac × 100% Inf lationdc = CP Iac Under this framework the results are in Appendix C Compared with the CPI and Inflation numbers published by Indian Ministry of Finance, these values are much more volatile. This is expected though due to two factors. The first is purely economic in that prices can be readily changed online on a daily basis more efficiently than they can be in stores. The second is due to the way that the values are calculated on a monthly and yearly basis. The prices for these items are measured less frequently and hence have fewer possible times that they can change, as opposed to when we compile data on a daily basis. Testing Price Stickiness from Inflation with Unimodality Tests: Next we test the economic theory that prices are sticky. It is believed that prices are less resistant to changes in traditional marketplaces. However, the online marketplace is a very different environment. Note that if prices are sticky, then price changes should form a unimodal distribution centered at 0. To test this, we use two hypothesis tests: the Dip Test and the Proportional Mass Test. For both of these hypothesis tests we take the null hypothesis to be that the distribution of price changes is unimodal and centered at 0. We first test this hypothesis using Hartigan’s Dip Test. While we would like to do this for all three of the buckets above, it bucket for Clothing and Footwear was not able to have items crawled in this time period for enough days to warrant a hypothesis test. But the other two buckets have sufficient information to perform this test. We perform two versions of the Dip Test: one using linear interpolation and the other using Monte Carlo Simulation with 1,000 replications. The p-values for the two buckets were as follows: Bucket Food, Tobacco, Beverages Misc.

Linear Interpolation 0.9838 0.1464

Monte Carlo Simulation 0.986 0.14

Interestingly enough, we fail to reject the null hypothesis of Unimodality in all cases. However, these results to make some sense – Misc. items are things such as electronics, 10

furniture, and appliances. On the online marketplace, the prices of these are more likely to fluctuate than those for staple items, such as foods and beverages. This is verified in the below histograms.

Next we test this hypothesis using the Proportional Mass Test http://www.nber. org/papers/w16760. This test was developed recently at MIT specifically to test the Unimodality of distributions in economic settings. Since this test is very new, there is no R package to perform it. However, following the derivation of the test from the MIT paper we were able to write the test in R. However, the MIT paper used threshold values of $1, $2.5, and $5 to see the amount of proportional mass is added in the distribution. Converting to rupees and using the exchange rate of 50rp = $1, we set out respective values to 50rp, 125rp, and 250rp. We use Monte Carlo Simulation with 1,000 replications to find the p-values. Bucket Food, Tobacco, Beverages Misc.

Proportional Mass Test P-Value 0.947 0

Note that this time the Misc. bucket is rejected as not being unimodal around 0. This is much more consistent with the histogram than the result of Hartigan’s Dip Test.

11

In addition, we see that Food, Tobacco, and Beverages has roughly the same p-value as under the Dip Test. Thus, we conclude that prices for miscellaneous items, such as electronics, furniture, and household items are less sticky than essentials like food and beverages. This suggests that predicting prices for Misc. items will be much more difficult than for Food, Tobacco, Beverages. Forecasts: Lastly, we will use the values from Food, Tobacco, Beverages and Misc. to construct an ARIMA(p,d,q) prediction model. In order to find the parameter values for (p,d,q), we use cross-validation to pick the triplet that yields the model with the lowest Akaike Information Criterion. In order to evaluate the effectiveness of the prediction, we will use the values from November 23rd till December 7th as our training set. We will then use Dec 8th till Dec 16th as our testing set. That is, if we imagined the day was December 8th and we had data on the two prior weeks and we wanted to make a prediction about the next one based solely on the past, this is the model we would build. Here are the results of the analysis: Bucket Food, Tobacco, Beverages Misc.

(p,d,q) via Cross-Validation ARIMA(0,0,0) ARIMA(1,1,0)

12

MSE 1401565 815312816425

Note that in this case the Food, Tobacco, Beverages prediction was to pick the average value, as evidenced by the horizontal line. This is consistent with the histogram of price changes above, since there were many days that had a price change of zero. For Misc. items we see the model as a much higher mean squared error than Food, Tobacco, Beverages. But this is expected, as the scattered distribution of price changes showed us that the prices for these items were far less sticky than those of Food, Tobacco, Beverages.

7 Lessons Learnt and Post Implementation Notes • Scraping for data is difficult. Though we were able to get a lot of prices by scraping websites in general it is difficult since most retailers change their layouts on a regular basis and even a small change in the layout causes all your spiders to fail. • On hindsight ,we realize that choosing MongoDB as a data store provided a place holder to dump data , we did not quite use it for any thing beyond just that. Instead dumping JSON files on local desktop and still using Amazon EMR to do the analysis might still have worked saving us one little headache/component in our architecture. • R-Hadoop integration is still not production ready. • We got EC2 compromised for once as we exposed more than desired ports on the Server.

8 Limitations of our Approach The approach we have used demonstrates that such an endeavor is possible and can be easily built at a more larger scale. Nevertheless some short comings exist in the sense we are limited by the number of data sources that are available. Though retail goods like food,electronics and other CPG goods are now sold by online retailers, services such as medical care, or personal care or large purchases such as real estate or rentals do not have large online marketplaces. Similarly the not all geographical regions have the same kind of exposure to online market places. That is while the CPI that we calculate does get the trend right it still is an incomplete measure as compared to the one calculated by Reserve Bank of India.

9 Appendix 9.1 Appendix A Below is the procedure by which to calculate the new weights for the Consumer Price Index after exclusions given the old Consumer Price Index weights. Reweighting Algorithm: 13

Suppose we have n categories but only want to have m categories for 0 < m < n. Denote the original weights by W = {w1 , ..., wn }and the new weights we desire as Z = P {z1 , ..., zm }. We impose the constraints that each 0 < wj < 1 and nj=1 wj = 1. Step 1: Assign the correct categories preliminary weights equal to the original weights. Mathematically, assign z1 = w1 , z2 = w2 , . . . , zm = wm . Step 2: Define the sum of the excluded weights as n X

E=

wj

j=m+1

and the sum of the new weights as N=

m X

zl .

l=1

Note that N is currently less than 1, as it is the sum of a proper subset of W. Step 3: We need N to sum to 1. To achieve this, we need to update the z’s to reflect the excluded categories. For each k ∈ {1, ..., m}, update zk 7→ zk +

zk E. N

This is equivalent to 

zk 7→ zk

E 1+ N



Now we have achieved constructingZ = {z1 , ..., zm }such that by reallocating the excluded weights.

Pm

l=1 zl

= 1 fairly

9.2 Appendix B • Github Repository - https://github.com/satishkt/bpp-india/ • Tableau Visualization - https://public.tableausoftware.com/views/bpp_wpi_ 0/PriceAnalysis?:embed=y&:display_count=no

9.3 Appendix C

14

CPI 106.8312 137.511053 133.878712 442.792329 514.803536 797.175041 412.962419 823.945396 455.254099 792.860554 804.37898 447.536169 477.174856 1025.633529 647.224839 749.369954 832.454373 463.089626 307.903021 977.612001 699.300975 543.599912 1.839218

Inflation 28.718065 -2.641491 230.741403 16.262975 54.850343 -48.19677 99.520673 -44.747055 74.157807 1.452768 -44.362523 6.622635 114.938721 -36.895117 15.782014 11.087236 -44.37057 -33.511138 217.506466 -28.468454 -22.265243 -99.66166

15

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.