Carbon+dioxide+emissions+affect+us+all.++But+how+do+we+affect+carbon+dioxide+emissions?

**Team**
 * Omar Alghanmi
 * Andrew Daniel

// Carbon dioxide emissions affect us all. But how do we affect carbon dioxide emissions? // = = =Introduction and Purpose = In the last few years, the discussion on greenhouse gas emissions and their negative consequences has increased substantially; for example, warmer weather around the globe has started to cause natural disasters and disturb agricultural production. Several questions have been raised in recent years about the way humans consume goods and services, corporate policy to leverage their growth, and the consumption of fossil fuel as a driver of economic growth. For this project, we aim to use a set of economic and environmental indicators such as oil and gold prices, electricity power consumption, population, and other major commodities’ prices not necessarily to predict, but drive new insights about what has caused CO2emissions to escalate in recent years. These insights could be used to conduct more advanced research. As a secondary goal, being able to understand the levels of CO2 emissions based on economic activity could be a useful tool to help understand industries that cause the most pollution or have the greatest need for green initiatives. = = = = =Literature Review = Recent reports suggest that CO2emissions have increased as the global economy has recovered from the 2008 global financial meltdown. In our research, we found several studies conducted to find a link between the increase in carbon dioxide emissions and the GDP over time. CO2emissions in developed countries, for example, decreased by 1.3% in 2008 and 7.6% in 2009, but increased by 3.4% in 2010. The International Energy Agency has estimated that power plants will account for 80% of CO2 emissions by 2020. The unprecedented population growth rate, especially in developing countries, has stimulated the demand for goods and services, pushing corporations to seek higher energy input for production. Major commodity prices have increased because of the high demand as well.

 Source: nature climate change, www.nature.com

The dangers posed by high carbon dioxide emissions are of critical importance for the future of the planet. Emissions have a direct impact on climate change, causing disruption to food production through changes in weather cycles and the occurrence of natural disasters.

=Procedure =

** Objective: ** Analyze a list of economic and environmental indicators to predict and obtain new insights into the growth of carbon dioxide emissions.

** Indicators: ** We will use annual data for the following initial set of variables. We intend to reduce the number of variables during the process, as we have less than 40 years’ worth of records. Data was collected from the World Bank database.


 * <span style="font-family: Arial,Helvetica,sans-serif;">** Cereal production (metric tons) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Electricity production from renewable sources (kWh) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** CO2 intensity (kg per kg of oil equivalent energy use) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** CO2 emissions from gaseous fuel consumption (kt) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** CO2 emissions (kt) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** CO2 emissions from liquid fuel consumption (kt) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** CO2 emissions (metric tons per capita) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Agricultural land (sq. km) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Foreign direct investment, net inflows (BoP, current US$) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Foreign direct investment, net inflows (% of GDP) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Electricity production (kWh) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Electric power consumption (kWh) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Electric power consumption (kWh per capita) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Energy use (kg of oil equivalent per capita) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** GDP (current US$) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Population growth (annual %) ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Population, total ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Aluminum, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Barley, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Coal, Australia, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Coconut oil, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Coffee, Arabica, cents/kg ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Coffee, Robusta, cents/kg ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Cotton, A Index, cents/kg ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Crude oil, avg, spot, $/bbl ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Gold, $/toz ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Agriculture, 2005=100 ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Energy, 2005=100 ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Metals and minerals, 2005=100 ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Non-energy commodities, 2005=100 ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Iron ore, cents/dmtu ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Natural gas, Europe, $/mmbtu ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Palm oil, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Phosphate rock, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Potassium Chloride, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Silver, cents/toz ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Soybean oil, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Steel cr coilsheet ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Steel hr coilsheet ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Steel, rebar, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Steel wire rod, $/mt ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Sugar, world, cents/kg ** ||
 * <span style="font-family: Arial,Helvetica,sans-serif;">** Wheat, US, HRW, $/mt ** ||

=<span style="color: #000080; font-family: Arial,Helvetica,sans-serif;">Process Overview =

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">In order to complete this project, we will use IBM SPSS modeler to build and evaluate predictive models.

<span style="font-family: Arial,Helvetica,sans-serif;">** Data Preparation ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Data preparation will be done using Modeler. Depending on the model, the data will have to be standardized (converted to z-scores) and/or normalized (changed into a Gaussian distribution). Outliers will be eliminated.

<span style="font-family: Arial,Helvetica,sans-serif;">** Modeling ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Modeling will be done with IBM SPSS Modeler. Because we are trying to predict a continuous variable, we are immediately limited in the number of models that we can apply. Likely models to be used include multiple linear regression, k nearest neighbor, regression trees, and neural nets. Because of the sheer number of variables present in the data, it is likely that a PCA factor will be used in order to reduce the dimensionality of the data and make it more manageable and understandable.

<span style="font-family: Arial,Helvetica,sans-serif;">** Evaluation ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Evaluation will be done in SPSS Modeler using a separate testing partition created from the data. Evaluative tools, such as lift and gains charts, will be used to determine the quality of each model created. Other means, such as confusion matrices comparing predicted to actual values, will also be used to evaluate models.

<span style="font-family: Arial,Helvetica,sans-serif;">** Deployment ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Models will be interpreted in order to provide insight into initial question and relate findings back to real world scenarios.

<span style="font-family: Arial,Helvetica,sans-serif;">** Project Revamp ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Based on feedback we received on our project proposal, we decided to expand the scope of the project in order to provide new insight. To do this, we took 50 years’ worth of social indicators and added them onto the existing dataset to check correlations between societal changes and the emission of CO2. The new predictor variables are: <span style="font-family: Arial,Helvetica,sans-serif;">

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Life expectancy at birth for males was initially included as well, but it was found to be perfectly correlated to female life expectancy and was thus removed. Dozens of other variables were initially included, but their data was so sparse that they were not usable and had to be eliminated. The remaining variables heavily featured education and labor force participation rates among women. <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">In order to thoroughly evaluate the two sets of variables closely, we began by testing with only the environmental factors first, then the social factors, then the two together. <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;">** Execution ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Here is our stream, broken into the three sections:

<span style="font-family: Arial,Helvetica,sans-serif;">

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">All three parts followed roughly the same procedure: partition the data into training and testing sets with a 60%/40% split, filter out non-relevant variables, and then use a mix of predictive models to identify the strongest predictors. On the environmental side, because there were so many variables, feature select and PCA factors were used to identify and pare down the total number. On the social side, there were fewer variables so filtering was unnecessary. Environmental data were also standardized, whereas social variables were not; some environmental numbers were in billions or trillions, while others were single digits, so it was necessary to put them all on the same scale. Social factors were mostly in percentages, so standardizing did not seem relevant.

<span style="font-family: Arial,Helvetica,sans-serif;">** Results Summary ** <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;">** Environmental ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Environmental findings were not terribly surprising: the biggest contributors to CO2 emissions were heavy industry (coal, steel), electricity, deforestation/agriculture, GDP, and population growth. Industry and agriculture are pretty straightforward. If we look at how GDP is calculated, a large part of that is government and corporate spending, so GDP growth is going to involve industrial purchases and infrastructure development, both of which will presumably contribute to pollution. Population growth is also a bit more subtle: as the population increases, the electricity requirement increases, which will in turn increase the output of electricity, thereby increasing greenhouse gases.

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">PCA factoring identified 28 factors, though the first three alone accounted for 74% of the data. The three factors it identified were heavy industry, agriculture, and natural resource exploitation. Feature selection had a similar list, though unique entries include education expenditure and gross national savings (both of which likely factor into GDP). Another interesting entry is the classification and regression tree, which split data first by agricultural land, and then used nothing but cereal farming for the rest of the tree. <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">An auto-numeric node was used to sort through models. The graph of its predictions versus actual values shows that the model is highly accurate based on the dataset we used, with a high correlation between the two numbers indicating that predicted values were close to actual ones.

<span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;">** Social ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">On the social side, the most important factors are life expectancy and the percentage of women who work, based on linear regression and classification and regression trees. C&RT also includes the ratio of women to men in education as important. In essence, it seems that anything that is associated with a “modern developed country” contributes to carbon dioxide and it’s shocking to find that women’s liberation is correlated to an increase in pollution. Upon further inspection, however, this seems reasonable. When we examined environmental factors, one of the variables most highly correlated with greenhouse gases was electricity production. And what is one of the biggest enablers of modernization has been electricity. It seems reasonable to state that a high percentage of women in the world’s workforce is indicative of a certain level of modernity and liberalism that is associated with “modern” countries. So as countries modernize and give their citizens more rights, they also progress technologically and expand their electricity capacity, increasing pollution. It is important to restate that important credo of statisticians: correlation does not imply causation. Human rights aren’t destroying the atmosphere; but the modernity that accompanies them is. <span style="font-family: Arial,Helvetica,sans-serif;">

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">An auto-numeric node was once again used to sort through models. The graph of its predictions versus actual values shows that the model is entirely undependable for the first 25 or so years, having predictions that are well above the actual values. In the second half of the dataset, however, it seems that social predictors are moderately successful at predicting emissions, as the shape of the line moves closer to the ideal 45 degree angle. <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;">** Combined ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 12pt;">The combined dataset, which was evaluated using only an auto-numeric node, actually does have both sets of variables contributing to the total. The C&R tree largely resembles the one from the environmental side with its reliance on agricultural land and cereal production; this time it also incorporates life expectancy and women working in fields other than agriculture.

<span style="font-family: Arial,Helvetica,sans-serif;">

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 12pt;">The auto-numeric node graph produced mixed results; it generally has a high correlation, though the inaccuracy that was found in the social variables graph is also present in the first half of the dataset here.

<span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;">** Evaluation ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">It’s important to point out the shortcomings of this project. Data reporting is complicated. The United States has had a census for hundreds of years, but developing countries such as Brazil and China have been less fastidious about record keeping and reporting, while countries like the Soviet Union kept a great number of secrets. Consequently, the data likely has huge gaps in it and the numbers we found do not necessarily reflect the reality of the world. Furthermore, while many of the data entries from 2000 to 2010 are complete, entries closer to 1960 are sparse, missing enormous amounts of data and bringing into question the reliability of the findings. The process, we believe, was technically solid. The data, on the other hand, likely was not.

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">In order to enhance the findings of this project, further study is required. Compiling a huge dataset of the status of each country during each year from 1960 to 2011 would significantly increase the number of data points and give Modeler more to work with; this was not done because it would have required a skills or a time commitment that were beyond our means. With that in mind, we do believe that our findings present us with glimpses of insight. <span style="font-family: Arial,Helvetica,sans-serif;"> <span style="font-family: Arial,Helvetica,sans-serif;">** Deployment ** <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">The biggest takeaway from the environmental side is that an increase in population and industry is going to lead to an increase in carbon dioxide. In terms of social factors, we can see that longer lifespans and more women in the workplace correlate to more pollution because of their shared source in modernity. So what can we do about it? What is the best recourse to eliminate pollution and save the world? Should we intervene in developing countries to put an end to their development? Institute a totalitarian government that will oversee all of us to ensure that no one lives beyond their means?

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">No!

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Unfortunately, there’s not really anything we can do about the growth of heavy industries or the expansion of electricity capacity or population growth. We can’t stop progress and we can’t stop the development of nations. All we can do is wait for new technology to come along and change the way that we harness power and take small steps to help minimize our carbon footprint and slow the growth of emissions. = = = = =<span style="color: #000080; font-family: Arial,Helvetica,sans-serif;">Administration =

<span style="font-family: Arial,Helvetica,sans-serif;">** Required Materials **
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Computer
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">SPSS Modeler
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Data Set
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Data Mining for Business Intelligence textbook
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Other online references available on WISE

<span style="font-family: Arial,Helvetica,sans-serif;">** Specific Tasks: **
 * 1) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Goal Establishment
 * 2) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Literature review
 * 3) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Review climate change, environment, and economic indicators to select a set to be used in the modeling
 * 4) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Import annual data from World Bank
 * 5) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Visualize the data
 * 6) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Evaluate the hypothesis
 * 7) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Audit the data
 * 8) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Outliers and missing data processing
 * 9) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Partition the data into 60 percent training and 40 percent validation
 * 10) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Determine indicators importance
 * 11) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Modeling
 * 12) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Interpret the results and choose a final model based on accuracy, mean error, lift charts, and confusion matrix
 * 13) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Prepare the report
 * 14) <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Present findings

<span style="font-family: Arial,Helvetica,sans-serif;">** Timeline for Completion of Activities: **
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Project Proposal: April 12, 2012
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Data Pre-processing: April 19
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Modeling: April 28, 2012
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Presentation Preparation: April 30, 2012
 * <span style="font-family: Arial,Helvetica,sans-serif; font-size: 16px;">Project Presentation: May 3, 2012

=<span style="color: #000080; font-family: Arial,Helvetica,sans-serif;">Bibliography = <span style="font-family: Arial,Helvetica,sans-serif;">Wikipedia. (n.d.). //Carbon dioxide in Earth's atmosphere//. Retrieved 04 2012, from Wikipedia: http://en.wikipedia.org/wiki/CO2_emissions <span style="font-family: Arial,Helvetica,sans-serif;">World Bank. (n.d.). //Data.// Retrieved 04 2012, from The World Bank : http://data.worldbank.org/topic <span style="font-family: Arial,Helvetica,sans-serif;">Peters, G., Weber, C., Guan, D., & Hubacek, K. (2007). //China's Growing CO2 EmissionsA Race between Increasing Consumption and Efficiency Gains.// Retrieved 04 2012, from http://pubs.acs.org/doi/abs/10.1021/es070108f <span style="font-family: Arial,Helvetica,sans-serif;">Peters, G., Marland, G., Quere, C., Boden, T., Canadell, J., & Raupach, M. (2011). //Rapid growth in CO2 emissions after the 2008–2009 global financial crisis.// Retrieved 04 2012, from Nature Climate Change: http://www.nature.com/nclimate/journal/v2/n1/full/nclimate1332.html <span style="font-family: Arial,Helvetica,sans-serif;">Revkin, A. (2011, 05 30). //Tracking Economy, CO2 Emissions Hit New High.// Retrieved 04 2012, from NYTimes: http://dotearth.blogs.nytimes.com/2011/05/30/tracking-economy-co2-emissions-hit-new-high/