AGSM+Admissions+Group



=**AGSM Admissions Final Report**=

Dr. Paul Dwyer, GSM 672 – Data Mining
 * Prepared For:**

Majed Alharbi, Sebastien Burgain, Robin Peach, Gabriela Pop, Jon Williams
 * Prepared By:**


 * May 12, 2009**

=**Part I: INTRODUCTION**=


 * __Business Challenge__**

Willamette University prides itself on being “the best in the West”. With increased recognition, the candidate pool is becoming more competitive. In order to build and maintain a brand that exudes excellence in education, Willamette admissions must carefully attract and admit students with high academic potential.

Thus the ongoing business challenge for the University is whom to admit and how to allocate financial aid dollars in order to attract top students. Also, there is a secondary challenge in defining what success means to Willamette that must be addressed first; is it defined as a high GPA or graduate school placement rate, or is the focus on filling a specific student profile by department?

For the purposes of this project we will isolate a single business challenge: What students should be accepted in order to maintain a high profile brand image, defining the determinant for this as GPA while at Willamette.

__**Project Goal**__

In order to address the specific business challenge identified above, we will work to determine what the best predictors of academic success are at Willamette University, specifically the Atkinson Graduate School of Management (AGSM). We will work to identify predictors of the most qualified candidates for admission to the school.

__**Data Mining Techniques**__

By using what we have obtained through data mining, we can utilize classification techniques such as logistic regression and classification trees, to determine what attribute(s) signal future success at Atkinson. Techniques will be explained in more detail in Part III in the data analysis methods section.

__**Audience and Results**__

This project is targeted toward the admission staff to provide them with historical data that may aid them in the candidate selection process. Furthermore, the desired outcome would most likely create a set of standards that would reduce the speculation of determining who to admit.

The results of the project will be summarized and presented during the week of May 11, 2009. A Microsoft PowerPoint presentation will be prepared and presented to an audience of peers as well as University faculty and staff. It will provide details on the project itself, literature review, methodology, findings, and any other applicable information to the intended project audience (AGSM admissions staff).

It is our hope that the findings of this project will help present and future admissions staff make educated decisions regarding student admissions based on their desired student profile.

__**Hypothesis**__

It is expected that of the variables in the dataset, GMAT will be the most significant. Also, it is expected that US versus Non-US and undergraduate major (business versus non-business) will also be factors in determining success as defined for the purposes of this research (GPA). Factors that are not expected to be significant include the start term (Fall or Spring) and gender.

=**PART II: LITERATURE REVIEW**=

__**Past Research**__

The Atkinson staff and students have been worked in the past on developing a model that helps the admissions department chose who get admitted from a pool of candidates. In addition, various professors both past and present have researched methods and determinants in the hopes to assist admissions personnel in finding the best mix of students and financial aid packages to meet the needs/desires of the school in terms of student profile. Additionally, there have been efforts to determine ways or predictors that result in an increase in the average GMAT score of Atkinson Graduate School students. One specific effort is that of Professor Mike Hand who continually works with students every year to apply optimization principles to this particular business problem; how to allocate financial aid to achieve certain student profile goals including level of diversity, SAT scores, as well as departmental enrollment goals (athletics, theater, music, etc). Various optimization and regression models have been taken into account when picking students, but there was no model that completely performs the task without human intervention. The final decisions have been always taken using rationale of different administrative staff.

__**Past Research Challenges**__

The main problem that has always arisen is the fact that the dataset for any year is very small, obviously because Atkinson is a small school. Also there are many changes made every year in the financial aid policy of the school and in the curriculum. All these made every model very specific for the dataset it was developed with and not very reliable for new datasets.

The current project will work on determining some of the predictors that have a significant impact on how well a candidate will perform or whether a candidate will reach the top 10% of the class. To mitigate the risk of the small class size our database will contain data for multiple years from 1994 until 2001. Also because there are major changes in the financial aid every year, we will not take into account the financial aid students are taking. Financial aid information can be used later in an optimization model that uses the information provided by our model as inputs.

__**Project Research Procedure**__

Our project aims at predicting which type of students will be more likely to succeed in AGSM to help the recruiting department establishing admission criteria. The data was acquired in the Spring of 2008 from AGSM admissions staff, who we will meet with to gain a full understanding of the data provided. Next, the data will be prepared for modeling and analysis. In order to complete the modeling portion of the project we will have to master various data mining techniques in SPSS Clementine 12.0 software, a data mining program used to run the appropriate models. The preparation and modeling stages will likely be repeated to gain a complete view of the data and relationships. Results will be evaluated and summarized for the aforementioned presentation to our peers and admissions staff.

=**Part III: METHODOLOGY**=

__**Technology Requirements**__

In order to complete this project our team requires the use of a computer with SPSS Clementine 12.0 software. Additionally it would be beneficial to the team to also have XLMiner 3 to run in Microsoft Excel.

__**Project Tasks & Analysis Methods**__


 * Data collection: The dataset was made available to us by the admissions department at the Atkinson Graduate School of Management. The dataset contains 495 complete records.
 * Pre-process the data to show understanding of what we define as student success in AGSM (including setting dummy variables and transforming skewed ranges).
 * Partition the dataset into training and validation sets at 40% and 60% respectively.
 * Test different models evaluating the effectiveness of each model in how well it can forecast student’s success in AGSM. Some of the models that we are going to examine in SPSS Clementine include: C&R tree, QUEST, CHAID, Neural networks, C5, logistic regression model and BAYES NET.
 * The best two models will be identified. Using those models, we will experiment with advanced features and output configurations to maximize probability of model success.
 * Identify and subsequently evaluate the top prediction model by examining its effectiveness as it relates to identification and selection of student to AGSM that will be successful (high GPA over a certain point that is to be determined through our analysis as well as through discussions regarding desires and expectations with admissions staff).
 * To evaluate and interpret the models, the team will use lift charts, prediction accuracy, and gains charts.
 * Findings and recommendations will be summarized and presented.
 * Note that the collection, preparation, modeling stages may have to be revisited in order to successfully examine all angles and report the most complete and accurate findings possible.

The team will work together and divide in sub teams when needed for analyzing select models. SPSS Clementine 12.0 will be used for applying each of the algorithms and for interpreting the results. For the small dataset available to us, currently there will be no major costs involved other than appropriate time for applying and interpreting each model.

=**Part IV: MODELING & ANALYSIS**=


 * __Process__**

The dataset obtained by Atkinson admissions staff was fairly clean to begin with, and no records contained null values, and thus we were able to maintain the entire 495 record data set. Data was initially pre-processed; this included converting text to numeric and binary variables as appropriate. Descriptions of the variables for the cleaned data are as follows:
 * //**Stud ID#:**// assigned student identification number at Willamette University (this variable was filtered out when conducting analyses, as it is a randomly assigned variable and not an applicable attribute of interest)
 * //**Bus:**// undergraduate major – business (1) or non-business (0), binary
 * //**Male:**// students gender – Male (1) or Female (0), binary
 * //**Application Start Term:**// start term applied to by the student - Fall (1) or Spring(0), binary
 * //**Age at Application:**// students age when the application was received
 * //**US vs. Non-US:**// country of origin – US (1) or non-US (0), binary
 * //**GMAT or equivalent****:**// Graduate Management Admission Test score, or equivalent
 * //**Cum GPA at graduation:**// cumulative GPA of the student upon graduation at AGSM

Several basic statistical measures (mean, minimum, maximum, standard deviation) were calculated and subsequently used to scale certain numeric variables for the purposes of evaluation and modeling: age at application, GMAT, cumulative GPA. The binning tool in Clementine was then used to generate 3 GPA bins:
 * = **GPA Bin** ||= **GPA Range** ||
 * = 1 ||= 3.01 - 3.33 ||
 * = 2 ||= 3.34 - 3.66 ||
 * = 3 ||= 3.67 - 4.00 ||

The various predictor variables were plotted against the binned GPA data to gain a greater understanding of the relationship between the variables and cumulative GPA. The graphical outputs are shown below:

//**Figure 1: Male vs. GPA**// //**Figure 2: Bus vs. GPA**// //**Figure 3: US/Non-US vs. GPA Figure 4: GMAT vs. GPA

Figure 5: Age vs. GPA**//

__**Modeling**__

To predict future success at Atkinson, as defined by GPA, we applied several methods of modeling, prediction, categorization and clustering to the dataset. We did this to cast a large net and see what we could learn about our data in the process and ensure that we came up with the best model. Although it was known that prediction and clustering were not practical methods considering the business problem at hand, we felt that there was still the chance of gaining useful insights into the data through such models. Cluster analysis was used in the attempt to reveal the clusters with high and/or low GPA’s to determine which descriptors were associated with success or a lack thereof. Prediction techniques were also examined as they showed what variables are most important and how they affect student GPA.

Cluster analysis did not provide very useful information. The K-means, Kohonan and Two-step all associated differing descriptors with the clusters of the highest average GPA. Also, the average GPA’s of the various clusters were largely similar. This is understandable when one considers that demographic data is weighted equally to other data in these models such as students’ GMAT score. Within similar demographic groups, there are high, medium and low achievers whose GPA’s will average to a mean similar to another cluster.

Prediction models provided more useful information, despite the fact that varying models gave contradictory results. Two variations of the neural net model were attempted. First, a traditional neural net model was run that would attempt to predict an applicant’s specific GPA. All data that was not binary was converted to a normalized scale from 0 to 1. The traditional neural net model returned a high rate of accuracy, 81%. The model revealed that students’ GMAT score was by far the most significant variable. Next to GMAT, the start term at Atkinson (Fall or Spring) was listed as the second most significant variable. Unfortunately, the neural net model does not provide any insight as to whether the effect of the start term is positive or negative. Additionally, only 5 of the nearly 500 records represent those who started school in the spring; thus the small number of records may have appeared significant due to outliers and not entirely representative of that group.

Next, a modified neural net model was created by first binning the scaled GPA data into 3 categories. This model attempted to predict what “success group” a student was in, rather than their precise GPA. The rate of accuracy of this model was 45%, significantly lower than the traditional neural net model described above. However, GMAT was again listed as the most significant predictor, while US vs. Non US was found to be the second most significant variable. This is likely due to the fact that there is very little variance in GPA; therefore the bins would not be far apart and it would be easy to predict the wrong group. When model accuracy was assessed, a small 4% error that would not add much to the total inaccuracy of the first model is, instead, a full group off and a much larger error value of 25% is attributed to this record. Nonetheless, the second neural net is one of the most accurate models built overall, and produces the greatest lift above random. However, the “black box” attribute of the neural net makes it a tool that is unfit for the admissions process. This is because the student admissions process is one where human involvement is necessary, and results have to be interpreted and weighted against instinct and things that are not quantifiable, such as an admissions interview.

The regression model provided results that were contradictory to all other models applied to the dataset, including the neural networks. Regression was the only model that did not identify a students’ GMAT score as the most significant variable. In fact, prior to normalizing the data, the model assigned a negative value to GMAT scores; meaning that an increase in GMAT score would predict a lower GPA. This is likely due to the dataset being incompatible with the regression model.

The following categorical models were examined: Logistic Regression, Chaid, Quest, C5.0, and C&R Tree. The logistic regression model yielded a low rate of accuracy, and did not provide results that could be easily interpreted by admissions personnel. Decision trees on the other hand, generated what could be used as a clear decision process by which to categorize and/or rank prospective students. The Quest and Chaid models each generated trees with only one layer. Those decision trees illustrated that applicants were to be split by their GMAT score; scores of 555 would predict those below the mark would be category two performers (GPA 3.34 – 3.66), and above as category three (GPA of 3.67 – 4.00). The C5.0 model gave results that were overly complicated and needed pruning. The C&R Tree model provided additional functionality in that the number of layers in the tree could be modified in order to achieve the desired balance between accuracy and over-fitting of the data. Ultimately, a C&R Tree model with 3 layers and the normalized GPA data binned into only three categories yielded the best model.

__**Results – Best Model**__

The C&R Tree model most accurately predicts an applicants’ performance group (GPA bin) 54.25% of the time. The modest lift chart illustrated that using the C&R Tree model to predict applicants’ performance groups was only slightly better than randomly assigning groups to applicants; peaking in performance between 60% and 80% of the data. Also, the C&R Tree model only assigns applicants to the second or third performance group, meaning it will not predict any students as having a GPA below 3.33.

//**Figure 6: C&R Tree Lift Chart**//

//**Figure 7: C&R Tree Model Variable Importance**//

//**Figure 8: C&R Tree**//

The following rules are based on the C&R Tree:

The decision tree also only uses two variables, GMAT and age. The way the tree distinguishes successful from unsuccessful appears to be that those with GMAT scores above average (550-555) are generally high performers (GPA 3.67 or above), though those in the middle age group could potentially be only intermediate performers (GPA of 3.34 to 3.66). Those applicants with GMAT scores lower than the initial cut off value (555) are most likely intermediate performers (GPA 3.34 - 3.66), unless they are above 32 years old then they are predicted to be a category 3 performer (GPA 3.67 - 4.00).

The significance of GMAT scores on performance is easy to understand. The significance of age is less easy to accept; yet some explanations can be made. Age is mostly only used to distinguish groups who are exceptions to rules based on GMAT scores. These groups are those 26 to 32 years of age with GMAT scores above 555 yet are only intermediate performers, and applicants whose GMAT scores are less than the initial cut off of 555, yet above 515 above 32 years in age who are still high performers. If GPA is thought about as a combination of work ethic and intelligence these relationships can loosely be explained.

GMAT is a rough gauge of intelligence; this is why it is used initially to divide students into performance groups. Motivation then must explain the groups who are anomalies. We hypothesize that applicants above 32 years old are more motivated to work hard in the program. They have been working for several years and are probably dissatisfied with their current position, looking to change career paths or move up within their firm. They are mostly more settled, possibly with family, and participate less in distracting activities than younger students. This may account for their high performance relative to their test scores. With regard to applicants between 26 and 32, there seems to be a lack of motivation. They have less time in the work force, while they too are probably trying to move up as well, they may not be as dissatisfied with their current position. Also if they have families, they are probably younger and less settled, taking up more of their time. If they do not have families, they are more likely than their older counterparts to take part in the distractions of younger students.

A type I error, incorrectly predicting a student as one that will be very successful (GPA over 3.67), is only moderately costly. If scholarships are determined based on predictors dictated by the model, a less successful student may receive a large scholarship, however since scholarships come with conditions to meet certain GPA requirements, this is a self correcting error. A more damaging and long term effect of admitting poor performing students could be diminished reputation and brand image, resulting in fewer applicants (and/or lower caliber applicants), diminished job placement or less placement in reputable organizations, etc.

Falsely categorizing highly successful students as low performers, a type II error would also have highly negative implications. This type of error could result in not admitting students that could potentially be the most successful. Over time this could result in similar issues as described above: lower performing class resulting in diminished reputation and thus applications from students, faculty, etc. The most negative impact of type II error comes from alums that are lower achievers in the workplace and do not contribute as much to capital drives.



__**Limitations & Recommendations for Future Research**__

Overall, there is the possibility that with more data a much more useful model could be created and used for assistance in the admissions process at the Atkinson Graduate School of Management. Unfortunately, as Atkinson is such a small program there was a distinct limitation in terms of the amount and depth of data that could be provided by the admissions office to us for the purposes of this project. It is our hope that this project will mark the beginning of further research that may be conducted by admissions personnel whom have the privilege of more data.

Additional data including GPA at previous institutions, length of full time work experience, or type of previous degree (BA vs BS) could all prove to be very useful. Furthermore, GPA is not the only, or necessarily the best, determinant of student success; job placement is ultimately what defines the success of a student and of the MBA program, and is also what generates donations and continued success in recruiting top notch students and faculty. Information regarding job placement, industry, and salary would also be helpful in developing a more robust, useful model.