Search+Advertising+Findings

By: Miranda Gestrin, Bre' Greenman, Sidnee Schaefer


 * Analysis of Models**

Within the data set, there are thirteen variables. The following is the description of each variable, with the variable’s name as it appears in the data set: 1 Tokens mean the number of words in the query or advertisement
 * Data Set Description**
 * ** Variable Abbreviation ** || ** Variable Name ** || ** Description ** ||
 * ** RawID ** || Raw Identification || Identification number ||
 * ** AdRelevanceZ ** || Ad Relevance || Number of tokens overlapping between query and ad, divided by number of tokens in query [1] ||
 * ** AdvAdCountZ ** || Advertiser Ad Count || Experience level of advertisers, measured by the number of ads created in the past ||
 * ** AdvDescripPerAdZ ** || Advertiser Descriptions Per Ad || Average number of descriptions per ad created by the advertiser. ||
 * ** AdvAvgHitRateZ ** || Advertiser Average Hit Rate || Average ratio between the clicks and number of impressions (i.e., click thru rate) for the advertiser ||
 * ** DepthZ ** || Depth || Number of ads displayed to user in a session ||
 * ** PositionZ ** || Position || The order of the ad in the impression list ||
 * ** UserClickinessZ ** || User Clickiness || Average click through rate of the user ||
 * ** AdUsrReachZ ** || Ad User Reach || Number of users that the ad was displayed to. ||
 * ** AdAttractionZ ** || Ad Attraction || Average click through rate for the ad ||
 * ** ClickThruRate ** || Click Through Rate || Proportion of times that the ad is clicked relative to how often it is shown. ||
 * ** UsrDescripImpressionsZ ** || User Description Impressions || Number of times a particular description is displayed for the user ||
 * ** UsrAdImpressionsZ ** || User Ad Impressions || Number of times an ad is displayed to the user ||


 * Data Stream:**

Prior to the principle components analysis (PCA), the outliers and extremes in the data set were removed. The click through rate was used as the target and the raw identification was set to an identification number variable. Additionally, the data set was placed through a partition with 60 percent of the data in the training partition and 40 percent in the testing. The following components were found:
 * Principle Components Analysis**

User Ad Impressions || Ad Attraction || Position || Advertiser Ad Count Advertiser Descriptions Per Ad ||
 * ** Component Name ** || ** Variable(s) Included in Component ** ||
 * ** Impressions ** || User Description Impressions
 * ** Average Click Through Rate ** || Advertiser Average Hit Rate
 * ** Placement ** || Depth
 * ** Relevance ** || Ad Relevance
 * ** Reach ** || Ad User Reach ||
 * ** Clicks ** || User Clickiness ||

Utilizing the components found in the PCA, the following neural network model was created:
 * Neural Network Analysis**

From this neural network, the most important predictor is the reach of the ad. It also indicates that average click through rate, clicks, relevance, and impressions are important elements in predicting click through rates. Additionally, placement of the ad is slightly relevant. Judging by the lift charts, this model is better than randomly predicting click through rates, meaning that the model is useful in prediction.

Utilizing the principle components analysis, the following CHAID tree was created:
 * CHAID Tree**



The first variable used in the CHAID tree to break down the predictions is impressions. From the impressions, there are four segments that it breaks down into, which include those that are less than -0.231 (the z-score value of impressions), -0.231 to -0.081, -0.081 to 0.687, and greater than 0.687. Out of these groups, 60 percent of the observations within the data set fell into the less than -0.231 impressions group. Using this 60 percent of the data, the model was further broken down using relevance, then clicks, and then relevance again. Other nodes in the impressions part of the tree, were broken down by variables such as relevance, reach, clicks, and average click through rates. Out of the predictors used in the model, the most important variable used was relevance. Comparing this model to the neural network, relevance is even more important than reach was in the neural network. However, the model does indicate that reach is the next most important variable to use as a predictor. Considering the lift chart for the model, this model is better than predicting click through rates randomly.

Separate from the PCA, following the removal of outliers and extremes as well as after implementing a partition, a feature select node was used. From this, the only predictors excluded were depth, user clickiness, and advertiser ad count. Depth and user clickiness, were close to a value of one and in order to test their significance, each of the following models were run with them included as well, but the results did not deviate substantially from the models without these variables. The following models do exclude depth, user clickiness, and advertiser ad count.
 * Feature Selection**

The first model used following the feature selection was the two step clustering. The following clusters were identified:
 * Two Step**



From this model, three clusters were formed. Cluster three is the largest, with 60.8 percent of the data, cluster one has 29.9 percent, and cluster two has 9.3 percent. Cluster three relies on ad user reach and ad relevance primarily. Cluster one strongly relies on ad relevance and also relies on ad user reach. Cluster two relies strongly on position as well as on ad relevance.

To consider the significance of the cluster findings, the predictor importance was evaluated. From the evaluation, user ad impressions, user description impression, ad attraction, ad user reach, position, advertiser average hit rate, advertiser description per ad, and ad relevance were all strong predictors, with equal importance. Click through rate was also important, but not as important as these other components. To further evaluate this model, a statistical correlation was used, which shows that user description impressions, user ad impressions, and ad relevance are the strongest correlations in the model, all of which are statistically significant. However, the actual strength of these correlations is minimal.

The second model used following the feature selection was the Gen Lin. The following shows the results from the model: In this model, the beta coefficients are all small, indicating that these variables have a minimal effect on predicting click through rates. Testing the significance of the coefficients, all are significant at a 0.001 level, with the exception of ad user reach, which would only be significant at the 0.05 level of significance. The most important predictors in the model were user description impressions and user ad impression. Ad relevance was also important.
 * Gen Lin**

The third model used following the feature selection is the regression model. The following indicates the impact of the variables within the model:
 * Regression**

Judging by the beta coefficients in the model, it appears that each of the variables have very little impact on the likelihood of the ad being clicked. However, for the relationships suggested by the model, all of the variables except for ad user reach are statistically significant at the 0.001 level. In this model, the most important predictor for click through rates is the user description impressions. User ad impressions and ad relevance are also important factors for the model.

After evaluation of the principle components analysis models, the most important predictors to use in predicting click through rates are impressions and reach. The impression component includes user description impressions and user ad impressions, whereas reach includes ad user reach. The impressions component was only present in the CHAID tree, but the importance of this component was far more than reach in either the CHAID tree or the neural network. Reach was an important predictor in both the CHAID tree and neural network models. Both the CHAID tree and neural network models had significant lift on the lift charts, meaning that the models are better than predicting at random.
 * Conclusion**

The feature selection models included a two step, Gen Lin, and regression. The two step regression created three clusters, each of these clusters were created primarily using relevance and two of the clusters included ad user reach. Within the two step, the most important predictors were user ad impressions, user description impressions, ad attraction, ad user reach, position, advertising average hit rate, advertising description per ad, and ad relevance. For the Gen Lin the most important predictors were user description impressions and user ad impressions. In the regression model, the most important predictors were user description impressions and user ad impressions. Out of these three models, the common important predictors were user description impressions and user ad impressions.

Comparing the evaluation of the principle components analysis models with the feature selection models, the similarities were that impressions in terms of the user description and user ad impressions were the most significant variables. This is supported by all of the models used. Reach was another important variable when considering the PCA models, but the feature selection models were mainly indicated user description impressions and user ad impressions as important.

For SOSO to incorporate these findings in its strategic marketing plans, it needs to display the same descriptions of advertisements to users and the same advertisements in order to increase ad click through rates. Frequency of impressions is most important, judging by the models used in this study. Considering the lift charts previously evaluated, these models are better than predicting click through rates randomly.

[1]