Emma Gertlowitz: April 2013

Sunday, April 28, 2013

Titanic Data Competition - Submission 8

I tried a couple of variations on my best submission to Kaggle today, but did not improve my score.

Remember that my best score ( 0.79904) so far is based on a logistic regression model with the following terms:

male (ie gender)
pclass
fare
fare per person
Title
age * class
sex * class
combined age (this is age, is missing values based on the median of respective Title)
family (total of sibsp and parch)

My first submission (31) was to remove family and add sibsp and parch. This resulted in score of 0.79426 - slightly under my best score.

Second submission (32) was to run a backwards elimination, and identify the following model:

pclass
Title
sex * class
combined age
family

This scored 0.78947, not too far under the best score, but with 5 predictors rather than 9 (although that won't have been the situation once SPSS automatically recoded Title into multiple dummy variables)

Third submission (33) was to create a new variable "child". Using combined age variable, cases with age less than or equal to 18 were coded 1 , with cases over 18 being coded 0..

This produced a score of 0.79426.

Still haven't got past the 0.80 threshold !

Thursday, April 25, 2013

What I have learnt about statistics

I've been studying statistics part time (one unit at a time) since 2010. When I was reading in advance of my current unit (Advance Topics in Regression), I realized that some parts of statistics were starting to make sense:

Residuals are important. I doubt that I would have taken notice of residuals before, except perhaps to see that the standardized residuals were mostly under 2 or 3 standard deviations. Now I see how important they are for model fitting. They tell you what for what cases (and combinations of predictors) the model doesn't work for.
Choosing the right statistical method to match your research question and data is an important skill. There are other techniques other than Anova and Linear Regression.
Feature and variable selection is important.
Exploratory data analysis is important.
Regression is everything : Anova = Regression = Machine Learning. There is a unified approach.
Regression is a social construct.

Since participating in the Titanic data competition, I'd add a few more things that I've learnt are important:

Cross validation is an important technique
Understanding a programming language is an important skill if you want to automate the data processing part (excel is a slow way to process data, create new variables, etc)

Titanic Data Competition - Submission 7

Best score yet yesterday.

First, I had to try and replicate my previous best score. Need to take more care in documenting my models. In the end, I found I had saved the SPSS output, and that had the model details.

My previous best score was 0.78947

I then retried the best score model, just substituting combined age with regression age. This scored 0.77512.

Next, I added family to the independent variables (family is the total of sibsp and parch). This moved me up 211 places, to position 124. The model scored 0.79904. The SPSS classification table had 83.2% correctly classified.

Next, I added adj cabin as an independent variable. While the SPSS classification table showed 84.2% correctly classified, this model scored only 0.78469.

Finally, I took out adj cabin and added Age_Sex_Class - this 3 way interaction had 84% correctly classified as per SPSS, but scored 0.79426 - which whilst not my best score, was the best compared to previous day's best.

Challenge now is to move score into the 0.80000 range !

Using Regression To Predict Age

In a previous post, I mentioned that I had used a regression model to predict age, where age was missing. Whilst the R Squared was higher than a model using the median age based on title, it resulted in a worse prediction.

I've now plotted the standardized variances between observed age and regression age, by age. There is obviously something going on that I don't understand, as there is a clear linear relationship - the standardized variance increases as age increases. We are under-predicting the age of older people (those over, say, 30) and over-predicting the age of younger people.

Not sure what's going on here.

Logistic Regression

I wanted in this post to look at Logistic Regression, which is the statistical technique I've used so far to produce predictions for the Titanic Competition.

Logistic regression is a procedure commonly used when predicting a binary outcome from a set of continuous or categorical predictors. Other types of logistic allow for ordinal and multinomial predictors.

Advantages of Logistic Regression

relatively free of restrictions
capacity to analyse a mix of continuous, discrete and dichotomous variables

Practical Issues

Need sufficient number of cases relative to the number of variables. SAS has a logistic regression method which is not sensitive to data sparseness

This is not a problem with the Titanic data.

Need sufficient expected frequencies in each cell. The usual rule of thumb is that logistic regression is not reliable if more than 20% of cells have expected frequencies less than five, or if there are expected frequencies less than one. The solution in these cases is to collapse categories for variables with more than two levels, and / or to accept lower power.

This is not a problem, at least at the moment, with the Titanic data.

Logistic Regression assumes a linear relationship between continuous predictors and the logit transform of the dependent variable. I'll cover this assumption in a separate post.
Absence of multicollinearity. This refers to the situation where two independent variables are highly correlated. If two variables were perfectly correlated, then the second variable is not adding any additional information to the model, and is redundant. In the Titanic data set, imagine if a new variable were added called "family", which was the total of sibsp and parch. The family variable would not add any additional information to a model.
Independence of errors. Logistic regression , as is the case with most other forms of regression, assumes that the responses of different cases are independent of each other. This assumption is arguably violated with the Titanic dataset, as passengers were related by family, nationality, and class.

According the Tabachnick and Fidell (2013), the impact of non-independence in linear regression is to produce overdispersion. This is when the variability in cell frequencies is greater than expected by the underlying model.
The authors indicate that this results in an inflated Type I error rate for tests for predictors. The suggested remedy is to undertake multilevel modelling.
I'm not sure at this stage how this relates to the Titanic Data, how it might impact the prediction results, and what, if anything I should do.

Absence of Outliers in the solution. Outliers are cases not well predicted by the model. A case that actually is in one category of the outcome may show a high probability for being

Other Issues

Logistic regression requires categorical variables to be converted into dummy/indicator variables. This is not a problem with SPSS (and other programs) as SPSS automatically creates new variables for variables declared as categorical.
Norusis outlines four diagnostic checking areas for logistic regression:

is the relationship between the logit and continuous variables linear?
How well does the model discriminate between cases that experience the event and cases that do not experience the event (model discrimination)?
How well do predicted probabilities match observed probabilities over the entire range of values (Model calibration)
are there unusual cases

Sunday, April 21, 2013

Titanic Data Competition - Predicting Age for Age = Missing Value

Age is a strong predictor of survival / otherwise; therefore, with a relatively large number of missing values for age, it makes sense to try and predict an age value as accurately as possible. One suggestion on the competition discussion forum was to interpolate based on average ages for the applicable title. For example, the average age for people with the title "Miss" was 21.00 .

I've used a regression model to attempt to predict age more accurately.

Using a regression model with indicator variables for the different titles, we achieve an R Squared of 28.6%.

If we include basically all other variables in the regression, the R Square increases to 43.7%.

Using backward elimination, a model with fewer variables had R Square of 43.3%

It will be interesting to see if and how this improves the predictions of a logistic regression model.

The final model was :

Predictors: (Constant), fare_per_person, cabin_G, cabin_F, Embarked_Q, Title_Other, Title_Miss, cabin_Y, Title_Master, Embarked_C, sibsp, male, fare, pclass, Title_Mr

I will do some more work on the residuals and other statistics. For example, the 'all variables in the model" had a maximum Mahalanobis Value of 712; under the simpler model produced by backward elimination, the Mahal. value of this particular case had reduced to 6.7, and the max Mahal. value was now 218.

Titanic Data Competition - Submission 6

5 submissions today, and my best result so far. I improved 766 positions today - I'm now equal 334.

Today's submissions were:

Binomial logistic regression, with the following variables;

          Variables

          Based on previous best:

          gender, pclass, fare and age 9with missing values replaced by age imputed from title median

          Additional variables

           age & class interaction, class and gender interaction, fare per person, and title

           Score : 0.68900 - not an improvement.

           However, in this model I had inadvertently classified the age&class interaction as categorical.

Same as above, but did not code age & class variable as categorical.

Amazingly, this improved my score by 430 positions, to 0.77990

Same as above, but changed cut point to 0.59.

This resulted in a lower score : 0.77512

Same as above, but changed cut point back to 0.5.

          Also removed more variables as being coded as categorical.

          This resulted in a further improvement and my best score to date: 0.78947

          I moved up the public leaderboard by 336 places.

Same model as above, but used multinomial logistic regression. Factors and covariates were correctly coded, so this may have result in lessor result. In future, will see if I can code some of the factors as covariates to see what impact this has.

Titanic Prediction Competition - Where Am I Up To?

I wanted to stop for a moment, reflect on what approaches I've been taking with the Titanic Prediction Competition, and plan how I might tackle this competition in future.

This competition fits in nicely with my studies - I'm enrolled in Master in Science (Applied Statistics) programme at Swinburne University (Melbourne, Australia). I am at the 2/3rds mark, and the unit I'm currently doing (Advanced Topics in Regression) fits in nicely with this competition.

Up till now, the course has focussed primarily on what I'd call standard statistics - linear regression with normal distributions. Advanced Topics in Regression takes us beyond this and introduces us to modelling techniques that can be used when the assumptions of normality and linearity don't apply. These techniques include:

transforming either predictor and / or response variables
creating new predictors (eg: polynomial terms, indicator variables, interactions)
piecewise regression
non-linear regression
weight least squares regression
loglinear analysis
generalised linear models
binary / ordinal multinomial logistic regression
multilevel regression

But knowing new modelling techniques is not the only advantage of being 2/3rds through an Applied Statistics course. I'm also developing what might be called an intuitive understanding of what data analysis is all about:

becoming better at defining and re-defining your research question
knowing which analysis technique is the most appropriate for your data and research question (what are the data assumptions of the method, and does the data at hand meet those assumptions, and what are the implications when the assumptions are violated.
understanding that residuals are important
knowing that selection of predictors is important (and this includes extracting new variables from the data you start with)
Importance of exploratory data analysis

So the knowledge I gained from my course means that competitions like this one provide valuable experience:

relatively small size of data set - this means the challenges are analytical rather than programming / computer science challenges of handling big data - if my programming skills are not up to it, I can still do something manually (in excel)
real world issues of missing data - the data sets you deal with, even at grad school, tend to be neatly packaged and designed to illustrate the concept / technique being studied.
benchmarking - Kaggle tells you where you are in comparison - based on my current score (1098 for 0.77512) there are a lot of people who have extracted a lot more information out of the data set than I have.

My aim is to initially use the techniques I'm learning in Advanced Topics in Regression

Cross tab / chi square
Log linear analysis
binary and multi-nomial logistic regression/ robust logistic regression
generalised linear regression

Next, I'm going to try out the techniques covered in my next unit (Statistical Marketing Tools) , which covers data mining tools.

Finally, writing posts like this forces me give a bit of thought about what I am doing.

Exploratory Data Analysis

Exploratory data analysis is a two step process - it means looking at the accuracy of your data set, and it means understanding your data set.

Here, we don't need to worry about accuracy. Kaggle have provided a data set that is internally consistent. That's unlike many real world data sets. There you're faced with transcription and coding errors, out of range values (negative or zero ages, incomes with a decimal point in the wrong place) and such like.

So with the Titanic data set, our aim is to understand the data.

For myself, the key parts of exploratory data analysis are to look at each variable as follows:

Produce a histogram or bar plot to look at the distribution of values
Crosstabulation / chi square analysis with the dependent variable (or alternatively, a logistic regression with just the variable in question as the predictor.
Summary report (like the Frequencies or Descriptives reports in SPSS)

Missing Values

One piece of information that comes from exploratory data analysis is whether there are any missing values for a variable.

With the Titanic data set, there are several variables with missing values, including:

age
cabin

Age

There are 177 missing values for age.

The discussion forum on the competition website has had a fair amount of comment devoted particularly to handling the age missing values. Age is a significant predictor of survival, and therefore it makes sense to interpolate an accurate as possible age where it is missing.

The methods of interpolating the missing values for age include

If age missing, predict non-survival
average age overall
use average age for each title group (title = Mr, Mrs, Master, Miss, etc)

It's also possible to substitute some other central tendency value, usually median

I think it is worth seeing if a regression model can be built to predict age from all other variables (excluding survival)

Finally, one could manually interpolate age by looking at position of the individual in a family. If a person traveled with another family, then it may be possible to "guess" age based on who the other family members are (eg, if person traveled with spouse or sibling, then that would potentially give a reasonable indication of the persons age).

Cabin

One contributor to the discussion forum suggested the following method to replace missing cabin value:

replace missing value with passenger class

There are 687 missing values for cabin - so the values that are present would arguably not provide a lot of information.

Feature Selection and Construction

Selecting the right variables, and extracting all predictive features from the data set is important.

Ways to improve model could include:

Use backward elimination to build model.
age and pclass are not linear predictors of survival - how to best construction an interaction term
extract title and use as a variable
investigate if there are other interactions present
construct fare per person variable, and look how this compares with total fare variable
which procedures requires dummy / indicator variables for analysis
what information does the cabin variable provide (particularly with so many missing values)
do I need to normalize continuous variables
can these methods be used in an ensemble manner
with decision cuts, how to model the result of different cuts on results
would any variables benefit from squaring or cubing etc
create a family variable
is it worth using name to create ethnicity variable

Monday, April 15, 2013

Titanic Data - Submission 5

Submission 5 (which I actually submitted on the same day as submission 4) has shown me a logical approach is worth pursuing.

I looked at my previous best submission, which used:

gender
pclass
fare
age - with survival coded as zero where age was missing.

With submission 5, I used a suggestion another competitor made on the forum, and used the title in the name to calculate an age.

This submission scored 0.77512, which was an improvement of 0.00957 over my previous score. I am now ranked 1068 out of 2449, and I improved 359 positions !!

Titanic Data Competition - Submission 4

Submission 4 is actually three submissions (a , b and c) and it's taught me that I need to go back to basics - all three submissions scored worse than one of my first submissions, which was a simple model comprised of

- gender, pclass, age, with survival = 0 where age was missing.

Clealy not an original model !

These are the variables I tossed into today's submissions:

pclass
sex
Age missing
combined age - which substituted an imputed age where age was missing.
sibsp
parch
family - which is the total of sibsp and parch
fare
log fare
adjusted cabin - which takes the first letter of the cabin, and substitutes X, Y, Z for missing values (where X, Y and Z represent 1st, 2nd and 3rd class respectively)
cabin missing
embarked

Submission 4b included the same variables. The difference was with this submission I experimented with changing the cut, and ended up going with 0.59, which gave me the best classification score using SPSS.

Submission 4C included same as first two, but in this case, I treated combined age and fare as categorical variables.

These models should have resulted in a great score according to the SPSS classification table (version C correctly classified 92.7% according to SPSS). The problem no doubt is that this figure is calculated on the training set, and not the test set. ( I've not split the training set into training / test).

So future strategy has to be to go back to to the simple model referred to above, and build from there.

Friday, April 5, 2013

Mistaking correlation for causation

Thursday, April 4, 2013

Kaggle Titanic Competition - Submission 3

Submission 3 gave me a surprise - it didn't produce my best score to date!

Submission 3 was supposed to be an improvement on submission 2 - where I had used a logistic regression to produce predictions using age, gender, pclass and fare as predictors. This model generated missing predictions where age was missing. In this case, I merely replaced the missing predictions with zeros - this made sense as a majority of Titanic passengers did not survive.

Submission 3 used a secondary model to generate predictions where the main model produced a missing prediction. The secondary model was a logistic regression with gender, pclass, and fare.

I'm surprised that this "thoughtful" model didn't outperform the somewhat arbitrary model from submission 2.

The kaggle scores were 0.74641 (submission 3) and 0.76555 (submission 2).

Other misc observations:-

the test set has one missing value for fare. The training may have had missing values - where the fare was 0.

the primary and secondary models mentioned above generated very similar predictions - excluding missing predictions, there were only 28 (out of 418 cases) where the predictions differed.

Where to next ? Firstly, I need to spend some more time working on producing a full training data set, with all the indicator and / or generated variables that I consider worth working with. That way I can produce a more nuanced logistic regression model. And I need to understand logistic regression in more detail. For example, in my current models, I've used prob = 0.5 as the threshold for predicting survival versus non-survival. I need to see if altering the threshold improves the predictions. Maybe other flavors of logistic regression will produce a better result.

Wednesday, April 3, 2013

Kaggle Titanic Competition - Submission 2

My second submission was a bit of an experiment, I've just learnt about logistic regression in class, and I wanted to try the technique out on the Titanic data set.

I had to use SPSS on this occasion; I couldn't get logistic regression in R to work for me. For some reason, the predict() function generated 891 predictions (the number of cases in the train data set), whereas I wanted it to generate 418 predictions - the number of cases in the test data set.

It was an interesting experiment (for a novice like myself).

Using the Binary Logistic function in SPSS, I set "survived" as the dependent variable and "pclass", "sex" , "age" and "fare" as covariates or independent variables.

The problem I encountered with this is that logistic regression (at least in SPSS) won't generate a predicted group membership value if there is a missing value, and the test data set has 86 missing "age" values. This doesn't cause a problem (?) when generating a model, as the default in SPSS is (I think) case-wise deletion.

As this was an experiment, I decided to submit two entries, replacing the missing predictions with
- 0
- then replaced missing predictions with 1

The first submission generated the following message from Kaggle :

You improved on your best score by 0.13876.
You just moved up 857 positions on the leaderboard.

Using the above logistic regression and substituting missing predictions with "0" scored 0.76555, up from 0.62679 from my first submission.

Interestingly, this was the same score as the default gender based model.:

                    If the passenger is female then survives, if not then does not.

I then replaced missing predictions with "1" - this scored poorly at 0.64115.

My observations

- perhaps a relationship between missing age and survival.
- importance of gender in survival.

My next steps are to:

- be able to generate predictions using logistic regression in R
- develop a more sophisticated logistic regression model

In particular, I'd like to see what sort of result I can get from a model that includes age, but defaults to a model that excludes age if age is missing.

Pages