Project 1

Project 1 is Due 10-27s at 11:59 PM. This is an individual project, but like the homework you can ask for advice from others directly and in the Slack Channel.

Project Objective

The goal is to develop an understanding of how individuals approach data science projects, seeing the entire process from exploratory data analysis to modeling and evaluation.

Project Selection

There are a tremendous number of potential Kaggle projects available that would make excellent selections. The following are currently approved projects. Easier 1. Porto Seguro’s Safe Driver Prediction 2. Zillow Prize: Zillow’s Home Value Prediction (Zestimate) (It seems this is now closed but is good if you got in early.) 3. House Prices: Advanced Regression Techniques (This is an beginner dataset, but could work for us.)

*Advanced 4. Web Traffic Time Series Forecasting (This is an advanced project.)

A list of solutions can be foundhere.

If you feel there is another Kaggle contest you would like to do, just ask in the #projects slack channel. You should make sure that there are existing solutions (kernels) and make sure that there is business relevance. You should avoid image based data or projects that only require visualizations.

Deliverables

Your goal is to develop a 6 page (1 inch margins, single spaced) report of some initial explorations into a Kaggle project.

  1. Executive Summary This should be 1 page summary in your own words of the problem, data, and findings.
  2. Data description and initial processing. This section should include basic characterization of data. You should run and report basic statistics on the data and generate 3 visualizations. You can review other kernels to understand some different approaches to the data, but this section you are required to generate all analyses (3 pages).
  3. Modeling and evaluation of 3 other solutions. Identify 3 other Kaggle solutions completed by others. You can do this by selecting on the project and then clicking on the link to Kernels. Summarize the features, modeling approach, and performance in a table. Then do some further research to comment on (2 pages)
  4. Appendix. Link to a notebook containing all exploratory data analysis code from part (2) as well as each of the original solutions of (3).

For the Appendix in the project, you are welcome to provide links however you want, but I have setup a private github repository here https://classroom.github.com/a/maQB6ufJ.

Project Evaluation Metrics.

The description below describes an ideal project. Projects will be evaluated subjectively by the instructor according to this rubric.

  • Formatting (10 points). The student presented the report in a format that indicated professionalism and care in the organization, writing, and presentation of the overall report.

  • Executive summary (20 points, 1 page). The student was able to present the problem and context in a way that is rich and interesting as well . They demonstrated a great understanding of the problem context and of the specific modeling problem itself. There is clear representation of data, the data exploration, and the modeling/evaluation.

  • Data description and initial processing (40 points. 3 pages). The student was able to clearly present an overall picture of the data using techniques presented in the class. This includes basic structure field by field descriptions as well as visualization and basic statistics. Where necessary they have adequately used techniques for cleaning the data or generating new features.

  • Modeling and Evaluation(30 points, 2 pages). There is a clear insightful comparison of approaches, and he predictive characteristics of the different models are clearly compared in a table with appropriate conclusions. There are outside resources consulted in the description of specific algorithms if relevant.

Project Submission

  • *The project is to be submitted to the LMS.

NOTE: If you copy and paste from the Kaggle description that is plagiarism and you will be reported to the Associate Dean’s office and receive a 0 on the project grade.