Machine Learning Project

Goal : The purpose of this project is to learn to use a real world machine learning library of your choice and apply it to some data that interests you. Unlike in data mining, where often the goal is to just explore the data and look for patterns, this project should be focused on determining if/how to use a set of features to predict another feature (this assumes you’re doing supervised learning, though unsupervised is possible as well).

Guidelines : 1-3 people per group.

Libraries : You may use any modern machine learning library. Some of the ones I suggest are:

PyTorch (very popular, lots of algorithms, including neural nets)
Tensorflow (mostly neural nets, though PyTorch is more popular now)
OpenCV (computer vision + ML algorithms)
Keras (neural nets)
PyTorch (lots of algorithms, including neural nets)
Scikit-Learn (used in class, lots of algorithms)

The project is extremely open-ended. It should consist of the following:

Find or collect a data set of interest. There are many sources on the web for data sets. I would prefer the data to be of a reasonably large, but really large data sets can bog down computers. A lower limit for data size should be 100 training examples, though in special circumstances you might get away with something lower (run it by me).
Formulate at least two questions you would like to answer from your data, in the form of predicting some variable from other variables.
Using the machine learning library, train at least two machine learning models per question, for a total of four models trained.
For each model, you should evaluate how well it does. There should be a training set and a testing set, and you should report how well your models perform.
What conclusions can you draw?

Data

There are lots of data sets available online. Pick something that you will enjoy working on, and something where there is a rich source of data available. Take some time in selecting a good data set - feel free to ask me for suggestions.

A nice selection of data sets is at the KDNuggets website.
The University of California at Irvine has put together a large repository of data sets for machine learning
Another repository of data sets for at the University of Edinburgh.
Statlib is a general repository for all things statistical, they have a nice collection.
Kaggle - This site hosts data mining/ML competitions. Each competition comes with a data set. You can access most datasets without taking part in the competition, but feel free to submit your results if you’re so inclined.
KDD Cup Datasets - KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.
Baseball/Basketball statistics - there are a number of repositories for this type of data.
US Government datasets - tons of census, voting, demographics datasets available.
- https://catalog.data.gov/dataset/
- https://data.cdc.gov/
Memphis Data Hub: (https://data.memphistn.gov/) datasets collected from Memphis area about public safety and community resources.
https://healthcaresummit.ieee.org/data-hackathon/ieee-covid-19-public-health-informatics-challenge/
COVID-19 dataset
https://github.com/fivethirtyeight/data - 538 is a popular interactive news and sports site that makes the data sets used in its articles available online.
https://github.com/awesomedata/awesome-public-datasets
Open-source dataset that contains topic-centric public data. Collected and sorted from various blogs, answers, and user feedback, it combines free and paid data sets on physics, sports, software, natural language, and machine learning.
https://data.unicef.org/resources/resource-type/datasets/
Datasets collected by UNICEF from all over the world.
https://datasetsearch.research.google.com/ - Google Dataset Search is a search engine for data sets. Try it out if there’s specific data that you’re looking for.

Use of AI to write code

I am experimenting with a new policy for this project: I will let you use AI-generated code for this final project (no other projects), but if you do so, you must understand your code well enough to explain it, line-by-line, to me.

Many machine learning frameworks have so many options and parameters it can feel daunting to get started. Therefore, I want you to take advantage of AI-generated code, which can be very good for gettings started with a new project. What I suggest is picking a dataset, the features you want to use as input and output, and then asking an LLM, “What machine learning model should I use for this?” Then, once you decide on something, you can have it generate code for you to load the data and run/train the model.

Here is the catch: The goal of this project is for you to learn about real-world machine learning, with or without the LLM. So if you choose to do this, you must understand the code that is being generated, and the decisions that went into it. You can do this by iterating on the code with the LLM to understand why it is doing what it’s doing. For instance, question every parameter it makes up, every model it selects, every option it generates, every line of code it writes. I want you to understand the code as well as if you wrote it yourself.

One part of your grade will be an ``oral quiz’’ with me during office hours where you will explain your code and what it does, along with the choices you made leading up to using it. The goal of this is not to trip you up or catch you off-guard, but rather to illustrate that you truly understand the code the LLM has generated for you.

You can expect I will ask you questions about why you chose a certain model, why you chose certain parameters, what the meaning of one or more sections of code is, and how to interpret the output of the model.

Proposal (due on Canvas at 11:59pm, April 11)

Your ML project proposal should be 1 page and contain the following elements:

List of group members.
What data set is being used - where does the data come from, and what are some characteristics of it (number of features, number of training examples, types of attributes).
Is there a reason you picked this data set? Tell me.
What is the question(s) of interest - be specific. Tell me the variables you’re going to predict, and which variables you will use as your input features.
What machine learning algorithms do you plan to use - understanding that this might change.

While I’d like you to think through your plan carefully, please understand that this is a proposal, and nothing you write must be set in stone.

Final Deliverables

Details are provided here for presentation, oral quiz, and final report.

Lightning presentations: In class on Thursday, May 1. 4 minutes per presentation. Send me visuals, powerpoint slides, pdfs, etc, in advance.
Oral quiz: Will take place at times arranged during finals week (or before, if you are ready). You shouldn’t do the quiz until you are done with the project (or at least have all the code written).
Final report: Due on canvas by Friday, May 9, 11:59pm.