Machine Learning Project
Goal : The purpose of this project is to learn to use a real world machine learning library of your choice and apply it to some data that interests you. Unlike in data mining, where often the goal is to just explore the data and look for patterns, this project should be focused on determining if/how to use a set of features to predict another feature (this assumes you’re doing supervised learning, though unsupervised is possible as well).
Guidelines : 1-3 people per group.
Libraries : You may use any modern machine learning library. Some of the ones I suggest are:
- Tensorflow (mostly neural nets)
- OpenCV (computer vision + ML algorithms)
- Keras (neural nets)
- PyTorch (lots of algorithms, including neural nets)
- Scikit-Learn (used in class, lots of algorithms)
The project is extremely open-ended. It should consist of the following:
- Find or collect a data set of interest. There are many sources on the web for data sets. I would prefer the data to be of a reasonably large, but really large data sets can bog down computers. A lower limit for data size should be 100 training examples, though in special circumstances you might get away with something lower (run it by me).
- Formulate at least two questions you would like to answer from your data, in the form of predicting some variable from other variables.
- Using the machine learning library, train at least two machine learning models per question, for a total of four models trained.
- For each model, you should evaluate how well it does. There should be a training set and a testing set, and you should report how well your models perform.
- What conclusions can you draw?
Data
There are lots of data sets available online. Pick something that you will enjoy working on, and something where there is a rich source of data available. Take some time in selecting a good data set - feel free to ask me for suggestions.
- A nice selection of data sets is at the KDNuggets website.
- The University of California at Irvine has put together a large repository of data sets for machine learning
- Another repository of data sets for at the University of Edinburgh.
- Statlib is a general repository for all things statistical, they have a nice collection.
- Kaggle - This site hosts data mining/ML competitions. Each competition comes with a data set. You can access most datasets without taking part in the competition, but feel free to submit your results if you’re so inclined.
- KDD Cup Datasets - KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.
- Baseball/Basketball statistics - there are a number of repositories for this type of data.
- US Government datasets - tons of census, voting, demographics datasets available.
- Memphis Data Hub: (https://data.memphistn.gov/) datasets collected from Memphis area about public safety and community resources.
- https://healthcaresummit.ieee.org/data-hackathon/ieee-covid-19-public-health-informatics-challenge/
- COVID-19 dataset
- https://github.com/fivethirtyeight/data - 538 is a popular interactive news and sports site that makes the data sets used in its articles available online.
- https://github.com/awesomedata/awesome-public-datasets
- Open-source dataset that contains topic-centric public data. Collected and sorted from various blogs, answers, and user feedback, it combines free and paid data sets on physics, sports, software, natural language, and machine learning.
- https://data.unicef.org/resources/resource-type/datasets/
- Datasets collected by UNICEF from all over the world.
- https://datasetsearch.research.google.com/ - Google Dataset Search is a search engine for data sets. Try it out if there’s specific data that you’re looking for.
Proposal (due on Canvas at 11:59pm, April 5)
Your ML project proposal should be 1 page and contain the following elements:
- List of group members.
- What data set is being used - where does the data come from, and what are some characteristics of it (number of features, number of training examples, types of attributes).
- Is there a reason you picked this data set? Tell me.
- What is the question(s) of interest - be specific. Tell me the variables you’re going to predict, and which variables you will use as your input features.
- What machine learning algorithms do you plan to use - understanding that this might change.
While I’d like you to think through your plan carefully, please understand that this is a proposal, and nothing you write must be set in stone.