Datenanalyse in Physik und Astronomie
Um einen benoteten Schein zu bekommen muss ein Datenprojekt in Form einer Hausarbeit bearbeitet und eingereicht werden. Die Details werden im Laufe der Vorlesung besprochen und dann auch hier aufgeführt.
To get a graded course certificate students are required to complete and submit a data project. The project is designed to allow the students to demonstrate what they learned during class and apply their skills to a real-world data set.
Every student can choose any data sets described in the following section. I only provide the pointer to the data. I am always open for alternative ideas for a data project - contact me! I do not ask a particular question or demand a particular analysis. Students are invited to investigate the data themselves and come up with an interesting question/analysis. For each data set I will make suggestions to help your imagination.
Deadline for submission: 30. September 2021
Look at all datasets. Try to understand what it contains. Find and check out complementary information. Does it sound interesting to you? Only then pick the project.
After picking the project download all the available data and all documentation that comes along with it. Start by looking at the data directly. Figure out the meaning of every element.
Now you can begin to 'play' with the data. Plot some of it. Try to visualize associations. Map it maybe. Note if something catches your attention. Use this phase to come up with an initial 'question' to address. Focus on this topic and analyze it. Also follow some subsequent paths and try to logically extend your results. Always carefully document what you are doing - keep a log of your analysis steps.
If you use complementary data or resources stick to proper citing and keep a reference list.
The project will be graded according to the following criteria:
Maybe you are planning to apply methods that we did not cover in class. This is OK. You are not limited to what had been taught in class. The course should have prepared you to pick up other techniques pretty fast. If you do so, summarize the applied technique and give references.
Project 1 - Hubway Data Visualization Challenge
“HUBWAY” means the Hubway public bike share system, operating in Boston, Brookline, Cambridge, and Somerville, Massachusetts.
This problem focuses on data visualization and not prediction / machine learning explicitly (No one stops you from applying those though). The official data challenge is closed now, but it was rewarding:
From the webpage:
“In 2012, after Hubway’s first year of operations, we held our first Data Visualization Challenge. The last 5 years have seen Hubway triple in size and much has changed in terms of access, availability, routes, and system usage. Hubway trip data is now released publicly each month.”
“Where do Hubway users ride? When do they ride? How far do they go? Which stations are most popular? On what days of the week are most rides taken? How do user patterns differ between members and casual riders? How does weather affect usage? These and many other questions can be answered by the ride data.”
You can also access present bike sharing data from https://www.thehubway.com/system-data
“Access the Hubway System Data to find metrics on all trips taken – including trip duration, start and stop time, station name, and user type (casual or member) – to inform your visualizations and other creative projects.” Also use the Related data (Census, neighborhoods, bike facilities, elevation, etc.) packaged up as Hack Day Treat (100MB zip)
You can view the public entries that have been submitted to the challenge here: http://hubwaydatachallenge.org/ Use the presented projects to get an idea what other people did and what you could do.
You are free to use available complementary data in your project. For example:
Boston Housing Data
Project 2 – MovieLens Dataset
GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://grouplens.org/datasets/movielens/). The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.
Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.
Full: 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Includes tag genome data with 14 million relevance scores across 1,100 tags. Last updated 9/2018.
Carefully read the README files that exlpains the data.
Data from The Telecom Italia Big Data Challenge
At the beginning of 2014, Telecom Italia, in collaboration with several international partners, launched the Telecom Italia Big Data Challenge. The contest made available to developers, designers and scientists a large dataset of 30+ kinds of data (mobile, weather, energy, etc.)
Dandelion account needed to access the data.
The World Database on Protected Areas (WDPA)
is the most comprehensive global database of marine and terrestrial protected areas, updated on a monthly basis, and is one of the key global biodiversity data sets being widely used by scientists, businesses, governments, International secretariats and others to inform planning, policy decisions and management.
Global Power Plant Database
The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one's own analysis. Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. As of June 2018, the database includes around 28,500 power plants from 164 countries. It will be continuously updated as data becomes available. The most recent release of the Global Power Plant Database 1.1 includes the addition of two countries (China and Fiji), over 3,000 power plants, and nearly 1300 gigawatts of power capacity.
Choose a suitable dataset from kaggle.net. Browse the available datasets and select an interesting one. Make sure, that the data provides sufficient opportunity to do some non-trivial analysis. Also consider combining dultiple datasets in your analysis. Contact me if you are not sure whether the data you would choose is suitable.
Check out what others did (see at kernels)
Berlin Airbnb Data https://www.kaggle.com/brittabettendorf/berlin-airbnb-data
World Happiness Report https://www.kaggle.com/unsdsn/world-happiness probably in combination with additional data
European Soccer Database https://www.kaggle.com/hugomathien/soccer
120 years of Olympic history: athletes and results https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
Avocado Prices https://www.kaggle.com/neuromusic/avocado-prices