Data Science and Visualization, S2026
Working with data in pandas
Exercise 1: Libraries
1.1 Use the empty code cell below to install the library pandas. Then comment out or delete the installation code before using the cell to import pandas. If you need help, go to the pandas installation guide and see lecture slides for the syntax used to import pandas
1.2 Check out the websites of the libraries listed below. Find out what the purpose of the libraries are and how to install them. When you have installed them, comment out the code that you used for this and rerun the cell to clear verbose output from your notebook
- Polars: Fast data wrangling library written for effective parallelism
- Matplotlib: Comprehensive library for data visualization
- NetworkX: Library for creation, manipulation, and analysis of complex networks
Exercise 2: Series and DataFrames
The code cell below creates a list of Danish population counts in number of people from 2000 to 2024. The numbers are taken from https://www.statbank.dk/statbank5a/default.asp?w=1536
# Create a list of Danish population counts from 2000 to 2024
population_count = [
5330020, 5349212, 5368354, 5383507, 5397640,
5411405, 5427459, 5447084, 5475791, 5511451,
5534738, 5560628, 5580516, 5602628, 5627235,
5659715, 5707251, 5748769, 5781190, 5806081,
5822763, 5840045, 5873420, 5932654, 5961249
]Use the empty code cell below to do the following:
- Create a new Series containing the years from 2000 to 2024. Assign the Series to the variable
year - Create a new DataFrame that has the Series
yearand the listpopulation_countas its columns. Assign the DataFrame to the variableDK_POP_00_24 - Output
DK_POP_00_24
Excercise 3: Loading
In this exercise, you will use the pandas library to explore a classic dataset of passengers on the Titanic. The dataset contains the following columns:
- PassengerId: Unique identifier
- Survived: Survival status (0, 1)
- Pclass: Ticket class (1, 2, 3)
- Name: Passenger name
- Sex: Gender (male, female)
- Age: Age in years
- SibSp: Number of siblings/spouses aboard
- Parch: Number of parents/children aboard
- Ticket: Ticket number
- Fare: Passenger fare in GBP
- Cabin: Cabin number
- Embarked: Port of embarkation
Download the file titanic.csv from Moodle and save it in the same folder as this notebook. Then use the empty code cell below to read it as DataFrame object and assign it to the variable df. Finally, find use a DataFrame method to find out how many rows and columns the data has and print the results
Exercise 4: Selecting and filtering
Use the loc operator and Boolean indexing to answer the questions listed below. Print your answers to the questions
- In which cabin did the passenger with the name Dodge, Master. Washington sleep?
- What was the last name of the oldest passenger on the Titanic?
- How many passengers embarked the Titanic in Cherbourg?
- How many passengers aged 60 years or older survived the sinking of the Titanic?
- How many men under the age of 30 bought a ticket for 50 GBP or more?
- What was the last name of the youngest woman with a ticket price below 10 GBP to survive the sinking of the Titanic?
# In which cabin did the passenger
# with the name Dodge, Master. Washington sleep?
# What was the last name of the oldest passenger on the Titanic?
# How many passengers embarked the Titanic in Cherbourg?
# How many passengers aged 60 years or older survived
# the sinking of the Titanic?
# How many men under the age of 30 bought a ticket for 50 GBP or more?
# What was the last name of the youngest woman
# with a ticket price below 10 GBP to survive
# the sinking of the Titanic?Exercise 5: Cleaning
5.1 Use the empty code cell below to do the following:
- Find out how many missing values there are on each of the columns in
df. Print the Series containing the results - Replace missing values on the column Embarked with
Cherbourg
5.2 Use the empty code cell below to do the following:
- Find out how many duplicated rows there are in
df. Print the result - Remove the duplicated rows while keeping the first copy of each
Exercise 6: Creating and transforming
6.1 Use the empty code cell below to do the following:
- Create a new column in
dfthat contains the number of family members aboard the Titanic for each passenger. The columns SibSp and Parch contain the information needed for this. Label the column FamMen - Create a new column in
dfthat contains the last name of each passenger. You will need to use the.map()method, a built-in string method and define a function to do this. Label the column LastName - Create a new column in
dfthat contains the deck where each passenger’s cabin was located. The cabin number starts with a letter that identifies the deck where the cabin was located. You will to use the.map()method and define a function that includes a conditional statement to do this. Label the column Deck
6.2 Use the empty code cell below to do the following
- Create a new column in
dfthat categorizes passengers by the number of family members they were travelling with. Define categories for passengers travelling alone, passengers travelling with 1-2 family members, and passengers travelling with 3 or more family members. The column FamMem that you created under exercise 6.1 contains the information needed for this. You will need to define a function and use the.map()method to do this. Label the new column FamSize - Based on the categorical column FamSize that you just created, create 3 new dummy variables for travelling alone, travelling with a small family, and travelling with a large family. Then use the method
.join()to merge the dummies back ontodf - Use the the function
pd.qcut()to create a new column indfthat groups passengers into bins defined by quartiles on the age variable. Label the column AgeQuart
Exercise 7: Merging and concatenating
This exercise will use the DataFrame DK_GDP_00_24 that you created in exercise 2. The code cell below creates a list of Danish GDP in Danish Kroner from 2000 to 2024. The numbers are taken from https://www.statbank.dk/statbank5a/default.asp?w=1536
# Create a list of Danish GDP from 2000 to 2024
GDP = [
1767252e6, 1783295e6, 1800910e6, 1815208e6, 1872596e6,
1933235e6, 2000184e6, 2014425e6, 2025536e6, 1927603e6,
1981158e6, 1988226e6, 1994714e6, 2033469e6, 2076855e6,
2128263e6, 2195314e6, 2263289e6, 2295506e6, 2334244e6,
2326592e6, 2501738e6, 2561083e6, 2470753e6, 2535701e6
]Use the empty code cell below to do the following:
- Create a new DataFrame that has the Series
yearand the listGDPas its columns. Assign the DataFrame to the variableDK_GDP_00_24. You created the Seriesyearin exercise 2 - Use the function
pd.merge()to merge the DataFrameDK_GDP_00_24to the DataFrameDK_POP_00_24on the Year column. You you created the DataFrameDK_GDP_00_24in exercise 2. Assign the resulting DataFrame to the variableDK_POP_GDP_00_24 - Download the file DK_POP_GDP_80_99.csv from Moodle and save it in the same folder as this notebook. Then read it as DataFrame object and assign it to the variable
DK_POP_GDP_80_99 - Stack the DataFrames
DK_POP_GDP_80_99andDK_POP_GDP_00_24using the functionpd.concat(). Assign the resulting DataFrame to the variableDK_POP_GDP_80_24 - Create a new column in
DK_POP_GDP_80_24containing Danish GDP per capita for the years 1980-2024. Use the formula \(PCAP = GDP / POP\) where PCAP denotes GDP per capita and POP denotes population count - Output
DK_POP_GDP_80_24
Exercise 8: Summarizing
In this exercise we will return to the Titanic data and the DataFrame df. We will use the original variables and the variables that you created in exercise 6 to compute summary statistics. As you compute the summary statistics, think about what information they provide about the passengers at the Titanic
8.1 Use the empty code cells below to do the following:
- Create and ouput a table of summary statistics for the variables Age, Fare, and FamMem
- Create and output a table of frequencies and relative frequencies of the categories defined by the column Pclass
- Create and ouput a table of frequencies and relative frequencies of the categories defined by the column Survived
8.2 Use the empty code cells below to do the following:
- Create a table displaying the correlation between Survived and the variables Pclass, Age, Fare, and FamMem. Start by creating a list containing the correlation coefficients. Then create a DataFrame with a single column and 4 rows from this list. Then assign a list containing the string
Correlation with Survived
as the column label and a list containing the stringsPclass
,Age
,Fare
, andFamMem
as the row labels. Outputting this DataFrame will give you a table displaying the correlations - Create a contingency table displaying relative frequencies of the categories defined by the column Survived across groups defined by the column Pclass
Exercise 9: split-apply-combine
9.1 In the empty code cell below, use the .groupby() and .agg() methods to create a table displaying number of observations, mean, standard deviation, mininum and maximum of Fare across groups defined by the column Embarked. Assign the labels N
, Mean
, SD
, Min
, and Max
to the columns of the table
9.2 In the empty code cell below, use the .groupby() and .agg() methods to create a table displaying number of observations, the frequency of passengers who died and survived, respectively, and the percentage of passengers who survived across groups defined by the columns Pclass and FamSize. Assign the labels N
, Survived
, Died
, and Percent survived
to the columns of the table You will need to define at least two custom functions and pass them to the agg() method to do this. Think about what information the table provides about which passengers survived the sinking of the Titanic