This notebook is a modified version of a guided exercise offered in the very first section of Udacity's Data Analyst Nanodegree. The goal of this exercise is to make sure the students have a general view of the data analysis process, from wrangling to communicating results.
Bay Area Bike Share is a company that provides on-demand bike rentals for customers in San Francisco, Redwood City, Palo Alto, Mountain View, and San Jose. Users can unlock bikes from a variety of stations throughout each city, and return them to any station within the same city. Users pay for the service either through a yearly subscription or by purchasing 3-day or 24-hour passes. Users can make an unlimited number of trips, with trips under thirty minutes in length having no additional charge; longer trips will incur overtime fees.
We're going to take a look at the data, do a little wrangling and analysis, and answer questions that would help understand the behavior of our customers, develop our network and fill our stations accordingly:
The data comes in three parts: the first half of Year 1 (files starting 201402
), the second half of Year 1 (files starting 201408
), and all of Year 2 (files starting 201508
). There are three main datafiles associated with each part: trip data showing information about each trip taken in the system (*_trip_data.csv
), information about the stations in the system (*_station_data.csv
), and daily weather data for each city in the system (*_weather_data.csv
).
We'll start by looking at only the first month of the bike trip data, from 2013-08-29 to 2013-09-30. The code below will take the data from the first half of the first year, then write the first month's worth of data to an output file. This code exploits the fact that the data is sorted by date (though it should be noted that the first two days are sorted by trip time, rather than being completely chronological).
# import all necessary packages and functions.
import csv
from datetime import datetime
import numpy as np
import pandas as pd
from babs_datacheck import question_3
from babs_visualizations import usage_stats, usage_plot
from IPython.display import display
%matplotlib inline
# file locations
file_in = '201402_trip_data.csv'
file_out = '201309_trip_data.csv'
with open(file_out, 'w') as f_out, open(file_in, 'r') as f_in:
# set up csv reader and writer objects
in_reader = csv.reader(f_in)
out_writer = csv.writer(f_out)
# write rows from in-file to out-file until specified date reached
while True:
datarow = next(in_reader)
# trip start dates in 3rd column, m/d/yyyy HH:MM formats
if datarow[2][:9] == '10/1/2013':
break
out_writer.writerow(datarow)
Let's look at the first few rows of the sampled data file.
sample_data = pd.read_csv('201309_trip_data.csv')
display(sample_data.head())
In this exploration, we're going to concentrate on factors in the trip data that affect the number of trips that are taken. Let's focus down on a few selected columns: the trip duration, start time, start terminal, end terminal, and subscription type. Start time will be divided into year, month, and hour components. We will also add a column for the day of the week and abstract the start and end terminal to be the start and end city.
Let's tackle the lattermost part of the wrangling process first. We will see how the station information is structured, then observe how the code will create the station-city mapping.
# Display the first few rows of the station data file.
station_info = pd.read_csv('201402_station_data.csv')
display(station_info.head())
# This function will be called by another function later on to create the mapping.
def create_station_mapping(station_data):
"""
Create a mapping from station IDs to cities, returning the
result as a dictionary.
"""
station_map = {}
for data_file in station_data:
with open(data_file, 'r') as f_in:
# set up csv reader object - note that we are using DictReader, which
# takes the first row of the file as a header row for each row's
# dictionary keys
weather_reader = csv.DictReader(f_in)
for row in weather_reader:
station_map[row['station_id']] = row['landmark']
return station_map
We can now use the mapping to condense the trip data to the selected columns noted above. This will be performed in the summarise_data()
function below. As part of this function, the datetime
module is used to parse the timestamp strings from the original data file as datetime objects (strptime
), which can then be output in a different string format (strftime
). The parsed objects also have a variety of attributes and methods to quickly obtain
We will convert the trip durations from being in terms of seconds to being in terms of minutes, and create the columns for the year, month, hour, and day of the week.
def summarise_data(trip_in, station_data, trip_out):
"""
This function takes trip and station information and outputs a new
data file with a condensed summary of major trip information. The
trip_in and station_data arguments will be lists of data files for
the trip and station information, respectively, while trip_out
specifies the location to which the summarized data will be written.
"""
# generate dictionary of station - city mapping
station_map = create_station_mapping(station_data)
with open(trip_out, 'w') as f_out:
# set up csv writer object
out_colnames = ['duration', 'start_date', 'start_year',
'start_month', 'start_hour', 'weekday',
'start_city', 'end_city', 'subscription_type']
trip_writer = csv.DictWriter(f_out, fieldnames = out_colnames)
trip_writer.writeheader()
for data_file in trip_in:
with open(data_file, 'r') as f_in:
# set up csv reader object
trip_reader = csv.DictReader(f_in)
# collect data from and process each row
for row in trip_reader:
new_point = {}
# convert duration units from seconds to minutes
### Question 3a: Add a mathematical operation below ###
### to convert durations from seconds to minutes. ###
new_point['duration'] = float(row['Duration']) / 60
# reformat datestrings into multiple columns
### Question 3b: Fill in the blanks below to generate ###
### the expected time values. ###
trip_date = datetime.strptime(row['Start Date'], '%m/%d/%Y %H:%M')
new_point['start_date'] = trip_date.strftime('%Y-%m-%d')
new_point['start_year'] = trip_date.strftime('%Y')
new_point['start_month'] = trip_date.strftime('%m')
new_point['start_hour'] = trip_date.strftime('%H')
new_point['weekday'] = trip_date.strftime('%A')
# remap start and end terminal with start and end city
new_point['start_city'] = station_map[row['Start Terminal']]
new_point['end_city'] = station_map[row['End Terminal']]
# two different column names for subscribers depending on file
if 'Subscription Type' in row:
new_point['subscription_type'] = row['Subscription Type']
else:
new_point['subscription_type'] = row['Subscriber Type']
# write the processed information to the output file.
trip_writer.writerow(new_point)
We will now process the data:
# Process the data by running the function we wrote above.
station_data = ['201402_station_data.csv']
trip_in = ['201309_trip_data.csv']
trip_out = '201309_trip_summary.csv'
summarise_data(trip_in, station_data, trip_out)
# Load in the data file and print out the first few rows
sample_data = pd.read_csv(trip_out)
display(sample_data.head())
# Verify the dataframe by counting data points matching each of the time features.
question_3(sample_data)
Let's now look at some initial trends in the data. Some code is already written and available in the babs_visualizations.py
script to help summarize and visualize the data; this has been imported as the functions usage_stats()
and usage_plot()
. Let's use the usage_stats()
function to see the total number of trips made in the first month of operations, along with some statistics regarding how long trips took.
trip_data = pd.read_csv('201309_trip_summary.csv')
usage_stats(trip_data)
There are over 27,000 trips in the first month, and that the average trip duration is larger than the median trip duration (the point where 50% of trips are shorter, and 50% are longer). In fact, the mean is larger than the 75% shortest durations. We will look into this later.
Let's now look at how those trips are divided by subscription type. We'll use the usage_plot()
function for this. The second argument of the function allows us to count up the trips across a selected variable, displaying the information in a plot.
usage_plot(trip_data, 'subscription_type')
Seems like there's about 50% more trips made by subscribers in the first month than customers. Let's try a different variable now. What does the distribution of trip durations look like?
usage_plot(trip_data, 'duration')
Most rides are expected to be 30 minutes or less, since there are overage charges for taking extra time in a single trip. The first bar spans durations up to about 1000 minutes, or over 16 hours. Based on the statistics we got out of usage_stats()
, we should have expected some trips with very long durations that bring the average to be so much higher than the median: the plot shows this in a dramatic, but unhelpful way.
We need to tweak visualization function parameters to make the data easier to understand. The third argument of the usage_plot()
function acts as a filter: filters can be set for data points as a list of conditions. Let's start by limiting things to trips of less than 60 minutes.
usage_plot(trip_data, 'duration', ['duration < 60'])
Better! We can see that most trips are indeed less than 30 minutes in length, but there's more that we can do to improve the presentation. Since the minimum duration is not 0, the left hand bar is slighly above 0. We want to be able to tell where there is a clear boundary at 30 minutes, so it will look nicer if we have bin sizes and bin boundaries that correspond to some number of minutes. Let's adjust the plot by setting "boundary" to 0. One of the bin edges (in this case the left-most bin) will start at 0 rather than the minimum trip duration. And by setting "bin_width" to 5, each bar will count up data points in five-minute intervals.
usage_plot(trip_data, 'duration', ['duration < 60'], boundary = 0, bin_width = 5)
The 5 to 10 trip duration shows the most number of trips: over 9,000.
Now that we've done some exploration on a small sample of the dataset, it's time to go ahead and put together all of the data in a single file and see what trends we can find. The code below will use the same summarise_data()
function as before to process data. After running the cell below, we'll have processed all the data into a single data file.
station_data = ['201402_station_data.csv',
'201408_station_data.csv',
'201508_station_data.csv' ]
trip_in = ['201402_trip_data.csv',
'201408_trip_data.csv',
'201508_trip_data.csv' ]
trip_out = 'babs_y1_y2_summary.csv'
# This function will take in the station data and trip data and
# write out a new data file to the name listed above in trip_out.
summarise_data(trip_in, station_data, trip_out)
trip_data = pd.read_csv('babs_y1_y2_summary.csv')
display(trip_data.head())
Let's explore the new dataset with usage_stats()
and usage_plot()
. A refresher on how to use the usage_plot()
function might be useful:
'<field> <op> <value>'
using one of the following operations: >, <, >=, <=, ==, !=. Data points must satisfy all conditions to be counted or visualized. For example, ["duration < 15", "start_city == 'San Francisco'"]
retains only trips that originated in San Francisco and are less than 15 minutes long.If data is being split on a numeric variable (thus creating a histogram), the most useful additional parameters that may be set by keyword are the following:
We can also add some customization to the usage_stats()
function as well. The second argument of the function can be used to set up filter conditions, just like how they are set up in usage_plot()
.
Let's begin by getting general information from our data.
usage_stats(trip_data)
usage_stats(trip_data, ['duration < 60'])
usage_plot(trip_data, 'duration', ['duration < 60'], boundary = 0, bin_width = 5)
We're getting the same results than we obtained from our sample: most of the trips last between 5 and 10 minutes.
usage_plot(trip_data, 'start_hour', bin_width=1)
We can clearly see that it's mainly used at the beginning and the end of a workday: people use the service to go to the workplace and come back.
usage_plot(trip_data, 'subscription_type')
Users of the service are mainly subscribers.
usage_plot(trip_data, 'weekday')
The service is used about three times less during the week-ends, comforting our previous hypothesis that it's used as a commuting means to go to work.
This project gave an overview of the data analysis process, and a few of the techniques used in this field.
There are many fields and topics of interest where I'd like to apply these techniques. I'm currently working on the survival chances of the Titanic passengers. I'm also looking for big, qualified dataset about the effects of mindfulness on well-being and productivity. The Apple Health App also offers a lot of opportunities, and healthcare in general. Datasets about accidents and contributing factors would be great as well. SpaceX also has a dataset on Kaggle with information about all of their launches, so it would be interesting to see which countries and companies are its most loyal customers.
Some topics might look sexier than others, but to me, it's more about how the analysis is conducted and the insights that can be discovered, than the 'theme' of the dataset.