In this project, I will explore Pittsburgh's restaurant review data available through the Yelp Dataset Challenge.
This assignment is broken into two parts:
Because Yelp reviews include the number of stars given by the user, the Yelp data set provides a unique opportunity to test how well our sentiment analysis works by comparing the number of stars to the polarity of reviews.
We'll explore geographic trends in the restaurant reviews, comparing our sentiment analysis results with user stars geographically. We'll also overlay review stars on maps of household income (using census data).
In this part, we'll load the data, perform a sentiment analysis, and explore the results.
import pandas as pd
# chose data from Pittsburgh
pittReviews = pd.read_json("data/reviews_pittsburgh.json.gz", orient="records", lines=True)
The first step is to split the review text into its individual words and make all of the words lower-cased.
We added a new column, called 'formatted_text', which each entry is a list of the lower-cased words in a review.
import re
def format_text(row):
formatted_row = row.lower().split()
return formatted_row
formatted_text = [format_text(row) for row in pittReviews['text']]
import numpy as np
formatted_text_df = pd.DataFrame(np.column_stack([formatted_text]))
formatted_text_df.columns=['formatted_text']
pittReviews = pittReviews.copy()
pittReviews = pd.concat([pittReviews, formatted_text_df], axis=1, ignore_index=False)
C:\Users\m1861\miniconda3\envs\musa-550-fall-2021\lib\site-packages\numpy\lib\shape_base.py:652: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. arr = asanyarray(v)
We used the nltk
library to remove any stop words from the list of words in each review.
We overwrote the 'formatted_text' column to contain a list of lower-cased words in each review, with no stop words.
import nltk
nltk.download('stopwords')
stop_words = list(set(nltk.corpus.stopwords.words('english')))
import string
punctuation = list(string.punctuation)
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\m1861\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
ignored = stop_words + punctuation
def removeStop_text(row):
noStop_row = [word for word in row if word not in ignored]
return noStop_row
noStop_text = [removeStop_text(row) for row in pittReviews['formatted_text']]
noStop_text_df = pd.DataFrame(np.column_stack([noStop_text]))
noStop_text_df.columns=['formatted_text']
pittReviews = pittReviews.iloc[:, 0:4]
pittReviews = pd.concat([pittReviews, noStop_text_df], axis=1, ignore_index=False)
C:\Users\m1861\miniconda3\envs\musa-550-fall-2021\lib\site-packages\numpy\lib\shape_base.py:652: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. arr = asanyarray(v)
Using the formatted text column, we created a list of textblob.TextBlob()
objects and then extracted the subjectivity
and polarity
.
We added two new columns to the review DataFrame: polarity
and subjectivity
.
import textblob
def getBlob_text(row):
blob = textblob.TextBlob(' '.join(row))
return blob
blobs = [getBlob_text(row) for row in pittReviews['formatted_text']]
data = {}
data['polarity'] = [blob.sentiment.polarity for blob in blobs]
data['subjectivity'] = [blob.sentiment.subjectivity for blob in blobs]
data = pd.DataFrame(data)
pittReviews = pd.concat([pittReviews, data], axis=1, ignore_index=False)
We used seaborn
to make two box plots, one showing the polarity vs number of user stars and one showing the subjectivity vs the number of user stars.
import seaborn as sns
sns.set_theme(style="whitegrid")
ax = sns.boxplot(x=pittReviews["stars"], y=pittReviews["polarity"]).set_title('polarity by the number of user stars')
sns.set_theme(style="whitegrid")
ax = sns.boxplot(x=pittReviews["stars"], y=pittReviews["subjectivity"]).set_title('subjectivity by the number of user stars')
The charts indicate that the sentiment analysis is pretty effective. Although the subjectivity chart shows that the higher the stars given, the more subjective the review, the distribution across groups is rather consistant at similar values. Meanwhile, while the polarity values also increase with the stars given, the variation is larger between different groups. The level of subjectivity across groups are different, but not as different as the values of polarity across groups. This mean that the analysis is rather effective.
In this part, we explored the importance and frequency of individual words in Yelp reviews.
We identified the most common reviews and then plotted the average polarity vs the user stars for the reviews where those words occur.
First we selected 1,000 random rows from the DataFrame holding the review data. Use the .sample()
function to perform the selection.
pittReviews_random = pittReviews.sample(1000)
Then we passed the subset of review data from the previous part to the reshape_data()
function defined below.
def reshape_data(review_subset):
"""
Reshape the input dataframe of review data.
"""
from pandas import Series, merge
X = (review_subset['formatted_text']
.apply(Series)
.stack()
.reset_index(level=1, drop=True)
.to_frame('word'))
R = review_subset[['polarity', 'stars', 'review_id']]
return merge(R, X, left_index=True, right_index=True).reset_index(drop=True)
reshaped_pittReviews_random = reshape_data(pittReviews_random)
"The reshape_data()
function breaks each sentence in the formatted text column into words and then was joined to the polarity and stars review results of its sentence."
Using the result from 1.6.2, we grouped the dataframe by the "word" column, and calculated the following three quantities:
word_size = pd.DataFrame(reshaped_pittReviews_random.groupby(['word']).size().reset_index(name='N'))
word_avg = pd.DataFrame(reshaped_pittReviews_random.groupby(['word']).mean().reset_index())
word = pd.concat([word_size[['word', 'N']], word_avg[['polarity', 'stars']]], axis=1, ignore_index=False)
We trimed the DataFrame from the last section to only include words that occurred at least 50 times.
word_over50 = word.loc[word['N']>=50]
We used matplotlib
to make a scatter plot of the average user stars vs average polarity for the words in the data frame from the last section. This involved two steps:
Loop over each row of the data frame from the last section and for each row:
plt.scatter(x, y)
to plot a scatter plot, where x is polarity and y is stars. plt.text(x, y, word)
to add the corresponding word to each scatter marker.Using the data frame from section 1.4, we added vertical and horizontal lines to the chart that shows the average number of user stars and the average polarity across all reviews in the data set.
We saw a strong trend between polarity and user stars, and some of the most common words occurring in these reviews.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(20, 15))
for i in word_over50.index:
word = word_over50['word'][i]
pol = word_over50['polarity'][i]
sta = word_over50['stars'][i]
plt.scatter(pol, sta, marker='o', c='pink')
plt.text(pol+0.001, sta+0.001, word, size=18)
plt.xlabel("average polarity", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("average stars", fontsize=20)
plt.yticks(fontsize=14)
plt.axvline(x=pittReviews['polarity'].mean(), linestyle='--', label="average polarity")
plt.axhline(y=pittReviews['stars'].mean(), linestyle='--', c='green', label="average stars")
plt.legend(loc='upper left')
plt.title('average stars and polarity for words', size=22)
Text(0.5, 1.0, 'average stars and polarity for words')
In this part, we will use the census API to download household income data and overlay restaurant locations.
We used the cenpy
package to download median household income in the past 12 months by census tract from the 2018 ACS 5-year data set for your county of interest.
import cenpy
acs = cenpy.remote.APIConnection("ACSDT5Y2018")
variables = [
"NAME",
"B19013_001E", # MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS
]
pitt_county_code = "003"
pa_state_code = "42"
pitt_inc_data = acs.query(
cols=variables,
geo_unit="block group:*",
geo_filter={"state": pa_state_code,
"county": pitt_county_code,
"tract": "*"},
)
for variable in variables:
# Convert all variables EXCEPT for NAME
if variable != "NAME":
pitt_inc_data[variable] = pitt_inc_data[variable].astype(float)
cenpy
to set the correct map service and download census tracts for the desired geographyacs.set_mapservice("tigerWMS_ACS2018")
Connection to American Community Survey: 5-Year Estimates: Detailed Tables 5-Year(ID: https://api.census.gov/data/id/ACSDT5Y2018) With MapServer: Census ACS 2018 WMS
acs.mapservice.layers[8]
# Use SQL to return geometries only for Allegheny County in PA
where_clause = f"STATE = {pa_state_code} AND COUNTY = {pitt_county_code}"
# Query for block groups
pitt_tracts = acs.mapservice.layers[8].query(where=where_clause)
pitt_inc_final = pitt_tracts.merge(
pitt_inc_data,
left_on=["STATE", "COUNTY", "TRACT"],
right_on=["state", "county", "tract"],
)
C:\Users\m1861\miniconda3\envs\musa-550-fall-2021\lib\site-packages\pyproj\crs\crs.py:68: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6 return _prepare_from_string(" ".join(pjargs))
fig, ax = plt.subplots(figsize=(10,10))
# Plot the choropleth
pitt_inc_final.plot(ax=ax, column='B19013_001E', legend=True, legend_kwds=dict(loc="lower left"),
cmap='viridis', scheme="Quantiles", k=5)
# Format
ax.set_title("median household income of Pittsburgh by census tract", fontsize=16)
ax.set_axis_off()
We used the latitude
and longitude
columns to create a GeoDataFrame after loading the JSON data.
pittRestaurants = pd.read_json("data/restaurants_pittsburgh.json.gz", orient="records", lines=True)
pittRestaurants
import geopandas as gpd
pittRestaurants_gdf = gpd.GeoDataFrame(
pittRestaurants, geometry=gpd.points_from_xy(pittRestaurants.longitude, pittRestaurants.latitude), crs="EPSG:4326")
pittRestaurants_gdf = pittRestaurants_gdf.to_crs(epsg=3857)
We overlaid the restaurants and color the points according to the 'stars' column.
import matplotlib.colors as colors
fig, ax = plt.subplots(figsize=(20,15))
# Plot the choropleth
pitt_inc_final.plot(ax=ax, column='B19013_001E', legend=True, legend_kwds=dict(loc="lower left"),
cmap='gray', scheme="Quantiles", k=5)
# Plot the choropleth
pittRestaurants_gdf.plot(ax=ax, column='stars', legend=True, alpha=0.8, cmap='coolwarm')
[xmin, ymin, xmax, ymax] = pitt_inc_final.total_bounds
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.set_axis_off()
ax.set_title("restaurants on the income map", fontsize=16)
Text(0.5, 1.0, 'restaurants on the income map')
Similar to what we had seen in Section 1, there appears a strong correlation between the two subplots.
merged_reviews = pittRestaurants_gdf.merge(pittReviews,
on='business_id')
# create the axes
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,15))
# Extract out the x/y coordindates of the Point objects
xcoords = merged_reviews.geometry.x
ycoords = merged_reviews.geometry.y
# Add the zillow geometry boundaries
pitt_inc_final.plot(ax=ax1, facecolor="none", edgecolor="gray", linewidth=0.5)
pitt_inc_final.plot(ax=ax2, facecolor="none", edgecolor="gray", linewidth=0.5)
# Plot a hexbin chart
hex_vals1 = ax1.hexbin(xcoords, ycoords, C=merged_reviews['polarity'], gridsize=50, cmap='coolwarm')
hex_vals2 = ax2.hexbin(xcoords, ycoords, C=merged_reviews['stars_y'], gridsize=50, cmap='coolwarm')
# add a colorbar and format
cb1 = fig.colorbar(hex_vals1, ax=ax1, location="bottom")
cb1.set_label('polarity')
cb2 = fig.colorbar(hex_vals2, ax=ax2, location="bottom")
cb2.set_label('stars')
ax1.set_axis_off()
ax2.set_axis_off()
ax1.set_title('hex bins giving the polarity of the restaurant review', size=22)
ax2.set_title('hex bins giving the number of stars of the restaurant', size=22)
Text(0.5, 1.0, 'hex bins giving the number of stars of the restaurant')