Sentiment Analysis with Yelp Reviews

In this project, I will explore Pittsburgh's restaurant review data available through the Yelp Dataset Challenge.

This assignment is broken into two parts:

Part 1: testing how well sentiment analysis works.

Because Yelp reviews include the number of stars given by the user, the Yelp data set provides a unique opportunity to test how well our sentiment analysis works by comparing the number of stars to the polarity of reviews.

Part 2: analyzing correlations between restaurant reviews and census data

We'll explore geographic trends in the restaurant reviews, comparing our sentiment analysis results with user stars geographically. We'll also overlay review stars on maps of household income (using census data).

1. Does Sentiment Analysis Work?

In this part, we'll load the data, perform a sentiment analysis, and explore the results.

1.1 Load review data

1.2 Format the review text

The first step is to split the review text into its individual words and make all of the words lower-cased.

We added a new column, called 'formatted_text', which each entry is a list of the lower-cased words in a review.

1.3 Review stop words

We used the nltk library to remove any stop words from the list of words in each review.

We overwrote the 'formatted_text' column to contain a list of lower-cased words in each review, with no stop words.

1.4 Calculate polarity and subjectivity

Using the formatted text column, we created a list of textblob.TextBlob() objects and then extracted the subjectivity and polarity.

We added two new columns to the review DataFrame: polarity and subjectivity.

1.5 Comparing the sentiment analysis to number of stars

We used seaborn to make two box plots, one showing the polarity vs number of user stars and one showing the subjectivity vs the number of user stars.

The charts indicate that the sentiment analysis is pretty effective. Although the subjectivity chart shows that the higher the stars given, the more subjective the review, the distribution across groups is rather consistant at similar values. Meanwhile, while the polarity values also increase with the stars given, the variation is larger between different groups. The level of subjectivity across groups are different, but not as different as the values of polarity across groups. This mean that the analysis is rather effective.

1.6 The importance of individual words

In this part, we explored the importance and frequency of individual words in Yelp reviews.

We identified the most common reviews and then plotted the average polarity vs the user stars for the reviews where those words occur.

1.6.1 Select a random sample of the review data

First we selected 1,000 random rows from the DataFrame holding the review data. Use the .sample() function to perform the selection.

1.6.2 Re-format the data

Then we passed the subset of review data from the previous part to the reshape_data() function defined below.

"The reshape_data() function breaks each sentence in the formatted text column into words and then was joined to the polarity and stars review results of its sentence."

1.6.3 Calculate the average number of stars and polarity for each word

Using the result from 1.6.2, we grouped the dataframe by the "word" column, and calculated the following three quantities:

  1. the size of each group
  2. the average number of user stars for each word
  3. the average polarity for each word

1.6.4 Select words the occur at least 50 times in reviews

We trimed the DataFrame from the last section to only include words that occurred at least 50 times.

1.6.5 Plot the average polarity vs user stars

We used matplotlib to make a scatter plot of the average user stars vs average polarity for the words in the data frame from the last section. This involved two steps:

Loop over each row of the data frame from the last section and for each row:

  1. Use plt.scatter(x, y) to plot a scatter plot, where x is polarity and y is stars.
  2. Use plt.text(x, y, word) to add the corresponding word to each scatter marker.

Using the data frame from section 1.4, we added vertical and horizontal lines to the chart that shows the average number of user stars and the average polarity across all reviews in the data set.

We saw a strong trend between polarity and user stars, and some of the most common words occurring in these reviews.

2. Correlating restaurant data and household income

In this part, we will use the census API to download household income data and overlay restaurant locations.

2.1 Query the Census API

We used the cenpy package to download median household income in the past 12 months by census tract from the 2018 ACS 5-year data set for your county of interest.

2.2 Download census tracts from the Census and merge the data from Part 2.1

2.3 Plot a choropleth map of the household income

2.4 Load the restaurants data

We used the latitude and longitude columns to create a GeoDataFrame after loading the JSON data.

2.5 Overlay restaurants on the income map

We overlaid the restaurants and color the points according to the 'stars' column.

2.6 Comparing polarity vs. stars geographically

Similar to what we had seen in Section 1, there appears a strong correlation between the two subplots.