June 24, 2024

harry potter

  • Collection of 7 stories: Each presenting a mystery of the dark lord that harry and his freinds have to face

  • magical world Setting:the story ads a portrail of action, cureosity, and mysterys through the mythical world

  • Themes of the power of love and friendship over evil and corruption: The adventures often include that he and his freinds use there minds and there heart to defeat evil ## Goals

  • Analyze the frequency and usage of words

  • Analyze the sentiment 7 stories

  • Identify and explore common topics

Method

  • Collect text data
  • Preprocess the text data (cleaning, tokenization).
  • Perform word frequency
  • Perform sentiment analysis
  • Perform tf-idf analysis
  • Perform topic modeling

Text data: Raw

  • Source: https://www.gutenberg.org/cache/epub/1661/pg1661.txt
##  [1] "\f\fHP 1 - Harry Potter and the"      
##  [2] "Sorcerer's Stone"                     
##  [3] "Harry Potter and the Sorcerer's Stone"
##  [4] ""                                     
##  [5] "Harry Potter"                         
##  [6] "&"                                    
##  [7] "The Sorcerer’s Stone"                 
##  [8] ""                                     
##  [9] "by J.K. Rowling"                      
## [10] ""                                     
## [11] "\fHP 1 - Harry Potter and the"        
## [12] "Sorcerer's Stone"                     
## [13] ""                                     
## [14] "\fCHAPTER ONE"                        
## [15] "THE BOY WHO LIVED"

Text data: Dataframe

## Rows: 7,557
## Columns: 1
## $ text <chr> "\f\fHP 1 - Harry Potter and the", "Sorcerer's Stone", "Harry Pot…
text
HP 1 - Harry Potter and the
Sorcerer's Stone
Harry Potter and the Sorcerer's Stone
Harry Potter
&
The Sorcerer’s Stone
by J.K. Rowling

Text data: Tidy

## Rows: 78,065
## Columns: 2
## $ line <int> 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 7, 7, 7, 9, 9, 9,…
## $ word <chr> "hp", "1", "harry", "potter", "and", "the", "sorcerer's", "stone"…
line word
1 hp
1 1
1 harry
1 potter
1 and
1 the
2 sorcerer's
2 stone
3 harry
3 potter

Word frequencies

Word frequencies: after removing potter as a stopword

Wordcloud

Sentiment analysis: Most common positive and negative words

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 41555 of `x` matches multiple rows in `y`.
## ℹ Row 2698 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Track sentiment in stories

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 41555 of `x` matches multiple rows in `y`.
## ℹ Row 2698 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Summary

  • Word frequency analysis: most common words in the whole book
  • Sentiment analysis:
  • Topic modeling: trained a topic model with 6 topics
  • Workflow for topic modeling:
    • use tidy tools for initial data exploration and preparation.
    • cast to a non-tidy structure to perform some machine learning algorithm.
    • tidy the results of statistical modeling and use tidy data principles again to understand model results

Thank you!