June 24, 2024

Romance of the Three Kingdoms

  • This is an ancient Chinese text by the author Luo Guanzhong
  • It is based on events during the end of the Han dynasty
  • It focuses on the power struggles between three main powers: the Shu dynasty, the Cao Wei dynasty, and the Wu dynasty
  • There are also three main characters that appear throughout the novel: Cao Cao, Liu Bei, and Sun Quan

Goals

  • Analyze the frequency and usage of words
  • Analyze the sentiment

Method

  • Collect text data
  • Preprocess the text data (cleaning, tokenization).
  • Perform word frequency analysis
  • Perform sentiment analysis

Text data: Raw

##  [1] "《三国演义》罗贯中"                                
##  [2] ""                                                  
##  [3] "第一回 宴桃园豪杰三结义 斩黄巾英雄首立功"          
##  [4] ""                                                  
##  [5] "    \t\t"                                          
##  [6] ""                                                  
##  [7] "    滚滚长江东逝水,浪花淘尽英雄。是非成败转头空。"
##  [8] ""                                                  
##  [9] "    青山依旧在,几度夕阳红。白发渔樵江渚上,惯"    
## [10] ""                                                  
## [11] "    看秋月春风。一壶浊酒喜相逢。古今多少事,都付"  
## [12] ""                                                  
## [13] "    笑谈中。"                                      
## [14] ""                                                  
## [15] "    ——调寄《临江仙》"

Text data: Dataframe

line text
1 《三国演义》罗贯中
2
3 第一回 宴桃园豪杰三结义 斩黄巾英雄首立功
4
5
6
7 滚滚长江东逝水,浪花淘尽英雄。是非成败转头空。
8
9 青山依旧在,几度夕阳红。白发渔樵江渚上,惯
10

Text data: Tidy

## Rows: 392,412
## Columns: 2
## $ line <int> 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7,…
## $ word <chr> "三国", "演义", "罗贯中", "第一", "回", "宴", "桃园", "豪杰", "三…
line word
1 三国
1 演义
1 罗贯中
3 第一
3
3
3 桃园
3 豪杰
3
3

Chinese Word Segmentation

  • Unlike English, where each word token is delimited by white spaces, Chinese word tokens are much less straightforward. A word, however, is an important semantic unit in many linguistic analysis.

  • Chinese word cutter jiebaR

## Warning: 程序包'jiebaR'是用R版本4.4.1 来建造的
## Warning: 程序包'jiebaRD'是用R版本4.4.1 来建造的
## Warning: 程序包'tmcn'是用R版本4.4.1 来建造的
line word
1 三国演义
1 罗贯中
3 第一回
3
3 桃园
3 豪杰
3
3 结义
3
3 黄巾

Word frequencies

Wordcloud

Sentiment analysis

Track sentiment in book

Summary

  • Segmentation of Chinese words
  • Word frequency analysis: most common words in the book
  • Sentiment analysis

Thank you!