Title: | Dictionaries and Word Lists for the 'qdap' Package |
---|---|
Description: | A collection of text analysis dictionaries and word lists for use with the 'qdap' package. |
Authors: | Tyler Rinker |
Maintainer: | Tyler Rinker <[email protected]> |
License: | GPL-2 |
Version: | 1.0.8 |
Built: | 2024-12-28 05:09:23 UTC |
Source: | https://github.com/trinker/qdapdictionaries |
A dataset containing abbreviations and their qdap friendly form.
data(abbreviations)
data(abbreviations)
A data frame with 14 rows and 2 variables
abv. Common transcript abbreviations
rep. qdap representation of those abbreviations
A dataset containing a vector of action words. This is a subset of the Moby project: Moby Part-of-Speech.
data(action.verbs)
data(action.verbs)
A vector with 1569 elements
From Grady Ward's Moby project: "This second edition is a particularly thorough revision of the original Moby Part-of-Speech. Beyond the fifteen thousand new entries, many thousand more entries have been scrutinized for correctness and modernity. This is unquestionably the largest P-O-S list in the world. Note that the many included phrases means that parsing algorithms can now tokenize in units larger than a single word, increasing both speed and accuracy."
A dataset containing a vector of adverbs words. This is a subset of the Moby project: Moby Part-of-Speech.
data(adverb)
data(adverb)
A vector with 13398 elements
From Grady Ward's Moby project: "This second edition is a particularly thorough revision of the original Moby Part-of-Speech. Beyond the fifteen thousand new entries, many thousand more entries have been scrutinized for correctness and modernity. This is unquestionably the largest P-O-S list in the world. Note that the many included phrases means that parsing algorithms can now tokenize in units larger than a single word, increasing both speed and accuracy."
A dataset containing a vector of words that amplify word meaning.
data(amplification.words)
data(amplification.words)
A vector with 49 elements
Valence shifters are words that alter or intensify the meaning of the polarized words and include negators and amplifiers. Negators are, generally, adverbs that negate sentence meaning; for example the word like in the sentence, "I do like pie.", is given the opposite meaning in the sentence, "I do not like pie.", now containing the negator not. Amplifiers are, generally, adverbs or adjectives that intensify sentence meaning. Using our previous example, the sentiment of the negator altered sentence, "I seriously do not like pie.", is heightened with addition of the amplifier seriously. Whereas de-amplifiers decrease the intensity of a polarized word as in the sentence "I barely like pie"; the word "barely" deamplifies the word like.
A stopword list containing a character vector of stopwords.
data(BuckleySaltonSWL)
data(BuckleySaltonSWL)
A character vector with 546 elements
From Onix Text Retrieval Toolkit API Reference: "This stopword list was built by Gerard Salton and Chris Buckley for the experimental SMART information retrieval system at Cornell University. This stopword list is generally considered to be on the larger side and so when it is used, some implementations edit it so that it is better suited for a given domain and audience while others use this stopword list as it stands."
Reduced from the original 571 words to 546.
http://www.lextek.com/manuals/onix/stopwords2.html
A dataset containing common contractions and their expanded form.
data(contractions)
data(contractions)
A data frame with 70 rows and 2 variables
contraction. The contraction word.
expanded. The expanded form of the contraction.
A dataset containing a vector of words that de-amplify word meaning.
data(deamplification.words)
data(deamplification.words)
A vector with 13 elements
Valence shifters are words that alter or intensify the meaning of the polarized words and include negators and amplifiers. Negators are, generally, adverbs that negate sentence meaning; for example the word like in the sentence, "I do like pie.", is given the opposite meaning in the sentence, "I do not like pie.", now containing the negator not. Amplifiers are, generally, adverbs or adjectives that intensify sentence meaning. Using our previous example, the sentiment of the negator altered sentence, "I seriously do not like pie.", is heightened with addition of the amplifier seriously. Whereas de-amplifiers decrease the intensity of a polarized word as in the sentence "I barely like pie"; the word "barely" deamplifies the word like.
A dataset containing syllable counts.
data(DICTIONARY)
data(DICTIONARY)
A data frame with 20137 rows and 2 variables
word. The word
syllables. Number of syllables
This data set is based on the Nettalk Corpus but has some researcher
word deletions and additions based on the needs of the
syllable_sum
algorithm.
Sejnowski, T.J., and Rosenberg, C.R. (1987). "Parallel networks that learn to pronounce English text" in Complex Systems, 1, 145-168. Retrieved from: http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Nettalk+Corpus)
UCI Machine Learning Repository website
A dataset containing discourse markers
data(discourse.markers.alemany)
data(discourse.markers.alemany)
A data frame with 97 rows and 5 variables
A dictionary of discourse markers from Alemany (2005). "In this lexicon, discourse markers are characterized by their structural (continuation or elaboration) and semantic (revision, cause, equality, context) meanings, and they are also associated to a morphosyntactic class (part of speech, PoS), one of adverbial (A), phrasal (P) or conjunctive (C)... Sometimes a discourse marker is underspecified with respect to a meaning. We encode this with a hash. This tends to happen with structural meanings, because these meanings can well be established by discursive mechanisms other than discourse markers, and the presence of the discourse marker just reinforces the relation, whichever it may be." (p. 191).
marker. The discourse marker
type. The semantic type (typically overlaps with semantic
except in the special types
structural. How the marker is used structurally
semantic. How the marker is used semantically
pos. Part of speech: adverbial (A), phrasal (P) or conjunctive (C)
Alemany, L. A. (2005). Representing discourse for automatic text summarization via
shallow NLP techniques (Unpublished doctoral dissertation). Universitat de Barcelona, Barcelona.
http://www.cs.famaf.unc.edu.ar/~laura/shallowdisc4summ/tesi_electronica.pdf
http://russell.famaf.unc.edu.ar/~laura/shallowdisc4summ/discmar/#description
Edward William Dolch's list of 220 Most Commonly Used Words.
data(Dolch)
data(Dolch)
A vector with 220 elements
Dolch's Word List made up 50-75% of all printed text in 1936.
Dolch, E. W. (1936). A basic sight vocabulary. Elementary School Journal, 36, 456-460.
A dataset containing common emoticons (adapted from Popular Emoticon List).
data(emoticon)
data(emoticon)
A data frame with 81 rows and 2 variables
meaning. The meaning of the emoticon
emoticon. The graphic representation of the emoticon
http://www.lingo2word.com/lists/emoticon_listH.html
A stopword list containing a character vector of stopwords.
data(Fry_1000)
data(Fry_1000)
A vector with 1000 elements
Fry's 1000 Word List makes up 90% of all printed text.
Fry, E. B. (1997). Fry 1000 instant words. Lincolnwood, IL: Contemporary Books.
A vector of function words from
John and Muriel Higgins's list
used for the text game ECLIPSE. The lest is augmented with additional
contractions from contractions
.
data(function.words)
data(function.words)
A vector with 350 elements
http://myweb.tiscali.co.uk/wordscape/museum/funcword.html
A dataset containing a vector of Grady Ward's English words augmented with
DICTIONARY
,
Mark Kantrowitz's names list,
other proper nouns, and contractions.
data(GradyAugmented)
data(GradyAugmented)
A vector with 122806 elements
A dataset containing a vector of Grady Ward's English words augmented with proper nouns (U.S. States, Countries, Mark Kantrowitz's Names List, and months) and contractions. That dataset is augmented for spell checking purposes.
Moby Thesaurus List by Grady Ward http://www.gutenberg.org
List of names from Mark Kantrowitz http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/.
A copy of the README
is available here
per the author's request.
A dataset containing a character vector of common interjections.
data(interjections)
data(interjections)
A character vector with 139 elements
http://www.vidarholen.net/contents/interjections/
A dataset containing a polarity lookup key (see polarity
).
data(key.pol)
data(key.pol)
A hash key with words and corresponding values.
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. National Conference on Artificial Intelligence.
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
A dataset containing a power lookup key.
data(key.power)
data(key.power)
A hash key with power words.
http://www.wjh.harvard.edu/~inquirer/inqdict.txt
A dataset containing a strength lookup key.
data(key.strength)
data(key.strength)
A hash key with strength words.
http://www.wjh.harvard.edu/~inquirer/inqdict.txt
A dataset containing a syllable lookup key (see DICTIONARY
).
data(key.syl)
data(key.syl)
A hash key with a modified DICTIONARY data set.
For internal use.
UCI Machine Learning Repository website
A dataset containing a synonym lookup key.
data(key.syn)
data(key.syn)
A hash key with 10976 rows and 2 variables (words and synonyms).
Scraped from:
Reverso Online Dictionary.
The word list fed to Reverso
is the unique words from the combination of DICTIONARY
and
labMT
.
A dataset containing words, average happiness score (polarity), standard deviations, and rankings.
data(labMT)
data(labMT)
A data frame with 10222 rows and 8 variables
word. The word.
happiness_rank. Happiness ranking of words based on average happiness scores.
happiness_average. Average happiness score.
happiness_standard_deviation. Standard deviations of the happiness scores.
twitter_rank. Twitter ranking of the word.
google_rank. Google ranking of the word.
nyt_rank. New York Times ranking of the word.
lyrics_rank. lyrics ranking of the word.
Dodds, P.S., Harris, K.D., Kloumann, I.M., Bliss, C.A., & Danforth, C.M. (2011) Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter. PLoS ONE 6(12): e26752. doi:10.1371/journal.pone.0026752
Edward William Dolch's list of 220 Most Commonly Used Words by reading level.
data(Leveled_Dolch)
data(Leveled_Dolch)
A data frame with 220 rows and 2 variables
Dolch's Word List made up 50-75% of all printed text in 1936.
Word. The word
Level. The reading level of the word
Dolch, E. W. (1936). A basic sight vocabulary. Elementary School Journal, 36, 456-460.
A dataset containing 1990 U.S. census data on first names.
data(NAMES)
data(NAMES)
A data frame with 5493 rows and 7 variables
name. A first name.
per.freq. Frequency in percent of the name by gender.
cum.freq. Cumulative frequency in percent of the name by gender.
rank. Rank of the name by gender.
gender. Gender of the combined male/female list (M/F).
gender2. Gender of the combined male/female list with "B" in place of overlapping (M/F) names.
pred.sex. Predicted gender of the names with B's in gender2
replaced with the gender that had a higher per.freq
.
A list version of the NAMES_SEX
dataset broken down by
first letter.
data(NAMES_LIST)
data(NAMES_LIST)
A list with 26 elements
Alphabetical list of dataframes with the following variables:
name. A first name.
gender2. Gender of the combined male/female list with "B" in place of overlapping (M/F) names.
pred.sex. Predicted gender of the names with B's in gender2
replaced with the gender that had a higher per.freq
.
A truncated version of the NAMES
dataset used for predicting.
data(NAMES_SEX)
data(NAMES_SEX)
A data frame with 5162 rows and 3 variables
name. A first name.
gender2. Gender of the combined male/female list with "B" in place of overlapping (M/F) names.
pred.sex. Predicted gender of the names with B's in gender2
replaced with the gender that had a higher per.freq
.
A dataset containing a vector of words that negate word meaning.
data(negation.words)
data(negation.words)
A vector with 23 elements
Valence shifters are words that alter or intensify the meaning of the polarized words and include negators and amplifiers. Negators are, generally, adverbs that negate sentence meaning; for example the word like in the sentence, "I do like pie.", is given the opposite meaning in the sentence, "I do not like pie.", now containing the negator not. Amplifiers are, generally, adverbs or adjectives that intensify sentence meaning. Using our previous example, the sentiment of the negator altered sentence, "I seriously do not like pie.", is heightened with addition of the amplifier seriously. Whereas de-amplifiers decrease the intensity of a polarized word as in the sentence "I barely like pie"; the word "barely" deamplifies the word like.
A dataset containing a vector of negative words.
data(negative.words)
data(negative.words)
A vector with 4776 elements
A sentence containing more negative words would be deemed a negative sentence, whereas a sentence containing more positive words would be considered positive.
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. National Conference on Artificial Intelligence.
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
A stopword list containing a character vector of stopwords.
data(OnixTxtRetToolkitSWL1)
data(OnixTxtRetToolkitSWL1)
A character vector with 404 elements
From Onix Text Retrieval Toolkit API Reference: "This stopword list is probably the most widely used stopword list. It covers a wide number of stopwords without getting too aggressive and including too many words which a user might search upon."
Reduced from the original 429 words to 404.
http://www.lextek.com/manuals/onix/stopwords1.html
A dataset containing a vector of positive words.
data(positive.words)
data(positive.words)
A vector with 2003 elements
A sentence containing more negative words would be deemed a negative sentence, whereas a sentence containing more positive words would be considered positive.
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. National Conference on Artificial Intelligence.
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
A subset of the Harvard IV Dictionary containing a vector of words indicating power.
data(power.words)
data(power.words)
A vector with 624 elements
http://www.wjh.harvard.edu/~inquirer/inqdict.txt
A dataset containing a vector of common prepositions.
data(preposition)
data(preposition)
A vector with 162 elements
Prints a view_data object.
## S3 method for class 'view_data' print(x, ...)
## S3 method for class 'view_data' print(x, ...)
x |
The view_data object. |
... |
ignored |
A collection of dictionaries and Word Lists to Accompany the qdap Package
A subset of the Harvard IV Dictionary containing a vector of words indicating strength.
data(strong.words)
data(strong.words)
A vector with 1474 elements
http://www.wjh.harvard.edu/~inquirer/inqdict.txt
A subset of the Harvard IV Dictionary containing a vector of words indicating submission.
data(submit.words)
data(submit.words)
A vector with 262 elements
http://www.wjh.harvard.edu/~inquirer/inqdict.txt
A stopword list containing a character vector of stopwords.
data(Top100Words)
data(Top100Words)
A character vector with 100 elements
Fry's Word List: The first 25 make up about one-third of all printed material in English. The first 100 make up about one-half of all printed material in English. The first 300 make up about 65% of all printed material in English."
Fry, E. B. (1997). Fry 1000 instant words. Lincolnwood, IL: Contemporary Books.
A stopword list containing a character vector of stopwords.
data(Top200Words)
data(Top200Words)
A character vector with 200 elements
Fry's Word List: The first 25 make up about one-third of all printed material in English. The first 100 make up about one-half of all printed material in English. The first 300 make up about 65% of all printed material in English."
Fry, E. B. (1997). Fry 1000 instant words. Lincolnwood, IL: Contemporary Books.
A stopword list containing a character vector of stopwords.
data(Top25Words)
data(Top25Words)
A character vector with 25 elements
Fry's Word List: The first 25 make up about one-third of all printed material in English. The first 100 make up about one-half of all printed material in English. The first 300 make up about 65% of all printed material in English."
Fry, E. B. (1997). Fry 1000 instant words. Lincolnwood, IL: Contemporary Books.
Lists and describes all the data sets available in qdapDictionaries.
view_data(package = "qdapDictionaries")
view_data(package = "qdapDictionaries")
package |
The name of the package. |
Returns the data sets of qdapDictionaries as a dataframe.
view_data()
view_data()
A subset of the Harvard IV Dictionary containing a vector of words indicating weakness.
data(weak.words)
data(weak.words)
A vector with 647 elements
http://www.wjh.harvard.edu/~inquirer/inqdict.txt