Title: | Tools for Stemming and Lemmatizing Text |
---|---|
Description: | Tools that stem and lemmatize text. Stemming is a process that removes endings such as affixes. Lemmatization is the process of grouping inflected forms together as a single base form. |
Authors: | Tyler Rinker [aut, cre] |
Maintainer: | Tyler Rinker <[email protected]> |
License: | GPL-2 |
Version: | 0.1.5 |
Built: | 2025-01-09 06:28:00 UTC |
Source: | https://github.com/trinker/textstem |
Lemmatize a vector of strings.
lemmatize_strings(x, dictionary = lexicon::hash_lemmas, ...)
lemmatize_strings(x, dictionary = lexicon::hash_lemmas, ...)
x |
A vector of strings. |
dictionary |
A dictionary of base terms and lemmas to use for
replacement. The first column should be the full word form in lower case
while the second column is the corresponding replacement lemma. The default
makes the dictionary from the text using
|
... |
Other arguments passed to |
Returns a vector of lemmatized strings.
The lemmatizer splits the string apart into tokens for speed optimization. After the lemmatizing occurs the strings are pasted back together. The strings are not guaranteed to retain exact spacing of the original.
x <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) ## Default lexicon::hash_lemmas dictionary lemmatize_strings(x) ## Hunspell dictionary lemma_dictionary <- make_lemma_dictionary(x, engine = 'hunspell') lemmatize_strings(x, dictionary = lemma_dictionary) ## Bigger data set library(dplyr) presidential_debates_2012$dialogue %>% lemmatize_strings() %>% head() ## Not run: ## Treetagger dictionary lemma_dictionary2 <- make_lemma_dictionary(x, engine = 'treetagger') lemmatize_strings(x, lemma_dictionary2) lemma_dictionary3 <- presidential_debates_2012$dialogue %>% make_lemma_dictionary(engine = 'treetagger') presidential_debates_2012$dialogue %>% lemmatize_strings(lemma_dictionary3) %>% head() ## End(Not run)
x <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) ## Default lexicon::hash_lemmas dictionary lemmatize_strings(x) ## Hunspell dictionary lemma_dictionary <- make_lemma_dictionary(x, engine = 'hunspell') lemmatize_strings(x, dictionary = lemma_dictionary) ## Bigger data set library(dplyr) presidential_debates_2012$dialogue %>% lemmatize_strings() %>% head() ## Not run: ## Treetagger dictionary lemma_dictionary2 <- make_lemma_dictionary(x, engine = 'treetagger') lemmatize_strings(x, lemma_dictionary2) lemma_dictionary3 <- presidential_debates_2012$dialogue %>% make_lemma_dictionary(engine = 'treetagger') presidential_debates_2012$dialogue %>% lemmatize_strings(lemma_dictionary3) %>% head() ## End(Not run)
Lemmatize a vector of words.
lemmatize_words(x, dictionary = lexicon::hash_lemmas, ...)
lemmatize_words(x, dictionary = lexicon::hash_lemmas, ...)
x |
A vector of words. |
dictionary |
A dictionary of base terms and lemmas to use for
replacement. The first column should be the full word form in lower case
while the second column is the corresponding replacement lemma. The default
uses |
... |
ignored. |
Returns a vector of lemmatized words.
x <- c("the", NA, 'doggies', ',', 'well', 'they', "aren\'t", 'Joyfully', 'running', '.') lemmatize_words(x)
x <- c("the", NA, 'doggies', ',', 'well', 'they', "aren\'t", 'Joyfully', 'running', '.') lemmatize_words(x)
Given a set of text strings, the function generates a dictionary of lemmas corresponding to words that are not in base form.
make_lemma_dictionary(..., engine = "hunspell", path = NULL, lang = switch(engine, hunspell = { "en_US" }, treetagger = { "en" }, lexicon = { NULL }, stop("engine not found")))
make_lemma_dictionary(..., engine = "hunspell", path = NULL, lang = switch(engine, hunspell = { "en_US" }, treetagger = { "en" }, lexicon = { NULL }, stop("engine not found")))
engine |
One of: "hunspell", "treetragger" or "lexicon". The lexicon and hunspell choices use the lexicon and hunspell packages, which may be faster than TreeTagger, have the tooling available without installing external tools but are likely less accurate. TreeTagger is likely more accurate but requires installing the TreeTagger program (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger. |
path |
Path to the TreeTagger program if |
lang |
A character string naming the language to be used in koRpus
(treetagger) or hunspell. The default language is |
... |
A vector of texts to generate lemmas for. |
Returns a two column data.frame
with tokens and
corresponding lemmas.
x <- c('the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!' ) make_lemma_dictionary(x) ## Not run: make_lemma_dictionary(x, engine = 'treetagger') ## End(Not run)
x <- c('the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!' ) make_lemma_dictionary(x) ## Not run: make_lemma_dictionary(x, engine = 'treetagger') ## End(Not run)
A dataset containing a cleaned version of all three presidential debates for the 2012 election.
data(presidential_debates_2012)
data(presidential_debates_2012)
A data frame with 2912 rows and 4 variables
person. The speaker
tot. Turn of talk
dialogue. The words spoken
time. Variable indicating which of the three debates the dialogue is from
A dataset containing a character vector of the text from Seuss's 'Sam I Am'.
data(sam_i_am)
data(sam_i_am)
A character vector with 169 elements
Seuss, Dr. (1960). Green Eggs and Ham.
Stem a vector of strings.
stem_strings(x, language = "porter", ...)
stem_strings(x, language = "porter", ...)
x |
A vector of strings. |
language |
The name of a recognized language (see
|
... |
Other arguments passed to |
Returns a vector of stemmed strings.
The stemmer requires splitting the string apart into tokens. After the stemming occurs the strings are pasted back together. The strings are not guaranteed to retain exact spacing of the original.
x <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) stem_strings(x)
x <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) stem_strings(x)
Stem a vector of words.
stem_words(x, language = "porter", ...)
stem_words(x, language = "porter", ...)
x |
A vector of words. |
language |
The name of a recognized language (see
|
... |
ignored. |
Returns a vector of stemmed words.
x <- c("the", 'doggies', ',', 'well', 'they', "aren\'t", 'Joyfully', 'running', '.') stem_words(x)
x <- c("the", 'doggies', ',', 'well', 'they', "aren\'t", 'Joyfully', 'running', '.') stem_words(x)
Tools that stem and lemmatize text. Stemming is a process that removes endings such as suffixes. Lemmatization is the process of grouping inflected forms together as a single base form.