Title: | Text Cleaning Tools |
---|---|
Description: | Tools to clean and process text. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see Sproat, Black, Chen, Kumar, Ostendorf, & Richards (2001) <doi:10.1006/csla.2001.0169>) or extracting them into new variables. For example, emoticons are often used in text but not always easily handled by analysis algorithms. The replace_emoticon() function replaces emoticons with word equivalents. |
Authors: | Tyler Rinker [aut, cre], ctwheels StackOverflow [ctb], Surin Space [ctb] |
Maintainer: | Tyler Rinker <[email protected]> |
License: | GPL-2 |
Version: | 0.9.7 |
Built: | 2024-10-25 02:59:54 UTC |
Source: | https://github.com/trinker/textclean |
Use like as a SQL-esque opertator for pattern matching. %like%
is
case insensitive while %slike%
is case sensitive. This is most useful
in a dplyr::filter
.
var %like% pattern var %LIKE% pattern var %slike% pattern var %SLIKE% pattern
var %like% pattern var %LIKE% pattern var %slike% pattern var %SLIKE% pattern
var |
A variable/column. |
pattern |
A search pattern. |
state.name[state.name %like% 'or'] state.name[state.name %LIKE% 'or'] state.name[state.name %slike% 'or'] ## No Oregon
state.name[state.name %like% 'or'] state.name[state.name %LIKE% 'or'] state.name[state.name %slike% 'or'] ## No Oregon
Adds a space after a comma as strip
and many other functions may consider a
comma separated string as one word (i.e., "one,two,three"
becomes
"onetwothree"
rather than "one two three"
).
add_comma_space(x)
add_comma_space(x)
x |
The text variable. |
Returns a vector of strings with commas that have a space after them.
## Not run: x <- c("the, dog,went", "I,like,it", "where are you", NA, "why", ",", ",f") add_comma_space(x) ## End(Not run)
## Not run: x <- c("the, dog,went", "I,like,it", "where are you", NA, "why", ",", ",f") add_comma_space(x) ## End(Not run)
Detect missing endmarks and replace with the desired symbol.
add_missing_endmark(x, replacement = "|", endmarks = c("?", ".", "!"), ...)
add_missing_endmark(x, replacement = "|", endmarks = c("?", ".", "!"), ...)
x |
The text variable. |
replacement |
Character string equal in length to pattern or of length one which are a replacement for matched pattern. |
endmarks |
The potential ending punctuation marks. |
... |
Additional arguments passed to
|
Returns a vector with missing endmarks added.
x <- c( "This in a", "I am funny!", "An ending of sorts%", "What do you want?" ) add_missing_endmark(x)
x <- c( "This in a", "I am funny!", "An ending of sorts%", "What do you want?" ) add_missing_endmark(x)
check_text
- Uncleaned text may result in errors, warnings, and
incorrect results in subsequent analysis. check_text
checks text for
potential problems and suggests possible fixes. Potential text anomalies
that are detected include: factors, missing ending punctuation, empty cells,
double punctuation, non-space after comma, no alphabetic characters,
non-ASCII, missing value, and potentially misspelled words.
available_check
- Provide a data.frame view of all the available
checks in the check_text
function.
check_text(x, file = NULL, checks = NULL, n = 10, ...) available_checks()
check_text(x, file = NULL, checks = NULL, n = 10, ...) available_checks()
x |
The text variable. |
file |
A connection, or a character string naming the file to print to.
If |
checks |
A vector of checks to include from |
n |
The number of affected elements to print out (the rest are truncated). |
... |
ignored. |
Returns a list with the following potential text faults report:
contraction- Text elements that contain contractions
date- Text elements that contain dates
digit- Text elements that contain digits/numbers
email- Text elements that contain email addresses
emoticon- Text elements that contain emoticons
empty- Text elements that contain empty text cells (all white space)
escaped- Text elements that contain escaped back spaced characters
hash- Text elements that contain Twitter style hash tags (e.g., #rstats)
html- Text elements that contain HTML markup
incomplete- Text elements that contain incomplete sentences (e.g., uses ending punctuation like ...)
kern- Text elements that contain kerning (e.g., 'The B O M B!')
list_column- Text variable that is a list column
missing_value- Text elements that contain missing values
misspelled- Text elements that contain potentially misspelled words
no_alpha- Text elements that contain elements with no alphabetic (a-z) letters
no_endmark- Text elements that contain elements with missing ending punctuation
no_space_after_comma- Text elements that contain commas with no space afterwards
non_ascii- Text elements that contain non-ASCII text
non_character- Text variable that is not a character column (likely factor
)
non_split_sentence- Text elements that contain unsplit sentences (more than one sentence per element)
tag- Text elements that contain Twitter style handle tags (e.g., @trinker)
time- Text elements that contain timestamps
url- Text elements that contain URLs
The output is a list containing meta checks and elemental checks but prints as a pretty formatted output with potential problem elements, the accompanying text, and possible suggestions to fix the text.
## Not run: v <- list(c('foo', 'bar'), NA, c('hello', 'world')) check_text(v) w <- factor(unlist(v)) check_text(w) x <- c("i like", "<p>i want. </p>thet them ther .", "I am ! that|", "", NA, ""they",were there", ".", " ", "?", "3;", "I like goud eggs!", "i 4like...", "\\tgreat", 'She said "yes"') check_text(x) print(check_text(x), include.text=FALSE) check_text(x, checks = c('non_split_sentence', 'no_endmark')) elementals <- available_checks()[is_meta != TRUE,][['fun']] check_text( x, checks = elementals[ !elementals %in% c('non_split_sentence', 'no_endmark') ] ) y <- c("A valid sentence.", "yet another!") check_text(y) z <- rep("dfsdsd'nt", 120) check_text(z) ## End(Not run)
## Not run: v <- list(c('foo', 'bar'), NA, c('hello', 'world')) check_text(v) w <- factor(unlist(v)) check_text(w) x <- c("i like", "<p>i want. </p>thet them ther .", "I am ! that|", "", NA, ""they",were there", ".", " ", "?", "3;", "I like goud eggs!", "i 4like...", "\\tgreat", 'She said "yes"') check_text(x) print(check_text(x), include.text=FALSE) check_text(x, checks = c('non_split_sentence', 'no_endmark')) elementals <- available_checks()[is_meta != TRUE,][['fun']] check_text( x, checks = elementals[ !elementals %in% c('non_split_sentence', 'no_endmark') ] ) y <- c("A valid sentence.", "yet another!") check_text(y) z <- rep("dfsdsd'nt", 120) check_text(z) ## End(Not run)
A fictitious dataset useful for small demonstrations.
data(DATA)
data(DATA)
A data frame with 11 rows and 5 variables
person. Speaker
sex. Gender
adult. Dummy coded adult (0-no; 1-yes)
state. Statement (dialogue)
code. Dialogue coding scheme
drop_element
- Filter to drop the matching elements of a vector.
keep_element
- Filter to keep the matching elements of a vector.
drop_element(x, pattern, regex = TRUE, ...) drop_element_regex(x, pattern, ...) drop_element_fixed(x, ...) keep_element(x, pattern, regex = TRUE, ...) keep_element_fixed(x, ...) keep_element_regex(x, pattern, ...)
drop_element(x, pattern, regex = TRUE, ...) drop_element_regex(x, pattern, ...) drop_element_fixed(x, ...) keep_element(x, pattern, regex = TRUE, ...) keep_element_fixed(x, ...) keep_element_regex(x, pattern, ...)
x |
A character vector. |
pattern |
A regex pattern to match for exclusion. |
regex |
logical. If setting this to |
... |
Other arguments passed to |
Returns a vector with matching elements removed.
x <- c('dog', 'cat', 'bat', 'dingo', 'dragon', 'dino') drop_element(x, '^d.+?g') keep_element(x, '^d.+?g') drop_element(x, 'at$') drop_element(x, '^d') drop_element(x, '\\b(dog|cat)\\b') drop_element_fixed(x, 'bat', 'cat') drops <- c('bat', 'cat') drop_element_fixed(x, drops)
x <- c('dog', 'cat', 'bat', 'dingo', 'dragon', 'dino') drop_element(x, '^d.+?g') keep_element(x, '^d.+?g') drop_element(x, 'at$') drop_element(x, '^d') drop_element(x, '\\b(dog|cat)\\b') drop_element_fixed(x, 'bat', 'cat') drops <- c('bat', 'cat') drop_element_fixed(x, drops)
drop_row
- Remove rows from a data set that contain a given
marker/term.
keep_row
- Keep rows from a data set that contain a given marker/term.
drop_empty_row
- Removes the empty rows of a data set that are common in
reading in data.
drop_NA
- Removes the NA
rows of a data set.
drop_row(dataframe, column, terms, ...) keep_row(dataframe, column, terms, ...) drop_empty_row(dataframe) drop_NA(dataframe, column = TRUE, ...)
drop_row(dataframe, column, terms, ...) keep_row(dataframe, column, terms, ...) drop_empty_row(dataframe) drop_NA(dataframe, column = TRUE, ...)
dataframe |
A dataframe object. |
column |
Column name to search for markers/terms. |
terms |
The regex terms/markers of the rows that are to be removed from the dataframe. |
... |
Other arguments passed to |
drop_row
- returns a dataframe with the termed/markered rows
removed.
drop_empty_row
- returns a dataframe with empty rows removed.
drop_NA
- returns a dataframe with NA
rows removed.
## Not run: ## drop_row EXAMPLE: drop_row(DATA, "person", c("sam", "greg")) keep_row(DATA, "person", c("sam", "greg")) drop_row(DATA, 1, c("sam", "greg")) drop_row(DATA, "state", c("Comp")) drop_row(DATA, "state", c("I ")) drop_row(DATA, "state", c("you"), ignore.case=TRUE) ## drop_empty_row EXAMPLE: (dat <- rbind.data.frame(DATA[, c(1, 4)], matrix(rep(" ", 4), ncol =2, dimnames=list(12:13, colnames(DATA)[c(1, 4)])))) drop_empty_row(dat) ## drop_NA EXAMPLE: DATA[1:3, "state"] <- NA drop_NA(DATA) ## End(Not run)
## Not run: ## drop_row EXAMPLE: drop_row(DATA, "person", c("sam", "greg")) keep_row(DATA, "person", c("sam", "greg")) drop_row(DATA, 1, c("sam", "greg")) drop_row(DATA, "state", c("Comp")) drop_row(DATA, "state", c("I ")) drop_row(DATA, "state", c("you"), ignore.case=TRUE) ## drop_empty_row EXAMPLE: (dat <- rbind.data.frame(DATA[, c(1, 4)], matrix(rep(" ", 4), ncol =2, dimnames=list(12:13, colnames(DATA)[c(1, 4)])))) drop_empty_row(dat) ## drop_NA EXAMPLE: DATA[1:3, "state"] <- NA drop_NA(DATA) ## End(Not run)
This is a stripped down version of gsubfn
from the gsubfn
package. It finds a regex match, and then uses a function to operate on
these matches and uses them to replace the original matches. Note that
the stringi packages is used for matching and extracting the regex
matches. For more powerful or flexible needs please see the gsubfn
package.
fgsub(x, pattern, fun, ...)
fgsub(x, pattern, fun, ...)
x |
A character vector. |
pattern |
Character string to be matched in the given character vector. |
fun |
A function to operate on the extracted matches. |
... |
ignored. |
Returns a vector with the pattern replaced.
## In this example the regex looks for words that contain a lower case letter ## followed by the same letter at least 2 more times. It then extracts these ## words, splits them appart into letters, reverses the string, pastes them ## back together, wraps them with double angle braces, and then puts them back ## at the original locations. fgsub( x = c(NA, 'df dft sdf', 'sd fdggg sd dfhhh d', 'ddd'), pattern = "\\b\\w*([a-z])(\\1{2,})\\w*\\b", fun = function(x) { paste0('<<', paste(rev(strsplit(x, '')[[1]]), collapse =''), '>>') } ) ## In this example we extract numbers, strip out non-digits, coerce them to ## numeric, cut them in half, round up to the closest integer, add the commas ## back, and replace back into the original locations. fgsub( x = c(NA, 'I want 32 grapes', 'he wants 4 ice creams', 'they want 1,234,567 dollars' ), pattern = "[\\d,]+", fun = function(x) { prettyNum( ceiling(as.numeric(gsub('[^0-9]', '', x))/2), big.mark = ',' ) } ) ## In this example we extract leading zeros, convert to an equal number of ## spaces. fgsub( x = c(NA, "00:04", "00:08", "00:01", "06:14", "00:02", "00:04"), pattern = '^0+', fun = function(x) {gsub('0', ' ', x)} )
## In this example the regex looks for words that contain a lower case letter ## followed by the same letter at least 2 more times. It then extracts these ## words, splits them appart into letters, reverses the string, pastes them ## back together, wraps them with double angle braces, and then puts them back ## at the original locations. fgsub( x = c(NA, 'df dft sdf', 'sd fdggg sd dfhhh d', 'ddd'), pattern = "\\b\\w*([a-z])(\\1{2,})\\w*\\b", fun = function(x) { paste0('<<', paste(rev(strsplit(x, '')[[1]]), collapse =''), '>>') } ) ## In this example we extract numbers, strip out non-digits, coerce them to ## numeric, cut them in half, round up to the closest integer, add the commas ## back, and replace back into the original locations. fgsub( x = c(NA, 'I want 32 grapes', 'he wants 4 ice creams', 'they want 1,234,567 dollars' ), pattern = "[\\d,]+", fun = function(x) { prettyNum( ceiling(as.numeric(gsub('[^0-9]', '', x))/2), big.mark = ',' ) } ) ## In this example we extract leading zeros, convert to an equal number of ## spaces. fgsub( x = c(NA, "00:04", "00:08", "00:01", "06:14", "00:02", "00:04"), pattern = '^0+', fun = function(x) {gsub('0', ' ', x)} )
Uses regular expressions to sub out a single day or month with a leading zero and then coerces to a date object.
fix_mdyyyy(x, ...)
fix_mdyyyy(x, ...)
x |
A character date in the form of m/d/yyyy where m and d can be single integers like 1 for January. |
... |
ignored. |
Returns a data vector
fix_mdyyyy(c('4/23/2017', '12/1/2016', '3/3/2013', '12/12/2012', '2013-01-01')) ## Not run: library(dplyr) data_frame( x = 1:4, y = LETTERS[1:4], start_date = c('4/23/2017', '12/1/2016', '3/3/2013', '12/12/2012'), end_date = c('5/23/2017', '12/9/2016', '3/3/2016', '2/01/2012') ) %>% mutate_at(vars(ends_with('_date')), fix_mdyyyy) ## End(Not run)
fix_mdyyyy(c('4/23/2017', '12/1/2016', '3/3/2013', '12/12/2012', '2013-01-01')) ## Not run: library(dplyr) data_frame( x = 1:4, y = LETTERS[1:4], start_date = c('4/23/2017', '12/1/2016', '3/3/2013', '12/12/2012'), end_date = c('5/23/2017', '12/9/2016', '3/3/2016', '2/01/2012') ) %>% mutate_at(vars(ends_with('_date')), fix_mdyyyy) ## End(Not run)
A logical test of missing sentence ending punctuation.
has_endmark(x, endmarks = c("?", ".", "!"), ...)
has_endmark(x, endmarks = c("?", ".", "!"), ...)
x |
A character vector. |
endmarks |
The potential ending punctuation marks, |
... |
ignored. |
Returns a logical vector.
x <- c( "I like it.", "Et tu?", "Not so much", "Oh, I understand.", "At 3 p.m., we go", NA ) has_endmark(x)
x <- c( "I like it.", "Et tu?", "Not so much", "Oh, I understand.", "At 3 p.m., we go", NA ) has_endmark(x)
Add -s, -es, or -ies to words.
make_plural( x, keep.original = FALSE, irregular = lexicon::pos_df_irregular_nouns )
make_plural( x, keep.original = FALSE, irregular = lexicon::pos_df_irregular_nouns )
x |
A vector of words to make plural. |
keep.original |
logical. If |
irregular |
A |
Returns a vector of plural words.
x <- c('fox', 'sky', 'dog', 'church', 'fish', 'miss', 'match', 'deer', 'block') make_plural(x)
x <- c('fox', 'sky', 'dog', 'church', 'fish', 'miss', 'match', 'deer', 'block') make_plural(x)
Given a text, find all the tokens that match a regex(es). This function is
particularly useful with replace_tokens
.
match_tokens(x, pattern, ignore.case = TRUE, ...)
match_tokens(x, pattern, ignore.case = TRUE, ...)
x |
A character vector. |
pattern |
Character string(s) to be matched in the given character vector. |
ignore.case |
logical. If |
... |
ignored. |
Returns a vector of tokens from a text matching a specific regex pattern.
with(DATA, match_tokens(state, c('^li', 'ou'))) with(DATA, match_tokens(state, c('^Th', '^I'), ignore.case = TRUE)) with(DATA, match_tokens(state, c('^Th', '^I'), ignore.case = FALSE))
with(DATA, match_tokens(state, c('^li', 'ou'))) with(DATA, match_tokens(state, c('^Th', '^I'), ignore.case = TRUE)) with(DATA, match_tokens(state, c('^Th', '^I'), ignore.case = FALSE))
gsub
mgsub
- A wrapper for gsub
that takes a vector
of search terms and a vector or single value of replacements.
mgsub_fixed
- An alias for mgsub
.
mgsub_regex
- An wrapper for mgsub
with fixed = FALSE
.
mgsub_regex_safe
- An wrapper for mgsub
.
mgsub( x, pattern, replacement, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = FALSE, order.pattern = fixed, safe = FALSE, ... ) mgsub_fixed( x, pattern, replacement, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = FALSE, order.pattern = fixed, safe = FALSE, ... ) mgsub_regex( x, pattern, replacement, leadspace = FALSE, trailspace = FALSE, fixed = FALSE, trim = FALSE, order.pattern = fixed, ... ) mgsub_regex_safe(x, pattern, replacement, ...)
mgsub( x, pattern, replacement, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = FALSE, order.pattern = fixed, safe = FALSE, ... ) mgsub_fixed( x, pattern, replacement, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = FALSE, order.pattern = fixed, safe = FALSE, ... ) mgsub_regex( x, pattern, replacement, leadspace = FALSE, trailspace = FALSE, fixed = FALSE, trim = FALSE, order.pattern = fixed, ... ) mgsub_regex_safe(x, pattern, replacement, ...)
x |
A character vector. |
pattern |
Character string to be matched in the given character vector. |
replacement |
Character string equal in length to pattern or of length one which are a replacement for matched pattern. |
leadspace |
logical. If |
trailspace |
logical. If |
fixed |
logical. If |
trim |
logical. If |
order.pattern |
logical. If |
safe |
logical. If |
... |
Additional arguments passed to |
mgsub
- Returns a vector with the pattern replaced.
mgsub(DATA$state, c("it's", "I'm"), c("it is", "I am")) mgsub(DATA$state, "[[:punct:]]", "PUNC", fixed = FALSE) ## Not run: library(textclean) hunthou <- replace_number(seq_len(1e5)) textclean::mgsub( "'twenty thousand three hundred five' into 20305", hunthou, seq_len(1e5) ) ## "'20305' into 20305" ## Larger example from: https://stackoverflow.com/q/18332463/1000343 ## A slower approach fivehunthou <- replace_number(seq_len(5e5)) testvect <- c("fifty seven", "four hundred fifty seven", "six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "three hundred forty six thousand four hundred fifty seven" ) textclean::mgsub(testvect, fivehunthou, seq_len(5e5)) ## Safe substitution: Uses the mgsub package as the backend dubious_string <- "Dopazamine is a fake chemical" pattern <- c("dopazamin","do.*ne") replacement <- c("freakout","metazamine") mgsub(dubious_string, pattern, replacement, ignore.case = TRUE, fixed = FALSE) mgsub(dubious_string, pattern, replacement, safe = TRUE, fixed = FALSE) ## End(Not run)
mgsub(DATA$state, c("it's", "I'm"), c("it is", "I am")) mgsub(DATA$state, "[[:punct:]]", "PUNC", fixed = FALSE) ## Not run: library(textclean) hunthou <- replace_number(seq_len(1e5)) textclean::mgsub( "'twenty thousand three hundred five' into 20305", hunthou, seq_len(1e5) ) ## "'20305' into 20305" ## Larger example from: https://stackoverflow.com/q/18332463/1000343 ## A slower approach fivehunthou <- replace_number(seq_len(5e5)) testvect <- c("fifty seven", "four hundred fifty seven", "six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "three hundred forty six thousand four hundred fifty seven" ) textclean::mgsub(testvect, fivehunthou, seq_len(5e5)) ## Safe substitution: Uses the mgsub package as the backend dubious_string <- "Dopazamine is a fake chemical" pattern <- c("dopazamin","do.*ne") replacement <- c("freakout","metazamine") mgsub(dubious_string, pattern, replacement, ignore.case = TRUE, fixed = FALSE) mgsub(dubious_string, pattern, replacement, safe = TRUE, fixed = FALSE) ## End(Not run)
Prints a check_text object.
## S3 method for class 'check_text' print(x, include.text = TRUE, file = NULL, n = NULL, ...)
## S3 method for class 'check_text' print(x, include.text = TRUE, file = NULL, n = NULL, ...)
x |
The check_text object. |
include.text |
logical. If |
file |
A connection, or a character string naming the file to print to.
If |
n |
The number of affected elements to print out (the rest are truncated) |
... |
ignored |
Prints a sub_holder object
## S3 method for class 'sub_holder' print(x, ...)
## S3 method for class 'sub_holder' print(x, ...)
x |
The sub_holder object |
... |
ignored |
Prints a which_are_locs object
## S3 method for class 'which_are_locs' print(x, n = 100, file = NULL, ...)
## S3 method for class 'which_are_locs' print(x, n = 100, file = NULL, ...)
x |
A which_are_locs object |
n |
The number of affected elements to print out (the rest are truncated) |
file |
Path to an external file to print to |
... |
ignored. |
This function replaces contractions with long form.
replace_contraction( x, contraction.key = lexicon::key_contractions, ignore.case = TRUE, ... )
replace_contraction( x, contraction.key = lexicon::key_contractions, ignore.case = TRUE, ... )
x |
The text variable. |
contraction.key |
A two column hash of contractions (column 1) and
expanded form replacements (column 2). Default is to use
|
ignore.case |
logical. Should case be ignored? |
... |
ignored. |
Returns a vector with contractions replaced.
## Not run: x <- c("Mr. Jones isn't going.", "Check it out what's going on.", "He's here but didn't go.", "the robot at t.s. wasn't nice", "he'd like it if i'd go away") replace_contraction(x) ## End(Not run)
## Not run: x <- c("Mr. Jones isn't going.", "Check it out what's going on.", "He's here but didn't go.", "the robot at t.s. wasn't nice", "he'd like it if i'd go away") replace_contraction(x) ## End(Not run)
Replaces dates with word equivalents.
replace_date(x, pattern = NULL, replacement = NULL, ...)
replace_date(x, pattern = NULL, replacement = NULL, ...)
x |
The text variable. |
pattern |
Character date regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with the pattern replaced.
x <- c( NA, '11-16-1980 and 11/16/1980', "and 2017-02-08 but then there's 2/8/2017 too" ) replace_date(x) replace_date(x, replacement = '<<DATE>>')
x <- c( NA, '11-16-1980 and 11/16/1980', "and 2017-02-08 but then there's 2/8/2017 too" ) replace_date(x) replace_date(x, replacement = '<<DATE>>')
Replaces email addresses.
replace_email(x, pattern = qdapRegex::grab("rm_email"), replacement = "", ...)
replace_email(x, pattern = qdapRegex::grab("rm_email"), replacement = "", ...)
x |
The text variable. |
pattern |
Character time regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with email addresses replaced.
x <- c( "fred is [email protected] and joe is [email protected] - but @this is a", "twitter handle for [email protected] or [email protected]/[email protected]", "hello world", NA ) replace_email(x) replace_email(x, replacement = '<<EMAIL>>') replace_email(x, replacement = '<a href="mailto:$1" target="_blank">$1</a>') ## Replacement with a function replace_email(x, replacement = function(x){ sprintf('<a href="mailto:%s" target="_blank">%s</a>', x, x) } ) replace_email(x, replacement = function(x){ gsub('@.+$', ' {{at domain}}', x) } )
x <- c( "fred is [email protected] and joe is [email protected] - but @this is a", "twitter handle for [email protected] or [email protected]/[email protected]", "hello world", NA ) replace_email(x) replace_email(x, replacement = '<<EMAIL>>') replace_email(x, replacement = '<a href="mailto:$1" target="_blank">$1</a>') ## Replacement with a function replace_email(x, replacement = function(x){ sprintf('<a href="mailto:%s" target="_blank">%s</a>', x, x) } ) replace_email(x, replacement = function(x){ gsub('@.+$', ' {{at domain}}', x) } )
Replaces emojis with word equivalents or a token identifier for use in the
sentimentr package. Not that this function will coerce the text to
ASCII using
Encoding(x) <- "latin1"; iconv(x, "latin1", "ASCII", "byte")
.
The function replace_emoji
replaces emojis with text representations
while replace_emoji_identifier
replaces with a unique identifier that
corresponds to lexicon::hash_sentiment_emoji
for use in the
sentimentr package.
replace_emoji(x, emoji_dt = lexicon::hash_emojis, ...) replace_emoji_identifier(x, emoji_dt = lexicon::hash_emojis_identifier, ...)
replace_emoji(x, emoji_dt = lexicon::hash_emojis, ...) replace_emoji_identifier(x, emoji_dt = lexicon::hash_emojis_identifier, ...)
x |
The text variable. |
emoji_dt |
A data.table of emojis (ASCII byte representations) and corresponding word/identifier meanings. |
... |
Other arguments passed to |
Returns a vector of strings with emojis replaced with word equivalents.
fls <- system.file("docs/emoji_sample.txt", package = "textclean") x <- readLines(fls)[1] replace_emoji(x) replace_emoji_identifier(x)
fls <- system.file("docs/emoji_sample.txt", package = "textclean") x <- readLines(fls)[1] replace_emoji(x) replace_emoji_identifier(x)
Replaces emoticons with word equivalents.
replace_emoticon(x, emoticon_dt = lexicon::hash_emoticons, ...)
replace_emoticon(x, emoticon_dt = lexicon::hash_emoticons, ...)
x |
The text variable. |
emoticon_dt |
A data.table of emoticons (graphical representations) and corresponding word meanings. |
... |
Other arguments passed to |
Returns a vector of strings with emoticons replaced with word equivalents.
x <- c( paste( "text from:", "http://www.webopedia.com/quick_ref/textmessageabbreviations_02.asp" ), "... understanding what different characters used in smiley faces mean:", "The close bracket represents a sideways smile )", "Add in the colon and you have sideways eyes :", "Put them together to make a smiley face :)", "Use the dash - to add a nose :-)", paste( "Change the colon to a semi-colon ;", "and you have a winking face ;) with a nose ;-)" ), paste( "Put a zero 0 (halo) on top and now you have a winking,", "smiling angel 0;) with a nose 0;-)" ), "Use the letter 8 in place of the colon for sunglasses 8-)", "Use the open bracket ( to turn the smile into a frown :-(", "I have experience with using the xp emoticon" ) replace_emoticon(x)
x <- c( paste( "text from:", "http://www.webopedia.com/quick_ref/textmessageabbreviations_02.asp" ), "... understanding what different characters used in smiley faces mean:", "The close bracket represents a sideways smile )", "Add in the colon and you have sideways eyes :", "Put them together to make a smiley face :)", "Use the dash - to add a nose :-)", paste( "Change the colon to a semi-colon ;", "and you have a winking face ;) with a nose ;-)" ), paste( "Put a zero 0 (halo) on top and now you have a winking,", "smiling angel 0;) with a nose 0;-)" ), "Use the letter 8 in place of the colon for sunglasses 8-)", "Use the open bracket ( to turn the smile into a frown :-(", "I have experience with using the xp emoticon" ) replace_emoticon(x)
Replaces grades with word equivalents.
replace_grade(x, grade_dt = lexicon::key_grade, ...)
replace_grade(x, grade_dt = lexicon::key_grade, ...)
x |
The text variable. |
grade_dt |
A data.table of grades and corresponding word meanings. |
... |
ignored. |
Returns a vector of strings with grades replaced with word equivalents.
(text <- replace_grade(c( "I give an A+", "He deserves an F", "It's C+ work", "A poor example deserves a C!" )))
(text <- replace_grade(c( "I give an A+", "He deserves an F", "It's C+ work", "A poor example deserves a C!" )))
Replaces Twitter style hash tags (e.g., '#rstats').
replace_hash(x, pattern = qdapRegex::grab("rm_hash"), replacement = "", ...)
replace_hash(x, pattern = qdapRegex::grab("rm_hash"), replacement = "", ...)
x |
The text variable. |
pattern |
Character time regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with hashes replaced.
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1" ) replace_hash(x) replace_hash(x, replacement = '<<HASH>>') replace_hash(x, replacement = '$3') ## Replacement with a function replace_hash(x, replacement = function(x){ paste0('{{', gsub('^#', 'TOPIC: ', x), '}}') } )
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1" ) replace_hash(x) replace_hash(x, replacement = '<<HASH>>') replace_hash(x, replacement = '$3') ## Replacement with a function replace_hash(x, replacement = function(x){ paste0('{{', gsub('^#', 'TOPIC: ', x), '}}') } )
Replaces HTML markup. The angle braces are removed and the HTML symbol markup is replaced with equivalent symbols.
replace_html(x, symbol = TRUE, ...)
replace_html(x, symbol = TRUE, ...)
x |
The text variable. |
symbol |
logical. If codeTRUE the symbols are retained with appropriate
replacements. If |
... |
Ignored. |
Replacements for symbols are as follows:
html | symbol |
© | (c) |
® | (r) |
™ | tm |
“ | " |
” | " |
‘ | ' |
’ | ' |
• | - |
· | - |
⋅ | [] |
– | - |
— | - |
¢ | cents |
£ | pounds |
€ | euro |
≠ | != |
½ | half |
¼ | quarter |
¾ | three fourths |
° | degrees |
← | <- |
→ | -> |
… | ... |
| |
< | < |
> | > |
« | << |
» | >> |
& | & |
" | " |
' | ' |
¥ | yen |
Returns a vector with HTML markup replaced.
x <- c( "<bold>Random</bold> text with symbols: < > & " '", "<p>More text</p> ¢ £ ¥ € © ® « »" ) replace_html(x) replace_html(x, FALSE) replace_white(replace_html(x, FALSE))
x <- c( "<bold>Random</bold> text with symbols: < > & " '", "<p>More text</p> ¢ £ ¥ € © ® « »" ) replace_html(x) replace_html(x, FALSE) replace_white(replace_html(x, FALSE))
Replaces incomplete sentence end marks (.., ..., .?, ..?, en & em dash etc.)
with "|"
.
replace_incomplete(x, replacement = "|", ...)
replace_incomplete(x, replacement = "|", ...)
x |
The text variable. |
replacement |
A string to replace incomplete punctuation marks with. |
... |
ignored. |
Returns a text variable (character sting) with incomplete sentence marks (.., ..., .?, ..?, en & em dash etc.) replaced with "|".
x <- c("the...", "I.?", "you.", "threw..", "we?") replace_incomplete(x) replace_incomplete(x, '...')
x <- c("the...", "I.?", "you.", "threw..", "we?") replace_incomplete(x) replace_incomplete(x, '...')
Replaces Internet slang.
replace_internet_slang( x, slang = paste0("\\b", lexicon::hash_internet_slang[[1]], "\\b"), replacement = lexicon::hash_internet_slang[[2]], ignore.case = TRUE, ... )
replace_internet_slang( x, slang = paste0("\\b", lexicon::hash_internet_slang[[1]], "\\b"), replacement = lexicon::hash_internet_slang[[2]], ignore.case = TRUE, ... )
x |
The text variable. |
slang |
A vector of slang strings to replace. |
replacement |
A vector of string to replace slang with. |
ignore.case |
logical. If |
... |
Other arguments passed to |
Returns a vector with names replaced.
x <- c( "Marc the n00b needs to RTFM otherwise ymmv.", "TGIF and a big w00t! The weekend is GR8!", "Will will do it", 'w8...this PITA needs me to say LMGTFY...lmao.', NA ) replace_internet_slang(x) replace_internet_slang(x, ignore.case = FALSE) replace_internet_slang(x, replacement = '<<SLANG>>') replace_internet_slang( x, replacement = paste0('{{ ', lexicon::hash_internet_slang[[2]], ' }}') )
x <- c( "Marc the n00b needs to RTFM otherwise ymmv.", "TGIF and a big w00t! The weekend is GR8!", "Will will do it", 'w8...this PITA needs me to say LMGTFY...lmao.', NA ) replace_internet_slang(x) replace_internet_slang(x, ignore.case = FALSE) replace_internet_slang(x, replacement = '<<SLANG>>') replace_internet_slang( x, replacement = paste0('{{ ', lexicon::hash_internet_slang[[2]], ' }}') )
In typography kerning is the adjustment of spacing. Often, in informal writing, adding manual spaces (a form of kerning) coupled with all capital letters is used for emphasis. This tool looks for 3 or more consecutive capital letters with spaces in between and removes the spaces. Essentially, the capitalized, kerned version is replaced with the word equivalent.
replace_kern(x, ...)
replace_kern(x, ...)
x |
The text variable. |
... |
ignored. |
Returns a vector with kern spaces removed.
StackOverflow user @ctwheels
https://stackoverflow.com/a/47438305/1000343
x <- c( "Welcome to A I: the best W O R L D!", "Hi I R is the B O M B for sure: we A G R E E indeed.", "A sort C A T indeed!", NA ) replace_kern(x)
x <- c( "Welcome to A I: the best W O R L D!", "Hi I R is the B O M B for sure: we A G R E E indeed.", "A sort C A T indeed!", NA ) replace_kern(x)
Replace misspelled words with their most likely replacement. This function uses hunspell in the backend. hunspell must be installed in order to use this feature.
replace_misspelling(x, ...)
replace_misspelling(x, ...)
x |
A character vector. |
... |
ignored.. |
Returns a vector of strings with misspellings replaced.
The function splits the string apart into tokens for speed optimization. After the replacement occurs the strings are pasted back together. The strings are not guaranteed to retain exact spacing of the original.
Surin Space and Tyler Rinker <[email protected]>.
## Not run: bad_string <- c("I cant spelll rigtt noow.", '', NA, 'Thiss is aslo mispelled?', 'this is 6$ and 38 cents in back2back!') replace_misspelling(bad_string) ## End(Not run)
## Not run: bad_string <- c("I cant spelll rigtt noow.", '', NA, 'Thiss is aslo mispelled?', 'this is 6$ and 38 cents in back2back!') replace_misspelling(bad_string) ## End(Not run)
Replaces money with word equivalents.
replace_money( x, pattern = "(-?)([$])([0-9,]+)(\\.\\d{2})?", replacement = NULL, ... )
replace_money( x, pattern = "(-?)([$])([0-9,]+)(\\.\\d{2})?", replacement = NULL, ... )
x |
The text variable. |
pattern |
Character money regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with the pattern replaced.
x <- c( NA, '$3.16 into "three dollars, sixteen cents"', "-$20,333.18 too", 'fff' ) replace_money(x) replace_money(x, replacement = '<<MONEY>>')
x <- c( NA, '$3.16 into "three dollars, sixteen cents"', "-$20,333.18 too", 'fff' ) replace_money(x) replace_money(x, replacement = '<<MONEY>>')
Replaces first/last names.
replace_names( x, names = textclean::drop_element(gsub("(^.)(.*)", "\\U\\1\\L\\2", c(lexicon::freq_last_names[[1]], lexicon::common_names), perl = TRUE), "^([AIU]n|[TSD]o|H[ea]Pa|Oh)$"), replacement = "", ... )
replace_names( x, names = textclean::drop_element(gsub("(^.)(.*)", "\\U\\1\\L\\2", c(lexicon::freq_last_names[[1]], lexicon::common_names), perl = TRUE), "^([AIU]n|[TSD]o|H[ea]Pa|Oh)$"), replacement = "", ... )
x |
The text variable. |
names |
A vector of names to replace. This may be made more custom through a vector provided from a named entity extractor. |
replacement |
A string to replace names with. |
... |
Other arguments passed to
|
Returns a vector with names replaced.
x <- c( "Mary Smith is not here", "Karen is not a nice person", "Will will do it", NA ) replace_names(x) replace_names(x, replacement = '<<NAME>>')
x <- c( "Mary Smith is not here", "Karen is not a nice person", "Will will do it", NA ) replace_names(x) replace_names(x, replacement = '<<NAME>>')
replace_non_ascii
- Replaces common non-ASCII characters.
place_non_ascii2
- Replaces all non-ASCII (defined as '[^ -~]+'
).
This provides a subset of functionality found in replace_non_ascii
that
is faster and likely less accurate.
replace_curly_quote
- Replaces curly single and double quotes. This
provides a subset of functionality found in replace_non_ascii
specific
to quotes.
replace_non_ascii(x, replacement = "", remove.nonconverted = TRUE, ...) replace_non_ascii2(x, replacement = "", ...) replace_curly_quote(x, ...)
replace_non_ascii(x, replacement = "", remove.nonconverted = TRUE, ...) replace_non_ascii2(x, replacement = "", ...) replace_curly_quote(x, ...)
x |
The text variable. |
replacement |
Character string equal in length to pattern or of length one which are a replacement for matched pattern. |
remove.nonconverted |
logical. If |
... |
ignored. |
Returns a text variable (character sting) with non-ASCII characters replaced.
x <- c( "Hello World", "6 Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher", 'This is a \xA9 but not a \xAE', '6 \xF7 2 = 3', 'fractions \xBC, \xBD, \xBE', 'cows go \xB5', '30\xA2' ) Encoding(x) <- "latin1" x replace_non_ascii(x) replace_non_ascii(x, remove.nonconverted = FALSE) z <- '\x95He said, \x93Gross, I am going to!\x94' Encoding(z) <- "latin1" z replace_curly_quote(z) replace_non_ascii(z)
x <- c( "Hello World", "6 Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher", 'This is a \xA9 but not a \xAE', '6 \xF7 2 = 3', 'fractions \xBC, \xBD, \xBE', 'cows go \xB5', '30\xA2' ) Encoding(x) <- "latin1" x replace_non_ascii(x) replace_non_ascii(x, remove.nonconverted = FALSE) z <- '\x95He said, \x93Gross, I am going to!\x94' Encoding(z) <- "latin1" z replace_curly_quote(z) replace_non_ascii(z)
replace_number
- Replaces numeric represented numbers with words
(e.g., 1001 becomes one thousand one).
as_ordinal
- A convenience wrapper for english::ordinal
that
takes integers and converts them to ordinal form.
replace_number(x, num.paste = FALSE, remove = FALSE, ...) as_ordinal(x, ...)
replace_number(x, num.paste = FALSE, remove = FALSE, ...) as_ordinal(x, ...)
x |
The text variable. |
num.paste |
logical. If |
remove |
logical. If |
... |
Other arguments passed to |
Returns a vector with numbers replaced.
The user may want to use replace_ordinal
first to remove ordinal number notation. For example
replace_number
would turn "21st" into
"twenty onest", whereas replace_ordinal
would
generate "twenty first".
Fox, J. (2005). Programmer's niche: How do you spell that number? R News. Vol. 5(1), pp. 51-55.
x <- c( NA, 'then .456 good', 'none', "I like 346,457 ice cream cones.", "I like 123456789 cashes.", "They are 99 percent good and 45678.2345667" ) replace_number(x) replace_number(x, num.paste = TRUE) replace_number(x, remove=TRUE) ## Not run: library(textclean) hunthou <- replace_number(seq_len(1e5)) textclean::mgsub( "'twenty thousand three hundred five' into 20305", hunthou, seq_len(1e5) ) ## "'20305' into 20305" ## Larger example from: https://stackoverflow.com/q/18332463/1000343 ## A slower approach fivehunthou <- replace_number(seq_len(5e5)) testvect <- c("fifty seven", "four hundred fifty seven", "six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "three hundred forty six thousand four hundred fifty seven" ) textclean::mgsub(testvect, fivehunthou, seq_len(5e5)) as_ordinal(1:10) textclean::mgsub('I want to be 1 in line', 1:10, as_ordinal(1:10)) ## End(Not run)
x <- c( NA, 'then .456 good', 'none', "I like 346,457 ice cream cones.", "I like 123456789 cashes.", "They are 99 percent good and 45678.2345667" ) replace_number(x) replace_number(x, num.paste = TRUE) replace_number(x, remove=TRUE) ## Not run: library(textclean) hunthou <- replace_number(seq_len(1e5)) textclean::mgsub( "'twenty thousand three hundred five' into 20305", hunthou, seq_len(1e5) ) ## "'20305' into 20305" ## Larger example from: https://stackoverflow.com/q/18332463/1000343 ## A slower approach fivehunthou <- replace_number(seq_len(5e5)) testvect <- c("fifty seven", "four hundred fifty seven", "six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "forty six thousand four hundred fifty seven", "three hundred forty six thousand four hundred fifty seven" ) textclean::mgsub(testvect, fivehunthou, seq_len(5e5)) as_ordinal(1:10) textclean::mgsub('I want to be 1 in line', 1:10, as_ordinal(1:10)) ## End(Not run)
Replaces mixed text/numeric represented ordinal numbers with words (e.g., "1st" becomes "first").
replace_ordinal(x, num.paste = FALSE, remove = FALSE, ...)
replace_ordinal(x, num.paste = FALSE, remove = FALSE, ...)
x |
The text variable. |
num.paste |
logical. If |
remove |
logical. If |
... |
ignored. |
Currently only implemented for ordinal values 1 through 100
x <- c( "I like the 1st one not the 22nd one.", "For the 100th time stop!" ) replace_ordinal(x) replace_ordinal(x, TRUE) replace_ordinal(x, remove = TRUE) replace_number(replace_ordinal("I like the 1st 1 not the 22nd 1."))
x <- c( "I like the 1st one not the 22nd one.", "For the 100th time stop!" ) replace_ordinal(x) replace_ordinal(x, TRUE) replace_ordinal(x, remove = TRUE) replace_number(replace_ordinal("I like the 1st 1 not the 22nd 1."))
Replaces ratings with word equivalents.
replace_rating(x, rating_dt = lexicon::key_rating, ...)
replace_rating(x, rating_dt = lexicon::key_rating, ...)
x |
The text variable. |
rating_dt |
A data.table of ratings and corresponding word meanings. |
... |
ignored. |
Returns a vector of strings with ratings replaced with word equivalents.
x <- c("This place receives 5 stars for their APPETIZERS!!!", "Four stars for the food & the guy in the blue shirt for his great vibe!", "10 out of 10 for both the movie and trilogy.", "* Both the Hot & Sour & the Egg Flower Soups were absolutely 5 Stars!", "For service, I give them no stars.", "This place deserves no stars.", "10 out of 10 stars.", "My rating: just 3 out of 10.", "If there were zero stars I would give it zero stars.", "Rating: 1 out of 10.", "I gave it 5 stars because of the sound quality.", "If it were possible to give them 0/10, they'd have it." ) replace_rating(x)
x <- c("This place receives 5 stars for their APPETIZERS!!!", "Four stars for the food & the guy in the blue shirt for his great vibe!", "10 out of 10 for both the movie and trilogy.", "* Both the Hot & Sour & the Egg Flower Soups were absolutely 5 Stars!", "For service, I give them no stars.", "This place deserves no stars.", "10 out of 10 stars.", "My rating: just 3 out of 10.", "If there were zero stars I would give it zero stars.", "Rating: 1 out of 10.", "I gave it 5 stars because of the sound quality.", "If it were possible to give them 0/10, they'd have it." ) replace_rating(x)
This function replaces symbols with word equivalents (e.g., @
becomes
"at"
.
replace_symbol( x, dollar = TRUE, percent = TRUE, pound = TRUE, at = TRUE, and = TRUE, with = TRUE, ... )
replace_symbol( x, dollar = TRUE, percent = TRUE, pound = TRUE, at = TRUE, and = TRUE, with = TRUE, ... )
x |
A character vector. |
dollar |
logical. If |
percent |
logical. If |
pound |
logical. If |
at |
logical. If |
and |
logical. If |
with |
logical. If |
... |
ignored. |
Returns a character vector with symbols replaced..
x <- c("I am @ Jon's & Jim's w/ Marry", "I owe $41 for food", "two is 10% of a #" ) replace_symbol(x)
x <- c("I am @ Jon's & Jim's w/ Marry", "I owe $41 for food", "two is 10% of a #" ) replace_symbol(x)
Replaces Twitter style handle tags (e.g., '@trinker').
replace_tag(x, pattern = qdapRegex::grab("rm_tag"), replacement = "", ...)
replace_tag(x, pattern = qdapRegex::grab("rm_tag"), replacement = "", ...)
x |
The text variable. |
pattern |
Character time regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with tags replaced.
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1" ) replace_tag(x) replace_tag(x, replacement = '<<TAG>>') replace_tag(x, replacement = '$3') ## Replacement with a function replace_tag(x, replacement = function(x){ gsub('@', ' <<TO>> ', x) } )
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1" ) replace_tag(x) replace_tag(x, replacement = '<<TAG>>') replace_tag(x, replacement = '$3') ## Replacement with a function replace_tag(x, replacement = function(x){ gsub('@', ' <<TO>> ', x) } )
Replaces time stamps with word equivalents.
replace_time( x, pattern = "(2[0-3]|[01]?[0-9]):([0-5][0-9])[.:]?([0-5]?[0-9])?", replacement = NULL, ... )
replace_time( x, pattern = "(2[0-3]|[01]?[0-9]):([0-5][0-9])[.:]?([0-5]?[0-9])?", replacement = NULL, ... )
x |
The text variable. |
pattern |
Character time regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with the pattern replaced.
x <- c( NA, '12:47 to "twelve forty-seven" and also 8:35:02', 'what about 14:24.5', 'And then 99:99:99?' ) ## Textual: Word version replace_time(x) ## Normalization: <<TIME>> replace_time(x, replacement = '<<TIME>>') ## Normalization: hh:mm:ss or hh:mm replace_time(x, replacement = function(y){ z <- unlist(strsplit(y, '[:.]')) z[1] <- 'hh' z[2] <- 'mm' if(!is.na(z[3])) z[3] <- 'ss' glue_collapse(z, ':') } ) ## Textual: Word version (forced seconds) replace_time(x, replacement = function(y){ z <- replace_number(unlist(strsplit(y, '[:.]'))) z[3] <- paste0('and ', ifelse(is.na(z[3]), '0', z[3]), ' seconds') paste(z, collapse = ' ') } ) ## Normalization: hh:mm:ss replace_time(x, replacement = function(y){ z <- unlist(strsplit(y, '[:.]')) z[1] <- 'hh' z[2] <- 'mm' z[3] <- 'ss' glue_collapse(z, ':') } )
x <- c( NA, '12:47 to "twelve forty-seven" and also 8:35:02', 'what about 14:24.5', 'And then 99:99:99?' ) ## Textual: Word version replace_time(x) ## Normalization: <<TIME>> replace_time(x, replacement = '<<TIME>>') ## Normalization: hh:mm:ss or hh:mm replace_time(x, replacement = function(y){ z <- unlist(strsplit(y, '[:.]')) z[1] <- 'hh' z[2] <- 'mm' if(!is.na(z[3])) z[3] <- 'ss' glue_collapse(z, ':') } ) ## Textual: Word version (forced seconds) replace_time(x, replacement = function(y){ z <- replace_number(unlist(strsplit(y, '[:.]'))) z[3] <- paste0('and ', ifelse(is.na(z[3]), '0', z[3]), ' seconds') paste(z, collapse = ' ') } ) ## Normalization: hh:mm:ss replace_time(x, replacement = function(y){ z <- unlist(strsplit(y, '[:.]')) z[1] <- 'hh' z[2] <- 'mm' z[3] <- 'ss' glue_collapse(z, ':') } )
replace_to
- Grab from beginning of string to a character(s).
replace_from
- Grab from character(s) to end of string.
replace_to(x, char = " ", n = 1, include = FALSE, ...) replace_from(x, char = " ", n = 1, include = FALSE, ...)
replace_to(x, char = " ", n = 1, include = FALSE, ...) replace_from(x, char = " ", n = 1, include = FALSE, ...)
x |
A character string |
char |
The character from which to grab until/from. |
n |
Number of times the character appears before the grab. |
include |
logical. If |
... |
ignored. |
returns a vector of text with begin/end of string to/from character removed.
Josh O'Brien and Tyler Rinker <[email protected]>.
https://stackoverflow.com/q/15909626/1000343
## Not run: x <- c("a_b_c_d", "1_2_3_4", "<_?_._:") replace_to(x, "_") replace_to(x, "_", 2) replace_to(x, "_", 3) replace_to(x, "_", 4) replace_to(x, "_", 3, include=TRUE) replace_from(x, "_") replace_from(x, "_", 2) replace_from(x, "_", 3) replace_from(x, "_", 4) replace_from(x, "_", 3, include=TRUE) x2 <- gsub("_", " ", x) replace_from(x2, " ", 2) replace_to(x2, " ", 2) x3 <- gsub("_", "\\^", x) replace_from(x3, "^", 2) replace_to(x3, "^", 2) x4 <- c("_a_b", "a__b") replace_from(x4, "_", 1) replace_to(x4, "_", 1) ## End(Not run)
## Not run: x <- c("a_b_c_d", "1_2_3_4", "<_?_._:") replace_to(x, "_") replace_to(x, "_", 2) replace_to(x, "_", 3) replace_to(x, "_", 4) replace_to(x, "_", 3, include=TRUE) replace_from(x, "_") replace_from(x, "_", 2) replace_from(x, "_", 3) replace_from(x, "_", 4) replace_from(x, "_", 3, include=TRUE) x2 <- gsub("_", " ", x) replace_from(x2, " ", 2) replace_to(x2, " ", 2) x3 <- gsub("_", "\\^", x) replace_from(x3, "^", 2) replace_to(x3, "^", 2) x4 <- c("_a_b", "a__b") replace_from(x4, "_", 1) replace_to(x4, "_", 1) ## End(Not run)
Replace tokens with a single substring. This is much faster than
mgsub
if one wants to replace fixed tokens
with a single value or remove them all together. This can be useful
for quickly replacing tokens like names in string with a single
value in order to reduce noise.
replace_tokens(x, tokens, replacement = NULL, ignore.case = FALSE, ...)
replace_tokens(x, tokens, replacement = NULL, ignore.case = FALSE, ...)
x |
A character vector. |
tokens |
A vector of token to be replaced. |
replacement |
A single character string to replace the tokens with.
The default, |
ignore.case |
logical. If |
... |
ignored. |
Returns a vector of strings with tokens replaced.
The function splits the string apart into tokens for speed optimization. After the replacement occurs the strings are pasted back together. The strings are not guaranteed to retain exact spacing of the original.
replace_tokens(DATA$state, c('No', 'what', "it's")) replace_tokens(DATA$state, c('No', 'what', "it's"), "<<TOKEN>>") replace_tokens( DATA$state, c('No', 'what', "it's"), "<<TOKEN>>", ignore.case = TRUE ) ## Not run: ## Now let's see the speed ## Set up data library(textshape) data(hamlet) set.seed(11) tokens <- sample(unique(unlist(split_token(hamlet$dialogue))), 2000) tic <- Sys.time() head(replace_tokens(hamlet$dialogue, tokens)) (toc <- Sys.time() - tic) tic <- Sys.time() head(mgsub(hamlet$dialogue, tokens, "")) (toc <- Sys.time() - tic) ## Amp it up 20x more data tic <- Sys.time() head(replace_tokens(rep(hamlet$dialogue, 20), tokens)) (toc <- Sys.time() - tic) ## Replace names example library(lexicon) library(textshape) nms <- gsub("(^.)(.*)", "\\U\\1\\L\\2", common_names, perl = TRUE) x <- split_portion( sample(c(sample(grady_augmented, 5000), sample(nms, 10000, TRUE))), n.words = 12 ) x$text.var <- paste0( x$text.var, sample(c('.', '!', '?'), length(x$text.var), TRUE) ) replace_tokens(x$text.var, nms, 'NAME') ## End(Not run)
replace_tokens(DATA$state, c('No', 'what', "it's")) replace_tokens(DATA$state, c('No', 'what', "it's"), "<<TOKEN>>") replace_tokens( DATA$state, c('No', 'what', "it's"), "<<TOKEN>>", ignore.case = TRUE ) ## Not run: ## Now let's see the speed ## Set up data library(textshape) data(hamlet) set.seed(11) tokens <- sample(unique(unlist(split_token(hamlet$dialogue))), 2000) tic <- Sys.time() head(replace_tokens(hamlet$dialogue, tokens)) (toc <- Sys.time() - tic) tic <- Sys.time() head(mgsub(hamlet$dialogue, tokens, "")) (toc <- Sys.time() - tic) ## Amp it up 20x more data tic <- Sys.time() head(replace_tokens(rep(hamlet$dialogue, 20), tokens)) (toc <- Sys.time() - tic) ## Replace names example library(lexicon) library(textshape) nms <- gsub("(^.)(.*)", "\\U\\1\\L\\2", common_names, perl = TRUE) x <- split_portion( sample(c(sample(grady_augmented, 5000), sample(nms, 10000, TRUE))), n.words = 12 ) x$text.var <- paste0( x$text.var, sample(c('.', '!', '?'), length(x$text.var), TRUE) ) replace_tokens(x$text.var, nms, 'NAME') ## End(Not run)
Replaces URLs.
replace_url(x, pattern = qdapRegex::grab("rm_url"), replacement = "", ...)
replace_url(x, pattern = qdapRegex::grab("rm_url"), replacement = "", ...)
x |
The text variable. |
pattern |
Character time regex string to be matched in the given character vector. |
replacement |
A function to operate on the extracted matches or a character string which is a replacement for the matched pattern. |
... |
ignored. |
Returns a vector with URLs replaced.
x <- c("@hadley I like #rstats for #ggplot2 work. ftp://cran.r-project.org/incoming/", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. https://ramnathv.github.io/user2014-rcharts/#1", NA ) replace_url(x) replace_url(x, replacement = '<<URL>>') ## Not run: ## Replacement with a function library(urltools) replace_url(x, replacement = function(x){ sprintf('{{%s}}', urltools::url_parse(x)$domain) } ) ## End(Not run)
x <- c("@hadley I like #rstats for #ggplot2 work. ftp://cran.r-project.org/incoming/", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. https://ramnathv.github.io/user2014-rcharts/#1", NA ) replace_url(x) replace_url(x, replacement = '<<URL>>') ## Not run: ## Replacement with a function library(urltools) replace_url(x, replacement = function(x){ sprintf('{{%s}}', urltools::url_parse(x)$domain) } ) ## End(Not run)
Pre-process data to replace one or more white space character with a single space (this includes new line characters).
replace_white(x, ...)
replace_white(x, ...)
x |
The character vector. |
... |
ignored. |
Returns a vector of character strings with escaped characters removed.
x <- "I go \r to the \tnext line" x replace_white(x)
x <- "I go \r to the \tnext line" x replace_white(x)
In informal writing people may use a form of text embellishment to emphasize
or alter word meanings called elongation (a.k.a. "word lengthening"). For
example, the use of "Whyyyyy" conveys frustration. Other times the usage may
be to be more sexy (e.g., "Heyyyy there"). Other times it may be used for
emphasis (e.g., "This is so gooood"). This function uses an augmented form
of Armstrong & Fogarty's (2007) algorithm. The algorithm first attempts to
replace the elongation with known semantic replacements (optional; default is
FALSE
). After this the algorithm locates all places where the same
letter (case insensitive) appears 3 times consecutively. These elements are
then further processed. The matches are replaced via fgsub
by first
taking the elongation to it's canonical form (drop all > 1 consecutive
letters to a single letter) and then replacing with the most common word
used in 2008 in Google's ngram data set that takes the canonical form. If
the canonical form is not found in the Google data set then the canonical
form is used as the replacement.
replace_word_elongation( x, impart.meaning = FALSE, elongation.search.pattern = "(?i)(?:^|\\b)\\w+([a-z])(\\1{2,})\\w*(?:$|\\b)", conservative = FALSE, elongation.pattern = sprintf("([a-z])(\\1{%s,})", as.integer(conservative) + 1), ... )
replace_word_elongation( x, impart.meaning = FALSE, elongation.search.pattern = "(?i)(?:^|\\b)\\w+([a-z])(\\1{2,})\\w*(?:$|\\b)", conservative = FALSE, elongation.pattern = sprintf("([a-z])(\\1{%s,})", as.integer(conservative) + 1), ... )
x |
The text variable. |
impart.meaning |
logical. If |
elongation.search.pattern |
The elongation pattern to search for. The default
only considers a repeat of |
conservative |
By default the |
elongation.pattern |
The actual pattern used for replacement. We use a search pattern and then this pattern with the assumption that an elongated word must have 3 or more letters in a row but often these elongations can also contain 2 or more letters in a row as well. |
... |
ignored. |
Returns a vector with word elongations replaced.
Armstrong, D. B., Fogarty, G. J., & Dingsdag, D. (2007). Scales measuring
characteristics of small business information systems. Proceedings of the
2011 Conference on Empirical Methods in Natural Language Processing (pp.
562-570). Edinburgh, Scotland. Retrieved from
http://www.aclweb.org/anthology/D11-1052
https://storage.googleapis.com/books/ngrams/books/datasetsv2.html
https://www.theatlantic.com/magazine/archive/2013/03/dragging-it-out/309220
https://english.stackexchange.com/questions/189517/is-there-a-name-term-for-multiplied-vowels
x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds', 'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!', "I'm liiiike whyyyyy me?", "WwwhhaTttt!", NA) replace_word_elongation(x) #Look at "WwwhhaTttt!" as "what!" replace_word_elongation(x, conservative = TRUE) #Look at "WwwhhaTttt!" as "whhat!" replace_word_elongation(x, impart.meaning = TRUE) replace_word_elongation(c('online mookkkkk!', "WwwhhaTttt!")) replace_word_elongation(c('online mookkkkk!', "WwwhhaTttt!"), conservative = TRUE)
x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds', 'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!', "I'm liiiike whyyyyy me?", "WwwhhaTttt!", NA) replace_word_elongation(x) #Look at "WwwhhaTttt!" as "what!" replace_word_elongation(x, conservative = TRUE) #Look at "WwwhhaTttt!" as "whhat!" replace_word_elongation(x, impart.meaning = TRUE) replace_word_elongation(c('online mookkkkk!', "WwwhhaTttt!")) replace_word_elongation(c('online mookkkkk!', "WwwhhaTttt!"), conservative = TRUE)
Strip text of unwanted characters.
strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = FALSE, lower.case = TRUE ) ## S3 method for class 'character' strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = FALSE, lower.case = TRUE ) ## S3 method for class 'factor' strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, lower.case = TRUE ) ## Default S3 method: strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, lower.case = TRUE ) ## S3 method for class 'list' strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, lower.case = TRUE )
strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = FALSE, lower.case = TRUE ) ## S3 method for class 'character' strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = FALSE, lower.case = TRUE ) ## S3 method for class 'factor' strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, lower.case = TRUE ) ## Default S3 method: strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, lower.case = TRUE ) ## S3 method for class 'list' strip( x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, lower.case = TRUE )
x |
The text variable. |
char.keep |
A character vector of symbols (i.e., punctuation) that
|
digit.remove |
logical. If |
apostrophe.remove |
logical. If |
lower.case |
logical. If |
Returns a vector of text that has been stripped of unwanted characters.
## Not run: DATA$state #no strip applied strip(DATA$state) strip(DATA$state, apostrophe.remove=TRUE) strip(DATA$state, char.keep = c("?", ".")) ## End(Not run)
## Not run: DATA$state #no strip applied strip(DATA$state) strip(DATA$state, apostrophe.remove=TRUE) strip(DATA$state, char.keep = c("?", ".")) ## End(Not run)
This function holds the place for particular character values, allowing the user to manipulate the vector and then revert the place holders back to the original values.
sub_holder( x, pattern, alpha.type = TRUE, holder.prefix = "zzzplaceholder", holder.suffix = "zzz", ... )
sub_holder( x, pattern, alpha.type = TRUE, holder.prefix = "zzzplaceholder", holder.suffix = "zzz", ... )
x |
A character vector. |
pattern |
Character string to be matched in the given character vector. |
alpha.type |
logical. If |
holder.prefix |
The prefix to use before the alpha key in the palce
holder when |
holder.suffix |
The suffix to use after the alpha key in the palce
holder when |
... |
Additional arguments passed to |
Returns a list with the following:
output |
keyed place holder character vector |
unhold |
A function used to revert back to the original values |
The unhold
function for sub_holder
will only work on keys
that have not been disturbed by subsequent alterations. The key follows the
pattern of holder.prefix ('zzzplaceholder') followed by lower case letter
keys followed by holder.suffix ('zzz') when alpha.type = TRUE
,
otherwise the holder is numeric.
## `alpha.type` as TRUE library(lexicon); library(textshape) (fake_dat <- paste(hash_emoticons[1:11, 1, with=FALSE][[1]], DATA$state)) (m <- sub_holder(fake_dat, hash_emoticons[[1]])) m$unhold(strip(m$output)) ## `alpha.type` as FALSE (numeric keys) vowels <- LETTERS[c(1, 5, 9, 15, 21)] (m2 <- sub_holder(toupper(DATA$state), vowels, alpha.type = FALSE)) m2$unhold(gsub("[^0-9]", "", m2$output)) mtabulate(strsplit(m2$unhold(gsub("[^0-9]", "", m2$output)), ""))
## `alpha.type` as TRUE library(lexicon); library(textshape) (fake_dat <- paste(hash_emoticons[1:11, 1, with=FALSE][[1]], DATA$state)) (m <- sub_holder(fake_dat, hash_emoticons[[1]])) m$unhold(strip(m$output)) ## `alpha.type` as FALSE (numeric keys) vowels <- LETTERS[c(1, 5, 9, 15, 21)] (m2 <- sub_holder(toupper(DATA$state), vowels, alpha.type = FALSE)) m2$unhold(gsub("[^0-9]", "", m2$output)) mtabulate(strsplit(m2$unhold(gsub("[^0-9]", "", m2$output)), ""))
Swap pattern x for pattern y and pattern y for pattern x in one fell swoop.
swap(x, pattern1, pattern2, ...)
swap(x, pattern1, pattern2, ...)
x |
A text variable. |
pattern1 |
Character string to be matched in the given character vector.
This will be replaced by |
pattern2 |
Character string to be matched in the given character vector.
This will be replaced by |
... |
ignored. |
Returns a vector with patterns 1 & 2 swapped.
x <- c("hash_abbreviation", "hash_contractions", "hash_grade", "key_emoticons", "key_power", "key_sentiment", "key_sentiment_nrc", "key_strength", "key_syllable", "key_valence_shifters") x swap(x, 'hash_', 'key_')
x <- c("hash_abbreviation", "hash_contractions", "hash_grade", "key_emoticons", "key_power", "key_sentiment", "key_sentiment_nrc", "key_strength", "key_syllable", "key_valence_shifters") x swap(x, 'hash_', 'key_')
Detect/Locate potential issues with text data. This family of functions generates a list of detections/location functions that can be accessed via the dollar sign or square bracket operators. Accessible functions include:
which_are() is_it()
which_are() is_it()
Contains contractions
Contains dates
Contains digits
Contains email addresses
Contains emoticons
Contains just white space
Contains escaped backslash character
Contains Twitter style hash tags
Contains html mark-up
Contains incomplete sentences (e.g., ends with ...)
Contains kerning (e.g. "The B O M B!")
Is a list of atomic vectors (Not provided by which_are
))
Contains potentially misspelled words
Contains a sentence with no ending punctuation
Contains commas with no space after them
Contains non-ASCII characters
Is a non-character vector (Not provided by which_are
))
Contains non split sentences
Contains a Twitter style handle used to tag others (use of the at symbol)
Contains a time stamp
Contains a URL
The functions above that have a description starting with 'is' rather than 'contains'
are meta functions that describe the attribute of the column/vector being passed
rather than attributes about the individual elements of the column/vector. The
meta functions will return a logical of length one and are not available under
which_are
.
which_are
returns an environment of functions that can be used to
locate and return the integer locations of the particular non-normalized text
named by the function.
is_it
returns an environment of functions that can be used to
detect and return a logical atomic vector of equal length to the input vector
(except for meta functions) of the particular non-normalized text
named by the function.
wa <- which_are() it <- is_it() wa$digit(c('The dog', "I like 2", NA)) it$digit(c('The dog', "I like 2", NA)) is_it()$list_column(c('the dog', 'ate the chicken'))
wa <- which_are() it <- is_it() wa$digit(c('The dog', "I like 2", NA)) it$digit(c('The dog', "I like 2", NA)) is_it()$list_column(c('the dog', 'ate the chicken'))