Title: | Regular Expression Removal, Extraction, and Replacement Tools |
---|---|
Description: | A collection of regular expression tools associated with the 'qdap' package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes. |
Authors: | Jason Gray [ctb], Tyler Rinker [aut, cre] |
Maintainer: | Tyler Rinker <[email protected]> |
License: | GPL-2 |
Version: | 0.7.8 |
Built: | 2025-01-10 02:44:52 UTC |
Source: | https://github.com/trinker/qdapregex |
This convenience function wraps left and right boundaries of each element of
a character vector. The default is to use "\b"
for left and right
boundaries.
bind( ..., left = "\\b", right = left, dictionary = getOption("regex.library") )
bind( ..., left = "\\b", right = left, dictionary = getOption("regex.library") )
left |
A single length character vector to use as the left bound. |
right |
A single length character vector to use as the right bound. |
dictionary |
A dictionary of canned regular expressions to search within. |
... |
Regular expressions to add grouping parenthesis to a named
expression from the default regular expression dictionary prefixed with
single at ( |
Returns a character vector.
bind(LETTERS, "[", "]") ## More useful default parameters/usage x <- c("Computer is fun. Not too fun.", "No it's not, it's dumb.", "What should we do?", "You liar, it stinks!", "I am telling the truth!", "How can we be certain?", "There is no way.", "I distrust you.", "What are you talking about?", "Shall we move on? Good then.", "I'm hungry. Let's eat. You already?") Fry25 <- c("the", "of", "and", "a", "to", "in", "is", "you", "that", "it", "he", "was", "for", "on", "are", "as", "with", "his", "they", "I", "at", "be", "this", "have", "from") gsub(pastex(list(bind(Fry25))), "[[ELIM]]", x)
bind(LETTERS, "[", "]") ## More useful default parameters/usage x <- c("Computer is fun. Not too fun.", "No it's not, it's dumb.", "What should we do?", "You liar, it stinks!", "I am telling the truth!", "How can we be certain?", "There is no way.", "I distrust you.", "What are you talking about?", "Shall we move on? Good then.", "I'm hungry. Let's eat. You already?") Fry25 <- c("the", "of", "and", "a", "to", "in", "is", "you", "that", "it", "he", "was", "for", "on", "are", "as", "with", "his", "they", "I", "at", "be", "this", "have", "from") gsub(pastex(list(bind(Fry25))), "[[ELIM]]", x)
A wrapper for bind
and pastex
that wraps each sub-expression
element with left/right boundaries (\b
by default) and then
concatenate/joins bound strings with a regex 'or' ("|"). Equivalent to
pastex(bind(...), sep = "|")
.
bind_or(..., group.all = TRUE, left = "\\b", right = left)
bind_or(..., group.all = TRUE, left = "\\b", right = left)
group.all |
logical. If |
left |
A single length character vector to use as the left bound. |
right |
A single length character vector to use as the right bound. |
... |
Regular expressions to paste together or a named expression
from the default regular expression dictionary prefixed with single at
( |
bind_or(LETTERS) bind_or("them", "those", "that", "these") bind_or("them", "those", "that", "these", group.all = FALSE)
bind_or(LETTERS) bind_or("them", "those", "that", "these") bind_or("them", "those", "that", "these", group.all = FALSE)
Combines a extracted object
## S3 method for class 'extracted' c(x, ...)
## S3 method for class 'extracted' c(x, ...)
x |
The extracted object |
... |
ignored |
Print a cheat sheet of common regex task chunks. cheat
prints a left
justified version of regex_cheat
.
cheat(dictionary = qdapRegex::regex_cheat, print = TRUE)
cheat(dictionary = qdapRegex::regex_cheat, print = TRUE)
dictionary |
A dictionary of cheat terms. Default is
|
print |
logical. If |
Prints a cheat sheet of common regex tasks such as lookaheads.
Invisibly returns regex_cheat
.
cheat()
cheat()
Escape literal beginning at (@) strings from qdapRegex parsing.
escape(pattern)
escape(pattern)
pattern |
A character string that should not be parsed. |
Many qdapRegex functions parse pattern
strings
beginning with an at character (@) and comparing against the default and
supplemental (regex_supplement
) dictionaries. This
means that a string such as "@before_" will be returned as
"\\w+?(?= ((%s|%s)\\b))". If the user wanted to use a regular
expression that was literally "@before_" the escape
function classes
the character string and tells the qdapRegex functions not to parse it
(i.e., keep it as a literal string).
Returns a character vector of the class "escape" and "character".
escape("@rm_caps") x <- "...character vector. Default, \\code{@rm_caps} uses..." rm_default(x, pattern = "@rm_caps") rm_default(x, pattern = escape("@rm_caps"))
escape("@rm_caps") x <- "...character vector. Default, \\code{@rm_caps} uses..." rm_default(x, pattern = "@rm_caps") rm_default(x, pattern = escape("@rm_caps"))
Visualize regular expressions using https://regexper.com/
explain( pattern, open = FALSE, print = TRUE, dictionary = getOption("regex.library") )
explain( pattern, open = FALSE, print = TRUE, dictionary = getOption("regex.library") )
pattern |
A character string containing a regular expression or a
character string starting with |
open |
logical. If |
print |
logical. Should |
dictionary |
A dictionary of canned regular expressions to search within. |
Note that https://regexper.com/ is a Java based regular expression viewer. Lookbehind and negative lookbehinds are not respected.
Prints https://regexper.com/ to the console, attempts to open the url to the visual representation provided by https://regexper.com/, and invisibly returns a list with the URLs.
Ananda Mahto, Matthew Flickinger, and Tyler Rinker <[email protected]>.
https://stackoverflow.com/a/27489977/1000343
https://regexper.com/
https://stackoverflow.com/a/27574103/1000343
explain("\\s*foo[A-Z]\\d{2,3}") explain("@rm_time") ## Not run: explain("\\s*foo[A-Z]\\d{2,3}", open = TRUE) explain("@rm_time", open = TRUE) ## End(Not run)
explain("\\s*foo[A-Z]\\d{2,3}") explain("@rm_time") ## Not run: explain("\\s*foo[A-Z]\\d{2,3}", open = TRUE) explain("@rm_time", open = TRUE) ## End(Not run)
convenience function to
grab(pattern, dictionary = getOption("regex.library"))
grab(pattern, dictionary = getOption("regex.library"))
pattern |
A character string starting with |
dictionary |
A dictionary of canned regular expressions to search within. |
Many R regular expressions contain doubled backslashes that are not
used in other regex interpreters. Using cat
can remove
backslash escapes (see Examples) or URLencode
if using in a url.
Returns a single string regular expression from one of the qdapRegex dictionaries.
grab("@rm_white") ## Not run: ## Throws an error grab("@foo") ## End(Not run) cat(grab("@pages2")) ## Not run: cat(grab("@pages2"), file="clipboard") ## End(Not run)
grab("@rm_white") ## Not run: ## Throws an error grab("@foo") ## End(Not run) cat(grab("@pages2")) ## Not run: cat(grab("@pages2"), file="clipboard") ## End(Not run)
group
- A wrapper for paste(collapse="|")
that also searches
the default and supplemental (regex_supplement
)
dictionaries for regular expressions before pasting them together with a pipe
(|
) separator.
group(..., left = "(", right = ")", dictionary = getOption("regex.library"))
group(..., left = "(", right = ")", dictionary = getOption("regex.library"))
left |
A single length character vector to use as the left bound. |
right |
A single length character vector to use as the right bound. |
dictionary |
A dictionary of canned regular expressions to search within. |
... |
Regular expressions to add grouping parenthesis to a named
expression from the default regular expression dictionary prefixed with
single at ( |
Returns a single string of regular expressions with grouping parenthesis added.
group(LETTERS) group(1) (grouped <- group("(the|them)\\b", "@rm_zip")) pastex(grouped)
group(LETTERS) group(1) (grouped <- group("(the|them)\\b", "@rm_zip")) pastex(grouped)
A wrapper for group
and pastex
that wraps each sub-expression
element with grouping parenthesis and then concatenate/joins grouped strings
with a regex 'or' ("|"). Equivalent to pastex(group(...), sep = "|")
.
group_or(..., group.all = TRUE)
group_or(..., group.all = TRUE)
group.all |
logical. If |
... |
Regular expressions to paste together or a named expression
from the default regular expression dictionary prefixed with single at
( |
group_or("@rm_hash", "@rm_tag") group_or("them", "those", "that", "these") group_or("them", "those", "that", "these", group.all = FALSE)
group_or("@rm_hash", "@rm_tag") group_or("them", "those", "that", "these") group_or("them", "those", "that", "these", group.all = FALSE)
Acts as a logical test of a regular expression's validity. is.regex
uses gsub
and tests for errors to determine a regular
expression's validity. The regular expression must conform to R's regular
expression rules (see ?regex
for details about how R handles regular
expressions).
is.regex(pattern)
is.regex(pattern)
pattern |
A regular expression to be tested. |
Returns a logical (TRUE
is a valid regular expression).
is.regex("I|***") is.regex("I|i") sapply(regex_usa, is.regex) sapply(regex_supplement, is.regex) ## `version` is not a valid regex
is.regex("I|***") is.regex("I|i") sapply(regex_usa, is.regex) sapply(regex_supplement, is.regex) ## `version` is not a valid regex
pastex
- A wrapper for paste(collapse="|")
that also searches
the default and supplemental (regex_supplement
)
dictionaries for regular expressions before pasting them together with a pipe
(|
) separator.
%|%
- A binary operator version of pastex
that joins two
character strings with a regex or ("|"). Equivalent to
pastex(x, y, sep="|")
.
%+%
- A binary operator version of pastex
that joins two
character strings with no space. Equivalent to pastex(x, y, sep="")
.
pastex(..., sep = "|", dictionary = getOption("regex.library")) x %|% y x %+% y
pastex(..., sep = "|", dictionary = getOption("regex.library")) x %|% y x %+% y
sep |
The separator to use between the expressions when they are collapsed. |
dictionary |
A dictionary of canned regular expressions to search within. |
x , y
|
Two regular expressions to paste together. |
... |
Regular expressions to paste together or a named expression
from the default regular expression dictionary prefixed with single at
( |
Returns a single string of regular expressions pasted together with
pipe(s) (|
).
Note that while pastex
is designed for pasting purposes it can
also be used to call a single regex from the default regional dictionary or
the supplemental dictionary (regex_supplement
) (see
Examples).
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza", "14% is $26 or $25.99", "It's 12:30 pm to 4:00 am") pastex("@rm_percent", "@rm_dollar") pastex("@rm_percent", "@time_12_hours") rm_dollar(x, extract=TRUE, pattern=pastex("@rm_percent", "@rm_dollar")) rm_dollar(x, extract=TRUE, pattern=pastex("@rm_dollar", "@rm_percent", "@time_12_hours")) ## retrieve regexes from dictionary pastex("@rm_email") pastex("@rm_url3") pastex("@version") ## pipe operator (%|%) "x" %|% "y" "@rm_url" %|% "@rm_twitter_url" ## pipe operator (%p%) "x" %+% "y" "@rm_time" %+% "\\s[AP]M" ## Remove Twitter Short URL x <- c("download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net", "twitter type: t.co/N1kq0F26tG", "still another one https://t.co/N1kq0F26tG :-)") rm_twitter_url(x) rm_twitter_url(x, extract=TRUE) ## Combine removing Twitter URLs and standard URLs rm_twitter_n_url <- rm_(pattern="@rm_twitter_url" %|% "@rm_url") rm_twitter_n_url(x) rm_twitter_n_url(x, extract=TRUE)
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza", "14% is $26 or $25.99", "It's 12:30 pm to 4:00 am") pastex("@rm_percent", "@rm_dollar") pastex("@rm_percent", "@time_12_hours") rm_dollar(x, extract=TRUE, pattern=pastex("@rm_percent", "@rm_dollar")) rm_dollar(x, extract=TRUE, pattern=pastex("@rm_dollar", "@rm_percent", "@time_12_hours")) ## retrieve regexes from dictionary pastex("@rm_email") pastex("@rm_url3") pastex("@version") ## pipe operator (%|%) "x" %|% "y" "@rm_url" %|% "@rm_twitter_url" ## pipe operator (%p%) "x" %+% "y" "@rm_time" %+% "\\s[AP]M" ## Remove Twitter Short URL x <- c("download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net", "twitter type: t.co/N1kq0F26tG", "still another one https://t.co/N1kq0F26tG :-)") rm_twitter_url(x) rm_twitter_url(x, extract=TRUE) ## Combine removing Twitter URLs and standard URLs rm_twitter_n_url <- rm_(pattern="@rm_twitter_url" %|% "@rm_url") rm_twitter_n_url(x) rm_twitter_n_url(x, extract=TRUE)
Prints a explain object
## S3 method for class 'explain' print(x, ...)
## S3 method for class 'explain' print(x, ...)
x |
The explain object |
... |
ignored |
Prints a extracted
object
## S3 method for class 'extracted' print(x, ...)
## S3 method for class 'extracted' print(x, ...)
x |
The |
... |
Ignored. |
Prints a regexr
object
## S3 method for class 'regexr' print(x, ...)
## S3 method for class 'regexr' print(x, ...)
x |
The |
... |
Ignored. |
qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes.
The qdapRegex package does not aim to compete with string manipulation
packages such as
stringr
or stringi
but is meant to provide access to canned, common regular expression patterns
that can be used within qdapRegex, with R's own regular
expression functions, or add on string manipulation packages such as
stringr
and stringi
.
A dataset containing the regex chunk name, the regex string, and a description of what the chunk does.
data(regex_cheat)
data(regex_cheat)
A data frame with 6 rows and 3 variables
Name. The name of the regex chunk.
Regex. The regex chunk.
What it Does. Description of what the regex chunk does.
A dataset containing a list of supplemental, canned regular expressions. The
regular expressions in this data set are considered useful but have not been
included in a formal function (of the type rm_XXX
). Users can utilize
the rm_
function to generate functions that can sub/replace/extract as
desired.
data(regex_supplement)
data(regex_supplement)
A list with 24 elements
The following canned regular expressions are included:
single word after the word "a"
single word after the word "the"
find single word after ? word (? = user defined); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)
find n words (not including punctuation) before or after ? word (? = user defined); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)
find n words (plus punctuation) before or after ? word (? = user defined); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
find sing word before ? word (? = user defined); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
find all occurrences of a substring except the first; regex pattern retrieved from StackOverflow's akrun: https://stackoverflow.com/a/31458261/1000343
substring beginning with hash (#) followed by either 3 or 6 select characters (a-f, A-F, and 0-9)
substring of four chunks of 1-3 consecutive digits separated with dots (.)
last occurrence of a delimiter; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own (user supplies the delimiter)
substring with "pp." or "p.", optionally followed by a space, followed by 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for extraction/removal purposes
substring 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for validation purposes
punctuation characters ([:punct:]
) with the ability to negate; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
a regex that is useful for splitting strings in the characters runs (e.g., "wwxyyyzz" becomes "ww", "x", "yyy", "zz"); regex pattern retrieved from Robert Redd: https://stackoverflow.com/a/29383435/1000343
regex string that splits on a delimiter and retains the delimiter
chunks digits > 4 into groups of 3 from right to left allowing for easy insertion of thousands separator; regex pattern retrieved from StackOverflow's stema: https://stackoverflow.com/a/10612685/1000343
substring of valid hours (1-12) followed by a colon (:) followed by valid minutes (0-60), followed by an optional space and the character chunk am or pm
substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS NOT contained in the substring
substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS contained in the substring
substring of white space after a comma
A true word boundary that only includes alphabetic characters; based on https://www.rexegg.com/'s suggestion taken from discussion of true word boundaries; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
A true left word boundary that only includes alphabetic characters; based on https://www.rexegg.com/'s suggestion taken from discussion of true word boundaries
A true right word boundary that only includes alphabetic characters; based on https://www.rexegg.com/'s suggestion taken from discussion of true word boundaries
substring of the video id from a YouTube video; taken from Jacob Overgaard's submission found https://regex101.com/r/kU7bP8/1
Regexes from this data set can be added to the pattern
argument of any
rm_XXX
function via an at sign (@) followed by a regex name from
this data set (e.g., pattern = "@after_the"
) provided the regular
expression does not contain non-regex such as sprintf
character string %s
.
Use qdapRegex:::examine_regex(regex_supplement)
to
interactively explore the regular expressions in regex_usa
. This will
provide a browser + console based break down of each regex in the dictionary.
Note that regexes containing %s
are replaced by
sprintf
and are not a valid regex on their own. The
S
is useful for adding these missing %s
parameters.
time <- rm_(pattern="@time_12_hours") time("I will go at 12:35 pm") x <- "v6.0.156 for Windows 2000/2003/XP/Vista Server version 1.1.20 Client Manager version 1.1.24" rm_default(x, pattern = "@version", extract=TRUE) rm_default(x, pattern = "@version2", extract=TRUE) x <- "this is 1000000 big 4356. And little 123 number." rm_default(x, pattern="@thousands_separator", replacement="\\1,") rm_default(x, pattern="@thousands_separator", replacement="\\1.") rm_default("I was,but it costs 10,000.", pattern="@white_after_comma", replacement=", ") x <- "I like; the donuts; a lot" strsplit(x, ";") strsplit(x, S(grab("split_keep_delim"), ";"), perl=TRUE) stringi::stri_split_regex(x, S(grab("split_keep_delim"), ";")) stringi::stri_split_regex("I like; the donuts; a lot:cool", S(grab("split_keep_delim"), ";|:")) ## Grab words around a point x <- c( "the magic word is e", "the dog is red and they are blue", "I am new but she is not new", "hello world", "why is it so cold? Perhaps it is Winter.", "It is not true the 7 is 8.", "Is that my drink?" ) rm_default(x, pattern = S("@around_", 1, "is", 1), extract=TRUE) rm_default(x, pattern = S("@around_", 2, "is", 2), extract=TRUE) rm_default(x, pattern = S("@around_", 1, "is|are|am", 1), extract=TRUE) rm_default(x, pattern = S("@around_", 1, "is not|is|are|am", 1), extract=TRUE) rm_default(x, pattern = S("@around_", 1, "is not|[Ii]s|[Aa]re|[Aa]m", 1), extract=TRUE) x <- c( "hello world", "45", "45 & 5 makes 50", "x and y", "abc and def", "her him foo & bar for Jack and Jill then" ) around_and <- rm_(pattern = S("@around_", 1, "and|\\&", 1), extract=TRUE) around_and(x) ## Split runs into chunks x <- "1111100000222000333300011110000111000" strsplit(x, grab("@run_split"), per = TRUE) ## Not run: library(qdap);library(ggplot2);library(reshape2) out <- setNames(lapply(c("@after_a", "@after_the"), function(x) { o <- rm_default(stringi:::stri_trans_tolower(pres_debates2012$dialogue), pattern = x, extract=TRUE) m <- qdapTools::matrix2df(data.frame(freq=sort(table(unlist(o)), TRUE)), "word") m[m$freq> 7, ] }), c("a", "the")) dat <- setNames(Reduce(function(x, y) { merge(x, y, by = "word", all = TRUE)}, out), c("Word", "A", "THE")) dat <- reshape2::melt(dat, id="Word", variable.name="Article", value.name="freq") dat <- dat[order(dat$freq, dat$Word), ] ord <- aggregate(freq ~ Word, dat, sum) dat$word <- factor(dat$Word, levels=ord[order(ord[[2]]), 1]) ggplot(dat, aes(x=freq, y=Word)) + geom_point()+ facet_grid(~Article) ## End(Not run) ## remove/extract pages numbers x <- c("I read p. 36 and then pp. 45-49", "it's on pp. 23-24;28") rm_pages <- rm_(pattern="@pages", extract=TRUE) rm_pages(x) rm_default(x, pattern = "@pages") rm_default(x, pattern = "@pages", extract=TRUE) rm_default(x, pattern = "@pages2", extract=TRUE) ## Validate pages page_val <- validate("@pages2", FALSE) page_val(c(66, "78-82", "hello world", TRUE, "44-45; 56")) ## Split on last occurrence x <- c( "test@[email protected]", "[email protected]", "test@xyz@[email protected]", "[email protected]@zz.vv.net" ) strsplit(x, S("@last_occurrence", "\\."), perl=TRUE) strsplit(x, S("@last_occurrence", "@"), perl=TRUE) ## True Word Boundaries x <- "this is _not a word666 and this is not a word too." ## Standard regex word boundary rm_default(x, pattern=bind("not a word")) ## Alphabetic only word boundaries rm_default(x, pattern=S("@word_boundary", "not a word")) ## Remove all but first occurrence of something x <- c( "12-3=4-5=678-9", "ABC-D=EF2-GHI-JK3=L-MN=", "9-87=65", "a - de=4fgh --= i5jkl", NA ) rm_default(x, pattern = S("@except_first", "-")) rm_default(x, pattern = S("@except_first", "="))
time <- rm_(pattern="@time_12_hours") time("I will go at 12:35 pm") x <- "v6.0.156 for Windows 2000/2003/XP/Vista Server version 1.1.20 Client Manager version 1.1.24" rm_default(x, pattern = "@version", extract=TRUE) rm_default(x, pattern = "@version2", extract=TRUE) x <- "this is 1000000 big 4356. And little 123 number." rm_default(x, pattern="@thousands_separator", replacement="\\1,") rm_default(x, pattern="@thousands_separator", replacement="\\1.") rm_default("I was,but it costs 10,000.", pattern="@white_after_comma", replacement=", ") x <- "I like; the donuts; a lot" strsplit(x, ";") strsplit(x, S(grab("split_keep_delim"), ";"), perl=TRUE) stringi::stri_split_regex(x, S(grab("split_keep_delim"), ";")) stringi::stri_split_regex("I like; the donuts; a lot:cool", S(grab("split_keep_delim"), ";|:")) ## Grab words around a point x <- c( "the magic word is e", "the dog is red and they are blue", "I am new but she is not new", "hello world", "why is it so cold? Perhaps it is Winter.", "It is not true the 7 is 8.", "Is that my drink?" ) rm_default(x, pattern = S("@around_", 1, "is", 1), extract=TRUE) rm_default(x, pattern = S("@around_", 2, "is", 2), extract=TRUE) rm_default(x, pattern = S("@around_", 1, "is|are|am", 1), extract=TRUE) rm_default(x, pattern = S("@around_", 1, "is not|is|are|am", 1), extract=TRUE) rm_default(x, pattern = S("@around_", 1, "is not|[Ii]s|[Aa]re|[Aa]m", 1), extract=TRUE) x <- c( "hello world", "45", "45 & 5 makes 50", "x and y", "abc and def", "her him foo & bar for Jack and Jill then" ) around_and <- rm_(pattern = S("@around_", 1, "and|\\&", 1), extract=TRUE) around_and(x) ## Split runs into chunks x <- "1111100000222000333300011110000111000" strsplit(x, grab("@run_split"), per = TRUE) ## Not run: library(qdap);library(ggplot2);library(reshape2) out <- setNames(lapply(c("@after_a", "@after_the"), function(x) { o <- rm_default(stringi:::stri_trans_tolower(pres_debates2012$dialogue), pattern = x, extract=TRUE) m <- qdapTools::matrix2df(data.frame(freq=sort(table(unlist(o)), TRUE)), "word") m[m$freq> 7, ] }), c("a", "the")) dat <- setNames(Reduce(function(x, y) { merge(x, y, by = "word", all = TRUE)}, out), c("Word", "A", "THE")) dat <- reshape2::melt(dat, id="Word", variable.name="Article", value.name="freq") dat <- dat[order(dat$freq, dat$Word), ] ord <- aggregate(freq ~ Word, dat, sum) dat$word <- factor(dat$Word, levels=ord[order(ord[[2]]), 1]) ggplot(dat, aes(x=freq, y=Word)) + geom_point()+ facet_grid(~Article) ## End(Not run) ## remove/extract pages numbers x <- c("I read p. 36 and then pp. 45-49", "it's on pp. 23-24;28") rm_pages <- rm_(pattern="@pages", extract=TRUE) rm_pages(x) rm_default(x, pattern = "@pages") rm_default(x, pattern = "@pages", extract=TRUE) rm_default(x, pattern = "@pages2", extract=TRUE) ## Validate pages page_val <- validate("@pages2", FALSE) page_val(c(66, "78-82", "hello world", TRUE, "44-45; 56")) ## Split on last occurrence x <- c( "test@[email protected]", "[email protected]", "test@xyz@[email protected]", "[email protected]@zz.vv.net" ) strsplit(x, S("@last_occurrence", "\\."), perl=TRUE) strsplit(x, S("@last_occurrence", "@"), perl=TRUE) ## True Word Boundaries x <- "this is _not a word666 and this is not a word too." ## Standard regex word boundary rm_default(x, pattern=bind("not a word")) ## Alphabetic only word boundaries rm_default(x, pattern=S("@word_boundary", "not a word")) ## Remove all but first occurrence of something x <- c( "12-3=4-5=678-9", "ABC-D=EF2-GHI-JK3=L-MN=", "9-87=65", "a - de=4fgh --= i5jkl", NA ) rm_default(x, pattern = S("@except_first", "-")) rm_default(x, pattern = S("@except_first", "="))
A dataset containing a list U.S. specific, canned regular expressions for use in various functions within the qdapRegex package.
data(regex_usa)
data(regex_usa)
A list with 54 elements
The following canned regular expressions are included:
abbreviations containing single lower case or capital letter followed by a period and then an optional space (this must be repeated 2 or more times)
Remove characters between a left and right boundary including the boundaries; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
Remove characters between a left and right boundary NOT including the boundaries; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
words containing 2 or more consecutive upper case letters and no lower case
phrases of 1 word or more containing 1 or more consecutive upper case letters and no lower case; if phrase is one word long then phrase must be 2 or more consecutive capital letters
substring that looks for in-text and parenthetical APA6 style citations (attempts to exclude references)
substring that looks for in-text APA6 style citations (attempts to exclude references)
substring that looks for parenthetical APA6 style citations (attempts to exclude references)
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters)
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) & zip code (exactly 5 or 5+4 consecutive digits)
dates in the form of 2 digit month, 2 digit day, and 2 or 4 digit year. Separator between month, day, and year may be dot (.), slash (/), or dash (-)
dates in the form of 3-9 letters followed by one or more spaces, 2 digits, a comma(,), one or more spaces, and 4 digits
dates in the form of XXXX-XX-XX; hyphen separated string of 4 digit year, 2 digit month, and 2 digit day
dates in the form of both rm_date
, rm_date2
, and rm_date3
substring with dollar sign ($) followed by (1) just dollars (no decimal), (2) dollars and cents (whole number and decimal), or (3) just cents (decimal value); dollars may contain commas
substring with (1) alphanumeric characters or dash (-), plus (+), or underscore (_) (This may be repeated) (2) followed by at (@), followed by the same regex sequence as before the at (@), and ending with dot (.) and 2-14 digits
common emoticons (logic is complicated to explain in words) using ">?[:;=8XB]{1}[-~+o^]?[|\")(>DO>{pP3/]+|</?3|XD+|D:<|x[-~+o^]?[|\")(>DO>{pP3/]+" regex pattern; general pattern is optional hat character, followed by eyes character, followed by optional nose character, and ending with a mouth character
substring of the last endmark group in a string; endmarks include (! ? . * OR |)
substring of the last endmark group in a string; endmarks include (! ? OR .)
substring of the last endmark group in a string; endmarks include (! ? . * | ; OR :)
substring that begins with a hash (#) followed by a word
substring of letters (that may contain apostrophes) n letters long (apostrophe not counted in length); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
substring of letters (that may contain apostrophes) n letters long (apostrophe counted in length); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
substring of 2 digits or letters a-f inside of a left and right angle brace in the form of "<a4>"
substring of any character that isn't a letter, apostrophe, or single space
substring that may begin with dash (-) for negatives, and is (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value; regex pattern provided by Jason Gray
substring beginning with (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value and followed by a percent sign (%)
phone numbers in the form of optional country code, valid 3 digit prefix, and 7 digits (may contain hyphens and parenthesis); logic is complex to explain (see https://stackoverflow.com/a/21008254/1000343 for more)
U.S. state abbreviations (and District of Columbia) that is constrained to just possible U.S. state names, not just two consecutive capital letters; taken from Mike Hamilton's submission found https://regexlib.com/REDetails.aspx?regexp_id=2177
substring with a repetition of repeated characters within a word; regex pattern retrieved from StackOverflow's, vks: https://stackoverflow.com/a/29438461/1000343
substring with a phrase (a sequence of 1 or more words) that is repeated 2 or more times (case is ignored; separating periods and commas are ignored); regex pattern retrieved from StackOverflow's, BrodieG: https://stackoverflow.com/a/28786617/1000343
substring with a word (marked with a boundary) that is repeat 2 or more times (case is ignored)
substring that begins with an at (@) followed by a word
Twitter substring that begins with an at (@) followed by a word composed of alpha-numeric characters and underscores, no longer than 15 characters
substring beginning with title (Mrs., Mr., Ms., Dr.) that is case independent or full title (Miss, Mizz, mizz) followed by a single lower case word or multiple capitalized words
substring that (1) must begin with 0-2 digits, (2) must be followed by a single colon (:), (3) optionally may be followed by either a colon (:) or a dot (.), (4) optionally may be followed by 1-infinite digits (if previous condition is true)
substring that is identical to rm_time
with the additional search for Ante Meridiem/Post Meridiem abbreviations (e.g., AM, p.m., etc.)
substring that is specific to transcription time stamps in the form of HH:MM:SS.OS where OS is milliseconds. HH: and .OS are optional. The SS.OS period divide may also be a comma or additional colon. The HH:SS divid may also be a period. String may be affixed with pound sign (#).
Twitter short link/url; substring optionally beginning with http, followed by t.co ending on a space or end of string (whichever comes first)
substring beginning with http, www., or ftp and ending on a space or end of string (whichever comes first); note that this regex is simple and may not cover all valid URLs or may include invalid URLs
substring beginning with http, www., or ftp and more constrained than rm_url
; based on @imme_emosol's response from https://mathiasbynens.be/demo/url-regex
substring beginning with http or ftp and more constrained than rm_url
& rm_url2
though light-weight, making it ideal for validation purposes; taken from @imme_emosol's response found https://mathiasbynens.be/demo/url-regex
substring of white space(s); this regular expression combines rm_white_bracket
, rm_white_colon
, rm_white_comma
, rm_white_endmark
, rm_white_lead
, rm_white_trail
, and rm_white_multiple
substring of white space(s) following left brackets ("{", "(", "[") or preceding right brackets ("}", ")", "]")
substring of white space(s) preceding colon(s)/semicolon(s)
substring of white space(s) preceding a comma
substring of white space(s) preceding a single occurrence/combination of period(s), question mark(s), and exclamation point(s)
substring of leading white space(s)
substring of leading/trailing white space(s)
substring of multiple, consecutive white spaces
substring of white space(s) preceding a comma or a single occurrence/combination of colon(s), semicolon(s), period(s), question mark(s), and exclamation point(s)
substring of trailing white space(s)
substring of 5 digits optionally followed by a dash and 4 more digits
Use qdapRegex:::examine_regex()
to interactively explore the
regular expressions in regex_usa
. This will provide a browser + console
based break down of each regex in the dictionary.
Remove/replace/extract substrings from a string. A function generator used
to make regex functions that operate typical of other qdapRegex
rm_XXX
functions. Use rm_
for removal and ex_
for
extraction.
rm_(...) ex_(...)
rm_(...) ex_(...)
... |
Arguments passed to
|
Returns a function that operates typical of other qdapRegex
rm_XXX
functions but with user defined defaults.
rm_digit <- rm_(pattern="[0-9]") rm_digit(" I 12 li34ke ice56cream78. ") rm_lead <- rm_(pattern="^\\s+", trim = FALSE, clean = FALSE) rm_lead(" I 12 li34ke ice56cream78. ") rm_all_except_letters <- rm_(pattern="[^ a-zA-Z]") rm_all_except_letters(" I 12 li34ke ice56cream78. ") extract_consec_num <- rm_(pattern="[0-9]+", extract = TRUE) extract_consec_num(" I 12 li34ke ice56cream78. ") ## Using the supplemental dictionary dataset: x <- "A man lives there! The dog likes it. I want the map. I want an apple." extract_word_after_the <- rm_(extract=TRUE, pattern="@after_the") extract_word_after_a <- rm_(extract=TRUE, pattern="@after_a") extract_word_after_the(x) extract_word_after_a(x) f <- rm_(pattern="@time_12_hours") f("I will go at 12:35 pm") x <- c( "[email protected]", "[email protected]", "[email protected]", "[email protected]" ) file_ext2 <- rm_(pattern="(?<=\\.)[a-z]*$", extract=TRUE) tools::file_ext(x) file_ext2(x)
rm_digit <- rm_(pattern="[0-9]") rm_digit(" I 12 li34ke ice56cream78. ") rm_lead <- rm_(pattern="^\\s+", trim = FALSE, clean = FALSE) rm_lead(" I 12 li34ke ice56cream78. ") rm_all_except_letters <- rm_(pattern="[^ a-zA-Z]") rm_all_except_letters(" I 12 li34ke ice56cream78. ") extract_consec_num <- rm_(pattern="[0-9]+", extract = TRUE) extract_consec_num(" I 12 li34ke ice56cream78. ") ## Using the supplemental dictionary dataset: x <- "A man lives there! The dog likes it. I want the map. I want an apple." extract_word_after_the <- rm_(extract=TRUE, pattern="@after_the") extract_word_after_a <- rm_(extract=TRUE, pattern="@after_a") extract_word_after_the(x) extract_word_after_a(x) f <- rm_(pattern="@time_12_hours") f("I will go at 12:35 pm") x <- c( "[email protected]", "[email protected]", "[email protected]", "[email protected]" ) file_ext2 <- rm_(pattern="(?<=\\.)[a-z]*$", extract=TRUE) tools::file_ext(x) file_ext2(x)
Remove/replace/extract abbreviations from a string containing lower case or capital letters followed by a period and then an optional space (this must be repeated 2 or more times).
rm_abbreviation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_abbreviation", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_abbreviation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_abbreviation", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_abbreviation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_abbreviation", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_abbreviation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_abbreviation", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with abbreviations removed.
Other rm_ functions:
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("I want $2.33 at 2:30 p.m. to go to A.n.p.", "She will send it A.S.A.P. (e.g. as soon as you can) said I.", "Hello world.", "In the U. S. A.") rm_abbreviation(x) ex_abbreviation(x)
x <- c("I want $2.33 at 2:30 p.m. to go to A.n.p.", "She will send it A.S.A.P. (e.g. as soon as you can) said I.", "Hello world.", "In the U. S. A.") rm_abbreviation(x) ex_abbreviation(x)
Remove/replace/extract strings bounded between a left and right marker.
rm_between( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_between_multiple( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = FALSE, merge = TRUE ) ex_between( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_between_multiple( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = FALSE, merge = TRUE )
rm_between( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_between_multiple( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = FALSE, merge = TRUE ) ex_between( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_between_multiple( text.var, left, right, fixed = TRUE, trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = FALSE, merge = TRUE )
text.var |
The text variable. |
left |
A vector of character or numeric symbols as the left edge to extract. |
right |
A vector of character or numeric symbols as the right edge to extract. |
fixed |
logical. If |
trim |
logical. If |
clean |
trim logical. If |
replacement |
Replacement for matched |
extract |
logical. If |
include.markers |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
merge |
logical. If |
Returns a character string with markers removed. If
rm_between
returns merged strings and is significantly faster. If
rm_between_multiple
the strings are optionally merged by
left
/right
symbols. The latter approach is more flexible and
names extracted strings by symbol boundaries, however, it is slower than
rm_between
.
gsub
,
rm_bracket
,
stri_extract_all_regex
Other rm_ functions:
rm_abbreviation()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- "I like [bots] (not)." rm_between(x, "(", ")") ex_between(x, "(", ")") rm_between(x, c("(", "["), c(")", "]")) ex_between(x, c("(", "["), c(")", "]")) rm_between(x, c("(", "["), c(")", "]"), include.markers=FALSE) ex_between(x, c("(", "["), c(")", "]"), include.markers=TRUE) ## multiple (naming and ability to keep separate bracket types but slower) x <- c("Where is the /big dog#?", "I think he's @arunning@b with /little cat#.") rm_between_multiple(x, "@a", "@b") ex_between_multiple(x, "@a", "@b") rm_between_multiple(x, c("/", "@a"), c("#", "@b")) ex_between_multiple(x, c("/", "@a"), c("#", "@b")) x2 <- c("Where is the L1big dogL2?", "I think he's 98running99 with L1little catL2.") rm_between_multiple(x2, c("L1", 98), c("L2", 99)) ex_between_multiple(x2, c("L1", 98), c("L2", 99)) state <- c("Computer is fun. Not too fun.", "No it's not, it's dumb.", "What should we do?", "You liar, it stinks!", "I am telling the truth!", "How can we be certain?", "There is no way.", "I distrust you.", "What are you talking about?", "Shall we move on? Good then.", "I'm hungry. Let's eat. You already?") rm_between_multiple(state, c("is", "we"), c("too", "on")) ## Use Grouping s <- "something before stuff $some text$ in between $1$ and after" rm_between(s, "$", "$", replacement="<B>\\2<E>") ## Using regular expressions as boundaries (fixed =FALSE) x <- c( "There are 2.3 million species in the world", "There are 2.3 billion species in the world" ) ex_between(x, left='There', right = '[mb]illion', fixed = FALSE, include=TRUE)
x <- "I like [bots] (not)." rm_between(x, "(", ")") ex_between(x, "(", ")") rm_between(x, c("(", "["), c(")", "]")) ex_between(x, c("(", "["), c(")", "]")) rm_between(x, c("(", "["), c(")", "]"), include.markers=FALSE) ex_between(x, c("(", "["), c(")", "]"), include.markers=TRUE) ## multiple (naming and ability to keep separate bracket types but slower) x <- c("Where is the /big dog#?", "I think he's @arunning@b with /little cat#.") rm_between_multiple(x, "@a", "@b") ex_between_multiple(x, "@a", "@b") rm_between_multiple(x, c("/", "@a"), c("#", "@b")) ex_between_multiple(x, c("/", "@a"), c("#", "@b")) x2 <- c("Where is the L1big dogL2?", "I think he's 98running99 with L1little catL2.") rm_between_multiple(x2, c("L1", 98), c("L2", 99)) ex_between_multiple(x2, c("L1", 98), c("L2", 99)) state <- c("Computer is fun. Not too fun.", "No it's not, it's dumb.", "What should we do?", "You liar, it stinks!", "I am telling the truth!", "How can we be certain?", "There is no way.", "I distrust you.", "What are you talking about?", "Shall we move on? Good then.", "I'm hungry. Let's eat. You already?") rm_between_multiple(state, c("is", "we"), c("too", "on")) ## Use Grouping s <- "something before stuff $some text$ in between $1$ and after" rm_between(s, "$", "$", replacement="<B>\\2<E>") ## Using regular expressions as boundaries (fixed =FALSE) x <- c( "There are 2.3 million species in the world", "There are 2.3 billion species in the world" ) ex_between(x, left='There', right = '[mb]illion', fixed = FALSE, include=TRUE)
Remove/replace/extract bracketed strings.
rm_bracket( text.var, pattern = "all", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_round( text.var, pattern = "(", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_square( text.var, pattern = "[", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_curly( text.var, pattern = "{", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_angle( text.var, pattern = "<", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_bracket_multiple( text.var, trim = TRUE, clean = TRUE, pattern = "all", replacement = "", extract = FALSE, include.markers = FALSE, merge = TRUE ) ex_bracket( text.var, pattern = "all", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_bracket_multiple( text.var, trim = TRUE, clean = TRUE, pattern = "all", replacement = "", extract = TRUE, include.markers = FALSE, merge = TRUE ) ex_angle( text.var, pattern = "<", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_round( text.var, pattern = "(", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_square( text.var, pattern = "[", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_curly( text.var, pattern = "{", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... )
rm_bracket( text.var, pattern = "all", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_round( text.var, pattern = "(", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_square( text.var, pattern = "[", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_curly( text.var, pattern = "{", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_angle( text.var, pattern = "<", trim = TRUE, clean = TRUE, replacement = "", extract = FALSE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) rm_bracket_multiple( text.var, trim = TRUE, clean = TRUE, pattern = "all", replacement = "", extract = FALSE, include.markers = FALSE, merge = TRUE ) ex_bracket( text.var, pattern = "all", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_bracket_multiple( text.var, trim = TRUE, clean = TRUE, pattern = "all", replacement = "", extract = TRUE, include.markers = FALSE, merge = TRUE ) ex_angle( text.var, pattern = "<", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_round( text.var, pattern = "(", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_square( text.var, pattern = "[", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... ) ex_curly( text.var, pattern = "{", trim = TRUE, clean = TRUE, replacement = "", extract = TRUE, include.markers = ifelse(extract, FALSE, TRUE), dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
pattern |
The type of bracket (and encased text) to remove. This is one
or more of the strings |
trim |
logical. If |
clean |
trim logical. If |
replacement |
Replacement for matched |
extract |
logical. If |
include.markers |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
merge |
logical. If |
rm_bracket
- returns a character string with
multiple brackets removed. If extract = TRUE
the results are
optionally merged and named by bracket type. This is more flexible than
rm_bracket
but slower.
rm_round
- returns a character string with round brackets removed.
rm_square
- returns a character string with square brackets
removed.
rm_curly
- returns a character string with curly brackets
removed.
rm_angle
- returns a character string with angle brackets
removed.
rm_bracket_multiple
- returns a character string with
multiple brackets removed. If extract = TRUE
the results are
optionally merged and named by bracket type. This is more flexible than
rm_bracket
but slower.
Martin Morgan and Tyler Rinker <[email protected]>.
https://stackoverflow.com/q/8621066/1000343
gsub
,
rm_between
,
stri_extract_all_regex
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
examp <- structure(list(person = structure(c(1L, 2L, 1L, 3L), .Label = c("bob", "greg", "sue"), class = "factor"), text = c("I love chicken [unintelligible]!", "Me too! (laughter) It's so good.[interrupting]", "Yep it's awesome {reading}.", "Agreed. {is so much fun}")), .Names = c("person", "text"), row.names = c(NA, -4L), class = "data.frame") examp rm_bracket(examp$text, pattern = "square") rm_bracket(examp$text, pattern = "curly") rm_bracket(examp$text, pattern = c("square", "round")) rm_bracket(examp$text) ex_bracket(examp$text, pattern = "square") ex_bracket(examp$text, pattern = "curly") ex_bracket(examp$text, pattern = c("square", "round")) ex_bracket(examp$text, pattern = c("square", "round"), merge = FALSE) ex_bracket(examp$text) ex_bracket(examp$tex, include.markers=TRUE) ## Not run: library(qdap) ex_bracket(examp$tex, pattern="curly") %>% unlist() %>% na.omit() %>% paste2() ## End(Not run) x <- "I like [bots] (not). And <likely> many do not {he he}" rm_round(x) ex_round(x) ex_round(x, include.marker = TRUE) rm_square(x) ex_square(x) rm_curly(x) ex_curly(x) rm_angle(x) ex_angle(x) lapply(ex_between('She said, "I am!" and he responded..."Am what?".', left='"', right='"'), "[", c(TRUE, FALSE))
examp <- structure(list(person = structure(c(1L, 2L, 1L, 3L), .Label = c("bob", "greg", "sue"), class = "factor"), text = c("I love chicken [unintelligible]!", "Me too! (laughter) It's so good.[interrupting]", "Yep it's awesome {reading}.", "Agreed. {is so much fun}")), .Names = c("person", "text"), row.names = c(NA, -4L), class = "data.frame") examp rm_bracket(examp$text, pattern = "square") rm_bracket(examp$text, pattern = "curly") rm_bracket(examp$text, pattern = c("square", "round")) rm_bracket(examp$text) ex_bracket(examp$text, pattern = "square") ex_bracket(examp$text, pattern = "curly") ex_bracket(examp$text, pattern = c("square", "round")) ex_bracket(examp$text, pattern = c("square", "round"), merge = FALSE) ex_bracket(examp$text) ex_bracket(examp$tex, include.markers=TRUE) ## Not run: library(qdap) ex_bracket(examp$tex, pattern="curly") %>% unlist() %>% na.omit() %>% paste2() ## End(Not run) x <- "I like [bots] (not). And <likely> many do not {he he}" rm_round(x) ex_round(x) ex_round(x, include.marker = TRUE) rm_square(x) ex_square(x) rm_curly(x) ex_curly(x) rm_angle(x) ex_angle(x) lapply(ex_between('She said, "I am!" and he responded..."Am what?".', left='"', right='"'), "[", c(TRUE, FALSE))
Remove/replace/extract 'all caps' words containing 2 or more consecutive upper case letters from a string.
rm_caps( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_caps( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_caps( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_caps( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with "all caps" removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("UGGG! When I use caps I am YELLING!") rm_caps(x) rm_caps(x, replacement="\\L\\1") ex_caps(x)
x <- c("UGGG! When I use caps I am YELLING!") rm_caps(x) rm_caps(x, replacement="\\L\\1") ex_caps(x)
Remove/replace/extract 'all caps' phrases containing 1 or more consecutive upper case letters from a string. If one word phrase the word must be 3+ letters long.
rm_caps_phrase( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps_phrase", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_caps_phrase( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps_phrase", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_caps_phrase( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps_phrase", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_caps_phrase( text.var, trim = !extract, clean = TRUE, pattern = "@rm_caps_phrase", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with "all caps phrases" removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("UGGG! When I use caps I am YELLING!", "Or it may mean this is VERY IMPORTANT!", "or trying to make a LITTLE SEEM like IT ISN'T LITTLE" ) rm_caps_phrase(x) ex_caps_phrase(x)
x <- c("UGGG! When I use caps I am YELLING!", "Or it may mean this is VERY IMPORTANT!", "or trying to make a LITTLE SEEM like IT ISN'T LITTLE" ) rm_caps_phrase(x) ex_caps_phrase(x)
Remove/replace/extract APA6 style citations from a string.
Counts of normalized citations ("et al." to original author converted to author + year standarization).
rm_citation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_citation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) as_count(x, ...)
rm_citation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_citation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) as_count(x, ...)
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Ignored. |
x |
The output from |
The default regular expression used by rm_citation
finds
in-text and parenthetical citations. This behavior can be altered by using a
secondary regular expression from the regex_usa
data (or other dictionary) via (pattern = "@rm_citation2"
or
pattern = "@rm_citation3"
). See Examples for example usage.
Returns a character string with citations removed.
Returns a data.frame
of Authors, Years, and n (counts).
This function is experimental.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
## All Citations x <- c("Hello World (V. Raptor, 1986) bye", "Narcissism is not dead (Rinker, 2014)", "The R Core Team (2014) has many members.", paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and", "beautiful. When I grow up, I want to marry R.\""), "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).", "Wickham's (in press) Tidy Data should be out soon.", "Rinker's (n.d.) dissertation not so much.", "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).", "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\"" ) rm_citation(x) ex_citation(x) as_count(ex_citation(x)) rm_citation(x, replacement="[CITATION HERE]") ## Not run: qdapTools::vect2df(sort(table(unlist(rm_citation(x, extract=TRUE)))), "citation", "count") ## End(Not run) ## In-Text ex_citation(x, pattern="@rm_citation2") ## Parenthetical ex_citation(x, pattern="@rm_citation3") ## Not run: ## Mining Citation if (!require("pacman")) install.packages("pacman") pacman::p_load(qdap, qdapTools, dplyr, ggplot2) url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx") parts <- read_docx("whole_language_timeline-updated.docx") %>% rm_non_ascii() %>% split_vector(split = "References", include = TRUE, regex=TRUE) parts[[1]] parts[[1]] %>% unbag() %>% ex_citation() %>% c() ## Counts parts[[1]] %>% unbag() %>% ex_citation() %>% as_count() ## By line ex_citation(parts[[1]]) ## Frequency cites <- parts[[1]] %>% unbag() %>% ex_citation() %>% c() %>% data_frame(citation=.) %>% count(citation) %>% arrange(n) %>% mutate(citation=factor(citation, levels=citation)) ## Distribution of citations (find locations and then plot) cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){ m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE) data.frame( citation=x, start = m[[1]] -5, end = m[[1]] + 5 + attributes(m[[1]])[["match.length"]] ) })) ggplot(cite_locs) + geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3, color="yellow") + xlab("Duration") + scale_x_continuous(expand = c(0,0), limits = c(0, nchar(unbag(parts[[1]])) + 25)) + theme_grey() + theme( panel.grid.major=element_line(color="grey20"), panel.grid.minor=element_line(color="grey20"), plot.background = element_rect(fill="black"), panel.background = element_rect(fill="black"), panel.border = element_rect(colour = "grey50", fill=NA, size=1), axis.text=element_text(color="grey50"), axis.title=element_text(color="grey50") ) ## End(Not run)
## All Citations x <- c("Hello World (V. Raptor, 1986) bye", "Narcissism is not dead (Rinker, 2014)", "The R Core Team (2014) has many members.", paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and", "beautiful. When I grow up, I want to marry R.\""), "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).", "Wickham's (in press) Tidy Data should be out soon.", "Rinker's (n.d.) dissertation not so much.", "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).", "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\"" ) rm_citation(x) ex_citation(x) as_count(ex_citation(x)) rm_citation(x, replacement="[CITATION HERE]") ## Not run: qdapTools::vect2df(sort(table(unlist(rm_citation(x, extract=TRUE)))), "citation", "count") ## End(Not run) ## In-Text ex_citation(x, pattern="@rm_citation2") ## Parenthetical ex_citation(x, pattern="@rm_citation3") ## Not run: ## Mining Citation if (!require("pacman")) install.packages("pacman") pacman::p_load(qdap, qdapTools, dplyr, ggplot2) url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx") parts <- read_docx("whole_language_timeline-updated.docx") %>% rm_non_ascii() %>% split_vector(split = "References", include = TRUE, regex=TRUE) parts[[1]] parts[[1]] %>% unbag() %>% ex_citation() %>% c() ## Counts parts[[1]] %>% unbag() %>% ex_citation() %>% as_count() ## By line ex_citation(parts[[1]]) ## Frequency cites <- parts[[1]] %>% unbag() %>% ex_citation() %>% c() %>% data_frame(citation=.) %>% count(citation) %>% arrange(n) %>% mutate(citation=factor(citation, levels=citation)) ## Distribution of citations (find locations and then plot) cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){ m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE) data.frame( citation=x, start = m[[1]] -5, end = m[[1]] + 5 + attributes(m[[1]])[["match.length"]] ) })) ggplot(cite_locs) + geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3, color="yellow") + xlab("Duration") + scale_x_continuous(expand = c(0,0), limits = c(0, nchar(unbag(parts[[1]])) + 25)) + theme_grey() + theme( panel.grid.major=element_line(color="grey20"), panel.grid.minor=element_line(color="grey20"), plot.background = element_rect(fill="black"), panel.background = element_rect(fill="black"), panel.border = element_rect(colour = "grey50", fill=NA, size=1), axis.text=element_text(color="grey50"), axis.title=element_text(color="grey50") ) ## End(Not run)
Remove/replace/extract LaTeX citations from a string.
rm_citation_tex( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation_tex", replacement = "", extract = FALSE, split = extract, unlist.extract = TRUE, dictionary = getOption("regex.library"), ... ) ex_citation_tex( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation_tex", replacement = "", extract = TRUE, split = extract, unlist.extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_citation_tex( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation_tex", replacement = "", extract = FALSE, split = extract, unlist.extract = TRUE, dictionary = getOption("regex.library"), ... ) ex_citation_tex( text.var, trim = !extract, clean = TRUE, pattern = "@rm_citation_tex", replacement = "", extract = TRUE, split = extract, unlist.extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or character string). |
replacement |
Replacement for matched |
extract |
logical. If |
split |
logical. If |
unlist.extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Additional arguments passed to
|
Returns a character string with citations (bibkeys) removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c( "I say \\parencite*{Ted2005, Moe1999} go there in \\textcite{Few2010} said to.", "But then \\authorcite{Ware2013} said it was so \\pcite[see][p. 22]{Get9999c}.", "then I \\citep[p. 22]{Foo1882c} him") rm_citation_tex(x) rm_citation_tex(x, replacement="[[CITATION]]") ex_citation_tex(x)
x <- c( "I say \\parencite*{Ted2005, Moe1999} go there in \\textcite{Few2010} said to.", "But then \\authorcite{Ware2013} said it was so \\pcite[see][p. 22]{Get9999c}.", "then I \\citep[p. 22]{Foo1882c} him") rm_citation_tex(x) rm_citation_tex(x, replacement="[[CITATION]]") ex_citation_tex(x)
Remove/replace/extract city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) from a string.
rm_city_state( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_city_state( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_city_state( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_city_state( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with city & state removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- paste0("I went to Washington Heights, NY for food! ", "It's in West ven,PA, near Bolly Bolly Bolly, CA!", "I like Movies, PG13") rm_city_state(x) ex_city_state(x)
x <- paste0("I went to Washington Heights, NY for food! ", "It's in West ven,PA, near Bolly Bolly Bolly, CA!", "I like Movies, PG13") rm_city_state(x) ex_city_state(x)
Remove/replace/extract city (single lower case word or multiple consecutive capitalized words before a comma and state) + state (2 consecutive capital letters) + zip code (5 digits or 5 + 4 digits) from a string.
rm_city_state_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state_zip", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_city_state_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state_zip", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_city_state_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state_zip", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_city_state_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_city_state_zip", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with city, state, & zip removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- paste0("I went to Washington Heights, NY 54321 for food! ", "It's in West ven,PA 12345, near Bolly Bolly Bolly, CA12345-1234!", "hello world") rm_city_state_zip(x) ex_city_state_zip(x)
x <- paste0("I went to Washington Heights, NY 54321 for food! ", "It's in West ven,PA 12345, near Bolly Bolly Bolly, CA12345-1234!", "hello world") rm_city_state_zip(x) ex_city_state_zip(x)
Remove/replace/extract dates from a string in the form of (1) XX/XX/XXXX, XX/XX/XX, XX-XX-XXXX, XX-XX-XX, XX.XX.XXXX, or XX.XX.XX OR (2) March XX, XXXX or Mar XX, XXXX OR (3) both forms.
rm_date( text.var, trim = !extract, clean = TRUE, pattern = "@rm_date", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_date( text.var, trim = !extract, clean = TRUE, pattern = "@rm_date", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_date( text.var, trim = !extract, clean = TRUE, pattern = "@rm_date", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_date( text.var, trim = !extract, clean = TRUE, pattern = "@rm_date", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
The default regular expression used by rm_date
finds numeric
representations not word/abbreviations. This means that
"June 13, 2002"
is not matched. This behavior can be altered (to
include month names/abbreviations) by using a secondary regular expression
from the regex_usa
data (or other dictionary) via
(pattern = "@rm_date2"
, pattern = "@rm_date3"
, or
pattern = "@rm_date4"
). See
Examples for example usage.
Returns a character string with dates removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
## Numeric Date Representation x <- paste0("Format dates as 04/12/2014, 04-12-2014, 04.12.2014. or", " 04/12/14 but leaves mismatched: 12.12/2014") rm_date(x) ex_date(x) ## Word/Abbreviation Date Representation x2 <- paste0("Format dates as Sept 09, 2002 or October 22, 1887", "but not 04-12-2014 and may match good 00, 9999") rm_date(x2, pattern="@rm_date2") ex_date(x2, pattern="@rm_date2") ## Year-Month-Day Representation x3 <- sprintf("R uses time in this format %s.", Sys.time()) rm_date(x3, pattern="@rm_date3") ## Grab all types ex_date(c(x, x2, x3), pattern="@rm_date4")
## Numeric Date Representation x <- paste0("Format dates as 04/12/2014, 04-12-2014, 04.12.2014. or", " 04/12/14 but leaves mismatched: 12.12/2014") rm_date(x) ex_date(x) ## Word/Abbreviation Date Representation x2 <- paste0("Format dates as Sept 09, 2002 or October 22, 1887", "but not 04-12-2014 and may match good 00, 9999") rm_date(x2, pattern="@rm_date2") ex_date(x2, pattern="@rm_date2") ## Year-Month-Day Representation x3 <- sprintf("R uses time in this format %s.", Sys.time()) rm_date(x3, pattern="@rm_date3") ## Grab all types ex_date(c(x, x2, x3), pattern="@rm_date4")
Remove/replace/extract substring from a string. This is the template used by
other qdapRegex rm_XXX
functions.
rm_default( text.var, trim = !extract, clean = TRUE, pattern, replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_default( text.var, trim = !extract, clean = TRUE, pattern, replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_default( text.var, trim = !extract, clean = TRUE, pattern, replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_default( text.var, trim = !extract, clean = TRUE, pattern, replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with substring removed.
rm_
,
gsub
,
stri_extract_all_regex
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
## Built in regex dictionary rm_default("I live in Buffalo, NY 14217", pattern="@rm_city_state_zip") ## User defined regular expression pat <- "(\\s*([A-Z][\\w-]*)+),\\s([A-Z]{2})\\s(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b" rm_default("I live in Buffalo, NY 14217", pattern=pat)
## Built in regex dictionary rm_default("I live in Buffalo, NY 14217", pattern="@rm_city_state_zip") ## User defined regular expression pat <- "(\\s*([A-Z][\\w-]*)+),\\s([A-Z]{2})\\s(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b" rm_default("I live in Buffalo, NY 14217", pattern=pat)
Remove/replace/extract dollars amounts from a string.
rm_dollar( text.var, trim = !extract, clean = TRUE, pattern = "@rm_dollar", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_dollar( text.var, trim = !extract, clean = TRUE, pattern = "@rm_dollar", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_dollar( text.var, trim = !extract, clean = TRUE, pattern = "@rm_dollar", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_dollar( text.var, trim = !extract, clean = TRUE, pattern = "@rm_dollar", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with dollars removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza", "14% is $26 or $25.99", "Really?...$123,234.99 is not cheap.") rm_dollar(x) ex_dollar(x)
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza", "14% is $26 or $25.99", "Really?...$123,234.99 is not cheap.") rm_dollar(x) ex_dollar(x)
Remove/replace/extract email addresses from a string.
rm_email( text.var, trim = !extract, clean = TRUE, pattern = "@rm_email", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_email( text.var, trim = !extract, clean = TRUE, pattern = "@rm_email", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_email( text.var, trim = !extract, clean = TRUE, pattern = "@rm_email", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_email( text.var, trim = !extract, clean = TRUE, pattern = "@rm_email", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with email addresses removed.
Barry Rowlingson and Tyler Rinker <[email protected]>.
The email regular expression was taken from: https://stackoverflow.com/a/25077704/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- paste("fred is [email protected] and joe is [email protected] - but @this is a twitter handle for [email protected] or [email protected]/[email protected]") x2 <- c("fred is [email protected] and joe is [email protected] - but @this is a", "twitter handle for [email protected] or [email protected]/[email protected]", "hello world") rm_email(x) rm_email(x, replacement = '<a href="mailto:\\1" target="_blank">\\1</a>') ex_email(x) ex_email(x2)
x <- paste("fred is fred@foo.com and joe is joe@example.com - but @this is a twitter handle for twit@here.com or foo+bar@google.com/fred@foo.fnord") x2 <- c("fred is [email protected] and joe is [email protected] - but @this is a", "twitter handle for [email protected] or [email protected]/[email protected]", "hello world") rm_email(x) rm_email(x, replacement = '<a href="mailto:\\1" target="_blank">\\1</a>') ex_email(x) ex_email(x2)
Remove/replace/extract common emoticons from a string.
rm_emoticon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_emoticon", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_emoticon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_emoticon", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_emoticon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_emoticon", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_emoticon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_emoticon", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with emoticons removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("are :-)) it 8-D he XD on =-D they :D of :-) is :> for :o) that :-/", "as :-D I xD with :^) a =D to =) the 8D and :3 in =3 you 8) his B^D was") rm_emoticon(x) ex_emoticon(x)
x <- c("are :-)) it 8-D he XD on =-D they :D of :-) is :> for :o) that :-/", "as :-D I xD with :^) a =D to =) the 8D and :3 in =3 you 8) his B^D was") rm_emoticon(x) ex_emoticon(x)
Remove/replace/extract endmarks from a string.
rm_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_endmark", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_endmark", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_endmark", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_endmark", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
The default regular expression used by rm_endmark
finds
endmark punctuation used in the qdap package; this includes ! . ? * AND
|. This behavior can be altered (to ; AND : or to use just ! . AND ?) by
using a secondary regular expression from the
regex_usa
data (or other dictionary) via
(pattern = "@rm_endmark2"
or pattern = "@rm_endmark3"
). See
Examples for example usage.
Returns a character string with endmarks removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("I like the dog.", "I want it *|", "I;", "Who is| that?", "Hello world", "You...") rm_endmark(x) ex_endmark(x) rm_endmark(x, pattern="@rm_endmark2") ex_endmark(x, pattern="@rm_endmark2") rm_endmark(x, pattern="@rm_endmark3") ex_endmark(x, pattern="@rm_endmark3")
x <- c("I like the dog.", "I want it *|", "I;", "Who is| that?", "Hello world", "You...") rm_endmark(x) ex_endmark(x) rm_endmark(x, pattern="@rm_endmark2") ex_endmark(x, pattern="@rm_endmark2") rm_endmark(x, pattern="@rm_endmark3") ex_endmark(x, pattern="@rm_endmark3")
Remove/replace/extract hash tags from a string.
rm_hash( text.var, trim = !extract, clean = TRUE, pattern = "@rm_hash", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_hash( text.var, trim = !extract, clean = TRUE, pattern = "@rm_hash", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_hash( text.var, trim = !extract, clean = TRUE, pattern = "@rm_hash", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_hash( text.var, trim = !extract, clean = TRUE, pattern = "@rm_hash", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with hash tags removed.
stackoverflow's hwnd and Tyler Rinker <[email protected]>.
The hash tag regular expression was taken from: https://stackoverflow.com/a/25096474/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1" ) rm_hash(x) rm_hash(rm_tag(x)) ex_hash(x) ## remove just the hash symbol rm_hash(x, replace="\\3")
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1" ) rm_hash(x) rm_hash(rm_tag(x)) ex_hash(x) ## remove just the hash symbol rm_hash(x, replace="\\3")
Remove/replace/extract words that are n letters in length (apostrophes not counted).
rm_nchar_words( text.var, n, trim = !extract, clean = TRUE, pattern = "@rm_nchar_words", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_nchar_words( text.var, n, trim = !extract, clean = TRUE, pattern = "@rm_nchar_words", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_nchar_words( text.var, n, trim = !extract, clean = TRUE, pattern = "@rm_nchar_words", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_nchar_words( text.var, n, trim = !extract, clean = TRUE, pattern = "@rm_nchar_words", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
n |
The number of letters counted in the word. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
The default regular expression used by rm_nchar_words
counts
letter length, not characters. This means that apostrophes are not include
in the character count. This behavior can be altered (to include apostrophes
in the character count) by using a secondary regular expression from the
regex_usa
data (or other dictionary) via
(pattern = "@rm_nchar_words2"
). See Examples for example
usage.
Returns a character string with n letter words removed.
stackoverflow's CharlieB and Tyler Rinker <[email protected]>.
The n letter/character word regular expression was taken from: https://stackoverflow.com/a/25243885/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- "This is Jon's dogs' 'bout there in a word Mike's re'y." rm_nchar_words(x, 4) ex_nchar_words(x, 4) ## Count characters (apostrophes and letters) ex_nchar_words(x, 5, pattern = "@rm_nchar_words2") ## nchar range rm_nchar_words(x, "1,2") ## Not run: ## Larger example library(qdap) ex_nchar_words(hamlet[["dialogue"]], 5) ## End(Not run)
x <- "This is Jon's dogs' 'bout there in a word Mike's re'y." rm_nchar_words(x, 4) ex_nchar_words(x, 4) ## Count characters (apostrophes and letters) ex_nchar_words(x, 5, pattern = "@rm_nchar_words2") ## nchar range rm_nchar_words(x, "1,2") ## Not run: ## Larger example library(qdap) ex_nchar_words(hamlet[["dialogue"]], 5) ## End(Not run)
Remove/replace/extract non-ASCII substring from a string. This is the template used by
other qdapRegex rm_XXX
functions.
rm_non_ascii( text.var, trim = !extract, clean = TRUE, pattern = "@rm_non_ascii", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ascii.out = TRUE, ... ) ex_non_ascii( text.var, trim = !extract, clean = TRUE, pattern = "@rm_non_ascii", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ascii.out = TRUE, ... )
rm_non_ascii( text.var, trim = !extract, clean = TRUE, pattern = "@rm_non_ascii", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ascii.out = TRUE, ... ) ex_non_ascii( text.var, trim = !extract, clean = TRUE, pattern = "@rm_non_ascii", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ascii.out = TRUE, ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
ascii.out |
logical. If |
... |
ignored. |
Returns a character string with "all non-ascii" removed.
MacOS 14, Sonoma (and likely all versions afterward), has a different implementation of iconv which may not result in expected results.
iconv
is used within rm_non_ascii
.
iconv
's behavior across operating systems may not be
consistent.
stackoverflow's MrFlick, hwnd, and Tyler Rinker <[email protected]>.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("Hello World", "Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") Encoding(x) <- "latin1" x rm_non_ascii(x) rm_non_ascii(x, replacement="<<FLAG>>") ex_non_ascii(x) ex_non_ascii(x, ascii.out=FALSE) ## simple regex to remove non-ascii rm_default(x, pattern="[^ -~]") ex_default(x, pattern="[^ -~]")
x <- c("Hello World", "Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") Encoding(x) <- "latin1" x rm_non_ascii(x) rm_non_ascii(x, replacement="<<FLAG>>") ex_non_ascii(x) ex_non_ascii(x, ascii.out=FALSE) ## simple regex to remove non-ascii rm_default(x, pattern="[^ -~]") ex_default(x, pattern="[^ -~]")
rm_non_words
- Remove/replace/extract non-words (Anything that's not a
letter or apostrophe; also removes multiple white spaces) from a string.
rm_non_words( text.var, trim = !extract, clean = TRUE, pattern = "@rm_non_words", replacement = " ", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_non_words( text.var, trim = !extract, clean = TRUE, pattern = "[^A-Za-z' ]+", replacement = " ", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_non_words( text.var, trim = !extract, clean = TRUE, pattern = "@rm_non_words", replacement = " ", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_non_words( text.var, trim = !extract, clean = TRUE, pattern = "[^A-Za-z' ]+", replacement = " ", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with non-words removed.
Setting the argument extract = TRUE
is not very useful. Use the
following setup instead (see Examples for a demonstration).
rm_default(x, pattern = "[^A-Za-z' ]", extract=TRUE)
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c( "I like 56 dogs!", "It's seventy-two feet from the px290.", NA, "What", "that1is2a3way4to5go6.", "What do you*% want? For real%; I think you'll see.", "Oh some <html>code</html> to remove" ) rm_non_words(x) ex_non_words(x)
x <- c( "I like 56 dogs!", "It's seventy-two feet from the px290.", NA, "What", "that1is2a3way4to5go6.", "What do you*% want? For real%; I think you'll see.", "Oh some <html>code</html> to remove" ) rm_non_words(x) ex_non_words(x)
rm_number
- Remove/replace/extract number from a string (works on
numbers with commas, decimals and negatives).
as_numeric
- A wrapper for as.numeric(gsub(",", "", x))
, which
removes commas and converts a list of vectors of strings to numeric. If the
string cannot be converted to numeric NA
is returned.
as_numeric2
- A convenience function for as_numeric
that
unlists and returns a vector rather than a list.
rm_number( text.var, trim = !extract, clean = TRUE, pattern = "@rm_number", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) as_numeric(x) as_numeric2(x) ex_number( text.var, trim = !extract, clean = TRUE, pattern = "@rm_number", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_number( text.var, trim = !extract, clean = TRUE, pattern = "@rm_number", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) as_numeric(x) as_numeric2(x) ex_number( text.var, trim = !extract, clean = TRUE, pattern = "@rm_number", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
x |
a character vector to convert to a numeric vector. |
rm_number
- Returns a character string with number removed.
as_numeric
- Returns a list of vectors of numbers.
as_numeric2
- Returns an unlisted vector of numbers.
The number regular expression was created by Jason Gray.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("-2 is an integer. -4.3 and 3.33 are not.", "123,456 is 0 alot -123456 more than -.2", "and 3456789123 fg for 345.", "fg 12,345 23 .44 or 18.", "don't remove this 444,44", "hello world -.q") rm_number(x) ex_number(x) ##Convert to numeric as_numeric(ex_number(x)) # retain list as_numeric2(ex_number(x)) # unlist
x <- c("-2 is an integer. -4.3 and 3.33 are not.", "123,456 is 0 alot -123456 more than -.2", "and 3456789123 fg for 345.", "fg 12,345 23 .44 or 18.", "don't remove this 444,44", "hello world -.q") rm_number(x) ex_number(x) ##Convert to numeric as_numeric(ex_number(x)) # retain list as_numeric2(ex_number(x)) # unlist
Remove/replace/extract percentages from a string.
rm_percent( text.var, trim = !extract, clean = TRUE, pattern = "@rm_percent", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_percent( text.var, trim = !extract, clean = TRUE, pattern = "@rm_percent", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_percent( text.var, trim = !extract, clean = TRUE, pattern = "@rm_percent", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_percent( text.var, trim = !extract, clean = TRUE, pattern = "@rm_percent", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with percentages removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza", "14% is $26 or $25.99") rm_percent(x) ex_percent(x)
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza", "14% is $26 or $25.99") rm_percent(x) ex_percent(x)
Remove/replace/extract phone numbers from a string.
rm_phone( text.var, trim = !extract, clean = TRUE, pattern = "@rm_phone", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_phone( text.var, trim = !extract, clean = TRUE, pattern = "@rm_phone", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_phone( text.var, trim = !extract, clean = TRUE, pattern = "@rm_phone", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_phone( text.var, trim = !extract, clean = TRUE, pattern = "@rm_phone", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with phone numbers removed.
stackoverflow's Marius and Tyler Rinker <[email protected]>.
The phone regular expression was taken from: https://stackoverflow.com/a/21008254/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one", "43 Butter Rd, Brossard QC K0A 3P0 - 613 213 4567", "Please contact Mr. Bean (613)2134567", "1.575.555.5555 is his #1 number", "7164347566", "I like 1234567 dogs" ) rm_phone(x) ex_phone(x)
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one", "43 Butter Rd, Brossard QC K0A 3P0 - 613 213 4567", "Please contact Mr. Bean (613)2134567", "1.575.555.5555 is his #1 number", "7164347566", "I like 1234567 dogs" ) rm_phone(x) ex_phone(x)
Remove/replace/extract postal codes.
rm_postal_code( text.var, trim = !extract, clean = TRUE, pattern = "@rm_postal_code", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_postal_code( text.var, trim = !extract, clean = TRUE, pattern = "@rm_postal_code", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_postal_code( text.var, trim = !extract, clean = TRUE, pattern = "@rm_postal_code", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_postal_code( text.var, trim = !extract, clean = TRUE, pattern = "@rm_postal_code", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with postal codes removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA") rm_postal_code(x) ex_postal_code(x)
x <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA") rm_postal_code(x) ex_postal_code(x)
Remove/replace/extract words with repeating characters. The word must contain characters, each repeating at east 2 times
rm_repeated_characters( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_characters", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_repeated_characters( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_characters", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_repeated_characters( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_characters", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_repeated_characters( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_characters", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with percentages removed.
stackoverflow's vks and Tyler Rinker <[email protected]>.
https://stackoverflow.com/a/29438461/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- "aaaahahahahaha that was a good joke peep and pepper and pepe" rm_repeated_characters(x) ex_repeated_characters(x)
x <- "aaaahahahahaha that was a good joke peep and pepper and pepe" rm_repeated_characters(x) ex_repeated_characters(x)
Remove/replace/extract repeating phrases from a string.
rm_repeated_phrases( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_phrases", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_repeated_phrases( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_phrases", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_repeated_phrases( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_phrases", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_repeated_phrases( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_phrases", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with percentages removed.
stackoverflow's BrodieG and Tyler Rinker <[email protected]>.
https://stackoverflow.com/a/28786617/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c( "this is a big is a Big deal", "I want want to see", "I want, want to see", "I want...want to see see see how", "I like it. It is cool", "this is a big is a Big deal for those of, those of you who are." ) rm_repeated_phrases(x) ex_repeated_phrases(x)
x <- c( "this is a big is a Big deal", "I want want to see", "I want, want to see", "I want...want to see see see how", "I like it. It is cool", "this is a big is a Big deal for those of, those of you who are." ) rm_repeated_phrases(x) ex_repeated_phrases(x)
Remove/replace/extract repeating words from a string.
rm_repeated_words( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_words", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_repeated_words( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_words", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_repeated_words( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_words", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_repeated_words( text.var, trim = !extract, clean = TRUE, pattern = "@rm_repeated_words", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with percentages removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c( "this is a big is a Big deal", "I want want to see", "I want, want to see", "I want...want to see see see how", "I like it. It is cool", "this is a big is a Big deal for those of, those of you who are." ) rm_repeated_words(x) ex_repeated_words(x)
x <- c( "this is a big is a Big deal", "I want want to see", "I want, want to see", "I want...want to see see see how", "I like it. It is cool", "this is a big is a Big deal for those of, those of you who are." ) rm_repeated_words(x) ex_repeated_words(x)
Remove/replace/extract person tags from a string.
rm_tag( text.var, trim = !extract, clean = TRUE, pattern = "@rm_tag", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_tag( text.var, trim = !extract, clean = TRUE, pattern = "@rm_tag", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_tag( text.var, trim = !extract, clean = TRUE, pattern = "@rm_tag", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_tag( text.var, trim = !extract, clean = TRUE, pattern = "@rm_tag", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
The default regex pattern "(?<![@\w])@([a-z0-9_]+)\b"
is
more liberal and searches for the at (@) symbol followed by any word. This
can be accessed via pattern = "@rm_tag"
. Twitter user names are more
constrained. A second regex ("(?<![@\w])@([a-z0-9_]{1,15})\b"
) is
provide that contains the latter word to substring that begins with an at
(@) followed by a word composed of alpha-numeric characters and underscores,
no longer than 15 characters. This can be accessed via
pattern = "@rm_tag2"
(see Examples).
Returns a character string with person tags removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1", "[email protected] is my email", "A non valid Twitter is @abcdefghijklmnopqrstuvwxyz" ) rm_tag(x) rm_tag(rm_hash(x)) ex_tag(x) ## more restrictive Twitter regex ex_tag(x, pattern="@rm_tag2") ## Remove only the @ sign rm_tag(x, replacement = "\\3") rm_tag(x, replacement = "\\3", pattern="@rm_tag2")
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1", "[email protected] is my email", "A non valid Twitter is @abcdefghijklmnopqrstuvwxyz" ) rm_tag(x) rm_tag(rm_hash(x)) ex_tag(x) ## more restrictive Twitter regex ex_tag(x, pattern="@rm_tag2") ## Remove only the @ sign rm_tag(x, replacement = "\\3") rm_tag(x, replacement = "\\3", pattern="@rm_tag2")
rm_time
- Remove/replace/extract time from a string.
rm_transcript_time
- Remove/replace/extract transcript specific time
stamps from a string.
as_time
- Convert a time stamp removed by rm_time
or
rm_transcript_time
to a standard time format (HH:SS:MM.OS) and
optionally convert to as.POSIXlt
.
as_time
- A convenience function for as_time
that unlists and
returns a vector rather than a list.
rm_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_time", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) rm_transcript_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_transcript_time", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) as_time(x, as.POSIXlt = FALSE, millisecond = TRUE) as_time2(x, ...) ex_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_time", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) ex_transcript_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_transcript_time", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_time", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) rm_transcript_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_transcript_time", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) as_time(x, as.POSIXlt = FALSE, millisecond = TRUE) as_time2(x, ...) ex_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_time", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) ex_transcript_time( text.var, trim = !extract, clean = TRUE, pattern = "@rm_transcript_time", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
x |
A list with extracted time stamps. |
as.POSIXlt |
logical. If |
millisecond |
logical. If |
The default regular expression used by rm_time
finds
time with no AM/PM. This behavior can be altered by using a
secondary regular expression from the regex_usa
data (or other dictionary) via (pattern = "@rm_time2"
. See
Examples for example usage.
Returns a character string with time removed.
... in as_time2
are the other arguments passed to as_time
.
stackoverflow's hwnd and Tyler Rinker <[email protected]>.
The time regular expression was taken from: https://stackoverflow.com/a/25111133/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("R uses 1:5 for 1, 2, 3, 4, 5.", "At 3:00 we'll meet up and leave by 4:30:20", "We'll meet at 6:33.", "He ran it in :22.34") rm_time(x) ex_time(x) ## With AM/PM x <- c( "I'm getting 3:04 AM just fine, but...", "for 10:47 AM I'm getting 0:47 AM instead.", "no time here", "Some time has 12:04 with no AM/PM after it", "Some time has 12:04 a.m. or the form 1:22 pm" ) ex_time(x) ex_time(x, pat="@rm_time2") rm_time(x, pat="@rm_time2") ex_time(x, pat=pastex("@rm_time2", "@rm_time")) # Convert to standard format as_time(ex_time(x)) as_time(ex_time(x), as.POSIXlt = TRUE) as_time(ex_time(x), as.POSIXlt = FALSE, millisecond = FALSE) # Transcript specific time stamps x2 <-c( '08:15 8 minutes and 15 seconds 00:08:15.0', '3:15 3 minutes and 15 seconds not 1:03:15.0', '01:22:30 1 hour 22 minutes and 30 seconds 01:22:30.0', '#00:09:33-5# 9 minutes and 33.5 seconds 00:09:33.5', '00:09.33,75 9 minutes and 33.5 seconds 00:09:33.75' ) rm_transcript_time(x2) (out <- ex_transcript_time(x2)) as_time(out) as_time(out, TRUE) as_time(out, millisecond = FALSE) ## Not run: if (!require("pacman")) install.packages("pacman") pacman::p_load(chron) lapply(as_time(out), chron::times) lapply(as_time(out, , FALSE), chron::times) ## End(Not run)
x <- c("R uses 1:5 for 1, 2, 3, 4, 5.", "At 3:00 we'll meet up and leave by 4:30:20", "We'll meet at 6:33.", "He ran it in :22.34") rm_time(x) ex_time(x) ## With AM/PM x <- c( "I'm getting 3:04 AM just fine, but...", "for 10:47 AM I'm getting 0:47 AM instead.", "no time here", "Some time has 12:04 with no AM/PM after it", "Some time has 12:04 a.m. or the form 1:22 pm" ) ex_time(x) ex_time(x, pat="@rm_time2") rm_time(x, pat="@rm_time2") ex_time(x, pat=pastex("@rm_time2", "@rm_time")) # Convert to standard format as_time(ex_time(x)) as_time(ex_time(x), as.POSIXlt = TRUE) as_time(ex_time(x), as.POSIXlt = FALSE, millisecond = FALSE) # Transcript specific time stamps x2 <-c( '08:15 8 minutes and 15 seconds 00:08:15.0', '3:15 3 minutes and 15 seconds not 1:03:15.0', '01:22:30 1 hour 22 minutes and 30 seconds 01:22:30.0', '#00:09:33-5# 9 minutes and 33.5 seconds 00:09:33.5', '00:09.33,75 9 minutes and 33.5 seconds 00:09:33.75' ) rm_transcript_time(x2) (out <- ex_transcript_time(x2)) as_time(out) as_time(out, TRUE) as_time(out, millisecond = FALSE) ## Not run: if (!require("pacman")) install.packages("pacman") pacman::p_load(chron) lapply(as_time(out), chron::times) lapply(as_time(out, , FALSE), chron::times) ## End(Not run)
Remove/replace/extract title (honorific) + person name(s) from a string.
rm_title_name( text.var, trim = !extract, clean = TRUE, pattern = "@rm_title_name", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_title_name( text.var, trim = !extract, clean = TRUE, pattern = "@rm_title_name", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_title_name( text.var, trim = !extract, clean = TRUE, pattern = "@rm_title_name", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_title_name( text.var, trim = !extract, clean = TRUE, pattern = "@rm_title_name", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with person tags removed.
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_url()
,
rm_white()
,
rm_zip()
x <- c("Dr. Brend is mizz hart's in mrs. Holtz's.", "Where is mr. Bob Jr. and Ms. John Kennedy?") rm_title_name(x) ex_title_name(x)
x <- c("Dr. Brend is mizz hart's in mrs. Holtz's.", "Where is mr. Bob Jr. and Ms. John Kennedy?") rm_title_name(x) ex_title_name(x)
rm_url
- Remove/replace/extract URLs from a string.
rm_twitter_url
- Remove/replace/extract Twitter Short URLs from a
string.
rm_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_url", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) rm_twitter_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_twitter_url", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_url", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) ex_twitter_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_twitter_url", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_url", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) rm_twitter_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_twitter_url", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_url", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) ex_twitter_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_twitter_url", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
The default regex pattern "(http[^ ]*)|(www\.[^ ]*)"
is more
liberal. More constrained versions can be accessed
via pattern = "@rm_url2"
& pattern = "@rm_url3"
see
Examples).
Returns a character string with URLs removed.
The more constrained url regular expressions ("@rm_url2"
and "@rm_url3"
was adapted from imme_emosol's response:
https://mathiasbynens.be/demo/url-regex
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_white()
,
rm_zip()
x <- " I like www.talkstats.com and http://stackoverflow.com" rm_url(x) rm_url(x, replacement = '<a href="\\1" target="_blank">\\1</a>') ex_url(x) ex_url(x, pattern = "@rm_url2") ex_url(x, pattern = "@rm_url3") ## Remove Twitter Short URL x <- c("download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net", "twitter type: t.co/N1kq0F26tG", "still another one https://t.co/N1kq0F26tG :-)") rm_twitter_url(x) ex_twitter_url(x) ## Combine removing Twitter URLs and standard URLs rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url")) rm_twitter_n_url(x) rm_twitter_n_url(x, extract=TRUE)
x <- " I like www.talkstats.com and http://stackoverflow.com" rm_url(x) rm_url(x, replacement = '<a href="\\1" target="_blank">\\1</a>') ex_url(x) ex_url(x, pattern = "@rm_url2") ex_url(x, pattern = "@rm_url3") ## Remove Twitter Short URL x <- c("download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net", "twitter type: t.co/N1kq0F26tG", "still another one https://t.co/N1kq0F26tG :-)") rm_twitter_url(x) ex_twitter_url(x) ## Combine removing Twitter URLs and standard URLs rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url")) rm_twitter_n_url(x) rm_twitter_n_url(x, extract=TRUE)
rm_white
- Remove multiple white space (> 1 becomes a single white
space), white space before a comma, white space before a single or
consecutive combination of a colon, semicolon, or endmark (period, question
mark, or exclamation point), white space after a left bracket ("{", "(", "[")
or before a right bracket ("}", ")", "]"), leading or trailing white space.
rm_white_bracket
- Remove white space after a left bracket ("{", "(", "[")
or before a right bracket ("}", ")", "]").
rm_white_colon
- Remove white space before a single or consecutive
combination of a colon, semicolon.
rm_white_comma
- Remove white space before a comma.
rm_white_endmark
- Remove white space before endmark(s) (".", "?", "!").
rm_white_lead
- Remove leading white space.
rm_white_lead_trail
- Remove leading or trailing white space.
rm_white_trail
- Remove trailing white space.
rm_white_multiple
- Remove multiple white space (> 1 becomes a single
white space).
rm_white_punctuation
- Remove multiple white space before a comma, white
space before a single or consecutive combination of a colon, semicolon, or
endmark (period, question mark, or exclamation point).
rm_white( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_bracket( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_bracket", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_bracket( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_bracket", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_colon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_colon", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_colon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_colon", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_comma( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_comma", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_comma( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_comma", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_endmark", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_endmark", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_lead( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_lead( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_lead_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead_trail", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_lead_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead_trail", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_trail", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_trail", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_multiple( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_multiple", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_multiple( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_multiple", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_punctuation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_punctuation", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_punctuation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_punctuation", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_white( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_bracket( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_bracket", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_bracket( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_bracket", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_colon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_colon", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_colon( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_colon", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_comma( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_comma", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_comma( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_comma", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_endmark", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_endmark( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_endmark", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_lead( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_lead( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_lead_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead_trail", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_lead_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_lead_trail", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_trail", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_trail( text.var, trim = FALSE, clean = FALSE, pattern = "@rm_white_trail", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_multiple( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_multiple", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_multiple( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_multiple", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... ) rm_white_punctuation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_punctuation", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_white_punctuation( text.var, trim = !extract, clean = TRUE, pattern = "@rm_white_punctuation", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with extra white space removed.
rm_white_endmark
/rm_white_punctuation
- stackoverflow's hwnd and Tyler Rinker <[email protected]>.
The rm_white_endmark
/rm_white_punctuation
regular expression was taken from:
https://stackoverflow.com/a/25464921/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_zip()
x <- c(" There is ( $5.50 ) for , me . ", " that's [ 45.6% ] of! the pizza !", " 14% is { $26 } or $25.99 ?", "Oh ; here's colon : Yippee !") rm_white(x) rm_white_bracket(x) rm_white_colon(x) rm_white_comma(x) rm_white_endmark(x) rm_white_lead(x) rm_white_trail(x) rm_white_lead_trail(x) rm_white_multiple(x) rm_white_punctuation(x)
x <- c(" There is ( $5.50 ) for , me . ", " that's [ 45.6% ] of! the pizza !", " 14% is { $26 } or $25.99 ?", "Oh ; here's colon : Yippee !") rm_white(x) rm_white_bracket(x) rm_white_colon(x) rm_white_comma(x) rm_white_endmark(x) rm_white_lead(x) rm_white_trail(x) rm_white_lead_trail(x) rm_white_multiple(x) rm_white_punctuation(x)
Remove/replace/extract zip codes from a string.
rm_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_zip", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_zip", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
rm_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_zip", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... ) ex_zip( text.var, trim = !extract, clean = TRUE, pattern = "@rm_zip", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Returns a character string with U.S. 5 and 5+4 zip codes removed.
stackoverflow's hwnd and Tyler Rinker <[email protected]>.
The time regular expression was taken from: https://stackoverflow.com/a/25223890/1000343
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_caps()
,
rm_citation_tex()
,
rm_citation()
,
rm_city_state_zip()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
x <- c("Mr. Bean bought 2 tickets 2-613-213-4567", "43 Butter Rd, Brossard QC K0A 3P0 - 613 213 4567", "Rat Race, XX, 12345", "Ignore phone numbers(613)2134567", "Grab zips with dashes 12345-6789 or no space before12345-6789", "Grab zips with spaces 12345 6789 or no space before12345 6789", "I like 1234567 dogs" ) rm_zip(x) ex_zip(x) ## ======================= ## ## BUILD YOUR OWN FUNCTION ## ## ======================= ## ## example from: https://stackoverflow.com/a/26092576/1000343 zips <- data.frame(id = seq(1, 6), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540m, City, ST 12345", "company 12345678") ) ## Function to grab even if a character follows the zip # paste together a more flexible regular expression pat <- pastex( "@rm_zip", "(?<!\\d)\\d{5}(?!\\d)", "(?<!\\d)\\d{5}-\\d{4}(?!\\d)" ) # Create your own function that extract is set to TRUE ex_zip2 <- rm_(pattern=pat, extract=TRUE) ex_zip2(zips$address) ## Function to extract just 5 digit zips ex_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE) ex_zip3(zips$address)
x <- c("Mr. Bean bought 2 tickets 2-613-213-4567", "43 Butter Rd, Brossard QC K0A 3P0 - 613 213 4567", "Rat Race, XX, 12345", "Ignore phone numbers(613)2134567", "Grab zips with dashes 12345-6789 or no space before12345-6789", "Grab zips with spaces 12345 6789 or no space before12345 6789", "I like 1234567 dogs" ) rm_zip(x) ex_zip(x) ## ======================= ## ## BUILD YOUR OWN FUNCTION ## ## ======================= ## ## example from: https://stackoverflow.com/a/26092576/1000343 zips <- data.frame(id = seq(1, 6), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540m, City, ST 12345", "company 12345678") ) ## Function to grab even if a character follows the zip # paste together a more flexible regular expression pat <- pastex( "@rm_zip", "(?<!\\d)\\d{5}(?!\\d)", "(?<!\\d)\\d{5}-\\d{4}(?!\\d)" ) # Create your own function that extract is set to TRUE ex_zip2 <- rm_(pattern=pat, extract=TRUE) ex_zip2(zips$address) ## Function to extract just 5 digit zips ex_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE) ex_zip3(zips$address)
Convenience wrapper for sprintf
that allows recycling of
... of length one.
S(x, ...)
S(x, ...)
x |
A single string containing |
... |
A vector of substitutions equal in length to the number of
|
Returns a string with "%s"
replaced.
S("@after_", "the", "the") # Recycle S("@after_", "the") S("@rm_between", "LEFT", "RIGHT")
S("@after_", "the", "the") # Recycle S("@after_", "the") S("@rm_between", "LEFT", "RIGHT")
TC
- Capitalize titles according to traditional capitalization rules.
TC(text.var, lower = NULL, ...) L(text.var, ...) U(text.var, ...)
TC(text.var, lower = NULL, ...) L(text.var, ...) U(text.var, ...)
text.var |
The text variable. |
lower |
A vector of words to retain lower case for (unless first or last word). |
... |
Other arguments passed to: |
Case wrapper functions for stringi's stri_trans_tolower
,
stri_trans_toupper
, and stri_trans_totitle
.
Functions are useful within magrittr style chaining.
Returns a character vector with new case (lower, upper, or title).
TC
utilizes additional rules for capitalization beyond
stri_trans_totitle
that include:
Capitalize the first & last word
Lowercase articles, coordinating conjunctions, & prepositions
Lowercase "to" in an infinitive
stri_trans_tolower
,
stri_trans_toupper
,
stri_trans_totitle
y <- c( "I'm liking it but not too much.", "How much are you into it?", "I'd say it's yet awesome yet." ) L(y) U(y) TC(y)
y <- c( "I'm liking it but not too much.", "How much are you into it?", "I'd say it's yet awesome yet." ) L(y) U(y) TC(y)
Generate function to validate regular expressions.
validate( pattern, single = TRUE, trim = FALSE, clean = FALSE, dictionary = getOption("regex.library") )
validate( pattern, single = TRUE, trim = FALSE, clean = FALSE, dictionary = getOption("regex.library") )
pattern |
A character string containing a regular expression (or
character string for |
single |
logical. If |
trim |
logical. If |
clean |
trim logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
Returns a function that operates typical of other qdapRegex
rm_XXX
functions but with user defined defaults.
validate
uses qdapRegex's built in regular
expressions. As this patterns are used for text analysis they tend to be
flexible and thus liberal. The user may wish to define more conservative
validation regular expressions and supply to pattern
.
## Single element email valid_email <- validate("@rm_email") valid_email(c("[email protected]", "@trinker")) ## Multiple elements valid_email_1 <- validate("@rm_email", single=FALSE) valid_email_1(c("[email protected]", "@trinker")) ## single element address valid_address <- validate("@rm_city_state_zip") valid_address("Buffalo, NY 14217") valid_address("buffalo,NY14217") valid_address("buffalo NY 14217") valid_address2 <- validate(paste0("(\\b([A-Z][\\w-]*)+),", "\\s([A-Z]{2})\\s(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b")) valid_address2("Buffalo, NY 14217") valid_address2("buffalo, NY 14217") valid_address2("buffalo,NY14217") valid_address2("buffalo NY 14217")
## Single element email valid_email <- validate("@rm_email") valid_email(c("[email protected]", "@trinker")) ## Multiple elements valid_email_1 <- validate("@rm_email", single=FALSE) valid_email_1(c("[email protected]", "@trinker")) ## single element address valid_address <- validate("@rm_city_state_zip") valid_address("Buffalo, NY 14217") valid_address("buffalo,NY14217") valid_address("buffalo NY 14217") valid_address2 <- validate(paste0("(\\b([A-Z][\\w-]*)+),", "\\s([A-Z]{2})\\s(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b")) valid_address2("Buffalo, NY 14217") valid_address2("buffalo, NY 14217") valid_address2("buffalo,NY14217") valid_address2("buffalo NY 14217")