Extract substring in R using grepl
Extract substring in R using grepl
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work*.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
4 Answers
4
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \.csv$
.
gsub
^.*Work
\.csv$
For [\s\S]
or \d\D
... (does not work with [g]?sub)
[\s\S]
\d\D
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also n
when using the R engine.
.
n
pretty much all the solutions work, but i think this is more concise. thanks
– ajax2000
Sep 5 at 13:04
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
sub
gsub
grepl
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
NA
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
str_extract
stringr
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches/regexpr
base R
.
regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
To be more precise, you can use
"(?<=Work).*(?=.csv)"
.– r2evans
Aug 28 at 15:18
"(?<=Work).*(?=.csv)"
@avid_useR But, I am using
regmatches/regexpr
– akrun
Aug 28 at 15:24
regmatches/regexpr
@avid_useR Okay, that is right
– akrun
Aug 28 at 15:25
@AndreElrico, doesn't
[\s\S]
match any character? Isn't is more concise to use .
?– r2evans
Aug 28 at 15:32
[\s\S]
.
@r2evans I use both
[.]
or \.
, though I feel easier to type the former.– akrun
Aug 28 at 15:35
[.]
\.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
– Andre Elrico
Aug 28 at 15:20