Extract substring in R using grepl

I have a table with a string column formatted like this

abcdWorkstart.csv abcdWorkcomplete.csv

And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.

grepl("Work*.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start complete

please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
– Andre Elrico
Aug 28 at 15:20

4 Answers
4

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv") gsub("^.*Work|\.csv$", "", x) #[1] "start" "complete"

please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.

gsub

^.*Work

\.csv$

For [\s\S] or \d\D ... (does not work with [g]?sub)

[\s\S]

\d\D

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))

str1<- '12 .2 12' gsub("[^.]","m",str1,perl=T) gsub(".","m",str1,perl=T) gsub(".","m",str1,perl=F)

. matches also n when using the R engine.

.

n

pretty much all the solutions work, but i think this is more concise. thanks
– ajax2000
Sep 5 at 13:04

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

sub

gsub

grepl

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv') out <- sub(".*Work(.*)\.csv$", "\1", fn) out # [1] "start" "complete" "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ] # [1] "start" "complete"

Or marking them invalid with NA (or something else):

NA

out[ out == fn ] <- NA out # [1] "start" "complete" NA

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

str_extract

stringr

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv") library(stringr) str_extract(x, "(?<=Work).+(?=\.csv)") # [1] "start" "complete"

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches/regexpr

base R

.

regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE)) #[1] "start" "complete"

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

To be more precise, you can use "(?<=Work).*(?=.csv)".
– r2evans
Aug 28 at 15:18

"(?<=Work).*(?=.csv)"

@avid_useR But, I am using regmatches/regexpr
– akrun
Aug 28 at 15:24

regmatches/regexpr

@avid_useR Okay, that is right
– akrun
Aug 28 at 15:25

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
– r2evans
Aug 28 at 15:32

[\s\S]

.

@r2evans I use both [.] or \., though I feel easier to type the former.
– akrun
Aug 28 at 15:35

[.]

\.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt