Extract substring in R using grepl

Extract substring in R using grepl



I have a table with a string column formatted like this


abcdWorkstart.csv
abcdWorkcomplete.csv



And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.


grepl("Work*.csv", data$filename)



Basically I want to extract whatever between Work and .csv



desired outcome:


start
complete





please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
– Andre Elrico
Aug 28 at 15:20




4 Answers
4



Just as an alternative way, remove everything you don't want.


x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"



please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.


gsub


^.*Work


\.csv$



For [\s\S] or \d\D ... (does not work with [g]?sub)


[\s\S]


\d\D



https://regex101.com/r/wFgkgG/1



Works with akruns approach:



regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))


regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))


str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)



. matches also n when using the R engine.


.


n





pretty much all the solutions work, but i think this is more concise. thanks
– ajax2000
Sep 5 at 13:04



I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:


sub


gsub


grepl


fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"



You can work around this by filtering out the unchanged ones:


out[ out != fn ]
# [1] "start" "complete"



Or marking them invalid with NA (or something else):


NA


out[ out == fn ] <- NA
out
# [1] "start" "complete" NA



With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":


str_extract


stringr


x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"



Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches


regmatches/regexpr


base R


.


regmatches


regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"


v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')





To be more precise, you can use "(?<=Work).*(?=.csv)".
– r2evans
Aug 28 at 15:18


"(?<=Work).*(?=.csv)"





@avid_useR But, I am using regmatches/regexpr
– akrun
Aug 28 at 15:24


regmatches/regexpr





@avid_useR Okay, that is right
– akrun
Aug 28 at 15:25





@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
– r2evans
Aug 28 at 15:32


[\s\S]


.





@r2evans I use both [.] or \., though I feel easier to type the former.
– akrun
Aug 28 at 15:35


[.]


\.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)