regex - Comparing two version of the same string -
i write function compare 2 string in r. more precisely, if have data :
data <- list( "first sentence.", "very first sentence.", "very first , 1 sentences." )
i output :
[1] "very" " , 1 sentences"
my output built substring not included in previous one. example:
2nd vs 1st, remove matching string - "first sentence." - 2nd, result "very".
# "first sentence." # "very first sentence." # match: ^^^^^^^^^^^^^^^
now compare 3rd vs 2nd, remove matching string - "very first" - 3rd , result " , 1 sentences".
# "very first sentence." # "very first , 1 sentences." # match: ^^^^^^^^^^
then compare 4th vs 3rd, etc...
so based on example output should be:
c("very", " , 1 sentences") # [1] "very" " , 1 sentences"
here's tidyverse approach:
library(dplyr) library(tidyr) # put data in data.frame data_frame(string = unlist(data)) %>% # add id column can recombine later add_rownames('id') %>% # add lagged column compare against mutate(string2 = lag(string)) %>% # break strings words separate_rows(string) %>% # evaluate following calls rowwise (until regrouped) rowwise() %>% # chop rows string compare against, filter(!is.na(string2), # word not in comparison string !grepl(string, string2, ignore.case = true)) %>% # regroup id group_by(id) %>% # reassemble strings summarise(string = paste(string, collapse = ' ')) ## # tibble: 2 x 2 ## id string ## <chr> <chr> ## 1 2 ## 2 3 , 1 sentences.
select out string
if you'd vector appending
... %>% `[[`('string') ## [1] "very" "and 1 sentences."
Comments
Post a Comment