To extract a part of texts, we can use sub() with "\\n" as the replacement (n is an integer, usually 1 or 2). The example extracts date from a file name "ABC_2014_10_23.csv".
sub("^[A-Za-z]+(_)([0-9]{4}(_)[0-9]{2}(_)[0-9]{2})(.)[A-Za-z]+","\\2","ABC_2014_10_23.csv")
To replace certain words with another| word, we can do the following.
gsub("\\s*\\b(meters|m|meter)\\b","m",
"DAS FE TDAS 5meter ADSA RE 12 meters SA 23 Meters DASD 3 m",
ignore.case=TRUE)
"DAS FE TDAS 5meter ADSA RE 12m SA 23m DASD 3m"
To find fields matching the pattern, we can use grep().
The below looks for numbers followed by alphabet:
A<-c("SDA09DA","ADCZFD","081382","ASDF8673")
A[grep("[A-Z]+[0-9]+",A)]
"SDA09DA" "ASDF8673"
The below looks for numbers between alphabets:
A<-c("SDA09DA","ADCZFD","081382","ASDF8673")
A[grep("[A-Z]+[0-9]+[A-Z]",A)]
"SDA09DA"
The below looks for more complex pattern. This example looks for typical address format.
A<-c("100 ASD ST DASER","CSADAS SS ASD ADA",
"321 DSA XCSACXZ SADAS 213","321/32 ASA ASDDD RD WEASF",
"DSA 231 DDG BGBVCVB","SUITE 1/43 SAFDSA AVE ASDA",
"UNIT 21/2 ADSD AV SDADFF")
A[grep("^[A-Z]*\\s*[0-9]+\\s*(/)*\\s*[0-9]*\\s*[A-Z]+\\s*[A-Z]*\\s*(ST|RD|AVE|AV)\\s*[A-Z]*$",A)]
"100 ASD ST DASER"
"321/32 ASA ASDDD RD WEASF"
"SUITE 1/43 SAFDSA AVE ASDA"
"UNIT 21/2 ADSD AV SDADFF"
For more regular expression language, go to either of the following websites:
http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
http://www.rexegg.com/regex-quickstart.html
No comments:
Post a Comment