Friday 17 October 2014

Regular Expression

Regular expression can be used in R to substitute, extract or search of pattern of texts.


To extract a part of texts, we can use sub() with "\\n" as the replacement (n is an integer, usually 1 or 2). The example extracts date from a file name "ABC_2014_10_23.csv".

sub("^[A-Za-z]+(_)([0-9]{4}(_)[0-9]{2}(_)[0-9]{2})(.)[A-Za-z]+","\\2","ABC_2014_10_23.csv")

"2014_10_23"


To replace certain words with another| word, we can do the following.

gsub("\\s*\\b(meters|m|meter)\\b","m",
"DAS FE TDAS 5meter ADSA RE 12 meters SA 23 Meters DASD 3 m",
ignore.case=TRUE)

"DAS FE TDAS 5meter ADSA RE 12m SA 23m DASD 3m"


To find fields matching the pattern, we can use grep().

The below looks for numbers followed by alphabet:

A<-c("SDA09DA","ADCZFD","081382","ASDF8673")
A[grep("[A-Z]+[0-9]+",A)]

"SDA09DA"  "ASDF8673"


The below looks for numbers between alphabets:

A<-c("SDA09DA","ADCZFD","081382","ASDF8673")
A[grep("[A-Z]+[0-9]+[A-Z]",A)]

"SDA09DA"


The below looks for more complex pattern. This example looks for typical address format.

A<-c("100 ASD ST DASER","CSADAS SS ASD ADA",
"321 DSA XCSACXZ SADAS 213","321/32 ASA ASDDD RD WEASF",
"DSA 231 DDG BGBVCVB","SUITE 1/43 SAFDSA AVE ASDA",
"UNIT 21/2 ADSD AV SDADFF")

A[grep("^[A-Z]*\\s*[0-9]+\\s*(/)*\\s*[0-9]*\\s*[A-Z]+\\s*[A-Z]*\\s*(ST|RD|AVE|AV)\\s*[A-Z]*$",A)]

"100 ASD ST DASER"           
"321/32 ASA ASDDD RD WEASF" 
"SUITE 1/43 SAFDSA AVE ASDA" 
"UNIT 21/2 ADSD AV SDADFF"  



For more regular expression language, go to either of the following websites:


http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

http://www.rexegg.com/regex-quickstart.html











No comments:

Post a Comment