Wednesday, 22 July 2015

reshape2 package - similar function to pivot table in Excel


recast() in "reshape2" package offers a function that can work like a pivot table in Excel.



library(reshape2)
recast(data = CO2, Plant + Type ~ Treatment, measure.var = "uptake", sum)
   Plant        Type nonchilled chilled
1    Qn1      Quebec      232.6     0.0
2    Qn2      Quebec      246.1     0.0
3    Qn3      Quebec      263.3     0.0
4    Qc1      Quebec        0.0   209.8
5    Qc3      Quebec        0.0   228.1
6    Qc2      Quebec        0.0   228.9
7    Mn3 Mississippi      168.8     0.0
8    Mn2 Mississippi      191.4     0.0
9    Mn1 Mississippi      184.8     0.0
10   Mc2 Mississippi        0.0    85.0
11   Mc3 Mississippi        0.0   121.1
12   Mc1 Mississippi        0.0   126.0


Sometimes, the variables need to be treated as vector.

recast(data=CO2, CO2$Plant + CO2$Type ~ CO2$Treatment, measure.var = "uptake", sum)

Tuesday, 7 July 2015

connecting R to database




There are several ways to do this, but the following are what worked for me.


Connections 



Connecting to a database using ODBC:

N.B. 32-bit ODBC will only work with 32-bit R, so for 64-bit R, 64-bit ODBC needs to be installed

library(RODBC)
  
con <- odbcConnect(dsn = "database server name as recorded in ODBC", uid = username, pwd = password) 

N.B For Microsoft SQL Server, I have been told to use odbcDriverConnect instead of odbcConnect.
  
Connecting to Teradata database using JDBC:

library(RJDBC)

drv <- JDBC("com.teradata.jdbc.TeraDriver", "C:/your path to jar file/terajdbc4.jar; C:/
your path to jar file/tdgssconfig.jar") 
con <- dbConnect(drv, "jdbc:teradata://host ip details e.g. 000.00.000.00/", user = "username", password = "password", port = port e.g. 00)  
 



Connecting to Oracle database:

library(ROracle)

drv <- dbDriver("Oracle")
   
con <- dbConnect(drv, "username", "password", dbname="(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=host ip details e.g. 000.000.0.000)(PORT= port e.g. 00))(CONNECT_DATA=(SID=sid name)))")  



Working in Database


To run SQL script from R:


sqlQuery(con, your SQL query)

For example, 
sqlQuery(con, "Select * from XYZ where rownum < 10")



To create and write to table:

sqlSave(con, R object name to write, tablename = "schema.table name")

  
To create and write to table without the rownames of the R object:

sqlSave(con, R object name to write, tablename = "schema.table name", rownames=FALSE)

  
To create and write to table by adding to the existing contents of the table:

sqlSave(con, R object name to write, tablename = "schema.table name", append = TRUE)



To disconnect from the database:

odbcClose(con)           #for RODBC

dbDisconnect(con)          #for RJDBC




For Oracle database or using RJDBC:


To run SQL script from R:


dbGetQuery(con, your SQL query)

For example, 
dbGetQuery(con, "Select * from XYZ where rownum < 10")


To read table:

Dat <- dbReadTable(con, "table name")  

To write to table:

dbWriteTable(con, "table name", object name)  


To write to table without the rownames of the R object:

dbWriteTable(con, "table name", object name, row.names = FALSE)  

  
To write to table by replacing the existing contents of the table:

dbWriteTable(con, "table name", object nameoverwrite = TRUE)  


To write to table by adding to the existing contents of the table:

dbWriteTable(con, "table name", object nameappend = TRUE)  

  


Sometime writing a table with date-time field can create an error when time zones are set differently in the R environment and Oracle. The below is the easiest to fix the error.

1. To change time zone of the R environment:

Sys.setenv(TZ="GMT")

  
2. To change time zone of the Oracle:

Sys.setenv(ORA_SDTZ="GMT")

  


To remove/delete a table:

dbRemoveTable(con, "table name", object name)  


To list all the tables:

dbListTables(con, schema = "schema name (if applicable)")  



To list all fields in a table:

dbListFields(con, "table name", schema = "schema name (if applicable)")  



To disconnect from the database:

dbDisconnect(con)









Friday, 3 July 2015

Batch processing & scheduling of R



Scheduling R script runs in Windows

1. create a batch file (.bat) with the below command

"Your path to R.exe (e.g. C:\Program Files\R-2.12.0\bin\x64\R.exe)" CMD BATCH "Your path to script file (e.g. C:\Documents\Rscript.r)"

2. Use Windows Task Scheduler to create trigger pointing to the batch file created above.



NOTE:

If the script has any dependency on the log-in credentials or if the script accesses folders requiring specific permissions, selecting the option 'Run whether user is logged on or not' may generate access denied errors.




Scheduling R script runs in Linux

There are several options, but one simple option is to use Crontab


http://www.adminschoice.com/crontab-quick-reference


e.g.

The following will run r script at 6.30 PM everyday

30     18     *     *     *        Rscript /home/your path/rscriptfile.r



If this is running from a server and if it should run regardless of whether user is logged in or not (i.e. run in the background or run as a root) disown the process as below

30     18     *     *     *        Rscript /home/your path/rscriptfile.r & disown  


If you are connecting to a database inside the R script, you may need to export the environment for the database used when you run the R script. Alternatively, you may save the environment in your bash_profile and call it before R script is run.

30     18     *     *     *        source ~/.bash_profile && Rscript /home/your path/rscriptfile.r & disown  



The above example used && as the R script is dependent on the successful set up of the database environment. If you are running scripts consecutively, independent of the outputs of each other, you may use ; (semicolon) instead of &&.


30     18     *     *     *       Rscript /home/your path/rscriptfile1.r;  Rscript /home/your path/rscriptfile2.r