Skip to the content.

Week 9: A fun(ctional) example

Objectives

Let’s play with real data! Learn the most important data handling skills. In this tutorial:

Downloads

Download these files into your working directory:

Extract the data.zip folder.

Setting up

Start a new notebook file by selecting “File” -> “New File” -> “R Notebook” Save the file as “yourname_week8.Rmd”. Delete the instructions starting from “This is an [R…”. For the different code below, insert it as R chunks. An R chunk is code placed after a line that starts with ` { r } `and ends before a line with ` `.
As before, copy the code chunks into your R notebook as R chunks.

Reading in data

From command line

This works when you run an R script from command line (and not in your notebook).

args = commandArgs(trailingOnly=TRUE)

From files

- readline() reads in from stdin (ie keystrokes if in the console), but unlike scan(), you do not need to specify the type being returned. And you can give a prompt to appear:
- 

readline( prompt=”Hello there, please type in your favorite number:\n”)

   
- Another option is to read in the input with readLines(). This takes a "connection" (e.g., file, stdin, URL, zipped file, etc) and then number of lines to read in with the n parameter (default is the whole file).    

all_data = readLines(con=”my_other_data.gz”) all_data



- From Excel. Note, if this doesn't work properly, skip to the next part.  

```{r}
install.packages("xlsx") 
# This needs a few other packages, in particular rJava. If prompted, download the 64 bit version (for Windows) from: https://www.java.com/en/download/manual.jsp. 
# This might not be necessary,but in case
install.packages("rJava") 
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_181') # depending where Java is installed 

library(xlsx) 
data <- read.xlsx(file="pnasEisen1998.xlsx", 2)
key <- read.xlsx(file="pnasEisen1998.xlsx", sheetName = "Key")
install.packages("pdftools")
library("pdftools")
library("glue")
library("tidyverse")

pdf_filename  <- "elife-27469-v1.pdf"
pdf_text_extract <- pdf_text(pdf_filename)
length(pdf_text_extract)
# load("R_datafun.Rdata") # if the pdf extraction didn't work

From other places, types

install.packages(c("httr", "jsonlite"))

library("httr")
library("jsonlite")

uniprot_acc <- c("P21675")  # change this to your favorite protein

# Get UniProt entry by accession 
acc_uniprot_url <- c("https://www.ebi.ac.uk/proteins/api/proteins?accession=")

comb_acc_api <- paste0(acc_uniprot_url, uniprot_acc)

# basic function is GET() which accesses the API
protein <- GET(comb_acc_api,
           accept_json())

status_code(protein)  # returns a 200 means it worked

# use content() function from httr to give us a list 
protein_json <- content(protein) # gives a Large list


# Get features 
feat_api_url <- c("https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=")

comb_acc_api <- paste0(feat_api_url, uniprot_acc)

# basic function is GET() which accesses the API
prot_feat <- GET(comb_acc_api,
           accept_json())

prot_feat_json <- content(prot_feat) # gives a Large list
 

Check data

It is always very important to check that you have correctly read in your data. Some basic data/sanity checks.

Clean up data, or data “wrangling”

Data is never tidy. Data wrangling is the process of cleaning and structuring data to improve the data quality, remove inaccurate data, and ensure that the data is of good enough quality for your analysis.

Saving (or exporting) data

Tidy data

Tidyverse packages are designed to work well together. tidyverse

Tibbles

The basic data format for the tidyverse are “tibbles” (tidy tables). They work (and look) a little different to the basic R table, but can be manipulated with tidy based functions. R will print only the first ten rows of a tibble as well as all of the columns that fit into your console window. R also adds useful summary information about the tibble, such as the data types of each column and the size of the data set. Note, when you do not have the tidyverse packages loaded, tibbles act as data.frames!

x <- c("A", "B", "C", "D")
y <- 4:1
z <- c("How", "do", "you", "do?")
data_frame = data.frame(ID=x, num=y, words=z)
tibble_data = as_tibble(data_frame)

Tidy functions

select(tibble_data, ID, num)
filter(tibble_data, ID == "A")
arrange(tibble_data, num)
iris_tibble = as_tibble(iris)

iris_tibble_set <- filter(iris_tibble, Species == "setosa", Petal.Length > 1.6 )
iris_tibble_set <- select(iris_tibble_set, Sepal.Length, Sepal.Width)
iris_tibble_set <- arrange(iris_tibble_set, desc(Sepal.Width))

These could be combined:

arrange(select(filter(iris_tibble, Species == "setosa", Petal.Length > 1.6), Sepal.Length, Sepal.Width), desc(Sepal.Width))

But this is a little hard to read. The pipe version “tidies” this to:

iris_tibble %>% 
  filter(Species == "setosa", Petal.Length > 1.6 ) %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  arrange(desc(Sepal.Width))

Extracting summaries

summarise(iris_tibble, total = sum(Sepal.Width), max = max(Sepal.Width), mean = mean(Sepal.Width))
iris_tibble %>%
    group_by(Sepal.Width, Species) %>% 
    summarise( sum(Sepal.Length))
iris_tibble %>%
    group_by(Sepal.Width, Species) %>% 
    summarise( sum(Sepal.Length))
iris_tibble %>%
  mutate(ratio = Sepal.Width/Petal.Width)

Reshaping tidy data

Joining data

Other

Resources

Test yourself!

  1. Download and load the file “R_datafun.Rdata” into your environment.
  2. Plot the 13 sets within the datasaurus object. Calculate the mean and standard deviation of x and y, and then the pearson correlation between x and y. Record all these values. Why is it important to visualise your data?
  3. “Knit” your R markdown file into an html page or a pdf.

[Solutions next week]

Back to the homepage