Science Notes: 2015

Tuesday, November 17, 2015

Linux spawn shell windows using tmux

Start tmux session: tmux
Split the window: ^b (Ctrl + b), then type in quote " for horizontal split, or type in % for vertical split
Detach the current session: ^b, then press key d
Attach the previous session when starting tmux: tmux attach
List all sessions: ^b, then press key l

For detailed usage, see tmux tutorias:

https://danielmiessler.com/study/tmux/

https://gist.github.com/MohamedAlaa/2961058

Friday, November 6, 2015

Captions and Cross References in R Markdown File

See this great post:

https://rstudio-pubs-static.s3.amazonaws.com/98310_b44bc54001af49d98a7b891d204652e2.html

Tuesday, October 27, 2015

Read nth line from a text file using sed

To read a particular line (eg 10-th line) from a text file and split by delimiter into an array
Array=(`sed -n "10{p;q;}" file.txt`)

Friday, October 9, 2015

Determine Whether Two Regions Overlap

Say we have two genomic regions (x1,x2) and (y1,y2) from the same chromosome. The simplest way to check whether the two regions overlap is perhaps testing:

x1 <= y2 && y1 <= x2

assuming x1 <= x2 and y1 <= y2.

Reference:
http://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-two-integer-ranges-for-overlap

How to choose between AUC PR and AUC ROC?

An excellent discussion on the topic or PR vs ROC:
https://www.kaggle.com/forums/f/15/kaggle-forum/t/7517/precision-recall-auc-vs-roc-auc-for-class-imbalance-problems/41179

Tuesday, October 6, 2015

R: calling C from FORTRAN and vice versa

http://www.hep.by/gnu/r-patched/r-exts/R-exts_136.html

http://users.stat.umn.edu/~geyer/rc/

Tuesday, September 22, 2015

R Interval Utility Functions

A brief description about the R internal sort routines
http://www.hep.by/gnu/r-patched/r-exts/R-exts_144.html

Monday, August 31, 2015

Variant Annotation and Comparison

A nice post about variant annotation tools
http://blog.goldenhelix.com/ajesaitis/the-sate-of-variant-annotation-a-comparison-of-annovar-snpeff-and-vep/

Wednesday, August 26, 2015

Makefile detect OS

#Detect OS and processor
ifeq ($(OS),Windows_NT)
    CCFLAGS += -D WIN32
    ifeq ($(PROCESSOR_ARCHITECTURE),AMD64)
        CCFLAGS += -D AMD64
    endif
    ifeq ($(PROCESSOR_ARCHITECTURE),x86)
        CCFLAGS += -D IA32
    endif
else
    UNAME_S := $(shell uname -s)
    ifeq ($(UNAME_S),Linux)
        CCFLAGS += -D LINUX
    endif
    ifeq ($(UNAME_S),Darwin)
        CCFLAGS += -D OSX
    endif
    UNAME_P := $(shell uname -p)
    ifeq ($(UNAME_P),x86_64)
        CCFLAGS += -D AMD64
    endif
    ifneq ($(filter %86,$(UNAME_P)),)
        CCFLAGS += -D IA32
    endif
    ifneq ($(filter arm%,$(UNAME_P)),)
        CCFLAGS += -D ARM
    endif
endif

Reference

http://stackoverflow.com/questions/714100/os-detecting-makefile

Friday, July 31, 2015

RNAseq library type explained

http://onetipperday.blogspot.com/2012/07/how-to-tell-which-library-type-to-use.html

Monday, July 27, 2015

Read zip file without unzipping in R

The following R function is modified from Joshua Ulrich's post in stackoverflow. An argument FUN is added for specifying what R function would be employed to process the file handler.

read.zip = function(file, FUN=read.table, ...) {
zipFileInfo = unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
FUN(unz(file, as.character(zipFileInfo$Name)), ...)
}

Reference:
http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r

Tuesday, July 14, 2015

How R Searches and Finds Stuff

http://blog.obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Sunday, June 14, 2015

Install numpy with ATLAS support on Linux

Please see this excellent post
http://www.ankitsrivastava.net/2014/05/installing-pythonnumpy-with-atlas-support/

Installation of scipy is the same as that of numpy.

Saturday, May 23, 2015

awk: split string into array and select an array element

Say the input is a character string with varying number of comma separated elements, eg,,
ABC,DEF,GHI,JKL,MNO

If we want to the extract the second last element in this string (i.e., JKL)
echo ABC,DEF,GHI,JKL,MNO | awk '{len=split($0,arr,","); print arr[len-1]}'

where the variable len is used to capture the length of the array created by split. Therefore, to print the array length,
echo ABC,DEF,GHI,JKL,MNO | awk '{len=split($0,arr,","); print len}'

Wednesday, February 25, 2015

Extract rRNA and tRNA features from UCSC Browser

Credit to Matthew Speir.

Extract tRNA (ref1)
In the Table Browser, you can use the following steps to get the coordinates for tRNA genes, with all of the tRNA pseudogenes filtered out:

1. Select your assembly and tracks

    clade: Mammal
    genome: Human
    assembly: Feb. 2009 (GRCh37/hg19)
    group: Genes and Gene Predictions Tracks
    track: tRNA
    table: tRNAs
    output: GTF - gene transfer format
    output file: enter a file name to save your results to a file, or leave blank to display results in the browser

2. Click 'Filter'.

3. Enter 'Pseudo' into the aa field.
    The "aa" line should read: aa doesn't match Pseudo

4. Click 'Submit'.

Extract rRNA (ref2)
The GENCODE v19 track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeV19,
contains genomic coordinates for human ribosomal RNA, snRNA, and 5S
ribosomal RNA. You can use the following steps to access this
information, and get the output in BED format:

1. Navigate to the table browser, http://genome.ucsc.edu/cgi-bin/hgTables.

2. Select your assembly and tracks

    clade: Mammal
    genome: Human
    assembly: Feb. 2009 (GRCh37/hg19)
    group: Genes and Gene Predictions Tracks
    track: GENCODE Genes V19
    table: Basic (wgEncodeGencodeBasicV19)
    output: BED - browser extensible data
    output file: enter a file name to save your results to a file, or
leave blank to display results in your browser

3. Click 'Filter'.

4. Select the wgEncodeGencodeAttrsV19 from the 'Linked Tables' section

5. Click 'allow filtering using fields in checked tables'.

6. This step will change depending on whether you want the coordinates
for the rRNA or snRNA genes.
    6.1 For rRNA, type 'rRNA' in the 'geneType' and 'transciptType'
fields of the hg19.wgEncodeGencodeAttrsV19 based filters section.
        The "geneType" line should read: geneType does match rRNA
        The "transcriptType" line should read: transcriptType does match rRNA
    6.2 For snRNA, type 'snRNA' in the 'geneType' and 'transciptType'
fields of the hg19.wgEncodeGencodeAttrsV19 based filters section.
        The "geneType" line should read: geneType does match snRNA
        The "transcriptType" line should read: transcriptType does match snRNA

7. Click 'Submit'.

8. After you return to the main Table Browser page, click 'get output'.

Many of the 5S rRNA positions in this table are pseudogenes, and you
may need to try different filtering parameters to exclude these from
the output.

The coordinates for piRNA are contained in the UCSC Genes track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene. You can
get this information from the Table Browser using steps similar to
those I previously described:

1. Select your assembly and tracks

    clade: Mammal
    genome: Human
    assembly: Feb. 2009 (GRCh37/hg19)
    group: Genes and Gene Predictions Tracks
    track: UCSC Genes
    table: knownGene
    output: BED - browser extensible data
    output file: enter a file name to save your results to a file, or
leave blank to display results in the browser

2. Click 'Filter'.

3. Type '*piRNA*' in the 'description' field of the hg19.kgXref based
filters section.
    The "description" line should read: description does match *piRNA*

4. Click 'Submit'.

5. After you return to the main Table Browser page, click 'get output'.

Lastly, precursor miRNA coordinates can be found in the sno/miRNA
track, http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgRna.
Again, you can get this information using the Table Browser and steps
similar to those I previously described:

1. Select your assembly and tracks

    clade: Mammal
    genome: Human
    assembly: Feb. 2009 (GRCh37/hg19)
    group: Genes and Gene Predictions Tracks
    track: sno/miRNA
    table: wgRna
    output: BED - browser extensible data
    output file: enter a file name to save your results to a file, or
leave blank to display results in the browser

2. Click 'Filter'.

3. Enter 'miRNA' into the type field.
    The "type" line should read: type does match *miRNA*

4. Click 'Submit'.

5. After you return to the main Table Browser page, click 'get output'.

References
1. https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/NWDhuxc360w
2. https://groups.google.com/a/soe.ucsc.edu/forum/#!msg/genome/jSAY8w1JVVo/P6lk4OJzDNEJ

Wednesday, February 18, 2015

WES versus WGS

http://macarthurlab.org/2014/07/21/what-do-we-miss-with-exome-sequencing/ http://www.biomedcentral.com/1471-2105/15/247

Thursday, February 12, 2015

PCA Terminology in R/prcomp

In R, the prcomp returns the following components:

1. sdev, the standard deviations of the principal components (PCs) (i.e., the square roots of the eigenvalues of the covariance/correlation matrix). To calculate the variance explained by each PC: sdev^2/sum(sdev^2). A scree plot is simply something like barplot(sdev^2). To determine the appropriate number of "important" PCs, we can look for an "elbow" in the scree plot. The component number is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size.

2. rotation, the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).

3. x, the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix). This is also called PCA scores. Hence, cov(x) is the diagonal matrix diag(sdev^2). These PC scores can be used in visualization of sample outliers (e.g., plot(x[,1],x[,2])) and subsequent data analyses, such as correction for hidden structure in linear regression models with PC scores incorporated as covariates.

Sunday, February 1, 2015

Plot correlation matrix into a graph

http://stackoverflow.com/questions/5453336/plot-correlation-matrix-into-a-graph