Start tmux session: tmux
Split the window: ^b (Ctrl + b), then type in quote " for horizontal split, or type in % for vertical split
Detach the current session: ^b, then press key d
Attach the previous session when starting tmux: tmux attach
List all sessions: ^b, then press key l
For detailed usage, see tmux tutorias:
https://danielmiessler.com/study/tmux/
https://gist.github.com/MohamedAlaa/2961058
A weblog sharing great ideas, theory, and implementations in data, sciences, and beyond.
Tuesday, November 17, 2015
Friday, November 6, 2015
Captions and Cross References in R Markdown File
See this great post:
https://rstudio-pubs-static.s3.amazonaws.com/98310_b44bc54001af49d98a7b891d204652e2.html
https://rstudio-pubs-static.s3.amazonaws.com/98310_b44bc54001af49d98a7b891d204652e2.html
Tuesday, October 27, 2015
Read nth line from a text file using sed
To read a particular line (eg 10-th line) from a text file and split by delimiter into an array
Array=(`sed -n "10{p;q;}" file.txt`)
Array=(`sed -n "10{p;q;}" file.txt`)
Friday, October 9, 2015
Determine Whether Two Regions Overlap
Say we have two genomic regions (x1,x2) and (y1,y2) from the same chromosome. The simplest way to check whether the two regions overlap is perhaps testing:
Reference:
http://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-two-integer-ranges-for-overlap
x1 <= y2 && y1 <= x2
assuming x1 <= x2 and y1 <= y2.Reference:
http://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-two-integer-ranges-for-overlap
How to choose between AUC PR and AUC ROC?
An excellent discussion on the topic or PR vs ROC:
https://www.kaggle.com/forums/f/15/kaggle-forum/t/7517/precision-recall-auc-vs-roc-auc-for-class-imbalance-problems/41179
https://www.kaggle.com/forums/f/15/kaggle-forum/t/7517/precision-recall-auc-vs-roc-auc-for-class-imbalance-problems/41179
Tuesday, October 6, 2015
R: calling C from FORTRAN and vice versa
http://www.hep.by/gnu/r-patched/r-exts/R-exts_136.html
http://users.stat.umn.edu/~geyer/rc/
http://users.stat.umn.edu/~geyer/rc/
Tuesday, September 22, 2015
R Interval Utility Functions
A brief description about the R internal sort routines
http://www.hep.by/gnu/r-patched/r-exts/R-exts_144.html
http://www.hep.by/gnu/r-patched/r-exts/R-exts_144.html
Monday, August 31, 2015
Variant Annotation and Comparison
A nice post about variant annotation tools
http://blog.goldenhelix.com/ajesaitis/the-sate-of-variant-annotation-a-comparison-of-annovar-snpeff-and-vep/
http://blog.goldenhelix.com/ajesaitis/the-sate-of-variant-annotation-a-comparison-of-annovar-snpeff-and-vep/
Wednesday, August 26, 2015
Makefile detect OS
#Detect OS and processor
ifeq ($(OS),Windows_NT)
CCFLAGS += -D WIN32
ifeq ($(PROCESSOR_ARCHITECTURE),AMD64)
CCFLAGS += -D AMD64
endif
ifeq ($(PROCESSOR_ARCHITECTURE),x86)
CCFLAGS += -D IA32
endif
else
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
CCFLAGS += -D LINUX
endif
ifeq ($(UNAME_S),Darwin)
CCFLAGS += -D OSX
endif
UNAME_P := $(shell uname -p)
ifeq ($(UNAME_P),x86_64)
CCFLAGS += -D AMD64
endif
ifneq ($(filter %86,$(UNAME_P)),)
CCFLAGS += -D IA32
endif
ifneq ($(filter arm%,$(UNAME_P)),)
CCFLAGS += -D ARM
endif
endif
Reference
http://stackoverflow.com/questions/714100/os-detecting-makefile
ifeq ($(OS),Windows_NT)
CCFLAGS += -D WIN32
ifeq ($(PROCESSOR_ARCHITECTURE),AMD64)
CCFLAGS += -D AMD64
endif
ifeq ($(PROCESSOR_ARCHITECTURE),x86)
CCFLAGS += -D IA32
endif
else
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
CCFLAGS += -D LINUX
endif
ifeq ($(UNAME_S),Darwin)
CCFLAGS += -D OSX
endif
UNAME_P := $(shell uname -p)
ifeq ($(UNAME_P),x86_64)
CCFLAGS += -D AMD64
endif
ifneq ($(filter %86,$(UNAME_P)),)
CCFLAGS += -D IA32
endif
ifneq ($(filter arm%,$(UNAME_P)),)
CCFLAGS += -D ARM
endif
endif
Reference
http://stackoverflow.com/questions/714100/os-detecting-makefile
Friday, July 31, 2015
RNAseq library type explained
http://onetipperday.blogspot.com/2012/07/how-to-tell-which-library-type-to-use.html
Monday, July 27, 2015
Read zip file without unzipping in R
The following R function is modified from Joshua Ulrich's post in stackoverflow. An argument FUN is added for specifying what R function would be employed to process the file handler.
read.zip = function(file, FUN=read.table, ...) {
zipFileInfo = unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
FUN(unz(file, as.character(zipFileInfo$Name)), ...)
}
Reference:
http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
read.zip = function(file, FUN=read.table, ...) {
zipFileInfo = unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
FUN(unz(file, as.character(zipFileInfo$Name)), ...)
}
Reference:
http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
Tuesday, July 14, 2015
Sunday, June 14, 2015
Install numpy with ATLAS support on Linux
Please see this excellent post
http://www.ankitsrivastava.net/2014/05/installing-pythonnumpy-with-atlas-support/
Installation of scipy is the same as that of numpy.
http://www.ankitsrivastava.net/2014/05/installing-pythonnumpy-with-atlas-support/
Installation of scipy is the same as that of numpy.
Saturday, May 23, 2015
awk: split string into array and select an array element
Say the input is a character string with varying number of comma separated elements, eg,,
ABC,DEF,GHI,JKL,MNO
If we want to the extract the second last element in this string (i.e., JKL)
echo ABC,DEF,GHI,JKL,MNO | awk '{len=split($0,arr,","); print arr[len-1]}'
where the variable len is used to capture the length of the array created by split. Therefore, to print the array length,
echo ABC,DEF,GHI,JKL,MNO | awk '{len=split($0,arr,","); print len}'
ABC,DEF,GHI,JKL,MNO
If we want to the extract the second last element in this string (i.e., JKL)
echo ABC,DEF,GHI,JKL,MNO | awk '{len=split($0,arr,","); print arr[len-1]}'
where the variable len is used to capture the length of the array created by split. Therefore, to print the array length,
echo ABC,DEF,GHI,JKL,MNO | awk '{len=split($0,arr,","); print len}'
Wednesday, February 25, 2015
Extract rRNA and tRNA features from UCSC Browser
Credit to Matthew Speir.
Extract tRNA (ref1)
In the Table Browser, you can use the following steps to get the coordinates for tRNA genes, with all of the tRNA pseudogenes filtered out:
1. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: tRNA
table: tRNAs
output: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser
2. Click 'Filter'.
3. Enter 'Pseudo' into the aa field.
The "aa" line should read: aa doesn't match Pseudo
4. Click 'Submit'.
Extract rRNA (ref2)
The GENCODE v19 track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeV19,
contains genomic coordinates for human ribosomal RNA, snRNA, and 5S
ribosomal RNA. You can use the following steps to access this
information, and get the output in BED format:
1. Navigate to the table browser, http://genome.ucsc.edu/cgi-bin/hgTables.
2. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: GENCODE Genes V19
table: Basic (wgEncodeGencodeBasicV19)
output: BED - browser extensible data
output file: enter a file name to save your results to a file, or
leave blank to display results in your browser
3. Click 'Filter'.
4. Select the wgEncodeGencodeAttrsV19 from the 'Linked Tables' section
5. Click 'allow filtering using fields in checked tables'.
6. This step will change depending on whether you want the coordinates
for the rRNA or snRNA genes.
6.1 For rRNA, type 'rRNA' in the 'geneType' and 'transciptType'
fields of the hg19.wgEncodeGencodeAttrsV19 based filters section.
The "geneType" line should read: geneType does match rRNA
The "transcriptType" line should read: transcriptType does match rRNA
6.2 For snRNA, type 'snRNA' in the 'geneType' and 'transciptType'
fields of the hg19.wgEncodeGencodeAttrsV19 based filters section.
The "geneType" line should read: geneType does match snRNA
The "transcriptType" line should read: transcriptType does match snRNA
7. Click 'Submit'.
8. After you return to the main Table Browser page, click 'get output'.
Many of the 5S rRNA positions in this table are pseudogenes, and you
may need to try different filtering parameters to exclude these from
the output.
The coordinates for piRNA are contained in the UCSC Genes track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene. You can
get this information from the Table Browser using steps similar to
those I previously described:
1. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: UCSC Genes
table: knownGene
output: BED - browser extensible data
output file: enter a file name to save your results to a file, or
leave blank to display results in the browser
2. Click 'Filter'.
3. Type '*piRNA*' in the 'description' field of the hg19.kgXref based
filters section.
The "description" line should read: description does match *piRNA*
4. Click 'Submit'.
5. After you return to the main Table Browser page, click 'get output'.
Lastly, precursor miRNA coordinates can be found in the sno/miRNA
track, http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgRna.
Again, you can get this information using the Table Browser and steps
similar to those I previously described:
1. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: sno/miRNA
table: wgRna
output: BED - browser extensible data
output file: enter a file name to save your results to a file, or
leave blank to display results in the browser
2. Click 'Filter'.
3. Enter 'miRNA' into the type field.
The "type" line should read: type does match *miRNA*
4. Click 'Submit'.
5. After you return to the main Table Browser page, click 'get output'.
References
1. https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/NWDhuxc360w
2. https://groups.google.com/a/soe.ucsc.edu/forum/#!msg/genome/jSAY8w1JVVo/P6lk4OJzDNEJ
Extract tRNA (ref1)
In the Table Browser, you can use the following steps to get the coordinates for tRNA genes, with all of the tRNA pseudogenes filtered out:
1. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: tRNA
table: tRNAs
output: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser
2. Click 'Filter'.
3. Enter 'Pseudo' into the aa field.
The "aa" line should read: aa doesn't match Pseudo
4. Click 'Submit'.
Extract rRNA (ref2)
The GENCODE v19 track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeV19,
contains genomic coordinates for human ribosomal RNA, snRNA, and 5S
ribosomal RNA. You can use the following steps to access this
information, and get the output in BED format:
1. Navigate to the table browser, http://genome.ucsc.edu/cgi-bin/hgTables.
2. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: GENCODE Genes V19
table: Basic (wgEncodeGencodeBasicV19)
output: BED - browser extensible data
output file: enter a file name to save your results to a file, or
leave blank to display results in your browser
3. Click 'Filter'.
4. Select the wgEncodeGencodeAttrsV19 from the 'Linked Tables' section
5. Click 'allow filtering using fields in checked tables'.
6. This step will change depending on whether you want the coordinates
for the rRNA or snRNA genes.
6.1 For rRNA, type 'rRNA' in the 'geneType' and 'transciptType'
fields of the hg19.wgEncodeGencodeAttrsV19 based filters section.
The "geneType" line should read: geneType does match rRNA
The "transcriptType" line should read: transcriptType does match rRNA
6.2 For snRNA, type 'snRNA' in the 'geneType' and 'transciptType'
fields of the hg19.wgEncodeGencodeAttrsV19 based filters section.
The "geneType" line should read: geneType does match snRNA
The "transcriptType" line should read: transcriptType does match snRNA
7. Click 'Submit'.
8. After you return to the main Table Browser page, click 'get output'.
Many of the 5S rRNA positions in this table are pseudogenes, and you
may need to try different filtering parameters to exclude these from
the output.
The coordinates for piRNA are contained in the UCSC Genes track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene. You can
get this information from the Table Browser using steps similar to
those I previously described:
1. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: UCSC Genes
table: knownGene
output: BED - browser extensible data
output file: enter a file name to save your results to a file, or
leave blank to display results in the browser
2. Click 'Filter'.
3. Type '*piRNA*' in the 'description' field of the hg19.kgXref based
filters section.
The "description" line should read: description does match *piRNA*
4. Click 'Submit'.
5. After you return to the main Table Browser page, click 'get output'.
Lastly, precursor miRNA coordinates can be found in the sno/miRNA
track, http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgRna.
Again, you can get this information using the Table Browser and steps
similar to those I previously described:
1. Select your assembly and tracks
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions Tracks
track: sno/miRNA
table: wgRna
output: BED - browser extensible data
output file: enter a file name to save your results to a file, or
leave blank to display results in the browser
2. Click 'Filter'.
3. Enter 'miRNA' into the type field.
The "type" line should read: type does match *miRNA*
4. Click 'Submit'.
5. After you return to the main Table Browser page, click 'get output'.
References
1. https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/NWDhuxc360w
2. https://groups.google.com/a/soe.ucsc.edu/forum/#!msg/genome/jSAY8w1JVVo/P6lk4OJzDNEJ
Wednesday, February 18, 2015
WES versus WGS
http://macarthurlab.org/2014/07/21/what-do-we-miss-with-exome-sequencing/
http://www.biomedcentral.com/1471-2105/15/247
Thursday, February 12, 2015
PCA Terminology in R/prcomp
In R, the prcomp returns the following components:
1. sdev, the standard deviations of the principal components (PCs) (i.e., the square roots of the eigenvalues of the covariance/correlation matrix). To calculate the variance explained by each PC: sdev^2/sum(sdev^2). A scree plot is simply something like barplot(sdev^2). To determine the appropriate number of "important" PCs, we can look for an "elbow" in the scree plot. The component number is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size.
2. rotation, the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
3. x, the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix). This is also called PCA scores. Hence, cov(x) is the diagonal matrix diag(sdev^2). These PC scores can be used in visualization of sample outliers (e.g., plot(x[,1],x[,2])) and subsequent data analyses, such as correction for hidden structure in linear regression models with PC scores incorporated as covariates.
1. sdev, the standard deviations of the principal components (PCs) (i.e., the square roots of the eigenvalues of the covariance/correlation matrix). To calculate the variance explained by each PC: sdev^2/sum(sdev^2). A scree plot is simply something like barplot(sdev^2). To determine the appropriate number of "important" PCs, we can look for an "elbow" in the scree plot. The component number is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size.
2. rotation, the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
3. x, the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix). This is also called PCA scores. Hence, cov(x) is the diagonal matrix diag(sdev^2). These PC scores can be used in visualization of sample outliers (e.g., plot(x[,1],x[,2])) and subsequent data analyses, such as correction for hidden structure in linear regression models with PC scores incorporated as covariates.
Sunday, February 1, 2015
Plot correlation matrix into a graph
http://stackoverflow.com/questions/5453336/plot-correlation-matrix-into-a-graph
Subscribe to:
Posts (Atom)