Data Science Hub Directory of data tools
Bioconductor
Open-source software that offers a comprehensive collection of tools and packages for the analysis and interpretation of high-throughput genomic data in the R programming language.
Bowtie
An ultrafast, memory-efficient short read aligner that uses the Burrows-Wheeler transform and FM-index to align sequencing reads to large genomes.
BWA (Burrows-Wheeler Aligner)
A widely used alignment algorithm that maps low-divergent sequences against a large reference genome. It employs the Burrows-Wheeler transform and backward search with a suffix array to enable fast and accurate alignments.
GATK (Genome Analysis Toolkit)
A comprehensive suite of tools for analyzing high-throughput sequencing data, primarily used for variant discovery in whole exome sequencing (WES) and whole genome sequencing (WGS) data.
HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts 2)
A fast and sensitive alignment program for mapping next-generation sequencing reads to a reference genome.
Salmon
A lightweight, quasi-mapping-based tool for quantifying transcript abundance from RNA-sequencing (RNA-seq) data.
Scanpy
A Python-based toolkit for analyzing single-cell gene expression data.
Seurat
An R package designed for analyzing and exploring single-cell RNA-seq (scRNA-seq) data.
STAR (Spliced Transcripts Alignment to a Reference)
A highly efficient RNA-seq aligner that maps sequencing reads to a reference genome while considering spliced alignments.
ClinPhen
A fast, high-accuracy algorithm that scans clinical notes and generates a prioritized list of patient phenotypes.
Epic Caboodle
Epic’s enterprise data warehouse, an abstracted database that supports the clinical information system and allows for the exploration and analysis of patient data from hospitals around the nation.
Epic Clarity Reports
A tool for generating reports from the Epic database that include longer timeframes or require complex analysis.
Epic Cosmos
A database of de-identified data from inpatient and outpatient electronic health records for use in clinical research. It includes records from over 180 million patients and over 6.6 billion encounters.
Epic Slicer Dicer
A data exploration and visualization tool that allows users to analyze aggregate patient data.
Natural Language Processing
A tool that uses machine learning to process and analyze natural language text and data (e.g., free-text notes in electronic health records).
OMOP (Observational Medical Outcomes Partnership)
An open-science collaborative that aims to standardize the way healthcare data is structured and analyzed for observational research.
PhenoGPT
A specialized version of the GPT language model designed to analyze clinical text, enabling tasks such as phenotype extraction, disease coding, and clinical decision support in electronic health records.
PHIS (Pediatric Health Information System) Database
A comparative database with clinical and resource utilization data from inpatient, ambulatory surgery, emergency department, and observation unit patient encounters from more than 49 children’s hospitals.
Population Builder: Stratification Module
A tool that allows organizations to rapidly identify patients best suited for population health programs through predefined criteria and customizable filters.
REDCap
A secure web application used to build and manage online surveys and databases.
G*Power
A software tool for statistical power analysis and sample size calculation. It helps in designing studies with appropriate statistical power to detect meaningful effects.
JAMOVI
Open-source statistical software that’s frequently used in biostatistics for its simplicity and flexibility in data analysis and visualization.
JASP
Free and open-source statistical software offering a user-friendly interface for both frequentist and Bayesian analyses.
Minitab
A statistical software package with a comprehensive set of statistical tools for data analysis, quality improvement, and experimental design.
SAS (Statistical Analysis System)
A software suite used for advanced analytics, business intelligence, data management, predictive analytics.
SPSS (Statistical Package for the Social Sciences)
User-friendly software for interactive, or batched, statistical analysis.
STATA
A statistical software package favored for its efficiency in data management, analysis, and graphics. It’s particularly well-suited for handling large datasets common in biostatistics research.
ggplot2 (R)
An open-source data visualization package for the statistical programming language R.
Matplotlib (Python)
A comprehensive plotting library for creating static, animated, and interactive visualizations in Python.
Plotly (Python/R)
An open-source graphing library for creating interactive plots.
Informatica
A data processing platform that helps to process and manage large amounts of data from pre-compiled datasets built for research and operational requirements.
Keras (Python)
An application programming interface (API) for building and training neural networks.
PyTorch (Python)
A deep learning framework with a strong focus on flexibility.
Scikit-learn (Python)
A machine learning library for Python that offers simple and efficient tools for data mining and analysis, including classification, regression, clustering, and dimensionality reduction.
TensorFlow (Python)
An open-source machine learning framework developed by Google for building and training deep learning models.
MATLAB
A programming language used for numerical computing and simulations.
Python
A programming language widely used for data analysis, machine learning, and scientific computing.
R
A programming language used for statistical analysis and visualization.
SQL
A programming language essential for database management and data manipulation.
Nextflow
A system that enables scalable and reproducible scientific workflows using software containers.
Snakemake
A workflow management system for creating reproducible and scalable data analyses.