Play all

Intro

Hunting for data

Inspecting the VCF

Finding population labels for the samples

Parsing VCF with pysam

Going from alleles to numbers for a numpy array

When to work in colab versus python script

Saving data with pandas

Adding population labels from the panel file

To Colab!

PCA

First plot! Mission accomplished :

Using Altair for plotting with labels

Second plot with population labels!

Merging with the igsr_population.tsv data

TSNE

Exercise: PCA on the SNPs

Conclusion and origin story for this project

Description:

Embark on a comprehensive bioinformatics project walkthrough that explores the relationship between genes and geography through population genotype data analysis. Learn to run Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) on genetic data from the 1000 Genomes project. Follow step-by-step instructions to download and parse VCF files using pysam, create numpy arrays, and utilize pandas for data manipulation. Transition between Python scripts and Google Colab environments while mastering visualization techniques with both matplotlib and Altair. Gain insights into population genetics by coloring data points based on ancestry labels and merging additional population information. Conclude with an exercise on performing PCA on SNPs and discover the origin story behind this illuminating project.

Genes and Geography - A Bioinformatics Project

OMGenomics

Add to list

#Data Science #Bioinformatics #Data Analysis #Programming #Programming Languages #Python #Matplotlib #pandas #NumPy #Science #Biology #Genetics