Working with Real Data#

Note

There is a scripts/ folder in the main GitHub repo with a script named sfs_from_vcf.py. It contains more information and comments for each step described below.

To use data stored in a VCF file, dadi has several built-in functions that allow users to read in and format their data into a site frequency spectrum for use in downstream analyses. These steps include reading in the VCF file using population information (individual assignment to populations), which can handle individuals of any ploidy level based on the number of alleles in their genotype (GT) field.

Generating a frequency spectrum from a VCF file.#
 1import dadi
 2
 3# Put name of VCF file here. Can be gzipped or not.
 4vcf_file = "example.vcf.gz"
 5
 6# Put the name of your population info file
 7popinfo_file = "popinfo.txt"
 8
 9# Read the VCF data into a data dictionary using the population info
10data_dict = dadi.Misc.make_data_dict_vcf(
11    vcf_file,
12    popinfo_file
13)
14
15# Convert the data dictionary into a Spectrum object. A Spectrum object is
16# a wrapper around a NumPy array and is what dadi uses to work with site
17# frequency spectra
18sfs = dadi.Spectrum.from_data_dict(
19    data_dict=data_dict,
20    pop_ids=["pop1", "pop2"]  # names for populations in info file
21    projections=(20, 40)  # number of chromosomes sampled from each pop
22    polarized=False  # Can we determine ancestral vs. derived allelic states
23)
24
25# Write the SFS to file
26sfs.to_file("spectrum.fs")