Welcome to Vcfparser’s documentation!

vcfparser

https://img.shields.io/pypi/v/vcfparser.svg https://img.shields.io/travis/everestial/vcfparser.svg Documentation Status

Python (version >=3.6) package for parsing the genomics and transcriptomics VCF data.

Features

  • No external dependency except python (version >=3.6).
  • Minimalistic in nature.
  • Provides a lot of features to API users.
  • Cython compiling is provided to optimize performance.

Installation

Method A:
VCFsimplify uses vcfparser API, so the package is readily available if VCFsimplify is already installed.

This is only preferred while developing/optimizing VcfSimplify along with vcfparser.

Navigate to the VCFsimplify directory -> activate python -> call the ‘vcfparser’ package.

$ C:\Users\>cd VCF-Simplify
$ C:\Users\>cd VCF-Simplify>dir
  Volume in drive C is StorageDrive
  Volume Serial Number is .........

  Directory of C:\Users\VCF-Simplify

  07/12/2020  10:14 AM    <DIR>          .
  07/12/2020  10:14 AM    <DIR>          ..
  07/12/2020  08:55 AM    <DIR>          .github
  ............................
  ............................
  07/12/2020  10:42 AM    <DIR>          vcfparser
  07/12/2020  08:55 AM             1,494 VcfSimplify.py
          11 File(s)     20,873,992 bytes
          13 Dir(s)  241,211,793,408 bytes free

$ C:\Users\VCF-Simplify>python
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from vcfparser import VcfParser
>>>

Method B (preferred method):
Pip is the preferred method of installing and using vcfparser API if custom python scripts/app are being developed.
$ pip install vcfparser

Method C:
For offline install, or in order to build from the source code, follow advance install.

Cythonize (optional but helpful)

The installed “vcfparser” package can be cythonized to optimize performance. Cythonizing the package can increase the speed of the parser by about x.x - y.y (?) times.

TODO: Bhuwan - add required cython method in here

Usage

>>> from vcfparser import VcfParser
>>> vcf_obj = VcfParser('input_test.vcf')

Get metadata information from the vcf file

>>> metainfo = vcf_obj.parse_metadata()
>>> metainfo.fileformat
'VCFv4.2'
>>> metainfo.filters_
[{'ID': 'LowQual', 'Description': 'Low quality'}, {'ID': 'my_indel_filter', 'Description': 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0'}, {'ID': 'my_snp_filter', 'Description': 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0'}]
>>> metainfo.alt_
[{'ID': 'NON_REF', 'Description': 'Represents any possible alternative allele at this location'}]
>>> metainfo.sample_names
['ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> metainfo.record_keys
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']

Get Records from the vcf file

>>> records = vcf_obj.parse_records()
    # Note: Records are returned as generator.
>>> first_record = next(records)
>>> first_record.CHROM
'2'
>>> first_record.POS
'15881018'
>>> first_record.REF
'G'
>>> first_record.ALT
'A,C'
>>> first_record.QUAL
'5082.45'
>>> first_record.FILTER
['PASS']
>>> first_record.get_mapped_samples()
{'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.\n'}}

TODO: Bhuwan (priority - high) The very last example “first_record.get_mapped_samples()” is returning the value of the last sample/key with “n”. i.e: ‘PC’: ‘.n’ Please fix that issue - strip(‘n’) in the line before parsing.


Alternately, we can loop over each record by using a for-loop:

for record in records:
    chrom = record.CHROM
    pos = record.POS
    id = record.ID
    ref = record.REF
    alt = record.ALT
    qual = record.QUAL
    filter = record.FILTER
    format_ = record.format_
    infos = record.get_info_dict()
    mapped_sample = record.get_mapped_samples()
For more specific use cases please check the examples in the following section:
For tutorials in metadata, please follow Metadata Tutorial.
For tutorials in record parser, please follow Record Parser Tutorial.

Installation

Stable release

To install vcfparser, run this command in your terminal:

$ pip install vcfparser

This is the preferred method to install vcfparser, as it will always install the most recent stable release.

If you don’t have pip installed, use this Python installation guide.

From sources

The sources for vcfparser can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/everestial/vcfparser

Or download the tarball:

$ curl  -OL https://github.com/everestial/vcfparser/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

vcfparser

vcfparser package

Submodules

vcfparser.meta_header_parser module

class vcfparser.meta_header_parser.MetaDataParser(header_file)[source]

Bases: object

Parses a meta lines of the vcf files.

parse_lines()[source]

Parse a vcf metadataline

vcfparser.meta_header_parser.split_to_dict(string)[source]

vcfparser.metaviewer module

class vcfparser.metaviewer.MetaDataViewer(vcf_meta_file, filename='vcfmetafile')[source]

Bases: object

print_requested_metadata(metadata_of_interest)[source]
save_as_json()[source]

Convert the dictionary to a json object and write to a file

save_as_orderdict()[source]

Converts the json type dictionary to dictionary that has all the values under same keys in one list of values.

save_as_table()[source]

write data to a file as text

vcfparser.metaviewer.obj_to_dict(metainfo)[source]
vcfparser.metaviewer.unpack_str(s)[source]

vcfparser.record_parser module

class vcfparser.record_parser.Alleles(mapped_samples, tag='GT')[source]

Bases: object

__init__(mapped_samples, tag='GT')[source]

This class is used to store sample names with their types.

class vcfparser.record_parser.GenotypeVal(allele)[source]

Bases: object

__init__(allele)[source]

” For a given genotype data like (‘0/0’, ‘1|1’, ‘0/1’); this class computes and store values like whether it is homref, hom_alt or hetvar

class vcfparser.record_parser.Record(record_values, record_keys)[source]

Bases: object

A class that converts the record lines from input VCF into accessible record object.

__init__(record_values, record_keys)[source]

Initializes the class with header keys and record values.

Parameters:
  • record_keys (list) –
    • list of record keys generated for the record values
    • generated from string in the VCF that starts with #CHROM
    • stays the same for a particular VCF file
  • record_values (list) –
    • list of record values generated from the VCF record line
    • genrated from the lines below # CHROM in VCF file
    • values are dynamically updated in each for-loop
deletion_overlapping_variant()[source]
get_format_to_sample_map(sample_names=None, formats=None, convert_to_iupac=None)[source]
Parameters:
  • sample_names (list) – list of sample names that needs to be processed (default = all samples are processed)
  • formats (list) – list of format tags that needs to be processed (default = all format tags are processed)
  • convert_to_iupac (list) – list of tags (from FORMAT) that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases)
Returns:

dict of filtered sample names along with filtered format “tags:values”

Return type:

dict

Examples

>>> import vcfparser.vcf_parser as vcfparse
>>> myvcf = vcfparse.VcfParser("input_test.vcf")
>>> records = myvcf.parse_records()
>>> record = first(record)
>>> record.mapped_format_to_sample = {'ms01e': {'GT': './.','PI': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'PC': '.'}}
>>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
>>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC', 'PG'], convert_to_iupac= ['GT', 'PG'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
get_full_record_map(convert_to_iupac=None)[source]

Maps record values with record keys.

Parameters:convert_to_iupac (list) – list of genotpye tags that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases)
Returns:
  • dict – dict with key value pair with sample and infos modified
  • TODO (Done (Gopal) Add example input and output)

Examples

>>> record.get_full_record_map()
{'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.'}}}
get_info_as_dict(info_keys=None)[source]

Convert Info to dict for required keys

Parameters:info_keys (list) – Keys of interest (default = all keys will be mapped)
Returns:key: value pair of only required keys
Return type:dict

Notes

If ‘=’ isn’t present then it will return its value as ‘.’.

Examples

>>> info_str = 'AC=2,0;AF=1.00;AN=8;BaseQRankSum'
>>> info_keys= ['AC', 'BaseQRankSum']
>>> record.get_info_as_dict(info_keys)
{'AC': '2,0', 'BaseQRankSum': '-7.710e-01'}
get_mapped_tag_list(sample_names=None, tag=None, bases='numeric')[source]
static get_tag_values_from_samples(order_mapped_samples, tag, sample_names, split_at=None)[source]

Splits the tags of given samples from order_dict of mapped_samples

Parameters:
  • order_mapped_samples (OrderedDict) – Ordered dictionary of FORMAT tags mapped to SAMPLE values.
  • tag (str) – One of the FORMAT tag.
  • sample_names (list) – Name of the samples to extract the values from.
  • split_at (str) – Character to split the value string at. e.g “|”, “/”, “,” etc.
Returns:

List of list containing SAMPLE value for the FORMAT tag

Return type:

list of list

Examples

>>> order_mapped_samples = OrderedDict([('ms01e',{'GT': './.', 'PI': '.'}), ('MA622', {'GT': '0/0','PI': '.'})])
>>> tag = 'GT'
>>> sample_names = ['ms01e', 'MA622']
>>> record.get_tag_values_from_samples(order_mapped_samples, tag, sample_names)
[['./.'], ['0/0']]
>>> # using "/|"  # to split at GT values at both | and /
>>> get_tag_values_from_samples(order_mapped_samples, tag, sample_names, split_at= "/|")
[['.', '.'], ['0', '0']]
hasAllele(allele='0', tag='GT', bases='numeric')[source]
Parameters:
  • allele (str) – allele to check if it is present in given samples(default = ‘0’)
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having given allele

Return type:

dict

Example

>>> record.hasAllele(allele='0', tag='GT', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
hasINDEL()[source]
hasSNP(tag='GT', bases='numeric')[source]
hasVAR(genotype='0/0', tag='GT', bases='numeric')[source]
Parameters:
  • genotype (str) – genotype to check if it is present in given samples(default = ‘0/0’)
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having given genotype

Return type:

dict

Example

>>> record.hasVAR(genotype='0/0')

{‘MA611’: ‘0/0’, ‘MA605’: ‘0/0’, ‘MA622’: ‘0/0’}

has_phased(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having ‘/’ in samples formats

Return type:

dict

Examples

>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> from vcfpaser.record_parser import Record
>>> rec_obj = Record(rec_values, rec_keys)
>>> rec_obj.has_phased(tag="GT", bases="iupac")
{}
has_unphased(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having ‘/’ in samples formats

Return type:

dict

Examples

>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> from vcfparser.record_parser import Record
>>> rec_obj = Record(rec_values, rec_keys)
>>> rec_obj.has_unphased(tag="GT", bases="iupac")
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
hasnoVAR(tag='GT')[source]

Returns samples with empty genotype

isHETVAR(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having homoref

Return type:

dict

Examples

>>> record.isHETVAR(tag="GT", bases="numeric")
{}
isHOMREF(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having homoref

Return type:

dict

Examples

>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> from vcfparser.record_parser import Record
>>> rec_obj = Record(rec_values, rec_keys)
>>> rec_obj.isHOMREF(tag="GT", bases="iupac")
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
isHOMVAR(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having homoref

Return type:

dict

Examples

>>> record.isHOMVAR(tag="GT", bases="iupac")
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
isMissing(tag='GT')[source]
Parameters:tag (str) – format tags of interest (default = ‘GT’)
Returns:dict of sample with values having homoref
Return type:dict

Examples

>>> record.isMissing(tag='PI')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.', 'MA611': '.', 'MA605': '.', 'MA622': '.'}
iupac_to_numeric(ref_alt, genotype_in_iupac)[source]
mapped_rec_to_str(mapped_sample_dict)[source]
static split_genotype_tags()[source]
unmap_fmt_samples_dict(mapped_dict)[source]

Converts mapped dict again into string to write into the file.

vcfparser.vcf_parser module

class vcfparser.vcf_parser.VcfParser(filename)[source]

Bases: object

A class to parse the metadata information and yield records from the input VCF.

parse_metadata()[source]
parse_records()[source]
__init__(filename)[source]
Parameters:filename (file) – input vcf file that needs to be parsed. bgzipped files are also supported.
Returns:VCF object for iterating and querying.
Return type:Object
parse_metadata()[source]

function to parse the metadata information from VCF header.

Returns:MetaDataParser object for iterating and querying the metadata information.
Return type:Object

Uses

MetaDataParser class to create MetaData object

parse_records(chrom=None, pos_range=None, no_processors=1)[source]

Parse records and yield it.

Parameters:
  • chrom (str) – chormosome name or number. Default = None
  • pos_range (tuple) – genomic position of interest, e.g: (5, 15). Both upper and lower limits are inclusive. Default = None
  • no_of_recs (int) – number of records to process

Uses

Record module to create a Record object

Yields:Record object for interating and quering the record information.

vcfparser.vcf_writer module

class vcfparser.vcf_writer.VCFWriter(filename)[source]

Bases: object

A vcf writer to write headerline and datalines into new file

add_contig(id, length, key='contig')[source]
add_filter(id, desc='', key='FILTER')[source]
add_filter_long(id, num='.', type='.', desc='', key='FILTER')[source]
add_format(id, num='.', type='.', desc='', key='FORMAT')[source]
add_header_line(record_keys)[source]
add_info(id, num='.', type='.', desc='', key='INFO')[source]
add_normal_metadata(key, value)[source]

This is used to add normal key value metadata like: fileformat, filedate, refrence

add_record_value(preheader, info, format_, sample_str)[source]

Module contents

Tutorial on MetaData

Advanced tutorial on vcf parser module showing available methods for parsing metadata.

First import VcfParser module and instantiate an vcf object by passing vcf file as an argument.

Initial setup:

>>> from vcfparser import VcfParser
>>> vcf_obj = VcfParser('input_test.vcf')

We can also pass gzipped vcf file as an argument.

>>> vcf_obj = VcfParser('input_test.vcf.gz')

VcfParser module has two main methods:

  • parse_metadata: It contains methods for extracting information related to the metadata header.
  • parse_records: It contains methods for retrieving the record values from the vcf file.

Parsing VCF metadata:

To parse the metadata information we can call parse_metadata():

>>> # pass the VCF object to the 'parse_metadata()' function
>>> metainfo = vcf_obj.parse_metadata()

  • Metainfo provides several attributes and objects that helps in extracting specific metadata information.
  • These informations are reported as a list or dictionary.

To list all the available attributes within metainfo do:

>>> metainfo.__dir__()
['header_file', 'infos_', 'filters_', 'contig', 'format_', 'alt_', 'other_lines', 'fileformat', 'reference', 'sample_names', 'is_gvcf', 'gvcf_blocks', 'record_keys', 'VCFspec', 'gatk_commands', 'raw_meta_data', '_format_pattern', '_meta_pattern', 'sample_with_pos', '__module__', '__doc__', '__init__', '_parse_gvcf_block', '_parse_gatk_commands', 'parse_lines', '__dict__', '__weakref__', '__repr__', '__hash__', '__str__', '__getattribute__', '__setattr__', '__delattr__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__new__', '__reduce_ex__', '__reduce__', '__subclasshook__', '__init_subclass__', '__format__', '__sizeof__', '__dir__', '__class__']
>>> # or
>>> dir(metainfo)
['VCFspec', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_format_pattern', '_meta_pattern', '_parse_gatk_commands', '_parse_gvcf_block', 'alt_', 'contig', 'fileformat', 'filters_', 'format_', 'gatk_commands', 'gvcf_blocks', 'header_file', 'infos_', 'is_gvcf', 'other_lines', 'parse_lines', 'raw_meta_data', 'record_keys', 'reference', 'sample_names', 'sample_with_pos', 'testA']

To call the specific objects and attributes do:

>>> metainfo.VCFspec
[{'fileformat': 'VCFv4.2'}, {'GVCF': True}]

>>> metainfo.infos_
[{'ID': 'AF', 'Number': 'A', 'Type': 'Float', 'Description': 'Allele Frequency, for each ALT allele, in the same order as listed'}, {'ID': 'BaseQRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities'}, {'ID': 'ClippingRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases'}, {'ID': 'DP', 'Number': '1', 'Type': 'Integer', 'Description': 'Approximate read depth; some reads may have been filtered'},
{'ID': 'DS', 'Number': '0', 'Type': 'Flag', 'Description': 'Were any of the samples downsampled?'}, {'ID': 'END', 'Number': '1', 'Type': 'Integer', 'Description': 'Stop position of the interval'}, {'ID': 'ExcessHet', 'Number': '1', 'Type': 'Float', 'Description': 'Phred-scaled p-value for exact test of excess heterozygosity'}, {'ID': 'FS', 'Number': '1', 'Type': 'Float', 'Description': "Phred-scaled p-value using Fisher's exact test to detect strand bias"}, {'ID': 'HaplotypeScore', 'Number': '1', 'Type': 'Float', 'Description': 'Consistency of the site with at most two segregating haplotypes'}, {'ID': 'InbreedingCoeff', 'Number': '1', 'Type': 'Float', 'Description': 'Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation'}, {'ID': 'MLEAC', 'Number': 'A', 'Type': 'Integer', 'Description': 'Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed'}, {'ID': 'MLEAF', 'Number': 'A', 'Type': 'Float', 'Description': 'Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed'}, {'ID': 'MQ', 'Number': '1', 'Type': 'Float', 'Description': 'RMS Mapping Quality'}, {'ID': 'MQRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities'}, {'ID': 'QD', 'Number': '1', 'Type': 'Float', 'Description': 'Variant Confidence/Quality by Depth'}, {'ID': 'RAW_MQ', 'Number': '1', 'Type': 'Float', 'Description': 'Raw data for RMS Mapping
Quality'}, {'ID': 'ReadPosRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias'}, {'ID': 'SOR', 'Number': '1', 'Type': 'Float', 'Description': 'Symmetric Odds Ratio of 2x2 contingency table to detect strand bias'}, {'ID': 'set', 'Number': '1', 'Type': 'String', 'Description': 'Source VCF for the merged record in CombineVariants'}, {'ID': 'SF', 'Number': '.', 'Type': 'String', 'Description': 'Source File (index to sourceFiles, f when filtered)'}, {'ID': 'AC', 'Number': '.',
'Type': 'Integer', 'Description': 'Allele count in genotypes'}, {'ID': 'AN', 'Number': '1', 'Type': 'Integer', 'Description': 'Total number of alleles in called genotypes'}, {'ID': 'TS', 'Type': 'Test', 'Description': 'Allele count in genotypes'}]

>>> metainfo.record_keys
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']

Tutorial on record parser

Advanced tutorial on vcf parser module showing available methods for parsing records.

First import VcfParser module and instantiate an vcf object by passing vcf file as an argument.

Initial setup:

>>> from vcfparser import VcfParser
>>> vcf_obj = VcfParser('input_test.vcf')
We can also pass gzipped vcf file as an argument.
>>> vcf_obj = VcfParser('input_test.vcf.gz')

VcfParser module has two main methods:
  • parse_metadata: to extract the metadata information from VCF metadata header.
  • parse_records: to retrieve the record values from the VCF record lines.

Accessing VCF records:

Step 01:

>>> # pass the VCF object to the 'parse_records()' function
>>> records = vcf_obj.parse_records()

Step 02:

Yield record values - Method A: using next()

  • records is an generator object. Therefore, applying next(records) yields the very first record as Record object.
  • Subsequent next(records) will yield subsequent records after that first record from the VCF.
  • parse_records() uses the Record class which can be used directly if record_keys and record_vals are handy.

For more info about Record visit Record.

>>> first_record = next(records)
>>> print(first_record)
2       15881224        .       T       G       143.24  PASS    AC=0;AF=0.036;AN=12;BaseQRankSum=1.75;ClippingRankSum=0.00;DP=591;ExcessHet=3.0103;FS=3.522;InbreedingCoeff=-0.1072;MLEAC=1;MLEAF=0.036;MQ=41.48;MQRankSum=0.366;QD=15.92;ReadPosRankSum=0.345;SF=0,1,2,3,4,5,6;SOR=2.712;set=HignConfSNPs   GT:PM:PG:GQ:AD:PW:PI:PL:PC:PB:DP       ./.:.:./.:.:0:./.:.:.,.,.:.:.:0 0/0:.:0/0:3:1:0/0:.:.,.,.:.:.:1        0/0:.:0/0:12:4:0/0:.:.,.,.:.:.:4        0/0:.:0/0:3:4:0/0:.:.,.,.:.:.:4        0/0:.:0/0:30:17,0:0/0:.:0,30,450:.:.:17 0/0:.:0/0:15:7,0:0/0:.:0,15,225:.:.:7  0/0:.:0/0:39:25,0:0/0:.:0,39,585:.:.:25

Yield record values - Method B: using for-loop

Each record in the VCF can also be accessed on a for-loop

>>> for record in records:
...     print(record)
...     record.POS
...     break
...
2       15881018        .       G       A,C     5082.45 PASS    AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs
        GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC        0/1:5:.:0|1:.:./.:0,0:0,0,0,.,.,.:0:.:.        ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.        1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.        0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.        0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.        0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.
'15881018'

Step 03: Extract data using Record object attribute and methods

Record object also has several attributes and methods which allows us to extract the record values as list or dictionary.

>>> " list of available attribute and methods "
>>> dir(first_record)  # or print(dir(record)) on a for loop
['ALT', 'CHROM', 'FILTER', 'ID', 'POS', 'QUAL', 'REF', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_map_fmt_to_samples', '_to_iupac', 'deletion_overlapping_variant', 'format_', 'get_info_as_dict', 'get_mapped_samples', 'get_mapped_tag_list', 'hasAllele',
'hasINDEL', 'hasSNP', 'hasVAR', 'has_phased', 'has_unphased', 'hasnoVAR', 'info_str', 'isHETVAR', 'isHOMREF', 'isHOMVAR', 'isMissing', 'iupac_to_numeric', 'map_records_long', 'mapped_format_to_sample', 'rec_line', 'record_keys', 'record_vals', 'ref_alt', 'sample_names', 'sample_vals', 'get_tag_values_from_samples', 'unmap_fmt_samples_dict', 'vTest']

Attributes

>>> # available attributes in the "record" object are:
CHROM, POS, REF, ALT, ref_alt, QUAL, FILTER, info_str, format_, sample_names, sample_vals, mapped_format_to_sample

>>> "Access simple position level attribute values as"
>>> first_record.CHROM
'2'
>>> first_record.POS
'15881018'
>>> first_record.REF, first_record.ALT, first_record.QUAL, first_record.FILTER
('G', ['A', 'C'], '5082.45', ['PASS'])
>>> first_record.ref_alt  # call REF and ALT allele together
['C', 'CA']

>>> # keys represented in the "CHROM" line of the VCF
>>> first_record.record_keys
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> # Note: "record_keys" available within record object are same as the one from metainfo object.
>>> metainfo.record_keys  # from "parse_metadata()"
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>>
>>> first_record.record_values  # record values as list
['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']

>>> "Population level information is provided by the INFO key"
>>> # accessed using 'info_str'
>>> first_record.info_str   # info values as string
'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs'

>>> "Sample level infomation are extracted by matching the FORMAT tags with their corresponding values in the SAMPLE"
>>> first_record.format_  # available tags in FORMAT
['GT', 'PI', 'GQ', 'PG', 'PM', 'PW', 'AD', 'PL', 'DP', 'PB', 'PC']

>>> first_record.sample_names  # sample names
['ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']

>>> first_record.sample_vals  # sample values as list
['./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']

>>> # a default full map of the FORMAT tags to SAMPLE values
>>> first_record.mapped_format_to_sample
OrderedDict([('ms01e', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.',
'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('ms02g', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.', 'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('ms03g', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.', 'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('ms04h', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.', 'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('MA611', {'GT': '0/0', 'AD': '20,0', 'PI': '.', 'PW': '0/0', 'PG': '0/0', 'PM': '.', 'GQ': '54', 'DP': '20', 'PB': '.', 'PC': '.', 'PL': '0,54,810'}), ('MA605',
{'GT': '0/0', 'AD': '6,0', 'PI': '.', 'PW': '0/0', 'PG': '0/0', 'PM': '.', 'GQ': '18', 'DP': '6', 'PB': '.', 'PC': '.', 'PL': '0,18,206'}), ('MA622', {'GT': '0/0', 'AD': '27,0', 'PI': '.', 'PW': '0/0', 'PG': '0/0', 'PM': '.', 'GQ': '72', 'DP': '27', 'PB': '.', 'PC': '.', 'PL': '0,72,1080'})])

Methods on record object

Very specific parsing of the record object can be done using the provided methods. These methods take several args and kwargs to narrow down the information available in the Record object.


>>> "Parse the INFO string data using get_info_as_dict()"
>>> first_record.info_str   # the original info values as string
'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs'
>>> first_record.get_info_as_dict() # info values as dictionary
{'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}

>>> # info_keys can be provided extract specific keys:value
>>> first_record.get_info_as_dict(info_keys= ['AC', 'AF'])
{'AC': '2,0', 'AF': '1.00'}

>>> "More controlled FORMAT tag to SAMPLE value mapping can be done using get_format_to_sample_map()"
>>> # it helps to extract specific FORMAT tag values from specific SAMPLE
>>> first_record.get_format_to_sample_map(sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}

>>> "the mapped genotype values can be converted to IUPAC bases using the convert_to_iupac flag"
>>> first_record.get_format_to_sample_map(sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'], convert_to_iupac=['GT'])
{'ms01e': {'GT': './.', 'PC': '.', 'GT_iupac': './.'}, 'MA611': {'GT': '0/0', 'PC': '.', 'GT_iupac': 'G/G'}}
>>> first_record.get_format_to_sample_map(sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'], convert_to_iupac=['GT', 'PG'])
{'ms01e': {'GT': './.', 'PC': '.', 'GT_iupac': './.', 'PG_iupac': './.'}, 'MA611': {'GT': '0/0', 'PC': '.', 'GT_iupac': 'G/G', 'PG_iupac': 'G/G'}}

>>> # get a full mapping for all the record_keys and FORMAT within SAMPLE
>>> # Note: This mapping is only activated when called with lazy instantiation
>>> first_record.get_full_record_map()
{'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.'}}}

>>> # full mapping has the option to convert genotype bases to IUPAC
>>> first_record.get_full_record_map(convert_to_iupac= ['GT'])
{'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.', 'GT_iupac': './.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.', 'GT_iupac': './.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.', 'GT_iupac': './.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.', 'GT_iupac': 'A/A'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.', 'GT_iupac': 'G/G'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.', 'GT_iupac': 'G/G'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.', 'GT_iupac': 'G/G'}}}
>>> # Note: "convert_to_iupac" will add the genotype tag with suffix "_iupac" to show the genotype in IUPAC bases.

Genotype parsing

Genotype checks and parsing are one of most important use case of VCF data. VcfParser provides several methods to do those checks and extract data.

  • Check samples that have alleles of your interest.

>>> first_record.hasAllele(allele='1', tag= 'GT', bases = 'iupac')
{'ms04h': 'A/A'}

>>> first_record.hasAllele(allele='1', tag= 'GT', bases = 'numeric')
{'ms04h': '1/1'}

>>> first_record.hasAllele(allele='1', tag= 'PG', bases = 'numeric')
{'ms04h': '1/1'}

>>> first_record.hasAllele(allele='0', tag= 'PG', bases = 'numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}

>>> first_record.hasAllele(allele='0', tag= 'PG', bases = 'iupac')
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}

  • Check samples with specific genotype. Both numeric and iupac checks are available.
>>> first_record.hasVAR(genotype='0/0', tag= 'PG', bases = 'numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.hasVAR(genotype='G/G', tag= 'PG', bases = 'iupac')
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
>>> first_record.hasVAR(genotype='1/1', tag= 'PG', bases = 'numeric')
{'ms04h': '1/1'}
>>> first_record.hasVAR(genotype='A/A', tag= 'PG', bases = 'iupac')
{'ms04h': 'A/A'}

>>> # genotypes can be checked in phased state
>>> first_record.hasVAR(genotype='0|0', tag='GT', bases='numeric')
{}

  • Check phased vs unphased genotype. Specific genotype tag can be checked; default is ‘GT’.
>>> first_record.has_phased()
{}
>>> first_record.has_unphased()
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': '1/1', 'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.has_unphased(tag= 'PG')
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': '1/1', 'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.has_unphased(tag='PG', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.has_unphased(tag= 'PG', bases = 'iupac')
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}

  • Return samples with no variants (i.e. contains ‘./.’, ‘.|.’, ‘.’)
>>> first_record.hasnoVAR()
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.'}
>>> first_record.hasnoVAR(tag='GT')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.'}
>>> first_record.hasnoVAR(tag= 'PG')
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.'}

  • Samples with homozygous reference genotypes can be retrieved as.
>>> first_record.isHOMREF(tag='GT', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.isHOMREF(tag='GT', bases='iupac')
{'MA611': 'C/C', 'MA605': 'C/C', 'MA622': 'C/C'}

>>> #if another FORMAT tag also represents a genotype, specific the FORMAT tag
>>> first_record.isHOMREF(tag='PG', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.isHOMREF(tag='PG', bases='iupac')
{'MA611': 'C/C', 'MA605': 'C/C', 'MA622': 'C/C'}

  • Similarly, samples with homozygous variant genotypes can also be retrieved.
>>> first_record.isHOMVAR()
{'ms04h': '1/1'}
>>> first_record.isHOMVAR(tag= 'PG', bases= 'iupac')
{'ms04h': 'A/A'}

  • Samples with heterozygous variant genotypes in given record”
>>> first_record.isHETVAR()
{}

  • This returns samples with missing variants for certain FORMAT tags(i.e. contains ‘./.’, ‘.|.’, ‘.’). Currently we used ‘GT’ tag as default.
>>> first_record.isMissing()
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.'}

>>> # missing checks can be applied to other FORMAT tags too.
>>> first_record.isMissing(tag = 'PI')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.', 'MA611': '.', 'MA605': '.', 'MA622': '.'}

>>> first_record.isMissing(tag='GQ')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.'}

Credits

Development Lead

Contributors

TODO: Done Add email address?

Indices and tables