pylazybam¶
pylazybam.bam module¶
A Module for reading and writing BAM format files
Note that for convenience the functions from pylazybam.decoders and pylazybam.tags are also imported into the pylazybam.bam namespace
-
class
pylazybam.bam.FileReader(ubam: BinaryIO)¶ Bases:
pylazybam.bam._FileBaseA Pure Python Lazy Bam Parser Class
Parameters: ubam (BinaryIO) – An binary (bytes) file or stream containing a valid uncompressed bam file conforming to the specification. Yields: align (bytes) – A byte string of a bam alignment entry in raw binary format -
index_to_ref¶ A dictionary mapping bam reference numeric identifiers to names
Type: Dict[int:str]
-
refs¶ A dictionary of reference_name keys with reference_length
Type: Dict[str:int]
-
ref_to_index¶ A dictionary mapping reference names to the bam numeric identifier
Type: Dict[str:int]
-
sort_order¶ The value of the SO field indicating sort type. Value is as given in the BAM file. Should be one of ‘unknown’, ‘unsorted’, ‘queryname’ or ‘coordinate’
Type: str
Notes
It is advisable not to call the private functions or operate directly on the underlying file object.
The detailed specification for the BAM format can be found at https://samtools.github.io/hts-specs/SAMv1.pdf
Example
This class requires an uncompressed bam file as input. For this example we will use gzip to decompress the test file included as a resource in the package
>>> import gzip >>> from pkg_resources import resource_stream >>> ubam = gzip.open('/tests/data/paired_end_testdata_human.bam'))
A bam filereader object can then be created and headers inspected
>>> mybam = bam.FileReader(ubam) >>> print(mybam.header)
The filereader object is an iterator and yields alignments in raw format
>>> align = next(mybam)
Alignments can be processed using functions from pylazybam.bam
>>> print(mybam.index_to_ref[get_ref_index(align)], >>> get_pos(align), >>> get_AS(align))
-
close()¶ Close input file
-
get_full_raw_header()¶ Get a complete BAM header with all elements suitable for writing to a bam.FileWriter
Returns: The complete raw BAM header including the BAM magic and reference block Return type: bytes
-
get_updated_header(id: str, program: str, version: str, command: str = None, description: str = None, raw_header=None)¶ Get a modified version of the header with additional program information in a @PG BAM header suitable for writing to a new output file.
Does not include the BAM magic or the BAM reference block.
Note that the header should be written after self.magic and before self.raw_refs
Parameters: - id (str) – a unique identifier of this action of the program on the BAM
- program (str) – the name of the program used to process the BAM
- version (str) – version number of the program used to process the BAM
- command (str) – the command and arguments to the program used to process the BAM
- description (str) – optional description
- raw_header (bytes) – a raw BAM formatted header without BAM magic or REF block used for recursive modification to add multiple
Notes
Parameters will be utf-8 encoded If there are existing @PG records the last record will be used as PP
See also
-
reset_alignments()¶ Reset the file pointer to the beginning of the alignment block
-
update_header(*args, **kwargs)¶ Use get_updated_header to update self.raw_header inplace
See get_updated_header for Parameters and documentation
-
update_header_length(raw_header=None)¶ Update the length of the SAM format text component of the header
Parameters: raw_header (bytes) – optional raw SAM text header as bytes to process Default self.raw_header Returns: returns the length corrected raw header if a raw header is provided Return type: bytes
-
-
class
pylazybam.bam.FileWriter(file, raw_header=None, raw_refs=None, mode='wb', compresslevel=6)¶ Bases:
pylazybam.bam._FileBase-
close(*args, **kwargs)¶ Flush and write any data to the BAM file before finalizing and closing
-
get_full_raw_header()¶ Get a complete BAM header with all elements suitable for writing to a bam.FileWriter
Returns: The complete raw BAM header including the BAM magic and reference block Return type: bytes
-
get_updated_header(id: str, program: str, version: str, command: str = None, description: str = None, raw_header=None)¶ Get a modified version of the header with additional program information in a @PG BAM header suitable for writing to a new output file.
Does not include the BAM magic or the BAM reference block.
Note that the header should be written after self.magic and before self.raw_refs
Parameters: - id (str) – a unique identifier of this action of the program on the BAM
- program (str) – the name of the program used to process the BAM
- version (str) – version number of the program used to process the BAM
- command (str) – the command and arguments to the program used to process the BAM
- description (str) – optional description
- raw_header (bytes) – a raw BAM formatted header without BAM magic or REF block used for recursive modification to add multiple
Notes
Parameters will be utf-8 encoded If there are existing @PG records the last record will be used as PP
See also
-
seekable()¶ Return the seek state of the file
-
tell()¶ Return the current location of the pointer in the file
-
update_header(*args, **kwargs)¶ Use get_updated_header to update self.raw_header inplace
See get_updated_header for Parameters and documentation
-
update_header_length(raw_header=None)¶ Update the length of the SAM format text component of the header
Parameters: raw_header (bytes) – optional raw SAM text header as bytes to process Default self.raw_header Returns: returns the length corrected raw header if a raw header is provided Return type: bytes
-
write(data)¶ Write output to the BAM file. Data is buffered by the underlying bgzf method
Parameters: data (bytes) – The data to be written to the BAM file
-
write_header(raw_header=None, raw_refs=None)¶ Write the header information to the BAM file
Parameters: - raw_header (bytes) – A raw byte format header containing the standard SAM format header Default : self.raw_header
- raw_refs – A raw bytestring containing the reference sequence header data Default : self.raw_refs
Raises: RuntimeError– Raises a runtime error if header information already written
-
-
pylazybam.bam.get_bin(alignment: bytes) → int¶ Extract the BAI index bin from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The integer value of the index bin Return type: int
-
pylazybam.bam.get_flag(alignment: bytes) → int¶ Extract the alignment flag from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The alignment flag Return type: int Notes
Flag values can be tested on raw BAM alignment with pylazybam.bam.is_flag() Common flag values are available from pylazybam.bam.FLAGS
>>> print(pylazybam.bam.FLAGS) {"paired": 0x1, "aligned": 0x2, "unmapped": 0x4, "pair_unmapped": 0x8, "forward": 0x40, "reverse": 0x80, "secondary": 0x100, "qc_fail": 0x200, "duplicate": 0x400, "supplementary": 0x800,}
See https://samtools.github.io/hts-specs/SAMv1.pdf for details.
-
pylazybam.bam.get_len_read_name(alignment: bytes) → int¶ Extract the length of the read name from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The length of the read name Return type: int
-
pylazybam.bam.get_len_sequence(alignment: bytes) → int¶ Extract the sequence length from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: the length of the sequence Return type: int Notes
The decoded quality string will be the same length as the sequence
-
pylazybam.bam.get_mapq(alignment: bytes) → int¶ Extract the read mapping quality score from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The integer mapping quality score Return type: int
-
pylazybam.bam.get_number_cigar_operations(alignment: bytes) → int¶ Extract the number of cigar operations from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The number of operations in the cigar string Return type: int
-
pylazybam.bam.get_pair_pos(alignment: bytes) → int¶ Extract the one based position of this reads pair from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The one based left aligned position of this reads pair on the reference Return type: int
-
pylazybam.bam.get_pair_ref_index(alignment: bytes) → int¶ Extract the index identifying the reference sequence of this query sequences pair from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The zero based rank of the reference in the BAM header Return type: int Notes
The index can be converted to the reference name with pylazybam.bam.FileReader().index_to_ref[index]
-
pylazybam.bam.get_pos(alignment: bytes) → int¶ Extract the one based position of this read
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The one based left aligned position of this read on the reference Return type: int
-
pylazybam.bam.get_raw_base_qual(alignment: bytes, len_read_name: int, number_cigar_operations: int, len_sequence: int) → bytes¶ Extract the raw base qualities from a BAM alignment bytestring
Parameters: - alignment (bytes) – A byte string of a bam alignment entry in raw binary format
- len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
- number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
- len_sequence (int) – The length of the sequence and quality score strings eg from pylazybam.bam.get_len_sequence
Returns: The raw base qualities in BAM format as a binary bytestring
Return type:
-
pylazybam.bam.get_raw_cigar(alignment: bytes, len_read_name: int, number_cigar_operations: int) → bytes¶ Extract the raw cigar string from a BAM alignment bytestring
Parameters: Returns: The raw base cigar string in BAM format as a binary bytestring
Return type:
-
pylazybam.bam.get_raw_read_name(alignment: bytes, read_name_length: int) → bytes¶ Extract the raw readname from a BAM alignment bytestring
Parameters: Returns: The raw base readname in BAM format as a binary bytestring
Return type:
-
pylazybam.bam.get_raw_sequence(alignment: bytes, len_read_name: int, number_cigar_operations: int, len_sequence: int) → bytes¶ Extract the raw sequence from a BAM alignment bytestring
Parameters: - alignment (bytes) – A byte string of a bam alignment entry in raw binary format
- len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
- number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
- len_sequence (int) – The length of the sequence and quality score strings eg from pylazybam.bam.get_len_sequence
Returns: The raw sequence in BAM format as a binary bytestring
Return type:
-
pylazybam.bam.get_read_name(alignment: bytes, read_name_length: int) → str¶ Extract the read name in ASCII SAM format from a BAM alignment bytestring
Parameters: Returns: The read name in ASCII SAM format
Return type:
-
pylazybam.bam.get_ref_index(alignment: bytes) → int¶ Extract the reference index from a BAM alignment
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The zero based rank of the reference in the BAM header Return type: int Notes
The index can be converted to the reference name with pylazybam.bam.FileReader().index_to_ref[index]
-
pylazybam.bam.get_tag_bytestring(alignment: bytes, len_read_name: int, number_cigar_operations: int, len_sequence: int) → bytes¶ Extract the raw tags from a BAM alignment bytestring
Parameters: - alignment (bytes) – A byte string of a bam alignment entry in raw binary format
- len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
- number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
- len_sequence (int) – The length of the sequence and quality score strings eg from pylazybam.bam.get_len_sequence
Returns: The raw tags in BAM format as a binary bytestring
Return type:
-
pylazybam.bam.get_template_len(alignment: bytes) → int¶ Extract the template length from a BAM alignment bytestring
Parameters: alignment (bytes) – A byte string of a bam alignment entry in raw binary format Returns: The integer length of the template (The distance between aligned read pairs) Return type: int
pylazybam.decoders module¶
Format decoders for BAM alignment data types
-
pylazybam.decoders.decode_base_qual(raw_base_qual: bytes, offset: int = 33) → str¶ Decode raw BAM base quality scores into ASCII values
Parameters: Returns: The ASCII encoded SAM representation of the quality scores
Return type:
-
pylazybam.decoders.decode_cigar(raw_cigar: bytes) → str¶ Decode raw BAM cigar strings into ASCII values
Parameters: raw_cigar (bytes) – The cigar section of a BAM alignment record as bytes eg the output of pylazybam.bam.get_raw_cigar() Returns: The ASCII encoded SAM representation of the cigar string Return type: str
-
pylazybam.decoders.decode_sequence(raw_seq: bytes) → str¶ Decode raw BAM sequence into ASCII values
Parameters: raw_seq (bytes) – The sequence section of a BAM alignment record as bytes eg the output of pybam.bam.get_raw_sequence() Returns: The ASCII encoded SAM representation of the query sequence Return type: str
-
pylazybam.decoders.is_flag(alignment: bytes, flag: int) → bool¶ Test BAM flag values against a BAM alignment bytestring
Parameters: - alignment (bytes) – A byte string of a bam alignment entry in raw binary format
- flag – An integer representing the bitmask to compare to the
Returns: Returns true if all bits in the bitmask are set in the flag value
Return type: Notes
Common flag values are available from pylazybam.bam.FLAGS
>>> print(pylazybam.bam.FLAGS) {"paired": 0x1, "aligned": 0x2, "unmapped": 0x4, "pair_unmapped": 0x8, "forward": 0x40, "reverse": 0x80, "secondary": 0x100, "qc_fail": 0x200, "duplicate": 0x400, "supplementary": 0x800,}
See https://samtools.github.io/hts-specs/SAMv1.pdf for details.
pylazybam.tags module¶
Functions for extracting and decoding SAM tag data from BAM alignments
Extract the high scoring alignment score from an AS tag in a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns: AS Tag Value – the integer value of the AS tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract the MD tag from a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: None)
Returns: MD Tag Value – an ASCII string representing the SAM format value of the MD tag returns the value of no_tag if tag absent (default: None)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
The MD field aims to achieve SNP/indel calling independent of the reference. Numbers represent matches, letters bases that differ from the reference and bases preceeded by ^ are deletions. A ^T0A indicates a base change to an A immediately following a deleted T.
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract the suboptimal alignment score from an XS tag in a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns: XS Tag Value – the integer value of the XS tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
This function is for the genome aligner definition of XS where XS:i:<int> is the alignment score of the suboptimal alignment. This is not the same as the spliced aligner XS tag XS:C:<str> that represents the strand on which the intron occurs (equiv to TS:C:<str>)
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract the suboptimal alignment score from the ZS tag in a raw BAM alignment bytestring
Parameters: - tag_bytes (bytes) – a bytestring containing bam formatted tag elements
- no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns: ZS Tag Value – the integer value of the ZS tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
ZS is the equivalent to XS:i:<int> tag in some spliced aligners including HISAT2.
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone
Extract an integer format tag from a raw BAM alignment bytestring
Parameters: Returns: the integer value of the tag returns the value of no_tag if tag absent (default:MIN32INT)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchPotential values for the tag parameter include:
- AM:i:score The smallest template-independent mapping quality of any
- segment in the same template as this read. (See also SM.)
AS:i:score Alignment score generated by aligner.
CP:i:pos Leftmost coordinate of the next hit.
FI:i:int The index of segment in the template.
H0:i:count Number of perfect hits.
H1:i:count Number of 1-difference hits (see also NM).
H2:i:count Number of 2-difference hits.
- HI:i:i Query hit index, indicating the alignment record is the
- i-th one stored in SAM.
- IH:i:count Number of alignments stored in the file that contain the
- query in the current record.
MQ:i:score Mapping quality of the mate/next segment.
- NH:i:count Number of reported alignments that contain the query in
- the current record.
- NM:i:count Number of differences (mismatches plus inserted and deleted
- bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch.
- PQ:i:score Phred likelihood of the template, conditional on the mapping
- locations of both/all segments being correct.
- SM:i:score Template-independent mapping quality, i.e., the mapping
- quality if the read were mapped as a single read rather than as part of a read pair or template.
TC:i: The number of segments in the template.
- UQ:i: Phred likelihood of the segment,
- conditional on the mapping being correct.
Extract an integer format tag from a raw BAM alignment bytestring
Parameters: Returns: MD Tag Value – an ASCII string representing the SAM format value of the MD tag returns the value of no_tag if tag absent (default: None)
Return type: Raises: ValueError– raises a ValueError if more than one tag matchNotes
Potential values for the tag parameter include:
- BQ:Z:qualities Offset to base alignment quality (BAQ), of the same
- length as the read sequence. At the i-th read base, BAQi = Qi − (BQi − 64) where Qi is the i-th base quality
CC:Z:rname Reference name of the next hit; ‘=’ for same chromosome.
- E2:Z:bases The 2nd most likely base calls.
- Same encoding and same length as SEQ.
FS:Z:str Segment suffix.
MC:Z:cigar CIGAR string for mate/next segment.
MD:Z: String for mismatching positions.
- Q2:Z:qualities Phred quality of the mate/next segment sequence in the
- R2 tag. Same encoding as QUAL.
R2:Z:bases Sequence of the mate/next segment in the template.
- SA:Z: (rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+
- Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list.
- U2:Z: Phred probability of the 2nd call being wrong
- conditional on the best being wrong.
RG:Z:readgroup The read group to which the read belongs.
LB:Z:library The library from which the read has been sequenced.
PG:Z:program id Program. Value matches the header PG-ID tag
PU:Z:platformunit The platform unit in which the read was sequenced.
CO:Z:text Free-text comments.
- BC:Z:sequence Barcode sequence (Identifying the sample/library),
- with any quality scores (optionally) stored in QT tag. The BC tag should match the QT tag in length.
- QT:Z:qualities Phred quality of the sample barcode sequence in BC tag.
- Same encoding as QUAL, i.e., Phred score + 33.
- CB:Z:str Cell identifier, consisting of the optionally-corrected
- cellular barcode sequence and an optional suffix. The sequence part is similar to the CR tag
- CR:Z:sequence+ Cellular barcode. The uncorrected sequence bases of the
- cellular barcode as reported by the sequencing machine, with the corresponding base quality scores (optionally) stored in CY.
- CY:Z:qualities+ Phred quality of the cellular barcode sequence in CR tag
- Same encoding as QUAL, i.e., Phred score + 33.
- MI:Z:str Molecular Identifier. A unique ID within the SAM file
- for the source molecule from which this read is derived.
- OX:Z:sequence+ Raw (uncorrected) unique molecular identifier bases,
- with quality scores (optionally) stored in the BZ tag.
- BZ:Z:qualities+ Phred quality of the (uncorrected) unique molecular
- identifier sequence in the OX tag. Same encoding as QUAL, i.e., Phred score + 33.
- RX:Z:sequence+ Sequence bases from the unique molecular identifier.
- These could be either corrected or uncorrected. Unlike MI, the value may be non-unique in the file.
- QX:Z:qualities+ Phred quality of the unique molecular identifier
- sequence in the RX tag. Same encoding as QUAL, i.e., Phred score + 33
- OA:Z:(RNAME,POS,strand,CIGAR,MAPQ,NM;)+ The original alignment
- information of the record prior to realignment or unalignment by a subsequent tool.
OC:Z:cigar Original CIGAR, usually before realignment.
- CT:Z:strand ;type (;key (=value )?)* Complete read annotation tag
- used for consensus annotation dummy features.
- PT:Z:annotag(|annotag)* where each annotag matches
- start;end;strand;type(;key(=value)?)* Read annotations for parts of the padded read sequence.
see https://samtools.github.io/hts-specs/SAMtags.pdf
Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.
Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone