pylazybam

pylazybam.bam module

A Module for reading and writing BAM format files

Note that for convenience the functions from pylazybam.decoders and pylazybam.tags are also imported into the pylazybam.bam namespace

class pylazybam.bam.FileReader(ubam: BinaryIO)

Bases: pylazybam.bam._FileBase

A Pure Python Lazy Bam Parser Class

Parameters:ubam (BinaryIO) – An binary (bytes) file or stream containing a valid uncompressed bam file conforming to the specification.
Yields:align (bytes) – A byte string of a bam alignment entry in raw binary format
header

The ASCII representation of the header

Type:str
index_to_ref

A dictionary mapping bam reference numeric identifiers to names

Type:Dict[int:str]
raw_header

The raw bytestring representing the bam header

Type:bytes
raw_refs

The raw bytestring representing the reference sequences

Type:bytes
refs

A dictionary of reference_name keys with reference_length

Type:Dict[str:int]
ref_to_index

A dictionary mapping reference names to the bam numeric identifier

Type:Dict[str:int]
sort_order

The value of the SO field indicating sort type. Value is as given in the BAM file. Should be one of ‘unknown’, ‘unsorted’, ‘queryname’ or ‘coordinate’

Type:str

Notes

It is advisable not to call the private functions or operate directly on the underlying file object.

The detailed specification for the BAM format can be found at https://samtools.github.io/hts-specs/SAMv1.pdf

Example

This class requires an uncompressed bam file as input. For this example we will use gzip to decompress the test file included as a resource in the package

>>> import gzip
>>> from pkg_resources import resource_stream
>>> ubam = gzip.open('/tests/data/paired_end_testdata_human.bam'))

A bam filereader object can then be created and headers inspected

>>> mybam = bam.FileReader(ubam)
>>> print(mybam.header)

The filereader object is an iterator and yields alignments in raw format

>>> align = next(mybam)

Alignments can be processed using functions from pylazybam.bam

>>> print(mybam.index_to_ref[get_ref_index(align)],
>>>       get_pos(align),
>>>       get_AS(align))
close()

Close input file

get_full_raw_header()

Get a complete BAM header with all elements suitable for writing to a bam.FileWriter

Returns:The complete raw BAM header including the BAM magic and reference block
Return type:bytes
get_updated_header(id: str, program: str, version: str, command: str = None, description: str = None, raw_header=None)

Get a modified version of the header with additional program information in a @PG BAM header suitable for writing to a new output file.

Does not include the BAM magic or the BAM reference block.

Note that the header should be written after self.magic and before self.raw_refs

Parameters:
  • id (str) – a unique identifier of this action of the program on the BAM
  • program (str) – the name of the program used to process the BAM
  • version (str) – version number of the program used to process the BAM
  • command (str) – the command and arguments to the program used to process the BAM
  • description (str) – optional description
  • raw_header (bytes) – a raw BAM formatted header without BAM magic or REF block used for recursive modification to add multiple

Notes

Parameters will be utf-8 encoded If there are existing @PG records the last record will be used as PP

reset_alignments()

Reset the file pointer to the beginning of the alignment block

update_header(*args, **kwargs)

Use get_updated_header to update self.raw_header inplace

See get_updated_header for Parameters and documentation

update_header_length(raw_header=None)

Update the length of the SAM format text component of the header

Parameters:raw_header (bytes) – optional raw SAM text header as bytes to process Default self.raw_header
Returns:returns the length corrected raw header if a raw header is provided
Return type:bytes
class pylazybam.bam.FileWriter(file, raw_header=None, raw_refs=None, mode='wb', compresslevel=6)

Bases: pylazybam.bam._FileBase

close(*args, **kwargs)

Flush and write any data to the BAM file before finalizing and closing

get_full_raw_header()

Get a complete BAM header with all elements suitable for writing to a bam.FileWriter

Returns:The complete raw BAM header including the BAM magic and reference block
Return type:bytes
get_updated_header(id: str, program: str, version: str, command: str = None, description: str = None, raw_header=None)

Get a modified version of the header with additional program information in a @PG BAM header suitable for writing to a new output file.

Does not include the BAM magic or the BAM reference block.

Note that the header should be written after self.magic and before self.raw_refs

Parameters:
  • id (str) – a unique identifier of this action of the program on the BAM
  • program (str) – the name of the program used to process the BAM
  • version (str) – version number of the program used to process the BAM
  • command (str) – the command and arguments to the program used to process the BAM
  • description (str) – optional description
  • raw_header (bytes) – a raw BAM formatted header without BAM magic or REF block used for recursive modification to add multiple

Notes

Parameters will be utf-8 encoded If there are existing @PG records the last record will be used as PP

seekable()

Return the seek state of the file

tell()

Return the current location of the pointer in the file

update_header(*args, **kwargs)

Use get_updated_header to update self.raw_header inplace

See get_updated_header for Parameters and documentation

update_header_length(raw_header=None)

Update the length of the SAM format text component of the header

Parameters:raw_header (bytes) – optional raw SAM text header as bytes to process Default self.raw_header
Returns:returns the length corrected raw header if a raw header is provided
Return type:bytes
write(data)

Write output to the BAM file. Data is buffered by the underlying bgzf method

Parameters:data (bytes) – The data to be written to the BAM file
write_header(raw_header=None, raw_refs=None)

Write the header information to the BAM file

Parameters:
  • raw_header (bytes) – A raw byte format header containing the standard SAM format header Default : self.raw_header
  • raw_refs – A raw bytestring containing the reference sequence header data Default : self.raw_refs
Raises:

RuntimeError – Raises a runtime error if header information already written

pylazybam.bam.get_bin(alignment: bytes) → int

Extract the BAI index bin from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The integer value of the index bin
Return type:int
pylazybam.bam.get_flag(alignment: bytes) → int

Extract the alignment flag from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The alignment flag
Return type:int

Notes

Flag values can be tested on raw BAM alignment with pylazybam.bam.is_flag() Common flag values are available from pylazybam.bam.FLAGS

>>> print(pylazybam.bam.FLAGS)
{"paired": 0x1,
"aligned": 0x2,
"unmapped": 0x4,
"pair_unmapped": 0x8,
"forward": 0x40,
"reverse": 0x80,
"secondary": 0x100,
"qc_fail": 0x200,
"duplicate": 0x400,
"supplementary": 0x800,}

See https://samtools.github.io/hts-specs/SAMv1.pdf for details.

pylazybam.bam.get_len_read_name(alignment: bytes) → int

Extract the length of the read name from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The length of the read name
Return type:int
pylazybam.bam.get_len_sequence(alignment: bytes) → int

Extract the sequence length from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:the length of the sequence
Return type:int

Notes

The decoded quality string will be the same length as the sequence

pylazybam.bam.get_mapq(alignment: bytes) → int

Extract the read mapping quality score from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The integer mapping quality score
Return type:int
pylazybam.bam.get_number_cigar_operations(alignment: bytes) → int

Extract the number of cigar operations from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The number of operations in the cigar string
Return type:int
pylazybam.bam.get_pair_pos(alignment: bytes) → int

Extract the one based position of this reads pair from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The one based left aligned position of this reads pair on the reference
Return type:int
pylazybam.bam.get_pair_ref_index(alignment: bytes) → int

Extract the index identifying the reference sequence of this query sequences pair from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The zero based rank of the reference in the BAM header
Return type:int

Notes

The index can be converted to the reference name with pylazybam.bam.FileReader().index_to_ref[index]

pylazybam.bam.get_pos(alignment: bytes) → int

Extract the one based position of this read

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The one based left aligned position of this read on the reference
Return type:int
pylazybam.bam.get_raw_base_qual(alignment: bytes, len_read_name: int, number_cigar_operations: int, len_sequence: int) → bytes

Extract the raw base qualities from a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
  • number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
  • len_sequence (int) – The length of the sequence and quality score strings eg from pylazybam.bam.get_len_sequence
Returns:

The raw base qualities in BAM format as a binary bytestring

Return type:

bytes

pylazybam.bam.get_raw_cigar(alignment: bytes, len_read_name: int, number_cigar_operations: int) → bytes

Extract the raw cigar string from a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
  • number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
Returns:

The raw base cigar string in BAM format as a binary bytestring

Return type:

bytes

pylazybam.bam.get_raw_read_name(alignment: bytes, read_name_length: int) → bytes

Extract the raw readname from a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
Returns:

The raw base readname in BAM format as a binary bytestring

Return type:

bytes

pylazybam.bam.get_raw_sequence(alignment: bytes, len_read_name: int, number_cigar_operations: int, len_sequence: int) → bytes

Extract the raw sequence from a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
  • number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
  • len_sequence (int) – The length of the sequence and quality score strings eg from pylazybam.bam.get_len_sequence
Returns:

The raw sequence in BAM format as a binary bytestring

Return type:

bytes

pylazybam.bam.get_read_name(alignment: bytes, read_name_length: int) → str

Extract the read name in ASCII SAM format from a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
Returns:

The read name in ASCII SAM format

Return type:

str

pylazybam.bam.get_ref_index(alignment: bytes) → int

Extract the reference index from a BAM alignment

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The zero based rank of the reference in the BAM header
Return type:int

Notes

The index can be converted to the reference name with pylazybam.bam.FileReader().index_to_ref[index]

pylazybam.bam.get_tag_bytestring(alignment: bytes, len_read_name: int, number_cigar_operations: int, len_sequence: int) → bytes

Extract the raw tags from a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • len_read_name (int) – The length of the readname string eg from pylazybam.bam.get_len_read_name()
  • number_cigar_operations (int) – The number of cigar operations eg from pylazybam.bam.get_number_cigar_operations()
  • len_sequence (int) – The length of the sequence and quality score strings eg from pylazybam.bam.get_len_sequence
Returns:

The raw tags in BAM format as a binary bytestring

Return type:

bytes

pylazybam.bam.get_template_len(alignment: bytes) → int

Extract the template length from a BAM alignment bytestring

Parameters:alignment (bytes) – A byte string of a bam alignment entry in raw binary format
Returns:The integer length of the template (The distance between aligned read pairs)
Return type:int

pylazybam.decoders module

Format decoders for BAM alignment data types

pylazybam.decoders.decode_base_qual(raw_base_qual: bytes, offset: int = 33) → str

Decode raw BAM base quality scores into ASCII values

Parameters:
  • raw_base_qual (bytes) – The base quality section of a BAM alignment record as bytes eg the output from pylazybam.bam.get_raw_base_qual()
  • offset (int) – The offset to add to the quality values when converting to ASCII
Returns:

The ASCII encoded SAM representation of the quality scores

Return type:

str

pylazybam.decoders.decode_cigar(raw_cigar: bytes) → str

Decode raw BAM cigar strings into ASCII values

Parameters:raw_cigar (bytes) – The cigar section of a BAM alignment record as bytes eg the output of pylazybam.bam.get_raw_cigar()
Returns:The ASCII encoded SAM representation of the cigar string
Return type:str
pylazybam.decoders.decode_sequence(raw_seq: bytes) → str

Decode raw BAM sequence into ASCII values

Parameters:raw_seq (bytes) – The sequence section of a BAM alignment record as bytes eg the output of pybam.bam.get_raw_sequence()
Returns:The ASCII encoded SAM representation of the query sequence
Return type:str
pylazybam.decoders.is_flag(alignment: bytes, flag: int) → bool

Test BAM flag values against a BAM alignment bytestring

Parameters:
  • alignment (bytes) – A byte string of a bam alignment entry in raw binary format
  • flag – An integer representing the bitmask to compare to the
Returns:

Returns true if all bits in the bitmask are set in the flag value

Return type:

bool

Notes

Common flag values are available from pylazybam.bam.FLAGS

>>> print(pylazybam.bam.FLAGS)
{"paired": 0x1,
"aligned": 0x2,
"unmapped": 0x4,
"pair_unmapped": 0x8,
"forward": 0x40,
"reverse": 0x80,
"secondary": 0x100,
"qc_fail": 0x200,
"duplicate": 0x400,
"supplementary": 0x800,}

See https://samtools.github.io/hts-specs/SAMv1.pdf for details.

pylazybam.tags module

Functions for extracting and decoding SAM tag data from BAM alignments

pylazybam.tags.get_AS(tag_bytes: bytes, no_tag: Any = -2147483648) → int

Extract the high scoring alignment score from an AS tag in a raw BAM alignment bytestring

Parameters:
  • tag_bytes (bytes) – a bytestring containing bam formatted tag elements
  • no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns:

AS Tag Value – the integer value of the AS tag returns the value of no_tag if tag absent (default:MIN32INT)

Return type:

int

Raises:

ValueError – raises a ValueError if more than one tag match

Notes

Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.

Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone

pylazybam.tags.get_MD(tag_bytes: bytes, no_tag: Any = None) → str

Extract the MD tag from a raw BAM alignment bytestring

Parameters:
  • tag_bytes (bytes) – a bytestring containing bam formatted tag elements
  • no_tag (Any) – return value for when tag not found (default: None)
Returns:

MD Tag Value – an ASCII string representing the SAM format value of the MD tag returns the value of no_tag if tag absent (default: None)

Return type:

str

Raises:

ValueError – raises a ValueError if more than one tag match

Notes

The MD field aims to achieve SNP/indel calling independent of the reference. Numbers represent matches, letters bases that differ from the reference and bases preceeded by ^ are deletions. A ^T0A indicates a base change to an A immediately following a deleted T.

Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.

Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone

pylazybam.tags.get_XS(tag_bytes: bytes, no_tag: Any = -2147483648) → int

Extract the suboptimal alignment score from an XS tag in a raw BAM alignment bytestring

Parameters:
  • tag_bytes (bytes) – a bytestring containing bam formatted tag elements
  • no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns:

XS Tag Value – the integer value of the XS tag returns the value of no_tag if tag absent (default:MIN32INT)

Return type:

int

Raises:

ValueError – raises a ValueError if more than one tag match

Notes

This function is for the genome aligner definition of XS where XS:i:<int> is the alignment score of the suboptimal alignment. This is not the same as the spliced aligner XS tag XS:C:<str> that represents the strand on which the intron occurs (equiv to TS:C:<str>)

Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.

Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone

pylazybam.tags.get_ZS(tag_bytes: bytes, no_tag: Any = -2147483648) → int

Extract the suboptimal alignment score from the ZS tag in a raw BAM alignment bytestring

Parameters:
  • tag_bytes (bytes) – a bytestring containing bam formatted tag elements
  • no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns:

ZS Tag Value – the integer value of the ZS tag returns the value of no_tag if tag absent (default:MIN32INT)

Return type:

int

Raises:

ValueError – raises a ValueError if more than one tag match

Notes

ZS is the equivalent to XS:i:<int> tag in some spliced aligners including HISAT2.

Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.

Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone

pylazybam.tags.get_int_tag(tag_bytes: bytes, tag: bytes, no_tag: Any = -2147483648) → int

Extract an integer format tag from a raw BAM alignment bytestring

Parameters:
  • tag_bytes (bytes) – a bytestring containing bam formatted tag elements
  • tag (bytes) – the two byte tag to be returned eg b’AS’ The tag must represent an integer format tag
  • no_tag (Any) – return value for when tag not found (default: MIN32INT)
Returns:

the integer value of the tag returns the value of no_tag if tag absent (default:MIN32INT)

Return type:

int

Raises:

ValueError – raises a ValueError if more than one tag match

Potential values for the tag parameter include:

AM:i:score The smallest template-independent mapping quality of any
segment in the same template as this read. (See also SM.)

AS:i:score Alignment score generated by aligner.

CP:i:pos Leftmost coordinate of the next hit.

FI:i:int The index of segment in the template.

H0:i:count Number of perfect hits.

H1:i:count Number of 1-difference hits (see also NM).

H2:i:count Number of 2-difference hits.

HI:i:i Query hit index, indicating the alignment record is the
i-th one stored in SAM.
IH:i:count Number of alignments stored in the file that contain the
query in the current record.

MQ:i:score Mapping quality of the mate/next segment.

NH:i:count Number of reported alignments that contain the query in
the current record.
NM:i:count Number of differences (mismatches plus inserted and deleted
bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch.
PQ:i:score Phred likelihood of the template, conditional on the mapping
locations of both/all segments being correct.
SM:i:score Template-independent mapping quality, i.e., the mapping
quality if the read were mapped as a single read rather than as part of a read pair or template.

TC:i: The number of segments in the template.

UQ:i: Phred likelihood of the segment,
conditional on the mapping being correct.

see https://samtools.github.io/hts-specs/SAMtags.pdf

pylazybam.tags.get_str_tag(tag_bytes: bytes, tag: bytes, no_tag: Any = None) → str

Extract an integer format tag from a raw BAM alignment bytestring

Parameters:
  • tag_bytes (bytes) – a bytestring containing bam formatted tag elements
  • tag (bytes) – the two byte tag to be returned eg b’AS’ The tag must represent an integer format tag
  • no_tag (Any) – return value for when tag not found (default: None)
Returns:

MD Tag Value – an ASCII string representing the SAM format value of the MD tag returns the value of no_tag if tag absent (default: None)

Return type:

str

Raises:

ValueError – raises a ValueError if more than one tag match

Notes

Potential values for the tag parameter include:

BQ:Z:qualities Offset to base alignment quality (BAQ), of the same
length as the read sequence. At the i-th read base, BAQi = Qi − (BQi − 64) where Qi is the i-th base quality

CC:Z:rname Reference name of the next hit; ‘=’ for same chromosome.

E2:Z:bases The 2nd most likely base calls.
Same encoding and same length as SEQ.

FS:Z:str Segment suffix.

MC:Z:cigar CIGAR string for mate/next segment.

MD:Z: String for mismatching positions.

Q2:Z:qualities Phred quality of the mate/next segment sequence in the
R2 tag. Same encoding as QUAL.

R2:Z:bases Sequence of the mate/next segment in the template.

SA:Z: (rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+
Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list.
U2:Z: Phred probability of the 2nd call being wrong
conditional on the best being wrong.

RG:Z:readgroup The read group to which the read belongs.

LB:Z:library The library from which the read has been sequenced.

PG:Z:program id Program. Value matches the header PG-ID tag

PU:Z:platformunit The platform unit in which the read was sequenced.

CO:Z:text Free-text comments.

BC:Z:sequence Barcode sequence (Identifying the sample/library),
with any quality scores (optionally) stored in QT tag. The BC tag should match the QT tag in length.
QT:Z:qualities Phred quality of the sample barcode sequence in BC tag.
Same encoding as QUAL, i.e., Phred score + 33.
CB:Z:str Cell identifier, consisting of the optionally-corrected
cellular barcode sequence and an optional suffix. The sequence part is similar to the CR tag
CR:Z:sequence+ Cellular barcode. The uncorrected sequence bases of the
cellular barcode as reported by the sequencing machine, with the corresponding base quality scores (optionally) stored in CY.
CY:Z:qualities+ Phred quality of the cellular barcode sequence in CR tag
Same encoding as QUAL, i.e., Phred score + 33.
MI:Z:str Molecular Identifier. A unique ID within the SAM file
for the source molecule from which this read is derived.
OX:Z:sequence+ Raw (uncorrected) unique molecular identifier bases,
with quality scores (optionally) stored in the BZ tag.
BZ:Z:qualities+ Phred quality of the (uncorrected) unique molecular
identifier sequence in the OX tag. Same encoding as QUAL, i.e., Phred score + 33.
RX:Z:sequence+ Sequence bases from the unique molecular identifier.
These could be either corrected or uncorrected. Unlike MI, the value may be non-unique in the file.
QX:Z:qualities+ Phred quality of the unique molecular identifier
sequence in the RX tag. Same encoding as QUAL, i.e., Phred score + 33
OA:Z:(RNAME,POS,strand,CIGAR,MAPQ,NM;)+ The original alignment
information of the record prior to realignment or unalignment by a subsequent tool.

OC:Z:cigar Original CIGAR, usually before realignment.

CT:Z:strand ;type (;key (=value )?)* Complete read annotation tag
used for consensus annotation dummy features.
PT:Z:annotag(|annotag)* where each annotag matches
start;end;strand;type(;key(=value)?)* Read annotations for parts of the padded read sequence.

see https://samtools.github.io/hts-specs/SAMtags.pdf

Recommended try accept for use on raw alignment with fall back to calling on only the tag byte string.

Please test carefully on your BAM output as in complicated output the regular expression based extraction of the tag can be error prone