Reference Graph
Construction and Merging for
Human Genomic Sequences

27     January 2016 th
Tom Schiller, t_schiller@hotmail.com
Motivation
Motivation

Want to sequence an individual human's DNA

Knowing how an individual's DNA differs from usual DNA can allow diagnosis of genetic disorders

Done by sequencing many short reads

Reads aligned to reference to find individual's variations

Too many wrong assembly possibilities without reference

Graph reference lets us make better use of known information than reference string

Graph corresponds to several references with some similarities and some differences

Particular problem motivating this thesis:
Construct human reference graph

Current implementations for graph generation cannot deal with graphs of such size

Our approach: use human reference string and add local graphs in various positions without having to encode the entire reference as graph

Long sequence to be transformed into graph with local variations

Start with sequential data

Want to add local variations

Split sequence up

Replace one sequence with graph

Merge graphs together

Burrows-Wheeler Transform
Burrows-Wheeler Transform

BWT is transformation of a string
(Burrows and Wheeler, 1994)

Together with suffix array allows us to quickly find patterns in the string

Can be used to find reads in reference

Extended Burrows-Wheeler Transform
Extended Burrows-Wheeler Transform

BWT can be extended to XBW (Ferragina et al., 2009)

XBW can encode trees instead of sequences

XBW can be further extended (Sirén et al., 2014)

Extended XBW can encode directed acyclic graphs

To use extended XBW, add specific start and end nodes

Ensure that graph is reverse deterministic

Make graph prefix-sorted

Some prefixes are ambiguous, so use node splitting

Continue with node splitting

Graph can now be encoded using extended XBW

XBW Node Table
XBW Node Table

Represents a graph in a table

Has rows for BWT, prefixes, and bit-vectors M and F

Create XBW node table for given graph

Start with alphabetically sorted list of prefixes

Each column corresponds to exactly one node

Add BWT row with predecessor labels

Add rows with successor and predecessor amounts

Generated table now completely represents the graph

GGGC|GG|T#GTTTCCC$BWT
GGGC|GG|T#GTTTCCC$BWT
$CGCTG$GCGGCTGG$GGCGGGTGCTGGCTGGGTT#Prefix
1110011111111111M
1111010111111111F
Flat XBW Table
Flat XBW Table

Similar to XBW node table

Prefixes not stored, making it smaller

Each row consists of flat sequential data

Create flat XBW table based on XBW node table

Only keep first letters of prefixes and rename row to “FC”

Copy FC value for each 0 in M

Flatten all cells

Pairs of columns correspond to edges

GGGC|GG|T#GTTTCCC$BWT
$CGCTCCCG$GCGGCTGG$GGCGGGTGCTGGCTGGGTT#PrefixFC
1110011111111111M
1111010111111111F
GGGCGGT#GTTTCCC$BWT
$CCCCGGGGGGTTTT#FC
1110011111111111M
1111010111111111F
Graph Merging Library
Graph Merging Library

Implemented Graph Merging Library containing various algorithms for merging genomic graphs

Visualizes merging algorithms step by step

Merging XBW Node Tables
Merging XBW Node Tables

Core idea: put all columns from two node tables into one table, then sort columns alphabetically by prefixes

TG#A|TG$BWT
$AGT$TT#Prefix
1110111M
1111011F
C#G|AA$BWT
$ACG#Prefix
110111M
111011F
CTTGG|A#A#A|TG$BWT
$ACAGATCGAGCGTTATT#Prefix
1111111110110M
1111101111011F
TG#A|TG$C#G|AA$BWT
$AGT$TT#$ACG#Prefix
1110111110111M
1111011111011F
00000011111Origin

Problem 1: both tables might share prefixes

→ Solution: make prefix-sorted by splitting nodes - can be done locally

Problem 2: column ordering in individual tables slightly different than in merged table

→ Solution: introduce aftersort array

TGA|TG#AC$BWT
$AAATCGT$TA#Prefix
111110111M
111011111F
C#$BWT
$C#Prefix
111M
111F
CGA|TTG#CA$BWT
$AAATC$CTGTATC#Prefix
1111110111M
1110111111F

Problem 3: XBW node tables very big, as prefixes are stored entirely

→ Solution: merge flat XBW tables directly

Merging Flat XBW Tables
Merging Flat XBW Tables

Core idea: similarly to node table merging, put all columns into one table, alphabetically sorted by prefixes

Like with node table merging, aftersort array also used when ordering within first merged table changes

Problem 1: column pairs together, but not entire columns

→ Solution: merge BWT and F independently from FC and M

Problem 2: prefixes are not explicitly given

→ Solution: split nodes before joining columns

Problem 3: merging flat XBW tables can cause large amounts of node splitting

→ Large file size of merged graph

Also, merged graph needs to be constructed node for node

→ Direct opposite of what we wanted to achieve

→ Solution: fuse flat XBW tables instead of merging them

Fusing Flat XBW Tables
Fusing Flat XBW Tables

Core idea: keep tables separate and perform work on full graph by internally working on each table on its own

Problem: program working on fused table needs to keep track of interior tables

→ Solution: expanded navigation functions allow working with fused table without considering implementation details

Pattern Finding in Fused Table
Pattern Finding in Fused Table

Examine pattern finding in fused table in detail

Search for pattern “AC” in graph fused from two tables

Call highest-level navigation function find for pattern AC

Internally call LF for each fused table, here tables 0 and 1

First LF call for C on [0, 4], the entire interval in first table
(using absolute indexing)

Finds any C in the table's FC row, returns [2, 2]

Second LF call for A on interval [2, 2] in first table

Finds any A preceding nodes in the interval [2, 2]

A is found, and whole pattern has been stepped through, so that AC has been found in first table

Third and fourth LF call for C and A, now on second table

Overall result: [([1, 1]; 0), ([1, 1]; 1)]

012345i
G#ACA$BWT
$AACG#FC
110111M
111101F
0123456i
CTGAT#$BWT
$ACGTT#FC
1111101M
1110111F
Results
Results

Overall, merging possible, but complicated

Fusing in most cases preferable

→ Fusing also offers straightforward opportunities for parallelization

Through fusing, original problem of adding variations to long sequential data can be solved

Future Outlook
Future Outlook

Implementation of flat XBW table fusing in C++

Use fusing to generate entire human reference graph

Fusing as proposed here requires exactly one edge between fused graphs

→ Not very problematic, as focus is on long sequential data with local variations

Future work could focus on overcoming this requirement

Questions
Questions

Automated Tests
Automated Tests

Automated tests performed for flat XBW merging and flat XBW fusing

Over 4000 merge tests with random input graphs executed

All tests successful

Matrix of Cyclic Rotations
Matrix of Cyclic Rotations