BWAの使い方。

bwa mem, samtools sort, samtools index

これまでに準備したファイルを使って、BWAを用いたリファレンスゲノムへのWGSデータのマッピングを行います。
今回使う主なコマンドは 1. bwa mem 2. samtools sort 3. samtools index の３つ。

Twitterで記事の更新をお知らせしているので、興味を持たれた方は是非フォローをお願いします。フォローする @harrykun_blog

0. 作業するディレクトリへ移動する。前回の続きでフォルダの中身は

$ cd /path/to/working_dir/
$ ls
ID_list.txt  ERS_list.txt　3D7_genomes fasta fastq

1. 必要なツールをインストールする。

#samファイルを操作するツールsamtoolsをインストールする。
$ conda install -c bioconda samtools

#マッピングツールのbwaをインストールする。
conda install -c bioconda bwa

正しくインストールされたか確認する。

$ samtools

/opt/anaconda3/bin/samtools

が表示される。

eupatho-bioinfomatics.hatenablog.com

2. リファレンスゲノムへのマッピングを行う。
bwaでは様々なパラメーターが存在するが、ここでは詳細は説明しない...というかできません。いつか詳しく調べる機会があれば、その時に記事にしたいと思います。マッピングツールにはBWA以外にもHISAT2やBowtie2などあり、それぞれ得意不得意があるようです。

$ bwa mem

を実行すると、オプションの一覧が見られる。マニュアルはこちらのリンク

Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:

       -t INT        number of threads [1]
       -k INT        minimum seed length [19]
       -w INT        band width for banded alignment [100]
       -d INT        off-diagonal X-dropoff [100]
       -r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
       -y INT        seed occurrence for the 3rd round seeding [20]
       -c INT        skip seeds with more than INT occurrences [500]
       -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
       -W INT        discard a chain if seeded bases shorter than INT [0]
       -m INT        perform at most INT rounds of mate rescues for each read [50]
       -S            skip mate rescue
       -P            skip pairing; mate rescue performed unless -S also in use

Scoring options:

       -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
       -B INT        penalty for a mismatch [4]
       -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
       -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
       -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
       -U INT        penalty for an unpaired read pair [17]

       -x STR        read type. Setting -x changes multiple parameters unless overridden [null]
                     pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0  (PacBio reads to ref)
                     ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0  (Oxford Nanopore 2D-reads to ref)
                     intractg: -B9 -O16 -L5  (intra-species contigs to ref)

Input/output options:

       -p            smart pairing (ignoring in2.fq)
       -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
       -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
       -o FILE       sam file to output results to [stdout]
       -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
       -5            for split alignment, take the alignment with the smallest coordinate as primary
       -q            do not modify mapQ of supplementary alignments
       -K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []

       -v INT        verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]
       -T INT        minimum score to output [30]
       -h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]
       -a            output all alignments for SE or unpaired PE
       -C            append FASTA/FASTQ comment to SAM output
       -V            output the reference FASTA header in the XR tag
       -Y            use soft clipping for supplementary alignments
       -M            mark shorter split hits as secondary

       -I FLOAT[,FLOAT[,INT[,INT]]]
                     specify the mean, standard deviation (10% of the mean if absent), max
                     (4 sigma from the mean if absent) and min of the insert size distribution.
                     FR orientation only. [inferred]

Note: Please read the man page for detailed description of the command line and options.

BWAの'-t'とsamtoolsの'-@'の後ろの数字は、使用するCPUのスレッド数を意味するので、自分の環境に応じて数値を変更する。マッピング後のsam形式のファイルは大変重いので、コマンドを|でつないで直接bam形式のファイルに変換する。

インデックスのパスを毎回指定するのは面倒なので、変数に入れておくと便利である。
私の場合、リファレンスデータを入れるフォルダを一つにして、そこに生物ごとのフォルダを作って中にゲノムデータを入れるのが好みです。

Index='/path/to/workinddir/3D7_genomes/PlasmoDB-52_Pfalciparum3D7_Genome.fasta'
mkdir bam #bamファイルを入れるフォルダを作成する。

#bwaのコマンドを入力する。
#基本はbwa mem  -M reference .fasta > .samの形
while read line;do
bwa mem -R "@RG\tID:L\tSM:"line"\tPL:illumina\tLB:lib1\tPU:unit1" -t 16 \
  -M $Index ./fastq/"$line"_1.fastq.gz ./fastq/"$line"_2.fastq.gz | \
  samtools sort -@ 16 > "$line".sort.bam
  samtools index $line.sort.bam
  done < ID.txt

ちなみにMacでのバックスラッシュの\の入力は'option+¥'で可能である。
-R以下で指定する Readgroup等は、後ほどSNP解析で必要になるのであらかじめ付加しておく。 RGとSMはサンプルごとに一意の値を取る必要があることに注意する。

参考_BAM fileにRead Groupを付ける（GATKへの対応）

正常に終わるとfastqの中に以下のファイルが生成される。　(解析が終わったら追記します）