liuyujie0136's Website

Logo

A website for self learning, collecting and sharing.

Contact Me

关于readsCount、RPKM/FPKM、RPM(CPM)、TPM的理解

https://www.jianshu.com/p/c25e84383ae3

0. 背景

feature

两类测序bias

1. 几种丰度计算方法

reads Count

RPKM/FPKM

定义:

公式(1):RPKM=\frac{ExonMappedReads * 10^9}{TotalMappedReads * ExonLength}

公式(2):FPKM=\frac{ExonMappedFragments * 10^9}{TotalMappedFragments * ExonLength}

上述公式1 RPKM可从下面公式推导而出:

![公式(3): RPKM=\frac{(ExonMappedReads / TotalMappedReads * 10^6)}{ExonLength}10^3](https://math.jianshu.com/math?formula=%E5%85%AC%E5%BC%8F(3)%3A%20RPKM%3D%5Cfrac%7B(ExonMappedReads%20%2F%20TotalMappedReads%20%2010%5E6)%7D%7BExonLength%7D*10%5E3)

等价于:RPKM = (RPM / ExonLength) * 10^3

公式(4): RPKM=\frac{(ExonMappedReads * ReadLength)/ ExonLength * 10^9}{(TotalMappedReads * ReadLength/GenomeLength)}

解释:ExonMappedReads即为比对到该exon上的reads count;ReadLength是测序的Read长度; TotalMappedReads即为比对到基因组上所有reads count的总和;ExonLength 为该Exon的长度;ReadLength可消除;GenomeLength即为基因组全长,因为是相同基因组,所以该数值也可消除。

公式4中,TotalMappedReads * ReadLength/ GenomeLength为基因组上每个碱基的测序深度,ExonMappedReads * ReadLength / ExonLength可以简单的认为是该Exon上每个碱基的“测序深度”。两者相除,就得出该Exon上每个碱基相对基因组进行标准化的’测序深度’。因一般是相同物种,基因组一般相同,且测序长度相同,所以公式4换算并消去ReadLength,GenomeLength,就成为公式1的形式了。那么因Exon长短、测序深度造成的样本间造成的偏差,都可以消除。个人认为RPKM/FPKM应当能够消除两种类型的bias。

RPM

TPM

2. 相互关系

3. Use TPM !

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Kilobase Million) is now becoming quite popular. Since there seems to be a lot of confusion about these terms, I thought I’d use a StatQuest to clear everything up.

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:

  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:

  1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
  2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
  3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.

So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.

When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.

Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)

With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.