美国金州杀手苦心潜逃42年,最终栽倒在自己的DNA上,研究人员认为在不久的将来,利用公共DNA数据库将可鉴别数百起悬而未决案件中的嫌疑人。
A relatively small database could match the entire population, according to a new paper in Science
根据《科学》杂志的一篇新论文,一个相对较小的数据库可以匹配整个人口
Illustration by Alex Castro / The Verge
插图作者:Alex Castro / The Verge
This April, police solved a decades-old mystery - the identity of the Golden State Killer - with a previously unused DNA technique. Searching for a sample match in existing databases turned up nothing, but a search through a public DNA database located 10 to 15 possible distant relatives, which let police narrow down a suspect list and ultimately gave them the lead they needed.
今年4月,警方用一种以前未使用过的DNA技术,解开了一个长达几十年的谜团 - 金州杀手的身份。在现有的数据库中搜索样本匹配,结果一无所获,但通过对一个公共DNA数据库的搜索,找到了10到15个可能的远亲,这让警方缩小了嫌疑人名单的范围,最终给了他们所需的线索。
It was a new technique at the time, but after the high-profile success, that technique has proved to be one of the most powerful new tools in forensics. In the months since, groups like Parabon NanoLabs and the DNA Doe Project have identified at least 19 different cold case samples through this method, called familial DNA testing of public databases, providing crucial new leads for previously unsolvable cases.
这是当时的一项新技术,但在取得了高调的成功后,该技术已被证明是法医学领域最强大的新工具之一。在此后的几个月里,像Parabon NanoLabs和DNA Doe Project 这样的组织通过这种名为"公共数据库家族DNA测试"的方法,至少鉴定了19个不同的悬案样本,为以前无法解决的案例提供了至关重要的新线索。
Now, a pair of new discoveries could make that technique even more powerful. A paper published today in the journal Science finds that the same technique could span much further than contemporary labs realize, covering nearly the entire population from a relatively small base of samples. At the same time, researchers publishing in Cell have devised a way to extrapolate from incomplete samples, building out a broader picture of the genome than was originally tested. Taken together, those techniques would allow researchers to identify nearly anyone using only existing samples, a frighteningly powerful new tool for DNA forensics.
现在,一对新的发现可以使这项技术更加强大。今天发表在《科学》杂志上的一篇论文发现,同样的技术的应用范围可能比当代实验室所意识到的要广得多,在相对较小的样本基础上覆盖了几乎所有的人群。与此同时,在《细胞》杂志上发表的研究人员已经设计出一种从不完整的样本中推断出来的方法,构建了一个比最初测试的更广泛的基因组图。总而言之,这些技术将允许研究人员仅使用现有的样本就能识别几乎任何人,这是一种强大得惊人的DNA取证新工具。
Familial DNA testing is a break from conventional DNA testing, which looks for positive matches, like matching the DNA from a bloody glove to the DNA from a specific suspect. Crucially, a match can only be made if the suspect's DNA can be collected, which makes it impractical for most cold cases. But familial DNA searches look for partial matches, which could indicate that the sample comes from a sibling or a parent rather than the same person. That's not enough to conclusively identify a person on its own, but it can give police a crucial lead that can lead to further testing down the road.
家族DNA检测是对传统DNA测试的一次突破,传统的DNA测试寻找正面匹配,例如将来自血淋淋的手套的DNA与来自特定嫌疑人的DNA相匹配。至关重要的是,只有在嫌疑人的DNA能够被采集的情况下,才能进行配对,这使得大多数悬案无法进行配对。但家族DNA搜索寻找的是部分匹配,这可能表明样本来自兄弟姐妹或父母,而不是同一个人。这还不足以确定一个人的身份,但它可以给警方一个关键的线索,可以进行进一步的测试。
To find those partial matches, labs have drawn heavily on public DNA databases like GEDMatch and DNALand. Those searches don't require court approval because the data is already public, but they're more limited in scope. The largest database, GEDMatch, contains just under a million genetic profiles, significantly limiting the scope of many searches. The FBI's National DNA Index, in contrast, contains more than 17 million profiles, but can only be accessed under specific legal circumstances. Consumer DNA services like 23andMe and MyHeritage also contain significantly more samples, but their policies typically rule out law enforcement searches of this kind.
为了找到这些部分匹配,实验室在很大程度上依赖于公共DNA数据库,如GEDMatch和DNALand。这些搜索不需要法院批准,因为数据已经公开,但它们的范围更加有限。最大的数据库GEDMatch只包含不到100万个基因谱,这大大限制了许多搜索的范围。相比之下,FBI的国家DNA索引包含超过1700万份个人资料,但只能在特定的法律环境下访问。像23andMe和MyHeritage这样的消费者DNA服务也包含了更多样本,但他们的政策通常会排除此类执法搜查。
The result is a new scramble for data, and new uncertainty about how far the public data could reach. "The big limitation is coverage," says Yaniv Erlich, a computer science professor at Columbia University and chief science officer of MyHeritage. "And even if you find an individual, it requires complex analysis from that point."
其结果是对数据的新一轮争夺,以及公众数据能走多远的新不确定性。"最大的限制是覆盖范围,"哥伦比亚大学计算机科学教授,MyHeritage的首席科学家Yaniv Erlich说。"即使你找到一个人,也需要从这个角度进行复杂的分析。"
Now, Erlich has joined with other researchers from Columbia and Hebrew University to examine exactly how far that coverage could reach. For the Science paper, the team looked at a data set of 1.28 million individuals (largely drawn from the MyHeritage database) and produced a statistical analysis of how likely it is that a given person can be matched to a relative whose DNA is in the database. According to those results, researchers found that more than 60 percent of searches would result in a third cousin or closer match (the same proximity used for the Golden State Killer suspect), giving a reasonable chance to de-identify the target. As a result, researchers estimate a database would only need to cover 2 percent of a target population to provide a third-cousin-or-better match to nearly any person. "With the exponential growth of consumer genomics," the researchers write, "we posit that such database scale is foreseeable for some third-party websites in the near future."
现在,Erlich已经与哥伦比亚大学和希伯来大学的其他研究人员一起研究了这种覆盖范围的到底能达到多远。对于《科学》杂志的这篇论文,该团队查看了一个128万个人的数据(主要来自MyHeritage数据库),并对某一特定人与数据库中DNA的亲属相匹配的可能性进行了统计分析。根据这些结果,研究人员发现,超过60%的搜索将导致第三代表亲或更接近的匹配(与金州杀手嫌疑人使用相同的接近度),从而提供了一个合理的机会去识别目标。因此,研究人员估计,数据库只需覆盖2%的目标人群,即可为几乎任何人提供第三代表亲或更好的匹配。"随着消费者基因组学的指数级增长,"研究人员写道,"我们认为,在不久的将来,这种数据库规模在一些第三方网站上是可以预见的。"
Notably, that prediction is based on a homogenous population, but most collections of genetic data show significant racial disparities. The most significant one is in law enforcement databases, which are drawn from arrestee or convict populations and skew toward black and Latino populations as a result. Consumer and public databases exhibit the opposite bias, skewing toward Caucasians, who are subsequently more likely to be identified with a familial search, Erlich says.
值得注意的是,这一预测是基于一个同质的群体,但大多数遗传数据显示出显着的种族差异。最重要的是在执法数据库中,这些数据库是从被捕者或罪犯人口中提取的,因此倾向于黑人和拉丁裔人口。 Erlich说,消费者和公共数据库表现出相反的偏向,偏向高加索人,这些人随后更有可能通过家族搜索被确定身份。
At the same time, another group of scientists is expanding the reach of those techniques even further. Consumer genetic tests extract different portions of the genome than law enforcement tests, which has led to an ongoing comparison problem when a full sample cannot be obtained. But a group of researchers at Stanford University, University of California at Davis, and the University of Michigan have developed a method for comparing results even when portions of the genome don't overlap, drawing on known correlations between different portions of genetic code. The method isn't fully developed, but it could give forensic analysts much more flexibility in the type of data they can use.
与此同时,另一组科学家正在进一步扩大这些技术的应用范围。消费者基因测试提取的基因组部分与执法测试不同,这导致在无法获得完整样本时持续的比较问题。但斯坦福大学,加州大学戴维斯分校和密歇根大学的一组研究人员利用已知的不同基因序列之间的相关性,开发了一种方法,可以在基因组不重叠的情况下比较结果。这种方法还没有完全开发出来,但它可以让法医分析师在他们可以使用的数据类型上拥有更大的灵活性。
According to UC Davis' Michael Edge, who worked on the Cell paper, the new research "suggests a framework that law enforcement could use to start thinking about backward compatibility of existing STR databases with SNP data, but more work would be necessary to see how practical it would be."
根据负责《Cell》论文的加州大学戴维斯分校的Michael Edge的研究,这项新研究"提出了一个框架,执法部门可以使用这个框架开始考虑现有STR数据库与SNP数据的向后兼容性,但还需要做更多的工作来了解它的实用性。"