双语:Cloning Voices: You Took the Words Right out of My Mouth
发布时间:2018年03月22日
发布人:nanyuzi  

Cloning Voices: You Took the Words Right out of My Mouth

语音克隆:你道出我心声

 

It is now possible to imitate people’s speech patterns easily and precisely. That could bring trouble

现在机器可以轻松又准确地模仿人类讲话,问题或许也随之而来

 

Utter 160 or so French or English phrases into a phone app developed by CandyVoice, a new Parisian company, and the app’s software will reassemble tiny slices of those sounds to enunciate, in a plausible simulacrum of your own dulcet tones, whatever typed words it is subsequently fed. In effect, the app has cloned your voice. The result still sounds a little synthetic but CandyVoice’s boss, Jean-Luc Crébouw, reckons advances in the firm’s algorithms will render it increasingly natural. Similar software for English and four widely spoken Indian languages, developed under the name of Festvox, by Carnegie Mellon University’s Language Technologies Institute, is also available. And Baidu, a Chinese internet giant, says it has software that needs only 50 sentences to simulate a person’s voice.

 

巴黎一家新公司CandyVoice开发了一款手机应用,只要对着它说出约160个法语或英语短语,程序就能将这些发音的片段重组,念出之后打字输入的任何字句,听起来和你自己的声音颇为神似。这个应用其实是克隆了你的语音。拼合出的语音听起来还是有点合成的味道,但CandyVoice的老板让·吕克·克莱伯(Jean-Luc Crébouw)认为,公司算法的改进会令声音变得越来越自然。此外还有一款类似的软件Festvox,由卡内基梅隆大学的语言技术研究所针对英语及四种广泛使用的印度语言开发。而中国互联网巨头百度则表示,其开发的软件仅凭50句话就可以模拟一个人的声音。

 

Until recently, voice cloning – or voice banking, as it was then known – was a bespoke industry which served those at risk of losing the power of speech to cancer or surgery. Creating a synthetic copy of a voice was a lengthy and pricey process. It meant recording many phrases, each spoken many times, with different emotional emphases and in different contexts (statement, question, command and so forth), in order to cover all possible pronunciations. Acapela Group, a Belgian voice-banking company, charges €3,000 ($3,200) for a process that requires eight hours of recording. Other firms charge more and require a speaker to spend days in a sound studio.

 

直到不久前,语音克隆,即过去所说的“语音银行”,还只是个定制业务,为那些有可能因癌症或手术丧失语言能力的人服务。过去,模仿并合成语音耗时漫长,花费不菲。过程中要录制许多短句,每一句都要以不同的情感侧重及根据不同的语境(陈述、疑问、命令等)重复多次,为的是涵盖所有可能的发音。比利时语音银行公司阿卡贝拉集团(Acapela Group)对需耗时八小时的录制过程收取3000欧元(3200美元)的费用。其他公司收费更高,还需要顾客在录音室里花上好几天的时间。

 

Not any more. Software exists that can store slivers of recorded speech a mere five milliseconds long, each annotated with a precise pitch. These can be shuffled together to make new words, and tweaked individually so that they fit harmoniously into their new sonic homes. This is much cheaper than conventional voice banking, and permits novel uses to be developed. With little effort, a wife can lend her voice to her blind husband’s screen-reading software. A boss can give his to workplace robots. A Facebook user can listen to a post apparently read aloud by its author. Parents often away on business can personalise their children’s wirelessly connected talking toys. And so on. At least, that is the vision of Gershon Silbert, boss of VivoText, a voice-cloning firm in Tel Aviv.

 

今非昔比。现有的软件可以存储仅五毫秒长的语音录音片段,并逐一精确标注音调。这些片段可以调换顺序组成新词,并可单独微调,让新词听起来自然顺耳。这比传统语音银行便宜得多,而且还可以开发新的用途。妻子不用太费劲,就可以把自己的声音植入盲人丈夫的屏幕阅读软件里。雇主可以把自己的声音用到工厂机器人身上。Facebook用户可以收听仿佛是由帖子作者亲自朗读的内容。经常出差的家长可以个性化配置孩子的无线联网说话玩具。诸如此类。至少,这是特拉维夫语音克隆公司VivoText的老板格森·希尔伯特(Gershon Silbert)的期望。

 

Next year VivoText plans to release an app that lets users select the emphasis, speed and level of happiness or sadness with which individual words and phrases are produced. Mr. Silbert refers to the emotive quality of the human voice as “the ultimate instrument”. Yet this power also troubles him. VivoText licenses its software to Hasbro, an American toymaker keen to sell increasingly interactive playthings. Hasbro is aware, Mr. Silbert notes, that without safeguards a prankster might, for example, type curses on his mother’s smartphone in order to see a younger sibling burst into tears on hearing them spoken by a toy using mum’s voice.

 

VivoText计划明年发布一款应用,可让用户选择每一个单词和短语的重音、语速、快乐或悲伤的程度。希尔伯特把人声中这种情感特性形容为“终极工具”,但这种力量也让他感到困扰。VivoText将其软件授权给美国玩具制造商孩之宝(Hasbro),这家公司一心想出售互动性更强的玩具。希尔伯特指出,孩之宝也意识到,若没有防范措施,可能会出现一些问题,比如淘气的孩子可能在妈妈的智能手机上输入骂人的话,就为了看弟弟妹妹被玩具用妈妈的声音责骂后嚎啕大哭。

 

More troubling, any voice – including that of a stranger – can be cloned if decent recordings are available on YouTube or elsewhere. Researchers at the University of Alabama, Birmingham, led by Nitesh Saxena, were able to use Festvox to clone voices based on only five minutes of speech retrieved online. When tested against voice-biometrics software like that used by many banks to block unauthorised access to accounts, more than 80% of the fake voices tricked the computer. Alan Black, one of Festvox’s developers, reckons systems that rely on voice-ID software are now “deeply, fundamentally insecure”.

 

更令人担忧的是,只要在YouTube或其他地方能找到质量不错的语音片段,任何声音都可以克隆,包括陌生人的声音。在尼特什·塞克森纳(Nitesh Saxena)的带领下,阿拉巴马大学伯明翰分校的研究人员凭借短短五分钟的网络讲话片段就用Festvox克隆出了语音。许多银行使用语音识别软件来阻止非法入侵账户,当用这类软件来测试时,超过80%的合成语音成功骗过了计算机。Festvox的开发人员之一艾伦·布莱克(Alan Black)认为,如今依赖语音识别软件的系统“从根本上来说,极为不安全”。

 

And, lest people get smug about the inferiority of machines, humans have proved only a little harder to fool than software is. Dr. Saxena and his colleagues asked volunteers if a voice sample belonged to a person whose real speech they had just listened to for about 90 seconds. The volunteers recognised cloned speech as such only half the time (i.e., no better than chance). The upshot, according to George Papcun, an expert witness paid to detect faked recordings produced as evidence in court, is the emergence of a technology with “enormous potential value for disinformation”. Dr. Papcun, who previously worked as a speech-synthesis scientist at Los Alamos National Laboratory, a weapons establishment in New Mexico, ponders on things like the ability to clone an enemy leader’s voice in wartime.

 

机器的表现是很差,但人也没什么好自鸣得意的。实验证明,相比软件,要骗过人类也难不了多少。塞克森纳博士及其同事先让志愿者听一段90秒的人声录音,然后播放另一个语音样本,让他们判断是否出自说话者本人之口。志愿者仅在半数情况下辨别出了克隆语音,准确率跟纯靠猜是一样的。受聘为法庭鉴定伪造录音证据的专家证人乔治·帕普森(George Papcun)称,这会产生一种“在制造假情报方面有巨大潜在价值”的科技。曾任洛斯阿拉莫斯国家实验室(Los Alamos National Laboratory)语音合成科学家的帕普森琢磨着它会有怎样的用途,例如能否在战时克隆敌方领导人的语音。这一位于新墨西哥州的实验室是军方的武器研发机构。

 

As might be expected, countermeasures to sniff out such deception are being developed. Nuance Communications, a maker of voice-activated software, is working on algorithms that detect tiny skips in frequency at the points where slices of speech are stuck together. Adobe, best known as the maker of Photoshop, an image-editing software suite, says that it may encode digital watermarks into speech fabricated by a voice-cloning feature called VoCo it is developing. Such wizardry may help computers flag up suspicious speech. Even so, it is easy to imagine the mayhem that might be created in a world which makes it easy to put authentic-sounding words into the mouths of adversaries – be they colleagues or heads of state.

 

正如所料,已有机构在开发识别这类骗术的对策。语音控制软件开发商Nuance通讯公司(Nuance Communications)正在研究算法,检测语音片段之间连接点上微小的频率跳跃。以出品图像编辑软件Photoshop闻名的Adobe公司表示,它正在开发的名为VoCo的语音克隆软件也许可在合成的语音中添加数字水印。这类精妙技术或许有助计算机辨别可疑语音。即便如此,既然人们轻易就能让对手“亲口”说出逼真的言语,不难想象未来会出现怎样的混乱局面。


下载:英文、中文版本