Tech

I Made an AI Clone of Myself

Screen Shot 2023-02-22 at 9

In November, a company called Synthesia emailed Motherboard and offered “an exclusive date with your AI twin.”

“Hello, ever thought about creating your own digital twin? You’ve been invited to Synthesia’s New York studio to build your own virtual avatar, like me!” an AI clone of Synthesia spokesperson Laura Morelli said in a video embedded in the email. “Don’t miss out on learning more about the new sexy sector. Lock in your one-hour slot now to build your own avatar with Synthesia. Hurry now because spots are limited and filling up fast.”

Videos by VICE

I was a bit hesitant. While I will die someday, an AI that looks and sounds like me will not (assuming the service stays online.) It could also theoretically be used to make me say things I don’t agree with.

But I was also intrigued. What does it mean to be represented as an avatar, one that can eliminate any camera-shyness and always look camera-ready even if I’m not? “Journalism is busy at the best of times – absolutely chaotic at worst. I get that. So how about a digital twin to help you out with some of that workflow?,” Morelli wrote. Maybe my AI twin could make my job easier. Or, more grimly, maybe it could host my own funeral someday. I decided to go for it.

To create my AI clone, Synthesia told me that we would have to clone my voice and body, and it would take a total of a little over two hours to do so. Before the shoot, I was given a schedule of “Voice Clone,” “Prep [Hair and Makeup],” and “Video Performance.” No details beyond that. Entering the studio the day of, I had no idea what to expect, other than that I was like an actress responding to a call sheet, ready to do my best improv.

Most of the AI clones I’ve seen have been in viral deep fake videos of celebrities online. Whether it’s Obama saying “Killmonger was right,” or Mark Zuckerberg saying “whoever controls the data, controls the future,” AI is increasingly being used to make people say and do whatever people ask them to. Synthesia boasts that it has made digital clones of David Beckham and Lionel Messi. It says that more than 15,000 businesses have generated “more than 4.5 million videos using our SAAS [Software as a Service] platform.”

Synthesia told me that some of what its primary clients use the platform for include creating real estate tours and HR training videos for corporations. Companies that have used the platform include Accenture, Reuters, and BBC. Recently, Synthesia has made a number of headlines for people using its platform to create propaganda videos. For example, in January, someone used Synthesia to generate AI videos expressing support for Burkina Faso’s new military dictatorship. The user was shortly banned after. A few weeks ago, a research firm, Graphika, discovered that pro-China campaign videos were generated using Synthesia. This week, Synthesia videos with fabricated content about Venezuela’s economic improvement began trending on YouTube and TikTok.

Synthesia’s online studio offers more than 85 avatars who can speak more than 120 languages. These avatars can be made to say almost anything a user wants them to. And I was going to become one of them.

Going into this shoot, I was most worried about my avatar being used for inappropriate purposes or to say things I don’t believe in personally, but the Synthesia team reassured me that only I would be able to use my avatar.

“Cases like this highlight how difficult moderation is. The scripts in question do not use explicit language and require deep contextual understanding of the subject matter. No system will ever be perfect, but to avoid similar situations arising in future we will continue our work towards improving systems,” Synthesia CEO Victor Riparbelli told VICE World News about the Burkina Faso case.

I created my digital clone at Gramercy Park Studios, a creative studio in Hell’s Kitchen. The Synthesia team first took me to a recording studio, where I was handed a script: eight pages of lengthy paragraphs that were sorted by tone, such as professional, marketing, instructional, casual, and cheerful. Perhaps appropriately, the scripts were written by ChatGPT, they told me.

I was shocked at how much I had to read, and didn’t think I would be able to finish all of the reading within the allotted hour. I also doubted my ability to read so much without stuttering or messing up.

I entered the recording booth. I could see the audio technician and the Synthesia team on the other side of the glass, but could only hear my voice and every minute sound, from the rustling of my pants to the tapping of my feet, through my headphones. The script was set on a sheet music stand in front of me in order to prevent any sound from coming from the paper.

As we began recording, I tried my best to channel every voice—going from an audiobook narrator to a commercial salesperson. I recorded each paragraph at a time and each had to be seamlessly read through before I could move on. The audio technician made me redo each recording until all the words on the page were read, my pronunciation was correct, and my speaking speed was not too fast. By the end, I was parched and desperately needed to chug water.

IMG_0615.png
Getting my hair and makeup done on set.

Hair and makeup came next. The makeup artist asked me what makeup and hair I do on a daily basis. The key, she said, is to enhance your natural features for filming. The Synthesia team also had previously instructed me to wear basic, non-patterned or reflective clothing, so I chose an all-black outfit that I thought would be AI avatar ready. Like existing AI avatars, I wanted my avatar to be business casual and versatile, ready to speak on everything from casual to serious topics.

The body cloning was done in a film studio, where I stood on a designated spot in front of a green screen and faced enormous bright lights. I was fitted with a microphone, which was hidden in my bra, and faced a full film crew including a director, DP, wardrobe stylist, and sound mixer. It was my first time being on the other side of the camera and it felt daunting to have all eyes on me and know that, to some degree, I was expected to “perform”—to combine my facial expressions, the tone of my voice, and body movements into one smooth recording.

The director had me first nod my head in every direction of the clock. I looked directly up at 12 o’clock, then looked slightly to the left at 11 o’clock, and so on. Then, I had to move my eyes in all directions without moving my head. Between takes, the wardrobe stylist would come and smooth out my shirt, remove lint, and tell me to not move my arms too much. It seemed like the team was pretty experienced in directing AI-cloning shoots, even though they were freelancers hired for the project.

Finally, I had to read a script from a teleprompter, where the camera could pick up what I looked like when I talked. The director really emphasized positivity, telling me to smile with teeth before and after I spoke each line. He also told me to move my hands slightly in front of me when I talked, to accentuate my speech in a more animated fashion. “You want to have a really positive avatar,” he kept reminding me.

By the end of the shoot, I was exhausted but excited to see what my digital twin would look and sound like. I went home and waited several weeks. Then I got an email that my clone was ready—but not my voice.

I logged into the Synthesia platform and saw a super brightly lit-up profile image of myself as the Avatar. I immediately began testing her out, asking her to say everything from a short intro to rapping “Anaconda” lyrics. I wanted to test the limits of what she was capable of saying and doing and realized that she could do an impressive number of things, including talking in a British accent and speaking in Chinese.

If I wasn’t familiar with my own voice, the AI clone would probably be super convincing, because her mouth moves in a deceivingly natural way. Some coworkers who haven’t met me but have seen videos of me before asked if the videos they were shown were actually of me.

There are, thankfully, filters that ban users from making their AI from saying NSFW content, so I can’t make her say anything I want. When I tried to make her say curse words in one script, the video was unable to generate.

The Synthesia platform is pretty easy to navigate. Through my log-in, I’m able to access the Synthesia Studio, where I can create new videos either on a customizable blank screen or on a pre-made template, which includes templates like financial presentation, office interior, and weekly business update. After I select my video’s template, I’m brought to the editing page, where I can select an avatar, put text on the background, add shapes to the background, change the backdrop to another image or video, and screen record my computer tabs to include in the presentation.

Screen Shot 2023-02-22 at 9.07.39 AM.png
Screenshot of Synthesia Studio’s interface

Below the video preview is a box where I can type a script that I want the AI to say or upload an audio file that the AI will match its lips to say. When I type a script, I can preview the audio for the video and override pronunciations by typing in the correct pronunciation as well as add longer silences between words. Once I’m done with all the customization, I can hit generate and the button will also tell you about how long it will take for the video to be made, which is longer the more text you have.

A few weeks later, they synced my voice and my clone was fully ready. Her voice had remnants of my own but for the most part, sounded like the Siri-ification of my voice. The voice was pretty robotic and monotonous and there are no settings that can manually change the tone of the selected voice, such as making her scream or whisper.

Looking at my AI twin, I find her, as a whole, to be pretty accurate, especially if you don’t know what my real voice sounds like. And she’s definitely creepy. I showed my friends videos of her and they immediately knew that it was my AI and not me. Maybe it’s because I wouldn’t bob my head ever so slightly or talk like Siri. Either way, I told them if they missed me and I wasn’t reachable, they could watch those videos. I can talk in over 120 languages, so this could be a good way to communicate to people around the world without a translator, and to impress them with my fluent language skills.

With my journalist AI twin, I could use the platform to make her the host of a news show, maybe even take her to field report on the Metaverse—Motherboard Tonight featuring Chloe’s AI. Who’s watching?