Tech

Facebook’s Powerful Large Language Model Leaks Online

Facebook logo

Facebook’s large language model, which is usually only available to approved researchers, government officials, or members of civil society, has now leaked online for anyone to download. The leaked language model was shared on 4chan, where a member uploaded a torrent file for Facebook’s tool, known as LLaMa (Large Language Model Meta AI), last week.

This marks the first time a major tech firm’s proprietary AI model has leaked to the public. To date, firms like Google, Microsoft, and OpenAI have kept their newest models private, only accessible via consumer interfaces or an API, ostensibly to control instances of misuse. 4chan members claim to be running LLaMa on their own machines, but the exact implications of this leak are not yet clear.

Videos by VICE

In a statement to Motherboard, Meta did not deny the LLaMa leak, and stood by its approach of sharing the models among researchers.

“It’s Meta’s goal to share state-of-the-art AI models with members of the research community to help us evaluate and improve those models. LLaMA was shared for research purposes, consistent with how we have shared previous large language models. While the model is not accessible to all, and some have tried to circumvent the approval process, we believe the current release strategy allows us to balance responsibility and openness,” a Meta spokesperson wrote in an email.

Do you know anything else about the LLaMa leak? Are you using it for any projects? We’d love to hear from you. Using a non-work phone or computer, you can contact Joseph Cox securely on Signal on +44 20 8133 5190, Wickr on josephcox, or email joseph.cox@vice.com.

Like other AI models including OpenAI’s GPT-3, LLaMa is built on a massive collection of pieces of words, or “tokens.” From here, LLaMa can then take an input of words, and predict the next word to recursively generate more text, Meta explains in a blog post from February. LLaMa has multiple versions of different sizes, with LLaMa 65B and LLaMa 33B being trained on 1.4 trillion tokens. According to the LLaMA model card, the model was trained on datasets scraped from Wikipedia, books, academic papers from ArXiv, GitHub, Stack Exchange, and other sites.

In that same February blog post, Meta says that it is releasing LLaMa under a noncommercial license focused on research use cases to “maintain integrity and prevent misuse.”

“Access to the model will be granted on a case-by-case basis to academic researchers; those affiliated with organizations in government, civil society, and academia; and industry research laboratories around the world,” the post reads. Those protections have now been circumvented with the public release of LLaMa.

“We believe that the entire AI community—academic researchers, civil society, policymakers, and industry—must work together to develop clear guidelines around responsible AI in general and responsible large language models in particular. We look forward to seeing what the community can learn—and eventually build—using LLaMA,” Meta’s blog post adds.

Meanwhile, Meta appears to be filing takedown requests of the model online to control its spread.

Clement Delangue, CEO of open source AI firm Hugging Face, posted a staff update from GitHub regarding a user’s LLaMa repository. “Company Meta Platforms, Inc has requested a takedown of this published model characterizing it as an unauthorized distribution of Meta Properties that constitutes a copyright infringement or improper/unauthorized use,” the notice said.

Delangue cautioned  users against uploading LLaMa weights to the internet. The flagged user’s GitHub repository is currently offline.

Subscribe to our cybersecurity podcast, CYBER. Subscribe to our new Twitch channel.