Microsoft has unexpectedly pulled a gigantic facial recognition database containing photos of people’s faces from the internet, but traces of the data trove remain online.
If you’ve ever uploaded photos of yourself to the internet under a creative commons license—which allows for re-use under certain conditions—they may already have been used to train AI programs to recognize human faces.
Videos by VICE
Microsoft released MS-Celeb-1M, a dataset of roughly 10 million photos from 100,000 individuals collected from the internet in 2016. The database was designed to contain photos of celebrities, but as Berlin-based researcher Adam Harvey pointed out with his project Megapixels, the definition of “celebrity” was quite broad. The database also contained photos of “journalists, artists, musicians, activists, policy makers, writers, and academics,” Harvey wrote.
MS-Celeb-1M’s webpage is currently offline, but before the database was quietly pulled, it was used far and wide to train facial recognition programs. Entities that made use of images in the database, according to Harvey, include Chinese tech firms such as SenseTime and Megvii, which have been linked to the Chinese state’s use of facial recognition to track and oppress ethnic minorities.
In a statement to the Financial Times, Microsoft said that the database was taken down simply “because the research challenge is over.” Even so, it’s doubtful that the MS-Celeb-1M database’s life is over as well.
Like many facial recognition databases shared among researchers such as Yahoo’s database of nearly 100 million Flickr photos (which has been used by Ai researchers at IBM and beyond), MS-Celeb-1M got out of the pen. Even though Microsoft took it down, cleaned-up versions of the database are available to download from GitHub for example. Tools for working with the database, such as labelling lists that can reveal the names of photo subjects, also remain easily accessible.
“Despite the recent termination of the msceleb.org website, the dataset still exists in several repositories on GitHub, the hard drives of countless researchers, and will likely continue to be used in research projects around the world,” Harvey wrote on Megapixels. A facial recognition challenge this year at Imperial College London plans to use a variant of the MS-Celeb-1M database, and offers download links.
According to Harvey, “it’s fairly clear that Microsoft has lost control of their MS Celeb dataset and biometric data of nearly 100,000 individuals.”
Listen to CYBER, Motherboard’s new weekly podcast about hacking and cybersecurity.