AI Art Generators Sometimes Copy & Paste Images And Faces, New Evidence Finds
By Alexa Heah, 02 Feb 2023

Art communities rallying against the proliferation of artificial intelligence art may now have scientific proof that their concerns about databases copying their works are valid. In fact, researchers posit that image generators “memorize” images they’re trained on.
Scientists tested the hypothesis in a recently-posted preprint paper, extracting over a thousand training images from AI models, including photographs of individuals to film stills, press pictures, trademarked logos, and artwork.
Turns out, popular text-to-image generators, such as DALL-E 2, Midjourney, and Stable Diffusion, which scrap data on the web for free, tended to regurgitate exact copies of the images they’d been trained on.
Applying our method to Stable Diffusion and Google’s Imagen, we extract hundreds of images, and do so with high precision.
— Eric Wallace (@Eric_Wallace_) January 31, 2023
Many of these images are copyright or licensed, and some are photos of individuals. [4/9] pic.twitter.com/6YB2zHSraU
In theory, as per Vice, diffusion art generators gain insight from training data, learning when to add or remove noise from an image. Eventually, the systems will be able to produce original images based on a description given by a human user.
However, in reality, many of these algorithms use real artists’ work without consent or compensation, and even leave distinguishable marks in the “original” works they produce in the form of similar art styles or botched signatures.
During the study, researchers demonstrated that there were times these AI models generated the exact same image used in the training database, only adding or removing “inconsequential” components such as noise.
Models such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images.
— Eric Wallace (@Eric_Wallace_) January 31, 2023
Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time.
Paper: https://t.co/LQuTtAskJ9
ð[1/9] pic.twitter.com/ieVqkOnnoX
For example, when Stable Diffusion was asked to create an image based on the prompt “Ann Graham Lotz,” the software generated a picture of the American evangelist that looked extremely similar to the one on her Wikipedia page.
Though the AI-generated image seemed more grainy, the scientists were able to tell both pictures had “nearly identical pixel compositions,” which hinted at the training image having been “memorized” by the system as they had suspected.
Of course, this doesn’t happen all of the time. The systems in question could still portray an “original” image that accurately matched the description without copying any of the works sitting in their training database.
We also train hundreds of our own diffusion models to study the impact of various factors. Some highlights:
— Eric Wallace (@Eric_Wallace_) January 31, 2023
- Diffusion models memorize more than GANs
- Outlier images are memorized more
- Existing privacy-preserving methods largely fail pic.twitter.com/bFB4RHMcmb
Upon prompting Stable Diffusion with “Obama,” the AI-generated an image that looked akin to the former President, but did not match any of the ones found in its training dataset. The four nearest training images were found to be “very different” from the picture that appeared.
Vice points out that the susceptibility of art generators to “memorize” and copy images directly from a source could potentially bring about major copyright issues. Users won’t be able to tell if any image created is deemed “original” enough to be reproduced or distributed without concern.
In addition, images of individuals’ faces could be a privacy risk to those who have not expressly permitted their photographs to be used in datasets. Over 35% were under an explicit non-permissive copyright notice, while another 61% could have general copyright protection.
See our paper for a lot more technical details and results.
— Eric Wallace (@Eric_Wallace_) January 31, 2023
Speaking personally, I have many thoughts on this paper. First, everyone should de-duplicate their data as it reduces memorization. However, we can still extract non-duplicated images in rare cases! [6/9] pic.twitter.com/5fy8LsNbjb
Overall, the team managed to reproduce over a hundred “identical” images, which the researchers are calling an undercount, due to the fact they only took note of instances when the two pictures were “exactly” the same, while many others were very similar to the source.
As of now, it’s not clear exactly what artists or individuals can do to protect the rights to their appearances or artwork, but with the speed at which AI generators are developing, perhaps authorities should take the first step in ensuring that copyright protections are maintained.
[via Vice and arXiv, cover image via Judgar | Dreamstime.com]