0-0.png

The neural network was taught to “animate” portraits based on just one static image.
Russian specialists from the Samsung AI Center-Moscow Artificial Intelligence Center, in collaboration with engineers from the Skolkovo Institute of Science and Technology, developed a system capable of creating realistic animated images of people's faces based on just a few static human frames. Usually, in this case, the use of large databases of images is required, but in the example presented by the developers, the system was trained to create an animated image of a person’s face from just eight static frames, and in some cases one was enough.
As a rule, it is rather difficult to reproduce a photorealistic personalized module of a human face due to the high photometric, geometric and kinematic complexity of the reproduction of a human head. This is explained not only by the complexity of modeling the face as a whole (there are a large number of modeling approaches for this), but also by the complexity of modeling certain features: oral cavity, hair, and so on. The second complicating factor is our predisposition to catch even minor flaws in the finished model of human heads. This low tolerance for modeling errors explains the current prevalence of non-photorealistic avatars used in newsgroups.
According to the authors, the system, called Fewshot learning, is capable of creating very realistic models of talking heads of people and even portrait pictures. The algorithms produce a synthesis of the image of the head of the same person with the lines of the landmark face, taken from another video fragment, or using landmarks of the face of another person. Developers used an extensive celebrity video database as a source of training material for the system. To get the most accurate “talking head”, the system needs to use more than 32 images.
To create more realistic animated face images, developers used previous developments in generative-competitive modeling (GAN, where the neural network thinks about the details of the image, actually becoming an artist), as well as the machine meta-learning approach, where each element of the system is trained and designed to solve specific task.
For processing static images of people's heads and turning them into animated, three neural networks were used: Embedder (implementation network), Generator (generation network) and Discriminator (discriminator network). The first one separates the images of the head (with approximate facial landmarks) into embedding vectors that contain posture-independent information, the second network uses the facial orientations obtained by the network and generates new data based on them through a set of convolutional layers that provide stability to changes in scale turns, change of angle and other distortions of the original image of the face. A network discriminator is used to assess the quality and authenticity of the two other networks. As a result, the system turns the landmarks of a person’s face into realistic-looking personalized photos.
The developers emphasize that their system is able to initialize the parameters of both the generator network and the discriminator network individually for each person in the picture, so the learning process can be based on just a few images, which increases its speed, despite the need to select tens of millions of parameters.

Attachments forums

Re: LUXULY HI-NEWS