In a world dominated by visual impressions, seeing is like a powerful gateway to reality. Based on our senses, we make judgements about the world and ourselves in it. While we often take the complexity of seeing for granted in everyday life, upon closer inspection, we encounter ancient philosophical questions that delve into the nature of human existence. Even in ancient times, the mysteries of perception were explored, especially of vision from the nature of light to how our eyes work.
Research around machine vision therefore raises not only technical questions, but also fundamental considerations about human nature and our perception. This research leads us back to a question as old as humanity itself: How is human cognition possible?
Possibly, “computer vision” will even help us one day to look at ourselves in a new light and gain a deeper insight into ourselves?
What is Computer Vision?
Computer vision is a field within artificial intelligence (AI) that enables computers to derive information from images, videos and other inputs.
As we know, AI can process large amounts of data better than humans. But what about visual perception? Among our six senses, vision is possibly the most important. However, our vision cannot be compared to a camera that simply records reality. Our eye and brain already select whether something enters our perception. How this selection takes place and whether there is an objectively observable reality is a controversial philosophical question.
Computer vision is also not simply the recording of a video or image, but also the interpretation of what a computer sees. Analogous to human vision, this includes some sub-areas.
To process a video, numerous individual images must first be viewed. As a rule, a video consists of 60 frames (meanwhile sometimes up to 144 till 230) per second, with each frame consisting of tens of millions of pixels. With the help of AI, these pixels are analysed and bodies are automatically recognised or outlined. Then the detected bodies are further broken down into faces, mouths, hands and so on. This process is repeated for every single frame of the video. Let’s say the video lasts 50 seconds, then we have a total of 3000 frames. In addition, the movement between frames is compared and tracked to identify fluid movement in the video. In addition, relationships between objects in the video are detected.
The human brain: the master of visual processing
What sounds complex as computer vision, our eyes and brain perform every day: our eyes take in light and pass it on to the brain. There, the light is converted into images that we can see.
At the same time, our eyes are a true marvel of nature. Approximately 80 percent of the information we receive from our environment is conveyed to the brain through the sense of sight, where it is processed. The eye converts the electromagnetic waves of light into nerve impulses that are transmitted to the brain via the optic nerve. These nerve impulses contain information about the visual image that the eye has taken in.
Our brain interprets what our eye sees
In the brain, these nerve impulses are then processed in the visual centre and other related regions. Our brain compares the received signals with stored information and experiences to interpret the perceived world. It analyses shapes, colours, movements and other visual characteristics to provide us with a complete picture of our surroundings. It converts the two-dimensional images of both eyes into a three-dimensional image of the world. Processing visual information in the brain enables us to recognise objects and faces, have spatial depth perception, track motion and perform complex visual tasks. Our visual perception is a sophisticated processing that allows us to understand and respond to the visual world around us.
Human vision: How can it be taught to an artificial brain?
First of all, image capture and processing are crucial for video creation. With the help of cameras and sensors, three-dimensional scenes are recorded and captured as image sequences in the video. Each image represents colours as a two-dimensional series of numbers with pixels. The objects contained in them are detected by splitting the images into distinctive regions and determining their positions. The objects are then identified and their characteristics interpreted, e.g. the type and colour of a parrot.
Object tracking, for example the parrot in flight, enables the tracking of moving objects in successive images or videos for motion analysis. Gesture and movement recognition plays a role, e.g. recognising dance movements in video games for interactive control. Scene understanding involves comprehensive understanding of a scene with subtle object relationships, e.g. a hungry cat looking at a mouse.
The library of the moment: When we see people, we apply our accumulated knowledge of the world. Everything we have experienced and already seen in our lives is activated. Every time we look into the world, the world we have already seen looks with
All our accumulated knowledge of the world and our experience of perspective, geometry, our common sense works with every look we take at the world. Seeing is always synonymous with understanding the world. And that is the great challenge of computer vision research. A machine should not only be able to see, but also understand what it sees.
Inspired by the human brain, researchers invented convolutional networks, also known as deep learning filters, which learn to recognize patterns. This learning architecture was already discussed in the 1980s, when artificial intelligence was still a niche research. The breakthrough came in 2012, when a large number of images and videos were taken with smartphones for the first time and were available as training data. At the same time, better hardware was available: Computers and storage capacities became affordable.
The world through digital eyes: What do machines see today?
Machine vision has made amazing advances in various areas of everyday life. It detects when drivers are about to fall asleep, enables a smooth shopping experience without checkouts in autonomous shops and helps with airport security. Gesture recognition assesses movements in video games, while facial recognition enables the unlocking of mobile phones. Smart cameras produce aesthetic portraits, military applications distinguish enemies from civilians, and autonomous navigation enables safe movement of drones and vehicles. In addition, machine vision is used in medical image analysis for tumor detection, content moderation in social media, selection of appropriate advertisements, intelligent image search and even the creation of deepfakes.
Deepfakes and the old questions about human knowledge
Deepfakes are videos created and manipulated using Deep Learning. With this technology, a person’s outer appearance is changed, their facial expressions and lip sync are transformed into false facial expressions and utterances. An example of such deepfakes is the app Avatarify, which was developed in 2021. With this application, people in any photo can be brought to life by saying or singing something. The bodies of real people can be used in a completely different context today.
Deepfakes have become a widespread phenomenon and seriously test our perception. The consequences go far beyond the mere manipulation of videos and raise fundamental philosophical questions anew. We have to ask ourselves again how we acquire knowledge and what knowledge means in the first place. This affects all areas of our lives and has the potential to shake confidence in audio-visual records. We will increasingly need to be aware of the implications of this technology and take action to curb its misuse.
For example, in the future, videos, audio recordings or security camera footage could be presented as fabricated evidence in court and lead to serious miscarriages of justice. The unstoppable development of this technology suggests that it will gradually revolutionise the crime market as well.
There is no doubt that the market for deepfake detection software will grow strongly. It is no surprise that both Facebook and Google have offered prizes for the development of such programmes. A veritable arms race between counterfeiting and unmasking software is emerging.
Future anti-deception software will be, in some ways, a remake of today’s anti-virus software. It seems certain that new legal regulations will have to be introduced requiring the authentication of videos through the use of blockchain technologies for verification, for example.
Deepfake videos can help change human perception and knowledge formation by affecting people’s ability to distinguish between reality and fiction. This leads us to question our cognitive capacity and the way we acquire our knowledge. In the future, we will be forced to critically question and cast doubt on everything we see (digitally).
The boundaries between authenticity and manipulation will become blurred, forcing us humans to question our capacity for cognition, namely the way we acquire knowledge, because in the future, everything we see (digitally) will require our scrutiny.
The old philosophical question of epistemology, which deals with the way we gain knowledge and how we justify our beliefs, is experiencing a renaissance through digitalisation. Because in the age of deepfakes, we can be even more uncertain about how certain or reliable our knowledge is.
This fundamental philosophical discipline of epistemology, can once again help us understand and critically question the nature and scope of our knowledge.
The utopia in that?
The utopia is that we become more sensitive to our own judgements and question our opinions and insights about the world. This could almost be a patent for a better world. By recognising the illusion of certainty and absolute knowledge, we open ourselves to an ongoing process of learning and critical thinking.
This enables us to overcome our own prejudices and assumptions and develop a deeper understanding of the world.