Microsoft Azure AI Fundamentals:
Computer vision is an area of artificial intelligence (AI) in which software systems are designed to perceive the world visually, though cameras, images, and video. There are multiple specific types of computer vision problem that AI engineers and data scientists can solve using a mix of custom machine learning models and platform-as-a-service (PaaS) solutions - including many cognitive services in Microsoft Azure.
Introduction
Computer vision is one of the core
areas of artificial intelligence (AI), and focuses on creating solutions that
enable AI applications to "see" the world and make sense of it.
Of course, computers don't have
biological eyes that work the way ours do, but they are capable of processing
images; either from a live camera feed or from digital photographs or videos.
This ability to process images is the key to creating software that can emulate
human visual perception.
Some potential uses for computer vision include:
Content Organization:
Identify people or objects in photos and organize them based on that
identification. Photo recognition applications like this are commonly used in
photo storage and social media applications.
Text Extraction: Analyze
images and PDF documents that contain text and extract the text into a
structured format.
Spatial Analysis: Identify
people or objects, such as cars, in a space and map their movement within that
space.
To an AI application, an image is just
an array of pixel values. These numeric values can be used
as features to train machine learning models that make predictions
about the image and its contents.
Training machine learning models from
scratch can be very time intensive and require a large amount of data.
Microsoft's Computer Vision service gives you access to pre-trained computer
vision capabilities.
Get started with image analysis on Azure
The Computer Vision service is a
cognitive service in Microsoft Azure that provides pre-built computer vision
capabilities. The service can analyze images, and return detailed information
about an image and the objects it depicts.
Azure resources for Computer Vision
To use the Computer Vision service, you
need to create a resource for it in your Azure subscription.
You can use either of the following resource types:
Computer Vision: A
specific resource for the Computer Vision service. Use this resource type if
you don't intend to use any other cognitive services, or if you want to track
utilization and costs for your Computer Vision resource separately.
Cognitive Services: A
general cognitive services resource that includes Computer Vision along with
many other cognitive services; such as Text Analytics, Translator Text, and
others. Use this resource type if you plan to use multiple cognitive services
and want to simplify administration and development.
Whichever type of resource you choose to
create, it will provide two pieces of information that you will need to use it:
A key that is used to
authenticate client applications.
An endpoint that provides the
HTTP address at which your resource can be accessed.
If you create a Cognitive Services
resource, client applications use the same key and endpoint regardless of the
specific service they are using.
Analyzing images with the Computer Vision service
After you've created a suitable resource
in your subscription, you can submit images to the Computer Vision service to
perform a wide range of analytical tasks.
Describing an image
Computer Vision has the ability to
analyze an image, evaluate the objects that are detected, and generate a
human-readable phrase or sentence that can describe what was detected in the
image. Depending on the image contents, the service may return multiple
results, or phrases. Each returned phrase will have an associated confidence
score, indicating how confident the algorithm is in the supplied description.
The highest confidence phrases will be listed first.
To help you understand this concept,
consider the following image of the Empire State building in New York.
The returned phrases are listed below the image in the order of
confidence.
1.
A black and white
photo of a city
2.
A black and white
photo of a large city
3.
A large white
building in a city
Tagging visual features
The image descriptions generated by
Computer Vision are based on a set of thousands of recognizable objects, which
can be used to suggest tags for the image. These tags can be
associated with the image as metadata that summarizes attributes of the image;
and can be particularly useful if you want to index an image along with a set
of key terms that might be used to search for images with specific attributes
or contents.
For example, the tags returned for the
Empire State building image include:
1.
skyscraper
2.
tower
3.
building
Detecting objects
The object detection capability is similar
to tagging, in that the service can identify common objects; but rather than
tagging, or providing tags for the recognized objects only, this service can
also return what is known as bounding box coordinates. Not only will you get
the type of object, but you will also receive a set of coordinates that
indicate the top, left, width, and height of the object detected, which you can
use to identify the location of the object in the image, like this:
Detecting brands
This feature provides the ability to identify
commercial brands. The service has an existing database of thousands of
globally recognized logos from commercial brands of products.
When you call the service and pass it an
image, it performs a detection task and determine if any of the identified
objects in the image are recognized brands. The service compares the brands
against its database of popular brands spanning clothing, consumer electronics,
and many more categories. If a known brand is detected, the service returns a
response that contains the brand name, a confidence score (from 0 to 1
indicating how positive the identification is), and a bounding box
(coordinates) for where in the image the detected brand was found.
For example, in the following image, a
laptop has a Microsoft logo on its lid, which is identified and located by the
Computer Vision service.
Detecting faces
The Computer Vision service can detect
and analyze human faces in an image, including the ability to determine age and
a bounding box rectangle for the location of the face(s). The facial analysis
capabilities of the Computer Vision service are a subset of those provided by
the dedicated Face Service. If you need basic face detection and analysis,
combined with general image analysis capabilities, you can use the Computer
Vision service; but for more comprehensive facial analysis and facial
recognition functionality, use the Face service.
The following example shows an image of a person with their face
detected and approximate age estimated.
Categorizing an image
Computer Vision can categorize images
based on their contents. The service uses a parent/child hierarchy with a
"current" limited set of categories. When analyzing an image,
detected objects are compared to the existing categories to determine the best way
to provide the categorization. As an example, one of the parent categories
is people_. This image of a person on a roof is assigned a category
of people.
A slightly different categorization is
returned for the following image, which is assigned to the category people_group because
there are multiple people in the image:
Review the 86-category list here.
Detecting domain-specific content
When categorizing an image, the
Computer Vision service supports two specialized domain models:
Celebrities - The
service includes a model that has been trained to identify thousands of
well-known celebrities from the worlds of sports, entertainment, and business.
Landmarks - The
service can identify famous landmarks, such as the Taj Mahal and the Statue of
Liberty.
For example, when analyzing the
following image for landmarks, the Computer Vision service identifies the
Eiffel Tower, with a confidence of 99.41%.
Optical character recognition
The Computer Vision service can use
optical character recognition (OCR) capabilities to detect printed and
handwritten text in images.
Additional capabilities
In addition to these capabilities, the
Computer Vision service can:
Detect image types - for
example, identifying clip art images or line drawings.
Detect image color schemes -
specifically, identifying the dominant foreground, background, and overall
colors in an image.
Generate thumbnails -
creating small versions of images.
Moderate content -
detecting images that contain adult content or depict violent, gory scenes.
Summary
The Computer Vision service provides
many capabilities that you can use to analyze images, including generating a
descriptive caption, extracting relevant tags, identifying objects, determining
image type and metadata, detecting human faces, known brands, and celebrities,
and others.
No comments:
Post a Comment