Multimodal Learning
Meta-Transformer: A
Unified Framework for Multimodal Learning
–
Unified Multimodal
Learning. Meta-Transformer utilizes the same backbone to encode natural
languages, images, point clouds, audio, video, infrared, hyperspectral,
X-ray, time-series, tabular, and graph data. It reveals the potential of
transformer architectures for universal perception.
> Multimodal
learning involves utilizing data from various modalities to improve
model capacity. Despite the years of development in this field, it
remains challenging to devise a unified framework for processing natural
language, 2D images, 3D point clouds, and audio spectrograms due to
crucial gaps among these different modalities. This study proposes a
novel approach that demonstrates a network with frozen parameters can
encode the data from the aforementioned four modalities and achieve
favorable performance, resulting in a unified framework called Meta
Transformer. Using this framework, the raw input data from various
modalities are converted to a shared token space, allowing a subsequent
encoder with frozen parameters to extract high-level semantic features
of the input data. Composed of three main components: a unified data
tokenizer, a modality-shared encoder, and task-specific heads for
downstream tasks …