AI glossary
Learn all terms related to AI development and machine learning.
Asset
Assets (or data assets) are individual files to be labeled, such as an image, a video, or a text file. Can be hosted in a cloud bucket, uploaded from a local file location, or copied from a remote data source.
JavaScript
JavaScript is a high-level programming language commonly used for creating interactive and dynamic content on web pages, with applications ranging from front-end web development to server-side programming.
Few-shot learning
A technique whereby we prompt an LLM with several concrete examples of task performance.
Zero-shot learning
A technique whereby we prompt an LLM without any examples, attempting to take advantage of the reasoning patterns it has gleaned (i.e. a generalist LLM)
Fine tuning
A technique whereby we take an off-the-shelf open-source or proprietary model, re-train it on a variety of concrete examples, and save the updated weights as a new model checkpoint
Word embedding
A word embedding, trained on word co-occurrence in text corpora, represents each word (or common phrase) w as a d-dimensional word vector w~ 2 Rd. It serves as a dictionary of sorts for computer programs that would like to use word meaning. First, words with similar semantic meanings tend to have vectors that are close together. Second, the vector differences between words in embeddings have been shown to represent relationships between words.
Variable
A variable is a characteristic of a unit being observed that may assume more than one of a set of values to which a numerical measure or a category from a classification can be assigned.
Variance
The variance is the mean square deviation of the variable around the average value. It reflects the dispersion of the empirical values around its mean.
Quality
The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs.
Knowledge
The sum of all information derived from diagnostic, descriptive, predictive, and prescriptive analytics embedded in or available to or from a cognitive computing system.
Embedding
An embedding is a representation of a topological object, manifold, graph, field, etc. in a certain space in such a way that its connectivity or algebraic properties are preserved. For example, a field embedding preserves the algebraic structure of plus and times, an embedding of a topological space preserves open sets, and a graph embedding preserves connectivity. One space X is embedded in another space Y when the properties of Y restricted to X are the same as the properties of X.
Model-assisted labeling
A process in machine learning and data annotation where human annotators label data with the assistance of pre-trained machine learning models. Instead of relying solely on manual annotation, which can be time-consuming and expensive, model-assisted labeling leverages the predictions of machine learning models to accelerate the annotation process, letting labeling teams focus time on refining, accepting, or rejecting model predictions rather than starting from scratch with each label.
ChatGPT
A system built with a neural network transformer type of AI model that works well in natural language processing tasks. In this case, the model: (1) can generate responses to questions (Generative); (2) was trained in advance on a large amount of the written material available on the web (Pre-trained); (3) and can process sentences differently than other types of models (Transformer).
Foundation models
Foundation Models represent a large amount of data that can be used as a foundation for developing other models. For example, generative AI systems use large language foundation models. They can be a way to speed up the development of new systems, but there is controversy about using foundation models since depending on where their data comes from, there are different issues of trustworthiness and bias.
AGI (Artificial General intelligence)
Algorithms that perform a wide variety of tasks and switch simultaneously from one activity to another in the manner that humans do.
AI (Artificial Intelligence)
AI is a branch of computer science. AI systems use hardware, algorithms, and data to create “intelligence” to do things like make decisions, discover patterns, and perform some sort of action. AI is a general term and there are more specific terms used in the field of AI. AI systems can be built in different ways, two of the primary ways are: (1) through the use of rules provided by a human (rule-based systems); or (2) with machine learning algorithms.
Labelbox Workspace
Enables admins at large organizations to manage multiple instances of Labelbox with the same subscription account.
Labelbox Workflow
A workflow is a queue for labeling and reviewing assets within a project. Workflows provide granular control over data row reviews. Workflows are highly customizable and help define a step-by-step pipeline leading to an efficient and more accurate process.
Unsupervised learning
Algorithms, which take a set of data consisting only of inputs and then they attempt to cluster the data objects based on the similarities or dissimilarities in them.
Unstructured data
Unstructured data is defined as information that is not arranged according to a preset data model or schema, and therefore cannot be stored in a traditional database.
Underfitting
Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.
Transformers
A procedure that modifies a dataset.
Transfer learning
A technique in machine learning in which an algorithm learns to perform one task, such as recognizing cars, and builds on that knowledge when learning a different but related task, such as recognizing cats.
Training data
A dataset from which a model is learned.
Labelbox Template
If a data row needs to be relabeled, you can delete the annotations and then select existing annotations to use as a template for the next data row displayed in the editor. This allows you to curate a set of annotations, rather than start from scratch for each data row.
Taxonomy
Taxonomy refers to classification according to presumed natural relationships among types and their subtypes.
Supervised learning
A type of machine learning in which the algorithm compares its outputs with the correct outputs during training. In unsupervised learning, the algorithm merely looks for patterns in a set of data.
Labelbox Schema
The schema is the master blueprint for your training data and includes ontologies, features, and metadata.
Robotic process automation
A preconfigured software instance that uses business rules and predefined activity choreography to complete the autonomous execution of a combination of processes, activities, transactions, and tasks in one or more unrelated software systems to deliver a result or service with human exception management.
RLHF (Reinforcement learning with human feedback)
RLHF is an extension of Reinforcement Learning (RL), a reward and punishment-based training technique for AI models. It involves training a model through iterative interactions where humans provide guidance or evaluations to improve the model's decision-making process.
Reinforcement learning
A type of machine learning in which the algorithm learns by acting toward an abstract goal, such as “earn a high video game score” or “manage a factory efficiently.” During training, each effort is evaluated based on its contribution toward the goal.
Labelbox Queue
Labelbox has three queues to help move data rows through the labeling and review workflow: the batches queue, the labeling queue, and the review tasks queue.
Labelbox Project
The labeling environment in Labelbox, like a factory assembly line for producing labels. The initial state of the project can start with raw data, pre-existing ground truth, or pre-labeled data.
Preprocessing algorithm
A bias mitigation algorithm that is applied to training data.
Prediction
Output from your machine learning model that you can add to a data row to serve as a template for faster labeling.
Precision
A metric for classification models. Precision identifies the frequency with which a model was correct when classifying the positive class.
Overfitting
In statistics and machine learning, overfitting occurs when a model tries to predict a trend in data that is too noisy. Overfitting is the result of an overly complex model with too many parameters. A model that is overfitted is inaccurate because the trend does not reflect the reality of the data. An overfitted model is a model with a trend line that reflects the errors in the data that it is trained with, instead of accurately predicting unseen data.
Ontology
A collection of features and their relationships (also known as a taxonomy). Ontologies can be reused across different projects. Ontologies are essential for data labeling, model training, and evaluation. When you label or review a data asset, the ontology appears in the Tools panel.
Neural network
A highly abstracted and simplified model of the human brain used in machine learning. A set of units receives pieces of an input (pixels in a photo, say), performs simple computations on them, and passes them on to the next layer of units. The final layer represents the answer.
Nested classification
A classification-type annotation that is nested within an object-type annotation (as opposed to a global classification).
NLP (Natural language processing)
A computer's attempt to “understand” spoken or written language. It must parse vocabulary, grammar, and intent, and allow for variation in language use. The process often involves machine learning.
Model run
A model run is a model training experiment within a model directory. Each model run has its data snapshot (data rows, annotations, and data splits) versioned. You can upload predictions to a model run, and compare results and performance against other model runs in the model directory.
Labelbox Model
A Model is a directory where you can create, manage, and compare a set of Model Runs related to the same machine learning task. Each Model is specified by an ontology of data: it defines the machine learning task of the Model Runs inside the directory.
Model
Machine learning algorithms and data processing designed, developed, trained and implemented to achieve set outputs, inclusive of datasets used for said purposes unless otherwise stated.
Metadata
Data employed to annotate other data with descriptive information, possibly including their data descriptions, data about data ownership, access paths, access rights, and data volatility.
Labelbox Metadata
Metadata is non-annotation information about the asset to be labeled. There are two types of metadata: reserved keys (which cannot be changed) and custom (user-defined). Metadata helps search and filter data rows.
Media attributes
When you upload data assets, Labelbox automatically computes media attributes appropriate for the data type and stores their values as part of the data row. Examples include mimeType, width, height, codec, and more.
MLOps (Machine learning operations)
MLOps (machine learning operations) stands for the collection of techniques and tools for the deployment of ML models in production.
Machine learning
The study or the application of computer algorithms that improve automatically through experience. Machine learning algorithms build a model based on training data in order to perform a specific task, like aiding in prediction or decision-making processes, without necessarily being explicitly programmed to do so
LLM (Large Language Models)
A class of language models that use deep-learning algorithms and are trained on extremely large textual datasets that can be multiple terabytes in size. LLMs can be classed into two types: generative or discriminatory.
Generative LLMs are models that output text, such as the answer to a question or even writing an essay on a specific topic. They are typically unsupervised or semi-supervised learning models that predict what the response is for a given task. Discriminatory LLMs are supervised learning models that usually focus on classifying text, such as determining whether a text was made by a human or AI.
Inference
The stage of machine learning in which a model is applied to a task. For example, a classifier model produces the classification of a test sample.
Image segmentation
Image segmentation is the process of separating an image into multiple different parts to simplify its representation and facilitate analysis for training a computer vision model. Image segmentation is one of the most labor intensive annotation tasks because it requires pixel level accuracy. Labeling a single image can take up to 30 minutes. With image segmentation, each annotated pixel in an image belongs to a single class. The output is a mask that outlines the shape of the object in the image.
Hyperparameter
The parameters that are used to either configure a machine learning model (e.g., the penalty parameter C in a support vector machine, and the learning rate to train a neural network) or to specify the algorithm used to minimize the loss function (e.g., the activation function and optimizer types in a neural network, and the kernel type in a support vector machine).
Ground truth
A ground truth is information that is known to be real or true, as supported by direct observation and measurement. Labels made by humans are considered to be empirical ground truths, as opposed to labels added through model inference.
GPU (graphical processing unit)
A specialized chip capable of highly parallel processing. GPUs are well-suited for running machine learning and deep learning algorithms. GPUs were first developed for efficient parallel processing of arrays of values used in computer graphics. Modern-day GPUs are designed to be optimized for machine learning.
GAN (Generative Adversarial Network)
Generative Adversarial Networks, or GANs for short, are an approach to generative modeling using deep learning methods, such as convolutional neural networks. Generative modeling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.
Feature extraction
A more general method in which one tries to develop a transformation of the input space onto the low dimensional subspace that preserves most of the relevant information.
Feature
A feature is the master definition of what you want the model to predict. It is also the blueprint for your ground truth. Ontologies consist of features, which include objects (example: bounding box) and classifications (radio buttons). Features can have multiple, nested classifications.
Labelbox Editor
The labeling interface you can use to create, review, and edit annotations. When creating a new project, you're prompted to configure the editor, which defines the data type and the interface used while labeling.
Deep learning
An approach to AI that allows computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined through its relation to simpler concepts.
By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all the knowledge that the computer needs. The hierarchy of concepts enables the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers.
Dataset
Datasets are containers for data rows; they collect a set of related data assets.
Data type
Type of data row such as image (JPG/PNG), Video (MP4), text (.txt files).
Data split
You can split the selected data rows into train, validation, and test splits to prepare for model training and evaluation.
Data row
Represents an individual data asset, along with associated attributes (such as global ID) and annotations, which can include:
URL to your cloud-hosted file
Metadata
Media attributes (e.g., data type, size, etc.)
Attachments (files that provide context for your labelers)
Predictions
Consensus
The Consensus tool lets you compare labelers against each other by comparing annotations on a given asset. Consensus works in real-time so you can take immediate and corrective actions toward boosting team and model performance.
Computer vision
Computer vision is a field of Artificial Intelligence (AI) technology that enables computer systems to perform tasks that require visual perception.
Confusion matrix
A matrix showing the predicted and actual classifications. A confusion matrix is of size LxL, where L is the number of different label values.
Chatbot
A chatbot is a computer program which responds like an intelligent entity when conversed with. The conversation may be through text or voice. Any chatbot program understands one or more human languages by Natural Language Processing.
Catalog
An organization-wide platform for curating and exploring your unstructured data. Catalog enables you to easily browse, curate, and develop insights across all labeled and unlabeled data rows in your organization.
BERT (Bidirectional Encoder Representation from Transformers)
A language representation model designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Batch
A method for selecting data rows from Catalog and sending them to a labeling project. Sending batches of data rows to a labeling project is an alternative to attaching an entire dataset to a project.
Labelbox Benchmark
The Benchmark tool lets you designate a labeled asset as a “gold standard” and automatically compare all other labels on that asset to the benchmark label.
Attachments
Supplementary information you can attach to an asset in order to provide contextual information for your labeling team. When viewing data rows in detail view, attachments appear on a separate side panel.
Annotation
A human-made or computer-generated label on an asset. Annotations can be imported (as ground truth or pre-labels) or they can be created manually in the Labelbox editor. Annotations are categorized as objects (such as bounding box or polygon) or classifications (e.g. radio, checklist, etc).
Active learning
A proposed method for modifying machine learning algorithms by allowing them to specify test regions to improve their accuracy. At any point, the algorithm can choose a new point x, observe the output and incorporate the new (x, y) pair into its training base. It has been applied to neural networks, prediction functions, and clustering functions.