.. _data-model:
.. currentmodule:: lenskit.data
Data Model
==========
LensKit defines holistic data model for recommender training (and evaluation)
data. The model is graph-structured, but the interfaces and definitions center
tabular (data frame) views of that data for ease of training across a variety of
statistical modeling packages.
Apache Arrow is used as the common format for data, and data type definitions
are drawn from there. Data is transparently converted to NumPy arrays, Pandas
series or data frames, Torch tensors, etc. as requested.
Most code will either use one of the predefined dataset loading functions (such
as :func:`~lenskit.data.load_movielens`) or the
:class:`~lenskit.data.DatasetBuilder` to create data sets (see :ref:`data-api`).
.. note::
Working with the data directly as a heterogeneous graph for integration with
packages like PyTorch-Geometric is not difficult, and will be directly
supported in an upcoming backwards-compatible revision.
Example
~~~~~~~
To warm up, here is a simple version of how a MovieLens rating dataset is
represented in the LensKit data model, in `Chen's entity-relationship notation`_:
.. _Chen's entity-relationship notation: https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model
..
Michael is annoyed that this requires GraphViz and cannot yet be done in Mermaid.
https://github.com/mermaid-js/mermaid/issues/3723
.. graphviz::
graph G {
rankdir="LR"
node [shape=box, fontname="Sans-Serif", style=filled, fillcolor="#3fb1c5"]
U [label="user"]
I [label="item"]
node [shape=oval, fillcolor="#f3cf95", style="filled,dashed"]
Uid [label=<user_id>]
Iid [label=<item_id>]
node [style="filled"]
It [label="title"]
subgraph R {
rank="same"
node [shape=diamond, fillcolor="#e47fd7"]
R [label="rating"]
node[shape=oval, fillcolor="#f3cf95"]
Rv [label="rating"]
Rt [label="timestamp"]
Rt -- R;
R -- Rv;
}
U -- R;
R -- I;
Uid -- U;
I -- Iid;
I -- It;
}
This model shows two :term:`entities `, **user** and **item**, connected
by a :term:`relationship` **rating**. Users, items, and ratings have various
:term:`attributes ` further describing them.
Core Concepts
~~~~~~~~~~~~~
The LensKit data model has several core concepts, derived from
entity-relationship database model:
.. glossary::
Entity
The items, users, sessions, etc. about which the data set records data.
In a graph view of the data, these are the nodes in the graph.
Entity Class
Each entity has a particular class, such as ``item`` or ``user``, based
on its role in the dataset. All data sets have at least the ``item``
entity class. Entities do not have subtypes in the raw data model; if
components want to conceptually treat entities as having subtypes, such
as different types of items, they can use attributes to distinguish the
different subtypes.
Entity Identifier
Each entity has a unique (within its type) *identifier*. Entity
identifiers can be either integers or strings.
Attribute
Entities and relationships can have one or more *attributes*.
Attributes are consistent within an entity type (i.e., each entity or
relationship of a particular type has the same attributes with the same
types), and are nullable (any individual entity may be missing a value
for an attribute).
Relationship
A relationship connects two (or more) entities and may have additional
attributes attached to the relationship itself. Relationships may also
be repeated (more than one relationship record may exist for the same
combination of entities).
Relationship Class
Relationship classes are like entity classes, but describe the type of a
particular relationship. This allows for models or client code to query
for records of a particular relationship, such as “follows” or
“purchased”. Each relationship class has a fixed list of entity classes
that participate in relationships of that class. For example, a
``rating`` class typically has the ``user`` and ``item`` entity classes
participating.
Interaction
An interaction is a specific type of relationship record that records an
interaction between two or more entities, such as a user rating a book,
or a user purchasing a product in a particular session. Interactions
usually, but not always, have timestamps.
.. _data-entities:
Entities
~~~~~~~~
*Entities* in the LensKit data model represent individual objects in the data,
such as users or items. An entity is defined by its class and identifier, and
nothing else is directly recorded about the entity itself — the interesting data
resides in its attributes and relationships.
Entity identifiers can be integers or strings.
Every data set has the entity class ``item`` for the items that may be
recommended. Most datasets also have the class ``user``. Session-aware
recommendation data sets may have an entity class ``session``.
When representing entities or entity data in tabular form, identifiers are
stored in a column named ``_id`` (e.g. ``item_id``). Dataset functions
that map identifiers to 0-based contiguous array indexes will use the
``_num`` for this index, referred to as the *entity number*.
.. _data-attributes:
Attributes
~~~~~~~~~~
Entities (and relationships) can have associated *attributes* providing data
about that entity, relationship, or interaction. This can be anything from a
timestamp to review text to complex item metadata. Attributes are associated
with entity or relationship *classes*, and have types that must be consistent
across the class (each entity or relationship class has a schema defining its
attributes and their types).
Attributes come in several forms (called a *layout*):
- **Scalar** attributes store a single value for each entity or relationship
instance. The value can be any type supported by NumPy or Apache Arrow.
Attribute values may be missing.
- **List** attributes store zero or more values for each entity or
relationship instance. List elements must have the same type.
- **Vector** attributes store a fixed-length vector of integer or
floating-point values for each entity or relationship instance. The vector
length is defined by the entity or relationship class, and must be the same
for all instances of that class for which the vector attribute is defined.
The vector dimensions may have associated labels or names, or they may just
be numbered (e.g., for representing embeddings from a language model).
- **Sparse** attributes are vector attributes that are stored in compressed
sparse format, with missing values understood to be 0.
Attribute Name Restrictions
---------------------------
Attribute names can be freely chosen, subject to a few lightweight restrictions:
- Within an entity or relationship class, names must be unique.
- The names must not start with an underscore such as ``_$FOO``.
- For each entity class ``$FOO``, the names ``$FOO_id`` and ``$FOO_num`` are
reserved by LensKit and cannot be used by user-defined attributes (on any
entity or relationship). We recommend avoiding all attribute names of the
form ``$FOO_``.
Unsupported Features
--------------------
In the initial release of the new LensKit data model (in :ref:`2025.1`), not all
possible attribute and entity or relationship class combinations are supported.
In particular, relationships can only have scalar attributes. We intend to
relax this restriction in the future, with more time to determine an ergonomic
API for accessing such data. All attribute formats are supported for entities.
Repeated relationships are also not yet fully supported. Support is planned for
LensKit 2025.2.
.. _data-relationships:
Relationships
~~~~~~~~~~~~~
Relationships are links between two (or more) entities, optionally with
associated attributes. They are further divided into classes, with each class
defining its own set of relationship attributes.
Most relationships are between entities of different classes, in which case the
entity identifiers are stored in ``_id`` (or ``_num``) columns.
For self-relationships, however, this is not possible; such relationships must
define *aliases* for one or more of their appearances, and LensKit uses these
aliases to derive the appropriate column names. For example, a relationship
class that encodes citation relationships in a research paper recommender system
would be a self-relationship between items. It can alias ``item`` to ``citing``
and ``cited``, in which case the item identifiers are taken from ``citing_id``
and ``cited_id`` columns (or ``citing_num`` and ``cited_num``).
.. note::
Entity and relationship class names must be unique (you cannot use the same
name for an entity class and a relationship class).
.. _data-interactions:
Interactions
~~~~~~~~~~~~
An interaction is a relationship that indicates some kind of interaction between
entities for the purposes of learning and evaluating recommendations, such as
purchasing, shelving, clicking, or rating. There is no logical difference
between relationships and interactions; an interaction class is just a
relationship class that has been declared to represent interactions, so that
client and model code knows to treat it as interaction data. Most data sets
define a single interaction class, but can define more than one.
- Interactions should always involve the ``item`` entity class, without an
alias, preferably as the last entity class in the relationship definition.
- Interactions usually have timestamps (although this is not strictly
required). Timestamps can be either integers (treated as UNIX timestamps)
or Arrow timestamp types.
- The dataset can designate a *default interaction class* so that model code
can request the “interactions” without needing to know the different classes
involved. If no default class is specified, and more than one class is
defined, it is an error to request the interactions without specifying an
interaction class.
Certain attribute names, if defined, have particular meaning for interaction
records:
``timestamp``
The date and time of the interaction, as a UNIX or Arrow timestamp.
``rating``
A user-supplied rating for the user-item pair.
``count``
A count of the interactions between this pair. If client code requests an
matrix of interaction counts, and this attribute is defined, then its sum is
used as the total count of interactions between the entities. If no
``count`` attribute is defined, then a matrix of interaction counts is
computed by counting the interaction records.
.. todo::
Define what happens when ``count`` is NULL.
The order of entity classes in an interaction type is mildly meaningful: it is
convention for the last entity class to be the item, and for “interactor” (e.g.,
user or session) to be first.
.. _data-schema:
Schemas
~~~~~~~
A data *schema* (:class:`~lenskit.data.DataSchema`) defines the layout of the
tables, entity types, and relationship types. Client code will rarely need to
create or work with the schema directly; it is created and maintained by the
:class:`~lenskit.data.DatasetBuilder`.
.. _data-format:
LensKit Native Format
~~~~~~~~~~~~~~~~~~~~~
LensKit supports saving and loading data sets in *native* format: an optimized
format that can fully serialize and deserialize a dataset, with all supported
features. The :meth:`Dataset.save` (or :meth:`DatasetBuilder.save`) and
:meth:`Dataset.load` save and load datasets in this format, respectively. The
native format is not intended to be directly manipulated; loading it with
:meth:`Dataset.load` and extracting data from the resulting :class:`Dataset` is
the best way to process it.
LensKit maintains backwards compatibility within the current and previous major
releases. That is, data saved with LensKit 2025.3 can be read by any later
release of LensKit 2025.x or LensKit 2026.x. Further, dataset changes will not
be introduced in patch levels, so data is mutually intelligible between
202X.Y.Z1 and 202X.Y.Z2 for any Z1 and Z2.
Forward-compatibility is not yet maintained; data saved with LensKit may not be
readable by prior versions.
.. _data-internal:
Internal Representation
~~~~~~~~~~~~~~~~~~~~~~~
Data should only be accessed through the :class:`~lenskit.data.Dataset` API, as
the internal storage is subject to change.
In the current version, logically, each entity or relationship type is
represented as a table, consisting of:
- One or more entity identifier or number columns
- Zero or more attribute columns
Data may be internally broken into sub-tables for efficiency (e.g., for very
sparse attributes), but this is the logical view. Internally, relationships use
entity numbers instead of entity IDs to record the entities involved in a
relationship record.
As of LensKit 2025.1, the native format for storing a dataset on disk (used by
:meth:`~lenskit.data.Dataset.save` and :meth:`~lenskit.data.Dataset.load`) is a
directory with a ``schema.json`` file containing the serialized logical schema
and a Parquet file ``.parquet`` for each entity or relationship class
containing the identifiers and attribute values. For entity classes,
``.parquet`` contains both the entity IDS and entity numbers.