Entity Resolution¶
Multi-strategy resolution cascade: identifier match → string similarity → embedding match → LLM disambiguation → Splink linkage.
Orchestrator¶
orchestrator
¶
Multi-signal resolution orchestrator.
Combines all resolution methods (identifier, string, embedding, Splink,
graph, and LLM) into a single pipeline. Produces ResolvedEntity objects
with per-method confidence breakdowns and A0-A3 assurance levels.
The orchestrator implements a cascade pattern: cheap deterministic methods (identifier matching) run first, and more expensive probabilistic methods (embedding, Splink, LLM) only fire for records that remain unresolved. Signal weights are configurable per deployment.
Notes
This is the top-level entry point for Pipeline 2 (Entity Resolution) in the five-pipeline architecture. See Teikari (2026), Section 4 for the theoretical framework behind multi-signal resolution and the A0-A3 assurance level mapping.
See Also
music_attribution.resolution.identifier_match : Stage 1 -- exact ID matching. music_attribution.resolution.string_similarity : Stage 2 -- fuzzy name matching. music_attribution.resolution.embedding_match : Stage 3 -- semantic similarity. music_attribution.resolution.splink_linkage : Stage 4 -- probabilistic linkage. music_attribution.resolution.graph_resolution : Stage 5 -- graph evidence. music_attribution.resolution.llm_disambiguation : Stage 6 -- LLM tie-breaking.
ResolutionOrchestrator
¶
Orchestrate multi-signal entity resolution.
Combines identifier matching, string similarity, embedding similarity, Splink, graph evidence, and LLM disambiguation into a unified pipeline. Each signal contributes a weighted score; the final confidence is the weighted average of all active signals.
| PARAMETER | DESCRIPTION |
|---|---|
weights
|
Per-method weight overrides for score combination. Keys are method
names (
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
_weights |
Active signal weights.
TYPE:
|
_id_matcher |
Stage 1 identifier matcher.
TYPE:
|
_string_matcher |
Stage 2 string similarity matcher.
TYPE:
|
Examples:
>>> orchestrator = ResolutionOrchestrator()
>>> entities = await orchestrator.resolve(normalized_records)
Source code in src/music_attribution/resolution/orchestrator.py
resolve
async
¶
resolve(
records: list[NormalizedRecord],
) -> list[ResolvedEntity]
Resolve a list of NormalizedRecords into ResolvedEntities.
Executes the resolution cascade in order:
- Group records by shared identifiers (exact match).
- For ungrouped records, attempt string-similarity grouping.
- Remaining singletons form their own groups.
- Each group is resolved into a
ResolvedEntitywith confidence scores and assurance levels.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
Input records from the ETL pipeline. Each record represents a single source's view of an entity (artist, work, recording).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[ResolvedEntity]
|
One |
Notes
The cascade ordering ensures that high-confidence deterministic matches are found first, reducing the workload for expensive probabilistic methods downstream.
Source code in src/music_attribution/resolution/orchestrator.py
resolve_group
async
¶
resolve_group(
records: list[NormalizedRecord],
) -> ResolvedEntity
Resolve a pre-clustered group of records into a single entity.
Merges identifiers, picks the canonical name, detects cross-source conflicts, and computes a weighted confidence score from all available resolution signals.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
Pre-clustered records believed to represent the same entity. Must contain at least one record.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ResolvedEntity
|
A merged entity with combined identifiers, canonical name, per-method confidence breakdown, and assurance level. |
Notes
Records with confidence below _REVIEW_THRESHOLD (0.5) are
automatically flagged for human review in the attribution pipeline.
Source code in src/music_attribution/resolution/orchestrator.py
Identifier Match¶
identifier_match
¶
Identifier-based exact matching for entity resolution.
Stage 1 of the resolution cascade. The simplest and highest-confidence resolution method: if two records share the same ISRC, ISWC, ISNI, or MBID, they refer to the same entity with confidence approaching 1.0.
Standardized identifiers provide the strongest resolution signal because they are globally unique by design:
- ISRC (International Standard Recording Code) -- identifies recordings
- ISWC (International Standard Musical Work Code) -- identifies compositions
- ISNI (International Standard Name Identifier) -- identifies contributors
- MBID (MusicBrainz Identifier) -- MusicBrainz-specific stable UUID
- AcoustID -- acoustic fingerprint identifier
Notes
This module implements the deterministic resolution layer described in Teikari (2026), Section 4.1. Because standardized identifiers are globally unique, matches found here bypass all downstream probabilistic methods and receive A1+ assurance levels automatically.
See Also
music_attribution.resolution.orchestrator : Cascade coordinator that calls this first. music_attribution.resolution.string_similarity : Fallback for records without identifiers.
IdentifierMatcher
¶
Resolve entities by exact identifier matching.
Two NormalizedRecord objects sharing any standard identifier (ISRC,
ISWC, ISNI, MBID, AcoustID) are considered the same entity. This is the
highest-confidence resolution strategy because standardized identifiers
are globally unique by design.
The matcher uses a union-find data structure with path compression to efficiently cluster records that share identifiers, even transitively (e.g., record A shares ISRC with B, and B shares MBID with C, so A, B, C are all the same entity).
Notes
This is Stage 1 of the resolution cascade. Records matched here bypass all downstream probabilistic methods. See Teikari (2026), Section 4.1.
See Also
music_attribution.resolution.orchestrator : Cascade coordinator. music_attribution.resolution.string_similarity : Stage 2 fallback.
match
¶
match(
records: list[NormalizedRecord],
) -> list[ResolvedEntity]
Match records by shared identifiers and produce ResolvedEntities.
Groups records using union-find on shared identifier values, then
builds a ResolvedEntity for each group with merged identifiers,
conflict detection, and assurance level computation.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
Input records to match. Records without any identifiers will form singleton groups.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[ResolvedEntity]
|
One entity per distinct group found. Multi-record groups have
resolution method |
Source code in src/music_attribution/resolution/identifier_match.py
String Similarity¶
string_similarity
¶
String similarity matching for entity resolution.
Stage 2 of the resolution cascade. Fast fuzzy matching for entity names
using Jaro-Winkler distance (via jellyfish) and token-sort ratio
(via thefuzz). Handles common music-domain variations:
"The"prefix reordering ("Beatles, The"->"the beatles")- Accented character normalization (
"Bjork"matches"Bjork") - Abbreviation expansion (
"feat."->"featuring","ft."->"featuring") - Whitespace normalization
The two similarity algorithms are complementary:
- Jaro-Winkler excels at short strings and character-level typos.
- Token-sort ratio handles word reordering (
"John Elton"matches"Elton John").
Notes
This module implements the fuzzy string matching layer described in Teikari (2026), Section 4.2. It fires only for records that were not matched by exact identifiers in Stage 1.
See Also
music_attribution.resolution.identifier_match : Stage 1 (runs before this). music_attribution.resolution.embedding_match : Stage 3 (semantic similarity).
StringSimilarityMatcher
¶
String similarity matcher for music entity names.
Combines Jaro-Winkler similarity (good for short strings and typos) with token-sort ratio (good for word reordering) for robust matching. Takes the maximum of both scores for each comparison.
| PARAMETER | DESCRIPTION |
|---|---|
threshold
|
Minimum similarity score (0.0-1.0) to consider a match. Default is 0.85, which balances precision and recall for typical music entity names.
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
_threshold |
Active similarity threshold.
TYPE:
|
See Also
music_attribution.resolution.orchestrator.ResolutionOrchestrator : Uses this as Stage 2.
Source code in src/music_attribution/resolution/string_similarity.py
score
¶
Compute similarity score between two entity names.
Both names are normalized (accent stripping, abbreviation expansion, lowercase) before comparison. The score is the maximum of Jaro-Winkler similarity and token-sort ratio.
| PARAMETER | DESCRIPTION |
|---|---|
name_a
|
First entity name (raw, unnormalized).
TYPE:
|
name_b
|
Second entity name (raw, unnormalized).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Similarity score in range [0.0, 1.0]. Returns 1.0 for exact matches after normalization. |
Source code in src/music_attribution/resolution/string_similarity.py
find_candidates
¶
find_candidates(
name: str,
corpus: list[str],
threshold: float | None = None,
) -> list[tuple[str, float]]
Find candidate matches from a corpus above the similarity threshold.
Compares name against every entry in corpus and returns
those exceeding the threshold, sorted by descending score.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Query name to search for.
TYPE:
|
corpus
|
List of candidate names to compare against.
TYPE:
|
threshold
|
Override the instance threshold for this query. If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[tuple[str, float]]
|
Candidate matches as |
Source code in src/music_attribution/resolution/string_similarity.py
Embedding Match¶
embedding_match
¶
Embedding-based semantic matching for entity resolution.
Stage 3 of the resolution cascade. Uses sentence-transformers to embed entity names and metadata into dense vectors and finds semantically similar entities via cosine similarity. Handles cases that string matching misses:
- Translations (
"Die Fledermaus"~"The Bat") - Very different spellings of the same name
- Contextual metadata similarity (genre, collaborators)
The default model (all-MiniLM-L6-v2) produces 384-dimensional embeddings
suitable for fast cosine similarity search. In production, embeddings are
stored in PostgreSQL via pgvector halfvec(768) for efficient approximate
nearest-neighbor queries.
Notes
This module implements the semantic similarity layer described in Teikari (2026), Section 4.3. It fires only for records that were not resolved by identifier matching (Stage 1) or string similarity (Stage 2).
See Also
music_attribution.resolution.string_similarity : Stage 2 (runs before this). music_attribution.resolution.embedding_service : Persistence layer for pgvector. music_attribution.resolution.splink_linkage : Stage 4 (probabilistic linkage).
EmbeddingMatcher
¶
Semantic entity matching using sentence-transformer embeddings.
Embeds entity names into dense vectors and finds similar entities via cosine similarity. Supports in-memory storage for development and pgvector for production deployments.
The model is lazy-loaded on first use to avoid heavy import-time dependencies when the embedding stage is not needed.
| PARAMETER | DESCRIPTION |
|---|---|
model_name
|
Sentence-transformer model to use. Default is
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
_model_name |
Name of the sentence-transformer model.
TYPE:
|
_model |
Lazy-loaded
TYPE:
|
_embeddings |
In-memory embedding store (entity_id -> vector).
TYPE:
|
See Also
music_attribution.resolution.embedding_service : Production persistence via pgvector.
Source code in src/music_attribution/resolution/embedding_match.py
embed
async
¶
Embed a single text string into a dense vector.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Text to embed (entity name, metadata string, etc.).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[float]
|
Embedding vector. Dimensionality depends on the model
(384 for |
Source code in src/music_attribution/resolution/embedding_match.py
embed_batch
async
¶
Embed multiple texts in a single batch for efficiency.
Batch encoding is significantly faster than calling embed()
in a loop because the model can parallelize across inputs.
| PARAMETER | DESCRIPTION |
|---|---|
texts
|
List of texts to embed.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[list[float]]
|
List of embedding vectors, one per input text. |
Source code in src/music_attribution/resolution/embedding_match.py
store_embedding
async
¶
Store an embedding in the in-memory index for later similarity search.
In production, use EmbeddingService.store_embedding() for
pgvector-backed persistence instead.
| PARAMETER | DESCRIPTION |
|---|---|
entity_id
|
Unique identifier for the entity.
TYPE:
|
embedding
|
Embedding vector to store.
TYPE:
|
Source code in src/music_attribution/resolution/embedding_match.py
find_similar
async
¶
Find the most similar stored embeddings via brute-force cosine search.
Performs exhaustive comparison against all stored embeddings. For production-scale deployments, use pgvector's approximate nearest neighbor index instead.
| PARAMETER | DESCRIPTION |
|---|---|
query_embedding
|
The query embedding vector.
TYPE:
|
top_k
|
Number of top results to return. Default is 5.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[tuple[str, float]]
|
Top-k results as |
Source code in src/music_attribution/resolution/embedding_match.py
cosine_similarity
staticmethod
¶
Compute cosine similarity between two vectors.
Defined as dot(a, b) / (||a|| * ||b||). Returns 0.0 if
either vector has zero magnitude.
| PARAMETER | DESCRIPTION |
|---|---|
vec_a
|
First vector.
TYPE:
|
vec_b
|
Second vector (must have same dimensionality as
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Cosine similarity in range [-1.0, 1.0]. For normalized sentence-transformer outputs, values are typically in [0, 1]. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If vectors have different lengths (via |
Source code in src/music_attribution/resolution/embedding_match.py
LLM Disambiguation¶
llm_disambiguation
¶
LLM-assisted disambiguation for entity resolution.
Stage 6 (final) of the resolution cascade. For complex disambiguation cases (e.g., "John Williams the composer vs the guitarist"), uses PydanticAI with structured output to make a reasoned decision. The LLM is only called when other signals produce ambiguous results (confidence in the 0.4-0.7 range), providing strict cost control.
Key design decisions:
- Cost gating: LLM invocation is guarded by
should_invoke(), which checks that the best existing signal falls in the ambiguity range. - Deterministic caching: A SHA-256 cache key prevents duplicate LLM calls for the same candidate set and context.
- Structured output: The LLM returns a
DisambiguationResultwith chosen index, confidence, and reasoning (not free text).
Notes
This module implements the LLM disambiguation layer described in Teikari (2026), Section 4.6. The Oracle Problem (digital systems cannot fully verify physical reality) means LLM confidence is treated as one signal among many, not as ground truth.
See Also
music_attribution.resolution.graph_resolution : Stage 5 (runs before this). music_attribution.resolution.orchestrator : Cascade coordinator.
DisambiguationResult
¶
Bases: BaseModel
Structured output from LLM disambiguation.
Represents the LLM's reasoned decision about which candidate entity (if any) matches the query, along with self-reported confidence and a natural-language explanation.
| ATTRIBUTE | DESCRIPTION |
|---|---|
chosen_index |
Index into the candidates list identifying the chosen entity.
TYPE:
|
confidence |
LLM's self-reported confidence in range [0.0, 1.0]. This is one signal among many and should not be taken at face value.
TYPE:
|
reasoning |
Natural-language explanation of the LLM's decision.
TYPE:
|
alternatives_considered |
Number of candidate entities the LLM evaluated.
TYPE:
|
cached |
Whether this result was served from the in-memory cache.
TYPE:
|
LLMDisambiguator
¶
LLM-assisted entity disambiguation.
Only invoked when other resolution methods produce ambiguous results (confidence in the 0.4-0.7 range). Uses SHA-256 content-based caching to reduce LLM costs.
The _call_llm method is designed to be overridden in subclasses
or mocked in tests. In production, it would use a PydanticAI Agent
with structured DisambiguationResult output.
| ATTRIBUTE | DESCRIPTION |
|---|---|
_cache |
In-memory cache keyed by SHA-256 hash of candidate + context.
TYPE:
|
Notes
The ambiguity range constants (_AMBIGUITY_LOW=0.4,
_AMBIGUITY_HIGH=0.7) define when the LLM is invoked. Scores
above 0.7 are confident enough to not need LLM; scores below 0.4
are too uncertain for LLM to add value.
Source code in src/music_attribution/resolution/llm_disambiguation.py
disambiguate
async
¶
disambiguate(
candidates: list[NormalizedRecord], context: str
) -> DisambiguationResult
Disambiguate between candidate entities using LLM.
Checks the content-addressed cache first. On cache miss, calls
_call_llm() and caches the result. On LLM failure, returns
a safe fallback with chosen_index=None and confidence=0.0.
| PARAMETER | DESCRIPTION |
|---|---|
candidates
|
List of candidate records that could not be resolved by earlier cascade stages.
TYPE:
|
context
|
Additional context for disambiguation (e.g., album name, genre, release year).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DisambiguationResult
|
The LLM's structured decision, or a safe fallback on error. |
Source code in src/music_attribution/resolution/llm_disambiguation.py
should_invoke
async
¶
should_invoke(existing_scores: ResolutionDetails) -> bool
Determine if LLM disambiguation is needed based on existing signals.
The LLM is only invoked when the best signal from other methods falls in the ambiguity range [0.4, 0.7]. If no other signals exist at all, the LLM is invoked as a last resort.
| PARAMETER | DESCRIPTION |
|---|---|
existing_scores
|
Resolution scores from earlier cascade stages (string similarity, embedding similarity, graph path confidence).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
Source code in src/music_attribution/resolution/llm_disambiguation.py
Splink Linkage¶
splink_linkage
¶
Splink probabilistic record linkage for entity resolution.
Stage 4 of the resolution cascade. Implements Fellegi-Sunter probabilistic record linkage at scale using the Splink library. Estimates match/non-match probability distributions via expectation-maximization and produces calibrated linkage scores. Uses DuckDB backend for performance.
The Fellegi-Sunter model treats record comparison as a binary classification problem: for each pair of records, it estimates the probability that they refer to the same entity based on agreement/disagreement patterns across comparison fields. The model parameters (m-probabilities for matches, u-probabilities for non-matches) are estimated from the data using EM.
When Splink is not available (e.g., in lightweight test environments), the matcher falls back to a simple exact-match heuristic on comparison columns.
Notes
This module implements the probabilistic record linkage layer described in
Teikari (2026), Section 4.4. Splink v4 API is used (from splink import
block_on, not splink.blocking_rules_library).
References
.. [1] Fellegi, I. P., & Sunter, A. B. (1969). "A Theory for Record Linkage." Journal of the American Statistical Association, 64(328), 1183-1210.
See Also
music_attribution.resolution.embedding_match : Stage 3 (runs before this). music_attribution.resolution.graph_resolution : Stage 5 (runs after this).
SplinkMatcher
¶
Probabilistic record linkage using the Splink library.
Uses the Fellegi-Sunter model with configurable comparison columns and blocking rules to efficiently link records at scale. The workflow is:
configure_model()-- define comparison columns.estimate_parameters()-- learn m/u probabilities from data.predict()-- compute match probabilities for all candidate pairs.cluster()-- group records by match probability threshold.
| ATTRIBUTE | DESCRIPTION |
|---|---|
_model_configured |
Whether
TYPE:
|
_comparison_columns |
Column names used for record comparison.
TYPE:
|
_linker |
The Splink
TYPE:
|
Source code in src/music_attribution/resolution/splink_linkage.py
configure_model
¶
Configure the Splink model with comparison columns.
Must be called before estimate_parameters(). Each column
will be compared using exact-match comparisons with term
frequency adjustments.
| PARAMETER | DESCRIPTION |
|---|---|
comparison_columns
|
Column names to compare (e.g.,
TYPE:
|
Source code in src/music_attribution/resolution/splink_linkage.py
estimate_parameters
¶
Estimate Fellegi-Sunter m/u parameters from data.
Uses random sampling to estimate u-probabilities (probability of agreement among non-matches) and expectation-maximization to estimate m-probabilities (probability of agreement among matches) for each comparison column.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
DataFrame with a
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If |
Notes
If Splink is not installed, falls back to None linker and
subsequent calls to predict() will use the simple exact-match
fallback.
Source code in src/music_attribution/resolution/splink_linkage.py
predict
¶
Predict match probabilities for all candidate record pairs.
If the Splink linker is available, uses the trained model to predict. Otherwise, falls back to a simple exact-match heuristic on the comparison columns.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
DataFrame with comparison columns (used only in fallback mode).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
Source code in src/music_attribution/resolution/splink_linkage.py
cluster
¶
Cluster records into entity groups based on match predictions.
Uses union-find with path compression to transitively merge records connected by match probabilities above the threshold.
| PARAMETER | DESCRIPTION |
|---|---|
predictions
|
DataFrame with columns
TYPE:
|
threshold
|
Minimum match probability to consider a pair as linked. Default is 0.85.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[list[int]]
|
Clusters of |
Source code in src/music_attribution/resolution/splink_linkage.py
Graph Resolution¶
graph_resolution
¶
Graph-based entity resolution via relationship evidence.
Stage 5 of the resolution cascade. Uses relationship graph traversals to resolve entities based on shared connections. Two artist records sharing 3+ album relationships are likely the same artist, even if their names differ slightly.
The graph resolver computes confidence from two complementary signals:
- Jaccard coefficient of shared neighbor sets (structural similarity).
- Absolute shared count with diminishing returns (3+ shared neighbors is strong evidence regardless of total degree).
The in-memory adjacency graph is suitable for development and testing. In production, Apache AGE (PostgreSQL graph extension) provides the same traversal semantics with persistent storage and ACID guarantees.
Notes
This module implements the graph-based resolution layer described in Teikari (2026), Section 4.5. Graph evidence is particularly valuable for resolving entities with common names (e.g., "John Smith") where string similarity alone is insufficient.
See Also
music_attribution.resolution.splink_linkage : Stage 4 (runs before this). music_attribution.resolution.llm_disambiguation : Stage 6 (runs after this). music_attribution.resolution.graph_store : Persistent graph storage.
GraphResolver
¶
Resolve entities using relationship graph evidence.
Maintains an in-memory adjacency graph of entity relationships. Each entity is a node, and relationships (PERFORMED_ON, WROTE, PRODUCED, etc.) form bidirectional edges.
In production, this would query Apache AGE or a similar graph database. The in-memory implementation provides the same API for testing and development.
| ATTRIBUTE | DESCRIPTION |
|---|---|
_graph |
Adjacency list mapping entity IDs to sets of
TYPE:
|
_test_ids |
Optional test-only ID mapping for deterministic tests.
TYPE:
|
Source code in src/music_attribution/resolution/graph_resolution.py
add_relationship
¶
Add a bidirectional relationship to the graph.
Both directions are stored so that neighbor lookups work regardless of edge direction.
| PARAMETER | DESCRIPTION |
|---|---|
from_id
|
Source entity ID.
TYPE:
|
to_id
|
Target entity ID.
TYPE:
|
rel_type
|
Relationship type (e.g.,
TYPE:
|
Source code in src/music_attribution/resolution/graph_resolution.py
find_candidate_matches
async
¶
Find candidate entity matches based on shared neighbor relationships.
Two entities that share many neighbors (e.g., both performed on the same albums) are likely the same entity or closely related. The confidence score combines the ratio of shared-to-total neighbors with an absolute shared-count bonus.
| PARAMETER | DESCRIPTION |
|---|---|
entity_id
|
Entity ID to find matches for.
TYPE:
|
min_shared
|
Minimum number of shared neighbors to qualify as a candidate. Default is 2.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[tuple[str, float]]
|
Candidate matches as |
Source code in src/music_attribution/resolution/graph_resolution.py
score_graph_evidence
async
¶
Score the graph evidence that two entities are the same.
Combines two complementary signals:
- Jaccard coefficient:
|shared| / |union|of neighbor sets. - Shared count bonus:
min(|shared| / 3, 1.0)(diminishing returns -- 3+ shared is strong evidence).
The final score is the average of both signals, capped at 1.0.
| PARAMETER | DESCRIPTION |
|---|---|
entity_a
|
First entity ID.
TYPE:
|
entity_b
|
Second entity ID.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Confidence score in range [0.0, 1.0]. Returns 0.0 if either entity has no graph relationships or they share no neighbors. |
Source code in src/music_attribution/resolution/graph_resolution.py
Graph Store¶
graph_store
¶
Graph storage for ResolvedEntities.
Provides in-memory graph storage for development and testing, and defines the interface for Apache AGE integration in production. Enables relationship-based queries such as:
- "Find all entities that share an album with this artist."
- "What is the shortest path between two entities?"
- "Who are all performers on works by this composer?"
The graph is stored as an adjacency list of bidirectional edges with typed relationships and arbitrary attributes. BFS traversal supports depth-limited neighbor queries and shortest-path computation.
Notes
In production, Apache AGE (PostgreSQL graph extension) provides the same
traversal semantics with persistent storage, ACID guarantees, and Cypher
query support. The AsyncEdgeRepository provides the PostgreSQL-backed
edge storage layer.
See Also
music_attribution.resolution.graph_resolution : Graph-based entity resolution. music_attribution.resolution.edge_repository : PostgreSQL edge persistence.
GraphStore
¶
Store and query ResolvedEntities as a graph.
Uses in-memory storage by default. Production implementations would use Apache AGE (PostgreSQL graph extension).
The store maintains two data structures:
_entities: UUID-keyed map ofResolvedEntityobjects (nodes)._edges: Adjacency list of bidirectional edges with relationship type and arbitrary string attributes.
| ATTRIBUTE | DESCRIPTION |
|---|---|
_entities |
Entity node storage.
TYPE:
|
_edges |
Adjacency list:
TYPE:
|
Source code in src/music_attribution/resolution/graph_store.py
add_entity
async
¶
add_entity(entity: ResolvedEntity) -> None
Store a ResolvedEntity as a node in the graph.
If an entity with the same ID already exists, it is overwritten.
| PARAMETER | DESCRIPTION |
|---|---|
entity
|
The resolved entity to store.
TYPE:
|
Source code in src/music_attribution/resolution/graph_store.py
get_entity
async
¶
get_entity(entity_id: UUID) -> ResolvedEntity | None
Retrieve a ResolvedEntity by its UUID.
| PARAMETER | DESCRIPTION |
|---|---|
entity_id
|
The entity ID to look up.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ResolvedEntity | None
|
The entity if found, |
Source code in src/music_attribution/resolution/graph_store.py
add_relationship
async
¶
Add a bidirectional relationship between two entities.
Both directions are stored to enable traversal from either endpoint.
The entities referenced by from_id and to_id should already
exist in the store (but this is not enforced).
| PARAMETER | DESCRIPTION |
|---|---|
from_id
|
Source entity ID.
TYPE:
|
to_id
|
Target entity ID.
TYPE:
|
rel_type
|
Relationship type (e.g.,
TYPE:
|
attrs
|
Additional relationship attributes (e.g.,
TYPE:
|
Source code in src/music_attribution/resolution/graph_store.py
find_related
async
¶
find_related(
entity_id: UUID, rel_type: str, depth: int = 1
) -> list[ResolvedEntity]
Find entities related by a specific relationship type.
Performs a breadth-first traversal following only edges of the specified type, up to the given depth. Each entity is visited at most once (cycle-safe).
| PARAMETER | DESCRIPTION |
|---|---|
entity_id
|
Starting entity ID.
TYPE:
|
rel_type
|
Relationship type to follow (e.g.,
TYPE:
|
depth
|
Maximum traversal depth (number of hops). Default is 1.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[ResolvedEntity]
|
Related entities found within the traversal depth. Does not include the starting entity. |
Source code in src/music_attribution/resolution/graph_store.py
shortest_path
async
¶
shortest_path(
from_id: UUID, to_id: UUID
) -> list[ResolvedEntity]
Find the shortest path between two entities using BFS.
Traverses all relationship types to find the shortest path (fewest hops) between two entities. Useful for understanding how two entities are connected in the knowledge graph.
| PARAMETER | DESCRIPTION |
|---|---|
from_id
|
Starting entity ID.
TYPE:
|
to_id
|
Target entity ID.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[ResolvedEntity]
|
Entities along the shortest path, inclusive of both endpoints.
Returns a single-element list if |