Add OpenAI, Mistral or open-source embeddings to your knowledge graph.

Add OpenAI, Mistral or open-source embeddings to your knowledge graph.

Dgraph v24 introduces the much-expected merge of vector and graph database: vector support, HNSW index, and similarity search both in DQL and GraphQL. It is a major step to support GenAI RAG, classification, entity resolution, semantic search and many other AI and Graph use cases.

Vector databases are vector first: they let you find similar vectors, but what you need is real data. The data is either stored as a payload associated with the vector or as a reference ID. If you are using payload then creating multiple vectors for the same data introduces duplicates and synchronization issues. If you are using reference ID, then you need some extra queries to get the data you need.

Dgraph is entity first: you can add many vector predicates to the same entity type. For example, a Product may have one vector embedding built from the text description and another vector embedding created from the product image. When searching for similarity, you get similar entities i.e. the data and not only the vector. You don’t need extra queries to get the information you need. The entities found are part of the graph so you can also query any relationships in the same graph request.

Dgraph is a database and does not have the limitations of in-memory solutions: the vectors are treated as any other predicates, stored, and indexed in the core DBs.

This Blog shows how to get started with vector embeddings in Dgraph using OpenAI, Mistral, or Huggingface embedding models. It provides details about how Product embedding have been added to an existing Dgraph instance in the video example:

Adding a vector predicate to an existing entity type

In our example, the following minimal GraphQL Schema is deployed in Dgraph and the Database is populated with existing Products

type Product {
  id: String! @id
  description: String @search(by: [term])
  title: String @search(by: [term])
  imageUrl: String
}

With v24 we can declare a new vector predicate and specify an index using the @search directive. Vector predicates support hnsw index with euclidean cosine or dotproduct metric. A vector is a predicate of type [Float!] with the directive @embedding.

Here is the updated GraphQL Schema:

type Product {
  id: String! @id
  description: String @search(by: [term])
  title: String @search(by: [term])
  imageUrl: String
  characteristics_embedding: [Float!] @embedding @search(by: ["hnsw(metric: euclidean, exponent: 4)"])
}

Notes

  • You can add more than one embedding to an entity type.
  • You don’t specify the vector size. The first mutation will set it. If you you are using an embedding model producing vectors of size 384 for example, all the predicate values will have to be set with the same dimensions. If you decide to change the embedding model, you can easily drop all the predicate values and recompute the embeddings of your entities with the new model which may produce vectors a different dimension.
  • When deploying the updated model, your existing data is untouched, you have just added a new predicate and a vector index.

For our test with a local instance of Dgraph, we simply deploy the schema using

curl -X POST http://localhost:8080/admin/schema --silent --data-binary  '@./schema.graphql'

GraphQL API

Dgraph uses the deployed GraphQL schema to expose a GraphQL API with queries, mutations, and subscriptions for the declared types.
For each entity with at least one vector predicate, Dgraph v24 generates 2 new queries

  • querySimilar<Entity>ByEmbedding
  • querySimilar<Entity>ById

querySimilar<Entity>ByEmbedding returns the topK closest entities to a given vector. The typical use case is semantic or natural language search: the vector is computed in the client application from a sentence i.e. a request expressed in natural language and using the same model used for the entities’ embeddings.

querySimilar<Entity>ById returns the topK closest entities to a given entity. The typical use case is recommendation systems using similarity search.

Before experimenting with those new queries in the GraphQL API, we need to populate our graph with embeddings.

Adding embeddings

We are using a python script from the examples folder of the pydgraph repository.

The script is provided as-is, as an example. Adapt the logic to your needs.

The logic of the shared Python script is as follows:

  • use paginated queries so we don’t have size limit.
  • use an embedding config file.
  • find all entities of a given type. We have 2 options: get all entities or get only the entities for which the vector predicate is not present. The latter is to run the script to compute newly added entities.
  • for each entity, use a DQL query to retrieve the predicates needed.
  • create a text prompt from the value of the retrieved predicates and a text template ( with mustache notation used by pybars).
  • compute the vector embedding of the prompt using openAI, Mistral or a Hugging Face model
  • mutate the vector value in Dgraph.

For our Product we defined the following embedding configuration:

{
    "embeddings" : [
        {
            "entityType":"Product",
            "attribute":"characteristics_embedding",
            "index":"hnsw(metric: "euclidean")",
            "provider": "huggingface",
            "model":"sentence-transformers/all-MiniLM-L6-v2",
            "config" : {
                    "dqlQuery" : "{ title:Product.title }",
                    "template": "{{title}}"
                }
        }
    ]
}

Note that the script is using a DQL query on data generated from a GraphQL Schema. You can learn more about this topic in the doc section GraphQL – DQL interoperability

In a terminal window, declare the Dgraph GRPC endpoint.
For example, for a local instance:

export DGRAPH_GRPC=localhost:9080

If needed, for cloud instances, declare an admin client key.

export DGRAPH_ADMIN_KEY=<Dgraph cloud admin key>

and simply run the script

python ./computeEmbeddings.py

We are using

python 3.11
with 

openai                    1.27.0
mistralai                 0.1.8
pybars3                   0.9.7
sentence-transformers     2.2.2

Similarity Queries

Having vector predicates populated with your embeddings is all you need to perform similarity queries using the auto-generated queries in the GraphQL API.

In our example we have identified one of the Product with id 059446790X and performed a similarity search:

query QuerySimilarProductById {
    querySimilarProductById(id: "059446790X", by: characteristics_embedding, topK: 10) {
        id
        title
        vector_distance
    }
}

Note that you specify the predicate name (here characteristics_embedding) to be used for the similarity search in the query. As previously mentioned, you may have more than one vector attached to the Product entity and you can perform different similarity queries (similar description, similar image, etc…).

vector_distance is a generated predicate providing the distance between the given vector and each entity vector. It can be used to compute similarity score or to apply thresholds.

Conclusion

Dgraph added vector support as a first-class citizen with fast HNSW index support.

Using vector predicates to store embeddings, computed by ML models such as OpenAI, Mistral, Hugging Face, or others, is a surprisingly powerful approach to many AI or NLP use cases.

In this Blog, we showed how to quickly add embeddings to existing entities stored in Dgraph. Let us know what you are building by combining the power of Dgraph and ML models.

Photo by Tuur Tisseghem