Python client for Embedbase

(opens in a new tab)

Embedbase-py is a Python SDK for interacting with Embedbase. It provides a simple and convenient way to access Embedbase's features, such as searching datasets, adding data, and creating contexts.

⚠️ Status: Alpha release ⚠️

This is not an officially launched product and currently lacks documentation. Please use at your own risk. If you are using this library, please let us know by opening an issue or contacting us on Discord (opens in a new tab).

Getting Started

To install the Embedbase Python client library, run the following command:

pip install embedbase-client

Usage

Initializing the Client

To get started, import the EmbedbaseClient class and create a new instance:

from embedbase_client.client import EmbedbaseClient
 
embedbase_url = "https://api.embedbase.xyz"
embedbase_key = "<get your key here: https://app.embedbase.xyz>"
client = EmbedbaseClient(embedbase_url, embedbase_key)

In an async context, you can use the EmbedbaseAsyncClient class instead. This class provides the same methods as EmbedbaseClient, but they are all asynchronous.

from embedbase_client.client import EmbedbaseAsyncClient

Remember to use await when calling methods on EmbedbaseAsyncClient objects.

Learn more about asynchronous Python here (opens in a new tab).

Searching Datasets

To search a dataset, call the search method on a Dataset object:

dataset = client.dataset("your_dataset_name")
search_results = dataset.search("your_query", limit=5)

Adding Data

To add data to a dataset, call the add method on a Dataset object:

document = "your_document_text"
metadata = {"key": "value"}
result = dataset.add(document, metadata)

Creating a Context

To create a context, call the create_context method on a Dataset object:

query = "your_query"
context = dataset.create_context(query, limit=5)

Splitting and Chunking Large Texts

AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks. We highly recommend using this function. To split and chunk large texts, use the split_text function from the split module:

from embedbase_client.split import split_text
 
text = "your_long_text"
# ⚠️ note here that the value of max_tokens depends
# on the used embedder in embedbase.
# With models such as OpenAI's embeddings model, you can
# use a max_tokens of 500. With other models, you may need to
# use a lower max_tokens value.
# (embedbase cloud use openai model at the moment) ⚠️
max_tokens = 500
# chunk_overlap is the number of tokens that will overlap between chunks
# it is useful to have some overlap to ensure that the context is not
# cut off in the middle of a sentence.
chunk_overlap = 200
 
chunks = split_text(text, max_tokens, chunk_overlap)
 
# then ...
documents = []
for c in chunks:
    documents.append({
        "data": c.chunk,
    })
result = client.dataset("my-dataset").batch_add(documents)

Dealing with large datasets

If your dataset is large, we recommend running parallel requests like so:

import asyncio
 
async def batch(my_list, fn, batch_size=100):
    async def process_chunk(chunk):
        return await fn(chunk)
 
    tasks = []
    for i in range(0, len(my_list), batch_size):
        chunk = my_list[i:i + batch_size]
        tasks.append(asyncio.create_task(process_chunk(chunk)))
 
    results = await asyncio.gather(*tasks)
    return results
 
async def batch_add_fn(chunk):
  await asyncio.sleep(1)
  return client.dataset(dataset_id).batch_add(chunk)
 
results = await batch(documents, batch_add_fn)
print(f"Results: {results}")

Contributing

We welcome contributions to Embedbase-py (opens in a new tab).

If you have any feedback or suggestions, please open an issue or join our Discord (opens in a new tab) to discuss your ideas.

☕️ Javascript SDK 📡 API