Implement a local first Embedbase
⚠️ This is not recommended for production. For server deployment, you should never run a model in Embedbase directly (contact to learn more) ⚠️
A common concern of individuals and companies in the LLM era is that you need to send your data to a third-party.
Furthermore, this third-party is usually an American company which is subject to the jurisdiction of the United States. In the United States, the government has the right to access your data at any time.
In this example we will implement a local embedder for Embedbase which will respect your privacy concerns.
We will compute sentence embeddings using the awesome library Sentence Transformers (opens in a new tab).
Installation
Install the required dependencies in a virtual environment:
virtualenv env
source env/bin/activate
pip install embedbase sentence-transformers
Start Embedbase
Create a new file main.py
with the following code:
import typing
from embedbase import get_app
from embedbase.database.memory_db import MemoryDatabase
from embedbase.embedding.base import Embedder
import uvicorn
from sentence_transformers import SentenceTransformer
class LocalEmbedder(Embedder):
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
def __init__(
self, model: str = EMBEDDING_MODEL, **kwargs
):
super().__init__(**kwargs)
self.model = SentenceTransformer(model)
self._dimensions = self.model.get_sentence_embedding_dimension()
@property
def dimensions(self) -> int:
"""
Return the dimensions of the embeddings
:return: dimensions of the embeddings
"""
return self._dimensions
def is_too_big(self, text: str) -> bool:
"""
Check if text is too big to be embedded,
delegating the splitting UX to the caller
:param text: text to check
:return: True if text is too big, False otherwise
"""
return len(text) > self.model.get_max_seq_length()
async def embed(self, data: Union[List[str], str]) -> List[List[float]]:
"""
Embed a list of strings or a single string
:param data: list of strings or a single string
:return: list of embeddings
"""
embeddings = self.model.encode(data)
return embeddings.tolist() if isinstance(data, list) else [embeddings.tolist()]
app = (
get_app()
.use_embedder(LocalEmbedder())
.use_db(MemoryDatabase(dimensions=384))
.run()
)
if __name__ == "__main__":
uvicorn.run("main:app", reload=True)
Start the Embedbase application with the following command:
python3 main.py
Test the endpoint
TypeScript
import { createClient } from 'embedbase-js'
const embedbase = createClient("http://localhost:8000")
const SENTENCES = [
"The lion is the king of the savannah.",
"The chimpanzee is a great ape.",
"The elephant is the largest land animal.",
];
const DATASET_ID = "animals"
const add = async () => {
const data = await embedbase
.dataset(DATASET_ID)
.batchAdd(SENTENCES.map((data) => ({ data })))
console.log(data)
}
add()
You should get a similar response:
{
"results": [
{
"data": "The lion is the king of the savannah.",
"embedding": [...],
"hash": ...,
"metadata": null
},
{
"data": "The chimpanzee is a great ape.",
"embedding": [...],
"hash": ...,
"metadata": null
},
{
"data": "The elephant is the largest land animal.",
"embedding": [...],
"hash": ...,
"metadata": null
}
]
}
Let's try to search now
TypeScript
const search = async () => {
const data = await embedbase
.dataset(DATASET_ID)
.search("Animal that lives in the savannah", { limit: 1})
console.log(data)
}
search()
You should get a similar response:
"The lion is the king of the savannah."