Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
In the animated science fiction show Voltron, several robot lions combine into one big robot to perform great tasks. Similarly, Voltron Data, which launched in 2022 with $110 million in funding, focuses on integrating multiple open-source technologies like Apache Arrow, Apache Parquet, and Ibis to enhance data access. Today, Voltron Data is advancing further by announcing the Theseus distributed query engine, aimed at accelerating data queries to meet the demands of AI workloads.
Theseus is built to speed up large-scale data pipelines and queries using GPUs and other hardware accelerators. According to Josh Patterson, co-founder and CEO of Voltron Data, Theseus is based on modular, composable, and accelerated libraries that enhance data systems. This new product is part of their ongoing journey to become leading designers and builders of data systems.
Optimized for running distributed queries on datasets of 10 terabytes or more, Theseus caters to organizations with massive data processing needs, including Fortune 500 companies, government agencies, hedge funds, telcos, and media entertainment firms. The main goal is to speed up ETL (extract, transform, load) processes, feature engineering, and other data preparation tasks to support AI and analytics systems more efficiently. Faster AI systems need timely data transformation to work effectively.
Many users struggle with providing data quickly enough to their AI systems. Traditional data queries are often constrained by CPU performance, but Theseus leverages accelerated computing technologies like GPUs, making it “accelerator native.” This approach enables Theseus to run queries faster than traditional CPU-based distributed engines like Apache Spark at scale.
One major AI application for Theseus is hyper-parameter optimization, where rapid feature engineering and ETL can significantly improve model efficiency and accuracy.
Theseus is designed to be interoperable, utilizing open standards like Apache Arrow, Apache Parquet, and Ibis. It’s not a proprietary system; data from any Apache Arrow-compatible data lake can be queried by Theseus. The system allows seamless data transfer to and from various machine learning tools and frameworks, including PyTorch and TensorFlow. Theseus operates through SQL queries and Ibis, making it easy to integrate into existing workflows without needing its own front-end interface.
Voltron Data is partnering with Hewlett Packard Enterprise (HPE) to integrate Theseus into the HPE GreenLake hybrid cloud platform, providing the necessary infrastructure while enabling unified queries across other engines. Future plans for Theseus include expanding partnerships, adding functionalities like user-defined functions, and achieving tighter integration into data science pipelines. The aim for 2024 is to make integration with various data science tools faster and easier, further empowering users.