A few months ago, we said adios to the traditional text2sql approach and ran an experiment. One that would completely reconstruct our Copilot engine.
Instead of relying on LLMs to generate accurate SQL queries in real-time, we predefined data buckets for each use case and relied on LLMs to pick the right bucket.
More on this approach here.
So, essentially,
Data Bucket + Context + User Prompt → LLM
The LLM in turn understands the context and uses the selected data bucket to answer.
However, the bottleneck with this approach is the size of the data bucket.
For enterprises with data warehouses that make GBs look like KBs, it's simply not feasible to send any data bucket to the LLM directly - no matter how optimised it is.
While LLMs are steadily improving their context window, there is still a lot of ground to cover.
So we did what we do best - EXPERIMENT!
Setting up the Experiment
Went back to the drawing board and broke down the process and realized that all along we’ve been sending data in the JSON format.
So we decided to run a comparison between CSV and JSON formats when sending tabular data to the LLM to answer questions, using Claude 3.5 Sonnet (New).
The key focus of the comparison was evaluating the impact of the data format on accuracy, token usage, latency, and overall cost.
Generating the Testing Dataset
For our test, we used a table containing columns such as ad_id, ad_name, roas, spends, cpm, ctr, and more. We then created 10,000 test questions using various formats, such as:
1. “What is the {metric_name} value for ad {ad_name}?”
2. “Which ad has the {max/min} {metric_name} value?”
3. “Top n ads by {metric_name}, with their respective values.”
Measuring the Results
For an average dataset with around 5,000 cells, we found that:
- CSV format achieved an accuracy of 95.45%
- JSON format achieved 87.50% accuracy
In addition to the accuracy improvement, CSV consumed 56.20% fewer tokens than JSON, leading to significant reductions in both cost and latency. Specifically, we observed:
- A cost reduction of approximately 56.20%
- A latency reduction of 52%
- An overall accuracy improvement of 8.32%
Other Learnings
We split the questions into categories based on where the answer lies in the dataset. So for eg. r1 category stands for the topmost 20% of the dataset while r5 stands for the bottommost 20% of the dataset. Interestingly, the top and bottom sections displayed nearly 100% accuracy across different cell sizes, while the middle sections had relatively lower accuracy.
What’s Next?
Here are some upcoming experiments we're working on:
1. LLM Accuracy Improvement: We’re training the LLM using the dataset mentioned above to improve its performance.
2. Building a RAG System: As the number of cells increases, we noticed a dip in accuracy. To address this, we’re developing a robust Retrieval-Augmented Generation (RAG) system to reduce the number of cells sent to the LLM, which we expect will help maintain higher accuracy.