DataStore delivers significant performance improvements over pandas for many operations. This guide explains why and how to optimize your workloads.
Why DataStore Is Faster
- SQL Pushdown
Operations are pushed down to the data source:
# pandas: Loads ALL data, then filters in memory
df = pd.read_csv("huge.csv") # Load 10GB
df = df[df['year'] == 2024] # Filter in Python
# DataStore: Filter at source
ds = pd.read_csv("huge.csv") # Just metadata
ds = ds[ds['year'] == 2024] # Filter in SQL
df = ds.to_df() # Only load filtered data
- Column Pruning
Only needed columns are read:
# DataStore: Only reads name, age columns
ds = pd.read_parquet("wide_table.parquet")
result = ds.select('name', 'age').to_df()
# vs pandas: Reads all 100 columns, then selects
- Lazy Evaluation
Multiple operations compile to one query:
# DataStore: One optimized SQL query
result = (ds
.filter(ds['amount'] > 100)
.groupby('region')
.agg({'amount': 'sum'})
.sort('sum', ascending=False)
.head(10)
.to_df()
)
# Becomes:
# SELECT region, SUM(amount) FROM data
# WHERE amount > 100
# GROUP BY region ORDER BY sum DESC LIMIT 10
Benchmark: DataStore vs pandas
Test Environment
- Data: 10 million rows
- Hardware: Standard laptop
- File format: CSV
Results
| Operation | pandas (ms) | DataStore (ms) | Winner |
|---|
| GroupBy count | 347 | 17 | DataStore (19.93x) |
| Combined ops | 1,535 | 234 | DataStore (6.56x) |
| Complex pipeline | 2,047 | 380 | DataStore (5.39x) |
| MultiFilter+Sort+Head | 1,963 | 366 | DataStore (5.36x) |
| Filter+Sort+Head | 1,537 | 350 | DataStore (4.40x) |
| Head/Limit | 166 | 45 | DataStore (3.69x) |
| Ultra-complex (10+ ops) | 1,070 | 338 | DataStore (3.17x) |
| GroupBy agg | 406 | 141 | DataStore (2.88x) |
| Select+Filter+Sort | 1,217 | 443 | DataStore (2.75x) |
| Filter+GroupBy+Sort | 466 | 184 | DataStore (2.53x) |
| Filter+Select+Sort | 1,285 | 533 | DataStore (2.41x) |
| Sort (single) | 1,742 | 1,197 | DataStore (1.45x) |
| Filter (single) | 276 | 526 | Comparable |
| Sort (multiple) | 947 | 1,477 | Comparable |
Key Insights
- GroupBy operations: DataStore up to 19.93x faster
- Complex pipelines: DataStore 5-6x faster (SQL pushdown benefit)
- Simple slice operations: Performance comparable - difference negligible
- Best use case: Multi-step operations with groupby/aggregation
- Zero-copy:
to_df() has no data conversion overhead
When DataStore Wins
Heavy Aggregations
# DataStore excels: 19.93x faster
result = ds.groupby('category')['amount'].sum()
Complex Pipelines
# DataStore excels: 5-6x faster
result = (ds
.filter(ds['date'] >= '2024-01-01')
.filter(ds['amount'] > 100)
.groupby('region')
.agg({'amount': ['sum', 'mean', 'count']})
.sort('sum', ascending=False)
.head(20)
)
Large File Processing
# DataStore: Only loads what you need
ds = pd.read_parquet("huge_file.parquet")
result = ds.filter(ds['id'] == 12345).to_df() # Fast!
Multiple Column Operations
# DataStore: Combines into single SQL
ds['total'] = ds['price'] * ds['quantity']
ds['is_large'] = ds['total'] > 1000
ds = ds.filter(ds['is_large'])
When pandas Is Comparable
In most scenarios, DataStore matches or exceeds pandas performance. However, pandas may be slightly faster in these specific cases:
Small Datasets (<1,000 rows)
# For very small datasets, overhead is minimal for both
# Performance difference is negligible
small_df = pd.DataFrame({'x': range(100)})
Simple Slice Operations
# Single slice operations without aggregation
df = df[df['x'] > 10] # pandas slightly faster
ds = ds[ds['x'] > 10] # DataStore comparable
Custom Python Lambda Functions
# pandas required for custom Python code
def complex_function(row):
return custom_logic(row)
df['result'] = df.apply(complex_function, axis=1)
ImportantEven in scenarios where DataStore is “slower”, performance is typically on par with pandas - the difference is negligible for practical use. DataStore’s advantages in complex operations far outweigh these edge cases.For fine-grained control over execution, see Execution Engine Configuration.
Zero-Copy DataFrame Integration
DataStore uses zero-copy for reading and writing pandas DataFrames. This means:
# to_df() does NOT copy data - it's a zero-copy operation
result = ds.filter(ds['x'] > 10).to_df() # No data conversion overhead
# Same for creating DataStore from DataFrame
ds = DataStore(existing_df) # No data copy
Key implications:
to_df() is essentially free - no serialization or memory copying
- Creating DataStore from pandas DataFrame is instant
- Memory is shared between DataStore and pandas views
Optimization Tips
For aggregation-heavy workloads where you don’t need exact pandas output format (row order, MultiIndex columns, dtype corrections), enable performance mode for maximum throughput:
from chdb.datastore.config import config
config.use_performance_mode()
# Now all operations use SQL-first execution with no pandas overhead:
# - Parallel Parquet reading (no preserve_order)
# - Single-SQL aggregation (filter+groupby in one query)
# - No row-order preservation overhead
# - No MultiIndex, no dtype corrections
result = (ds
.filter(ds['amount'] > 100)
.groupby('region')
.agg({'amount': ['sum', 'mean', 'count']})
)
Expected improvement: Up to 2-8x faster for filter+groupby workloads, reduced memory usage for large Parquet files.
See Performance Mode for full details.
- Use Parquet Instead of CSV
# CSV: Slower, reads entire file
ds = pd.read_csv("data.csv")
# Parquet: Faster, columnar, compressed
ds = pd.read_parquet("data.parquet")
# Convert once, benefit forever
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet")
Expected improvement: 3-10x faster reads
- Filter Early
# Good: Filter first, then aggregate
result = (ds
.filter(ds['date'] >= '2024-01-01') # Reduce data early
.groupby('category')['amount'].sum()
)
# Less optimal: Process all data
result = (ds
.groupby('category')['amount'].sum()
.filter(ds['sum'] > 1000) # Filter too late
)
- Select Only Needed Columns
# Good: Column pruning
result = ds.select('name', 'amount').filter(ds['amount'] > 100)
# Less optimal: All columns loaded
result = ds.filter(ds['amount'] > 100) # Loads all columns
- Leverage SQL Aggregations
# GroupBy is where DataStore shines
# Up to 20x speedup!
result = ds.groupby('category').agg({
'amount': ['sum', 'mean', 'count', 'max'],
'quantity': 'sum'
})
- Use head() Instead of Full Queries
# Don't load entire result if you only need a sample
result = ds.filter(ds['type'] == 'A').head(100) # LIMIT 100
# Avoid this for large results
# result = ds.filter(ds['type'] == 'A').to_df() # Loads everything
- Batch Operations
# Good: Single execution
result = ds.filter(ds['x'] > 10).filter(ds['y'] < 100).to_df()
# Bad: Multiple executions
result1 = ds.filter(ds['x'] > 10).to_df() # Execute
result2 = result1[result1['y'] < 100] # Execute again
- Use explain() to Optimize
# View the query plan before executing
query = ds.filter(...).groupby(...).agg(...)
query.explain() # Check if operations are pushed down
# Then execute
result = query.to_df()
Profiling Your Workload
Enable Profiling
from chdb.datastore.config import config, get_profiler
config.enable_profiling()
# Run your workload
result = your_pipeline()
# View report
profiler = get_profiler()
profiler.report()
Identify Bottlenecks
Performance Report
==================
Step Duration % Total
---- -------- -------
SQL execution 2.5s 62.5% <- Bottleneck!
read_csv 1.2s 30.0%
Other 0.3s 7.5%
Compare Approaches
# Test approach 1
profiler.reset()
result1 = approach1()
time1 = profiler.get_steps()[-1]['duration_ms']
# Test approach 2
profiler.reset()
result2 = approach2()
time2 = profiler.get_steps()[-1]['duration_ms']
print(f"Approach 1: {time1:.0f}ms")
print(f"Approach 2: {time2:.0f}ms")
Best Practices Summary
| Practice | Impact |
|---|
| Enable performance mode | 2-8x faster for aggregation workloads |
| Use Parquet files | 3-10x faster reads |
| Filter early | Reduce data processing |
| Select needed columns | Reduce I/O and memory |
| Use GroupBy/aggregations | Up to 20x faster |
| Batch operations | Avoid repeated execution |
| Profile before optimizing | Find real bottlenecks |
| Use explain() | Verify query optimization |
| Use head() for samples | Avoid full table scans |
Quick Decision Guide
| Your Workload | Recommendation |
|---|
| GroupBy/aggregation | Use DataStore |
| Complex multi-step pipeline | Use DataStore |
| Large files with filters | Use DataStore |
| Simple slice operations | Either (comparable performance) |
| Custom Python lambda functions | Use pandas or convert late |
| Very small data (<1,000 rows) | Either (negligible difference) |
For automatic optimal engine selection, use config.set_execution_engine('auto') (default).
For maximum throughput on aggregation workloads, use config.use_performance_mode().
See Execution Engine and Performance Mode for details. Last modified on June 8, 2026