DataStore implements 209 pandas DataFrame methods for full API compatibility. Your existing pandas code works with minimal changes.
Compatibility Approach
# Typical migration - just change the import
- import pandas as pd
+ from chdb import datastore as pd
# Your code works unchanged
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()
Key principles:
- All 209 pandas DataFrame methods implemented
- Lazy evaluation for SQL optimization
- Automatic type wrapping (DataFrame → DataStore, Series → ColumnExpr)
- Immutable operations (no
inplace=True)
Attributes and Properties
| Property | Description | Triggers Execution |
|---|
shape | (rows, columns) tuple | Yes |
columns | Column names (Index) | Yes |
dtypes | Column data types | Yes |
values | NumPy array | Yes |
index | Row index | Yes |
size | Number of elements | Yes |
ndim | Number of dimensions | No |
empty | Is DataFrame empty | Yes |
T | Transpose | Yes |
axes | List of axes | Yes |
Examples:
from chdb import datastore as pd
ds = pd.read_csv("data.csv")
print(ds.shape) # (1000, 5)
print(ds.columns) # Index(['name', 'age', 'city', 'salary', 'dept'])
print(ds.dtypes) # name: object, age: int64, ...
print(ds.empty) # False
Indexing and Selection
| Method | Description | Example |
|---|
df['col'] | Select column | ds['age'] |
df[['col1', 'col2']] | Select columns | ds[['name', 'age']] |
df[condition] | Boolean indexing | ds[ds['age'] > 25] |
df.loc[...] | Label-based access | ds.loc[0:10, 'name'] |
df.iloc[...] | Integer-based access | ds.iloc[0:10, 0:3] |
df.at[...] | Single value by label | ds.at[0, 'name'] |
df.iat[...] | Single value by position | ds.iat[0, 0] |
df.head(n) | First n rows | ds.head(10) |
df.tail(n) | Last n rows | ds.tail(10) |
df.sample(n) | Random sample | ds.sample(100) |
df.select_dtypes() | Select by Dtype | ds.select_dtypes(include='number') |
df.query() | Query expression | ds.query('age > 25') |
df.where() | Conditional replace | ds.where(ds['age'] > 0, 0) |
df.mask() | Inverse where | ds.mask(ds['age'] < 0, 0) |
df.isin() | Value membership | ds['city'].isin(['NYC', 'LA']) |
df.get() | Safe column access | ds.get('col', default=None) |
df.xs() | Cross-section | ds.xs('key') |
df.pop() | Remove column | ds.pop('col') |
Statistical Methods
| Method | Description | SQL Equivalent |
|---|
mean() | Mean value | AVG() |
median() | Median value | MEDIAN() |
mode() | Mode value | - |
std() | Standard deviation | STDDEV() |
var() | Variance | VAR() |
min() | Minimum | MIN() |
max() | Maximum | MAX() |
sum() | Sum | SUM() |
prod() | Product | - |
count() | Non-null count | COUNT() |
nunique() | Unique count | UNIQ() |
value_counts() | Value frequencies | GROUP BY |
quantile() | Quantile | QUANTILE() |
describe() | Summary statistics | - |
corr() | Correlation matrix | CORR() |
cov() | Covariance matrix | COV() |
corrwith() | Pairwise correlation | - |
rank() | Rank values | RANK() |
abs() | Absolute values | ABS() |
round() | Round values | ROUND() |
clip() | Clip values | - |
cumsum() | Cumulative sum | Window function |
cumprod() | Cumulative product | Window function |
cummin() | Cumulative min | Window function |
cummax() | Cumulative max | Window function |
diff() | Difference | Window function |
pct_change() | Percent change | Window function |
skew() | Skewness | SKEW() |
kurt() | Kurtosis | KURT() |
sem() | Standard error | - |
all() | All true | - |
any() | Any true | - |
idxmin() | Index of min | - |
idxmax() | Index of max | - |
Examples:
ds = pd.read_csv("data.csv")
# Basic statistics
print(ds['salary'].mean())
print(ds['age'].std())
print(ds.describe())
# Group statistics
print(ds.groupby('department')['salary'].mean())
print(ds.groupby('city').agg({'salary': ['mean', 'std'], 'age': 'count'}))
Data Manipulation
| Method | Description |
|---|
drop() | Drop rows/columns |
drop_duplicates() | Remove duplicates |
duplicated() | Mark duplicates |
dropna() | Remove missing values |
fillna() | Fill missing values |
ffill() | Forward fill |
bfill() | Backward fill |
interpolate() | Interpolate values |
replace() | Replace values |
rename() | Rename columns/index |
rename_axis() | Rename axis |
assign() | Add new columns |
astype() | Convert types |
convert_dtypes() | Infer types |
copy() | Copy DataFrame |
Examples:
ds = pd.read_csv("data.csv")
# Drop operations
result = ds.drop(columns=['unused_col'])
result = ds.drop_duplicates(subset=['user_id'])
result = ds.dropna(subset=['email'])
# Fill operations
result = ds.fillna(0)
result = ds.fillna({'age': 0, 'name': 'Unknown'})
# Transform operations
result = ds.rename(columns={'old_name': 'new_name'})
result = ds.assign(
full_name=lambda x: x['first_name'] + ' ' + x['last_name'],
age_group=lambda x: pd.cut(x['age'], bins=[0, 25, 50, 100])
)
Sorting and Ranking
| Method | Description |
|---|
sort_values() | Sort by values |
sort_index() | Sort by index |
nlargest() | N largest values |
nsmallest() | N smallest values |
Examples:
# Sort by single column
result = ds.sort_values('salary', ascending=False)
# Sort by multiple columns
result = ds.sort_values(['department', 'salary'], ascending=[True, False])
# Get top/bottom N
result = ds.nlargest(10, 'salary')
result = ds.nsmallest(5, 'age')
Reshaping
| Method | Description |
|---|
pivot() | Pivot table |
pivot_table() | Pivot with aggregation |
melt() | Unpivot |
stack() | Stack columns to index |
unstack() | Unstack index to columns |
transpose() / T | Transpose |
explode() | Explode lists to rows |
squeeze() | Reduce dimensions |
droplevel() | Drop index level |
swaplevel() | Swap index levels |
reorder_levels() | Reorder levels |
Examples:
# Pivot table
result = ds.pivot_table(
values='amount',
index='region',
columns='product',
aggfunc='sum'
)
# Melt (unpivot)
result = ds.melt(
id_vars=['name'],
value_vars=['score1', 'score2', 'score3'],
var_name='test',
value_name='score'
)
# Explode arrays
result = ds.explode('tags')
Combining / Joining
| Method | Description |
|---|
merge() | SQL-style merge |
join() | Join on index |
concat() | Concatenate |
append() | Append rows |
combine() | Combine with function |
combine_first() | Combine with priority |
update() | Update values |
compare() | Show differences |
Examples:
# Merge (join)
result = pd.merge(df1, df2, on='id', how='left')
result = df1.join(df2, on='id')
# Concatenate
result = pd.concat([df1, df2, df3])
result = pd.concat([df1, df2], axis=1)
Binary Operations
| Method | Description |
|---|
add() / radd() | Addition |
sub() / rsub() | Subtraction |
mul() / rmul() | Multiplication |
div() / rdiv() | Division |
truediv() / rtruediv() | True division |
floordiv() / rfloordiv() | Floor division |
mod() / rmod() | Modulo |
pow() / rpow() | Power |
dot() | Matrix multiplication |
Examples:
# Arithmetic operations
result = ds['col1'].add(ds['col2'])
result = ds['price'].mul(ds['quantity'])
# With fill_value for missing data
result = ds['col1'].add(ds['col2'], fill_value=0)
Comparison Operations
| Method | Description |
|---|
eq() | Equal |
ne() | Not equal |
lt() | Less than |
le() | Less than or equal |
gt() | Greater than |
ge() | Greater than or equal |
equals() | Test equality |
compare() | Show differences |
Function Application
| Method | Description |
|---|
apply() | Apply function |
applymap() | Apply element-wise |
map() | Map values |
agg() / aggregate() | Aggregate |
transform() | Transform |
pipe() | Pipe functions |
groupby() | Group by |
Examples:
# Apply function
result = ds['name'].apply(lambda x: x.upper())
result = ds.apply(lambda row: row['a'] + row['b'], axis=1)
# Aggregate
result = ds.agg({'col1': 'sum', 'col2': 'mean'})
result = ds.agg(['sum', 'mean', 'std'])
# Pipe
result = (ds
.pipe(filter_active)
.pipe(calculate_metrics)
.pipe(format_output)
)
Time Series
| Method | Description |
|---|
rolling() | Rolling window |
expanding() | Expanding window |
ewm() | Exponentially weighted |
resample() | Resample time series |
shift() | Shift values |
asfreq() | Convert frequency |
asof() | Latest value as of |
at_time() | Select at time |
between_time() | Select time range |
first() / last() | First/last periods |
to_period() | Convert to period |
to_timestamp() | Convert to timestamp |
tz_convert() | Convert timezone |
tz_localize() | Localize timezone |
Examples:
# Rolling window
result = ds['value'].rolling(window=7).mean()
# Expanding window
result = ds['value'].expanding().sum()
# Shift
result = ds['value'].shift(1) # Lag
result = ds['value'].shift(-1) # Lead
Missing Data
| Method | Description |
|---|
isna() / isnull() | Detect missing |
notna() / notnull() | Detect non-missing |
dropna() | Drop missing |
fillna() | Fill missing |
ffill() | Forward fill |
bfill() | Backward fill |
interpolate() | Interpolate |
replace() | Replace values |
I/O Methods
| Method | Description |
|---|
to_csv() | Export to CSV |
to_json() | Export to JSON |
to_excel() | Export to Excel |
to_parquet() | Export to Parquet |
to_feather() | Export to Feather |
to_sql() | Export to SQL database |
to_pickle() | Pickle |
to_html() | HTML table |
to_latex() | LaTeX table |
to_markdown() | Markdown table |
to_string() | String representation |
to_dict() | Dictionary |
to_records() | Records |
to_numpy() | NumPy array |
to_clipboard() | Clipboard |
See I/O Operations for detailed documentation.
Iteration
| Method | Description |
|---|
items() | Iterate (column, Series) |
iterrows() | Iterate (index, Series) |
itertuples() | Iterate as named tuples |
Key Differences from Pandas
- Return Types
# Pandas returns Series
pdf['col'] # → pd.Series
# DataStore returns ColumnExpr (lazy)
ds['col'] # → ColumnExpr
- Lazy Execution
# DataStore operations are lazy
result = ds.filter(ds['age'] > 25) # Not executed yet
df = result.to_df() # Executed here
- No inplace Parameter
# Pandas
df.drop(columns=['col'], inplace=True)
# DataStore (always returns new object)
ds = ds.drop(columns=['col'])
- Comparing Results
# Use to_pandas() for comparison
pd.testing.assert_frame_equal(
ds.to_pandas(),
expected_df
)
See Key Differences for complete details. Last modified on June 8, 2026