DataSplit Node
The DataSplit node splits your dataset into training and test sets. This is a critical step for evaluating model performance on unseen data.
Overview
Section titled “Overview”| Property | Value |
|---|---|
| Type | Processing node |
| Inputs | DataFrame (from DataLoader) |
| Outputs | Split indices (train/test) |
| Library | scikit-learn |
Configuration
Section titled “Configuration”Test Split Ratio
Section titled “Test Split Ratio”The percentage of data to hold out for testing.
| Value | Training | Testing | Use Case |
|---|---|---|---|
| 0.1 | 90% | 10% | Large datasets (100k+ rows) |
| 0.2 | 80% | 20% | Standard (default) |
| 0.3 | 70% | 30% | Small datasets, extra validation |
| 0.4 | 60% | 40% | Very small datasets |
| 0.5 | 50% | 50% | Maximum validation |
Adjust with the slider (10%-50% range, 5% increments).
Random State
Section titled “Random State”Seed for reproducibility. Set a fixed value (e.g., 42) to get the same split every time.
- Default: 42
- Range: Any integer
Stratification
Section titled “Stratification”Enable stratified splitting to maintain class proportions in both train and test sets.
When to use stratification:
- Imbalanced classification problems
- When you have rare classes
- When class distribution matters
Configuration:
- Stratify — Toggle on/off
- Stratify Column — The target column to stratify by (only visible when enabled)
Example: If your dataset has 90% class A and 10% class B, stratification ensures both train and test sets maintain this 90/10 ratio.
How It Works
Section titled “How It Works”- Reads data from the connected DataLoader
- Shuffles the data using the random state
- Splits into train/test indices
- Saves indices to
split_indices.jsonin the work directory - Downstream Trainer uses these indices for training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # if stratification enabled)Connections
Section titled “Connections”| Direction | Node Types |
|---|---|
| Input from | DataLoader |
| Output to | Trainer |
Required pipeline structure:
DataLoader → DataSplit → Trainer → EvaluatorValidation
Section titled “Validation”The node validates:
- Must be connected to a DataLoader
- If stratification enabled, must specify a column name
- Split ratio must be between 0.1 and 0.5
Output
Section titled “Output”The DataSplit node saves split indices to:
~/mlops-work/split_indices.jsonThis file contains:
{ "train_indices": [0, 1, 3, 5, ...], "test_indices": [2, 4, 6, ...], "random_state": 42, "test_size": 0.2, "stratified": true, "stratify_column": "target"}The Trainer node reads this file to use the exact same split.
Best Practices
Section titled “Best Practices”Always Use DataSplit
Section titled “Always Use DataSplit”While the Trainer node has a built-in test split, using DataSplit is preferred:
- Separates concerns (splitting vs training)
- Makes split configuration visible in the pipeline
- Enables stratification
- Reproducible across reruns
Choose Appropriate Split Ratio
Section titled “Choose Appropriate Split Ratio”| Dataset Size | Recommended Split |
|---|---|
| < 1,000 rows | 0.2-0.3 |
| 1,000-10,000 | 0.2 |
| 10,000-100,000 | 0.15-0.2 |
| > 100,000 | 0.1-0.15 |
Smaller datasets need larger test sets for reliable evaluation.
Use Stratification for Classification
Section titled “Use Stratification for Classification”Always enable stratification for classification tasks, especially with:
- Imbalanced classes
- Multi-class problems
- Small datasets
Common Issues
Section titled “Common Issues””Stratify column not found”
Section titled “”Stratify column not found””The column name must match exactly (case-sensitive). Check the DataLoader preview for correct column names.
”Not enough samples in each class”
Section titled “”Not enough samples in each class””If a class has very few samples, stratification may fail. Options:
- Disable stratification
- Combine rare classes
- Collect more data
Different results each run
Section titled “Different results each run”Set a fixed random_state (e.g., 42) for reproducible splits.
Related Nodes
Section titled “Related Nodes”- DataLoader — Load data before splitting
- Trainer — Train on the split data