Skip to content

DataSplit Node

The DataSplit node splits your dataset into training and test sets. This is a critical step for evaluating model performance on unseen data.

PropertyValue
TypeProcessing node
InputsDataFrame (from DataLoader)
OutputsSplit indices (train/test)
Libraryscikit-learn

The percentage of data to hold out for testing.

ValueTrainingTestingUse Case
0.190%10%Large datasets (100k+ rows)
0.280%20%Standard (default)
0.370%30%Small datasets, extra validation
0.460%40%Very small datasets
0.550%50%Maximum validation

Adjust with the slider (10%-50% range, 5% increments).

Seed for reproducibility. Set a fixed value (e.g., 42) to get the same split every time.

  • Default: 42
  • Range: Any integer

Enable stratified splitting to maintain class proportions in both train and test sets.

When to use stratification:

  • Imbalanced classification problems
  • When you have rare classes
  • When class distribution matters

Configuration:

  • Stratify — Toggle on/off
  • Stratify Column — The target column to stratify by (only visible when enabled)

Example: If your dataset has 90% class A and 10% class B, stratification ensures both train and test sets maintain this 90/10 ratio.

  1. Reads data from the connected DataLoader
  2. Shuffles the data using the random state
  3. Splits into train/test indices
  4. Saves indices to split_indices.json in the work directory
  5. Downstream Trainer uses these indices for training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # if stratification enabled
)
DirectionNode Types
Input fromDataLoader
Output toTrainer

Required pipeline structure:

DataLoader → DataSplit → Trainer → Evaluator

The node validates:

  • Must be connected to a DataLoader
  • If stratification enabled, must specify a column name
  • Split ratio must be between 0.1 and 0.5

The DataSplit node saves split indices to:

~/mlops-work/split_indices.json

This file contains:

{
"train_indices": [0, 1, 3, 5, ...],
"test_indices": [2, 4, 6, ...],
"random_state": 42,
"test_size": 0.2,
"stratified": true,
"stratify_column": "target"
}

The Trainer node reads this file to use the exact same split.

While the Trainer node has a built-in test split, using DataSplit is preferred:

  • Separates concerns (splitting vs training)
  • Makes split configuration visible in the pipeline
  • Enables stratification
  • Reproducible across reruns
Dataset SizeRecommended Split
< 1,000 rows0.2-0.3
1,000-10,0000.2
10,000-100,0000.15-0.2
> 100,0000.1-0.15

Smaller datasets need larger test sets for reliable evaluation.

Always enable stratification for classification tasks, especially with:

  • Imbalanced classes
  • Multi-class problems
  • Small datasets

The column name must match exactly (case-sensitive). Check the DataLoader preview for correct column names.

If a class has very few samples, stratification may fail. Options:

  • Disable stratification
  • Combine rare classes
  • Collect more data

Set a fixed random_state (e.g., 42) for reproducible splits.