Script Node
The Script node lets you write custom Python code within your pipeline. Use it for data preprocessing, feature engineering, or custom analysis.
Overview
Section titled “Overview”| Property | Value |
|---|---|
| Type | Processing node |
| Inputs | DataFrame or Model (optional) |
| Outputs | DataFrame or Model |
| Editor | Monaco (VS Code-like) |
Editor Features
Section titled “Editor Features”The Script node uses Monaco Editor with:
- Python syntax highlighting
- Auto-completion
- Error highlighting
- Line numbers
- Multiple cursors
Available Variables
Section titled “Available Variables”Inside your script, these variables are available:
| Variable | Type | Description |
|---|---|---|
df | DataFrame | Input data (if connected to DataLoader) |
model | sklearn model | Input model (if connected to Trainer) |
np | module | NumPy (pre-imported) |
pd | module | Pandas (pre-imported) |
Output Requirements
Section titled “Output Requirements”Your script must assign results to specific output variables:
Outputting Data
Section titled “Outputting Data”# Input: df (from DataLoader)# Output: df (modified DataFrame)
# Filter rowsdf = df[df['age'] > 18]
# Add new columndf['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['young', 'mid', 'senior'])
# The final value of `df` is passed to the next nodeOutputting a Model
Section titled “Outputting a Model”# Input: model (from Trainer)# Output: model (modified or wrapped model)
# Access model propertiesprint(f"Model type: {type(model).__name__}")print(f"Feature importances: {model.feature_importances_}")
# Pass model through unchanged# model = model (implicit)Common Use Cases
Section titled “Common Use Cases”Data Preprocessing
Section titled “Data Preprocessing”# Fill missing valuesdf['age'].fillna(df['age'].median(), inplace=True)df['category'].fillna('unknown', inplace=True)
# Or drop rows with any missing valuesdf = df.dropna()# Create derived featuresdf['total_spend'] = df['price'] * df['quantity']df['log_income'] = np.log1p(df['income'])df['name_length'] = df['name'].str.len()
# Date featuresdf['date'] = pd.to_datetime(df['date'])df['day_of_week'] = df['date'].dt.dayofweekdf['month'] = df['date'].dt.month# One-hot encodingdf = pd.get_dummies(df, columns=['category', 'region'])
# Label encodingfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()df['category_encoded'] = le.fit_transform(df['category'])Data Filtering
Section titled “Data Filtering”# Filter by conditiondf = df[df['status'] == 'active']
# Filter by date rangedf['date'] = pd.to_datetime(df['date'])df = df[(df['date'] >= '2024-01-01') & (df['date'] <= '2024-12-31')]
# Sample random rowsdf = df.sample(frac=0.1, random_state=42) # 10% sampleData Transformation
Section titled “Data Transformation”# Normalize numeric columnsfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()numeric_cols = ['age', 'income', 'score']df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# Log transformdf['log_revenue'] = np.log1p(df['revenue'])Custom Analysis
Section titled “Custom Analysis”# Generate summary statisticsprint("Dataset Summary:")print(f"Rows: {len(df)}")print(f"Columns: {list(df.columns)}")print("\nNumeric stats:")print(df.describe())
print("\nMissing values:")print(df.isnull().sum())
print("\nClass distribution:")print(df['target'].value_counts())Error Handling
Section titled “Error Handling”Scripts run in an isolated Python subprocess. If your script fails:
- Error messages appear in the output panel
- The pipeline stops at this node
- Fix the error and re-run
Common Errors
Section titled “Common Errors”| Error | Cause | Fix |
|---|---|---|
NameError: name 'df' is not defined | No DataLoader connected | Connect a DataLoader first |
KeyError: 'column_name' | Column doesn’t exist | Check df.columns |
ModuleNotFoundError | Package not installed | Install with pip install package |
Print Output
Section titled “Print Output”All print() output appears in the node’s output panel:
print("Processing data...")print(f"Input shape: {df.shape}")
# Your processing code here
print(f"Output shape: {df.shape}")print("Done!")Output:
Processing data...Input shape: (1000, 10)Output shape: (950, 12)Done!Limitations
Section titled “Limitations”- No GUI: Can’t display matplotlib plots (use Evaluator for visualizations)
- No input: Can’t read from stdin or prompt for user input
- Timeout: Scripts timeout after 5 minutes by default
- Memory: Limited to available system memory
External Packages
Section titled “External Packages”You can import any installed Python package:
import numpy as np # Pre-importedimport pandas as pd # Pre-importedfrom sklearn.preprocessing import StandardScalerfrom scipy import statsimport reBest Practices
Section titled “Best Practices”- Keep scripts focused — One transformation per script for clarity
- Add comments — Document what the script does
- Print progress — Use
print()to show what’s happening - Handle errors — Use try/except for risky operations
- Test incrementally — Run after each change
Example: Complete Preprocessing Script
Section titled “Example: Complete Preprocessing Script”"""Data Preprocessing Script- Clean missing values- Encode categoricals- Create features- Scale numerics"""
print("Starting preprocessing...")
# 1. Handle missing valuesprint(f"Missing before: {df.isnull().sum().sum()}")df['age'].fillna(df['age'].median(), inplace=True)df['income'].fillna(df['income'].mean(), inplace=True)df = df.dropna(subset=['target'])print(f"Missing after: {df.isnull().sum().sum()}")
# 2. Feature engineeringdf['income_per_age'] = df['income'] / (df['age'] + 1)df['is_high_income'] = (df['income'] > df['income'].median()).astype(int)
# 3. Encode categoricalsdf = pd.get_dummies(df, columns=['region', 'category'], drop_first=True)
# 4. Scale numericsfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()numeric_cols = ['age', 'income', 'score', 'income_per_age']df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
print(f"Final shape: {df.shape}")print(f"Columns: {list(df.columns)}")print("Preprocessing complete!")Related Nodes
Section titled “Related Nodes”- DataLoader — Load data to process
- Trainer — Train on processed data