Why not feed raw signal?
Imagine handing a friend a one-hour audio recording of a meeting and asking "so what did they decide?" They have to listen to all of it before answering. A raw EEG window is like that recording: thousands of samples per channel, mostly redundant, riddled with artifacts, and only a tiny fraction of it actually relates to your intent.
Feature extraction is the act of writing the summary. Instead of every sample, you compute a few numbers that capture the part that matters: how strong a rhythm is, or how big a bump appears at a certain moment. The classifier then works with that short summary, which is faster, more stable, and far easier to learn from.
Band-power features
For rhythm-based BCIs like motor imagery and SSVEP, the useful information lives in how much energy sits in a particular frequency band on a particular channel. That quantity is the band power. When you imagine moving your left hand, the mu rhythm (about 8 to 13 Hz) over the opposite side of your head drops; band power turns that drop into a single number the decoder can read.
A simple recipe: keep only the frequencies you care about with a bandpass filter, then measure how much the signal wiggles. The variance of a bandpassed signal is a clean stand-in for its power, because power is just average squared amplitude.
import numpy as np
from scipy.signal import butter, filtfilt
# x: one channel of a short EEG window; fs: sampling rate (Hz)
def band_power(x, fs, lo=8, hi=13):
# design an 8-13 Hz bandpass (the mu band)
b, a = butter(4, [lo, hi], btype="band", fs=fs)
# filtfilt filters forward + backward (no phase shift)
xf = filtfilt(b, a, x)
# variance of the band = its power -> one number per window
return np.var(xf)ERP features
P300 and other event-related potential (ERP) BCIs work differently. Here the signal does not tell you its story through ongoing rhythm power; it tells you through a shape that appears at a fixed delay after a stimulus. The P300, for instance, is a positive voltage bump that peaks roughly 300 ms after something surprising or relevant flashes by.
So an ERP feature is not power in a band; it is the amplitude at chosen moments in time, measured from the stimulus. A common approach is to lock the window to each stimulus, then sample the voltage at a few time points (say every 50 ms from 0 to 600 ms) across channels. Stack those values and you have your feature vector. The shape of the feature is time, not frequency.
Keeping it small
It is tempting to throw in every band on every channel at every time point. Resist it. With many features but only a few minutes of calibration data, the classifier starts memorizing noise that happened to line up with your labels: it looks brilliant in training and falls apart live. That trap is called overfitting, and the next guide is devoted to it.
Two cures help. Feature selection keeps only the channels and bands that actually carry intent and drops the rest. Dimensionality reduction combines many raw channels into a few informative mixtures. For motor imagery this pairs naturally with common spatial patterns (CSP), which learns the channel mixtures that best separate two mental states before you ever compute band power.