Exploring EEG Data and Designing a PostgreSQL Schema for Machine Learning

Electroencephalography (EEG) is a fascinating window into the brain’s activity. By placing electrodes on the scalp, researchers and clinicians can capture the electrical signals produced by neurons firing in sync. These brainwave recordings have applications ranging from medical diagnosis to cutting-edge brain-computer interfaces. In this article, we’ll introduce what EEG is and the types of data it produces, then dive into how we can organize this data in a PostgreSQL database. We’ll outline a simple relational schema (just a few tables) optimized for machine learning projects, balancing accessibility (for general readers) with technical depth (for engineers). Finally, we’ll hint at how this structured data can be leveraged in a future machine learning project.

Introduction to EEG

What is EEG? Electroencephalography (EEG) is an electrophysiological technique for recording the brain’s electrical activity from the scalp . Essentially, the brain’s billions of neurons communicate via tiny electrical impulses. When thousands of neurons fire together, they create an electrical field strong enough to spread through brain tissue, the skull, and skin – eventually detectable on the scalp . EEG uses electrodes (often embedded in a cap) to measure these voltage fluctuations over time at various points on the head . Because changes in neural activity happen very fast, EEG offers excellent temporal resolution, capturing events on the scale of milliseconds . This makes EEG ideal for studying dynamic brain functions like perception, attention, and cognition in real-time.

How does it work? In a typical EEG setup, multiple electrodes (e.g. 16, 32, or 64 channels) are placed in standard positions on the scalp (such as the 10–20 system). Each electrode picks up voltage differences between itself and a reference, reflecting underlying neural oscillations. The raw signals are very small (tens of microvolts), so they are amplified and digitized into a continuous stream of samples (often at 250–1000 Hz sampling rate) . A session might last from a few minutes to hours, generating large amounts of data. For example, a research-grade EEG recording can include 64 active EEG channels plus additional sensors like EOG (eye movement) channels , producing millions of data points per session. Despite challenges with noise (blinks, muscle activity, electrical interference), EEG remains a non-invasive and widely used method due to its ability to capture brain activity as it happens.

What is it used for? Historically, EEG has been invaluable for clinical diagnostics like detecting epilepsy (seizures show characteristic EEG patterns) . It’s also used to monitor anesthesia depth during surgeries, sleep studies, and to diagnose other brain disorders. Beyond medicine, EEG is popular in cognitive neuroscience research and emerging areas like neurofeedback and brain-computer interfaces. Its high time resolution allows researchers to study whendifferent parts of the brain activate during tasks. In recent years, there’s growing interest in applying machine learning to EEG data – for example, to recognize patterns that indicate fatigue, to control prosthetics via thought, or to classify mental states. This intersection of EEG and ML brings us to the need for effective data management: before we can train algorithms, we must organize and prepare the rich data EEG provides.

Types of EEG Data

EEG experiments produce several types of data. Understanding these will guide how we store and manage the information. Let’s break down the key data types:

Raw Voltage Signals (Time-Series Data)

The core EEG data is the raw voltage time series recorded from each electrode (channel). This is essentially a stream of voltage values over time for each sensor on the scalp. Think of it as multiple waveforms (one per channel) that represent brainwave activity. A typical EEG recording might look like a set of squiggly lines oscillating at different frequencies. These raw signals contain everything – the brain’s spontaneous rhythms, responses to stimuli, plus noise and artifacts.

Key characteristics of raw EEG signals:

  • Multi-channel: Modern EEG caps record from many sites on the scalp simultaneously (e.g., 32 channels). Each channel has a label (like Fz, C3, P4, etc. in the 10–20 system) corresponding to a scalp location.
  • Sampling rate: EEG is sampled at a fixed frequency (e.g., 256 Hz means 256 data points per second per channel). This yields a time series of data points. For instance, 10 seconds of data at 256 Hz from 32 channels would be 10 * 256 * 32 = 81,920 samples in total.
  • Units: The voltage is measured in microvolts (µV). Because these signals are tiny, they’re amplified and filtered by the EEG hardware before recording .
  • Continuous vs. epochs: Some experiments record continuous EEG for a period, while others segment data into trials or epochs (time-locked to events, see ERPs below). In storage, continuous data might be saved as one long sequence per channel, whereas epoched data might be saved as segments.

Raw EEG data is rich but often requires preprocessing (filtering out noise, artifact removal) before it’s directly useful. However, for completeness we usually want to store the raw signals, since they are the source from which all other EEG metrics are derived.

Frequency Bands and Derived Features

One common way to analyze EEG is by looking at frequency bands – essentially breaking the signal into components by their oscillation speed. EEG waveforms are traditionally categorized into bands such as:

  • Delta – the slowest waves (<4 Hz), prominent in deep sleep .
  • Theta – 4–7 Hz, seen during drowsiness or light sleep .
  • Alpha – ~8–12 Hz, associated with relaxed wakefulness (for example, alpha waves are stronger when you close your eyes) .
  • Beta – ~13–30 Hz, linked to active thinking or alertness .
  • Gamma – >30 Hz (up to ~100 Hz), a fast rhythm that has been linked to heightened focus or integration of information .

These frequency band definitions are somewhat conventional (exact cutoffs can vary) , but they’re useful for summarizing EEG activity. Researchers often compute derived features from raw EEG such as:

  • Power spectral density or band power: measuring the strength of activity in each frequency band over a time window.
  • Ratios or indices: e.g., the ratio of alpha to theta power, which might correlate with certain cognitive states.
  • Spectrograms or time-frequency representations: showing how power in various bands changes over time.

In terms of data storage, frequency features can be considered derived data. We might store some of these if they’ll be used as features for machine learning. However, they can also be computed on-the-fly from raw signals when needed. If we do pre-compute them (to save computation later), we’d likely store them in a separate table or as part of a featurestable.

Event-Related Potentials (ERPs) are specific brain responses time-locked to events or stimuli. In an experiment, you might present a stimulus (say, a flash of light or a word on screen) and record the EEG. If you repeat this many times and average the EEG segments time-locked to the stimulus, a characteristic waveform may emerge – that is the ERP. It represents neural activity consistently triggered by that event, with unrelated brain activity averaged out.

ERPs are typically named by their polarity (positive/negative) and timing. For example, the P300 is a well-known ERP component: a positive voltage peak around 300 milliseconds after a rare or significant stimulus. It often appears in “oddball” tasks when the subject detects an infrequent target item among frequent non-targets . The P300 is associated with attention and decision-making processes . Other examples include the N100 (a negative deflection ~100 ms after a sensory stimulus) or the N170 (related to face recognition).

In terms of data:

  • To analyze ERPs, EEG data is cut into epochs (time segments) around each event (e.g., from 100 ms before to 800 ms after a stimulus). Each epoch is usually labeled with the event type (stimulus category, etc.).
  •  Multiple epochs for the same event type may be averaged to get a cleaner ERP waveform.
  • We also record event markers or triggers in the data – these are timestamps of when each stimulus or significant event occurred during the recording.

For database storage, raw EEG data might be continuous, but we’d store the event markers (with timestamps and labels) so we can later extract or compute ERPs. We might also store pre-processed ERP data or features (e.g., the peak amplitude of P300 for each trial) if needed for machine learning classification.

Metadata and Experimental Context

Apart from the signal data, metadata is crucial for EEG research:

  • Subject details: Who is the subject or participant? This can include an anonymized subject ID, demographic info (age, gender, handedness), clinical conditions, etc. Such data is important since EEG characteristics can vary with age or other factors.
  • Experimental conditions: Information about the experiment setup for each recording session. For example, what task was the person doing? What are the conditions or stimuli (e.g., “eyes open resting”, “eyes closed resting”, “visual oddball task”)? Was the subject on any medication? All such contextual details help in analysis and are essential for machine learning models to avoid confounds.
  • Session info: Date and time of recording, equipment used (which EEG device, how many channels, sampling rate), electrode configuration, etc. These details help ensure data is comparable or to account for differences (e.g., one session might use 32 channels, another uses 64).
  • Data processing notes: Any preprocessing applied (filters, artifact removal, etc.) could also be documented as metadata. For example, one might note that a 60 Hz notch filter was applied to remove mains interference.

In a database, metadata would be stored in structured form. We typically have separate tables for subjects and sessions/experiments (with fields for condition, date, etc.). This metadata allows us to query and subset the EEG data easily (for instance, “give me all resting-state recordings from male subjects over 50 years old” or “fetch data from sessions where the task was an oddball paradigm”).

Why store all these types of data? For a machine learning project, having raw data, derived features, event markers, and metadata all in one place can be very powerful. Raw signals let you try new feature extractions; frequency bands and ERPs capture known meaningful patterns; event labels provide ground truth for supervised learning; and metadata lets you control for variables or filter the dataset. A well-designed database can accommodate all these interrelated pieces in an organized way.

Database Design Considerations

Storing EEG data for analysis poses some challenges. EEG datasets can be large and complex, so careful database design is needed to ensure the data is organized, efficient, and easily queryable for machine learning tasks. Here are key considerations:

  • Normalization (Organizing Data): We should structure the database to minimize redundancy and ambiguity. Database normalization is the process of splitting data into tables such that each type of information is stored once, and relationships are maintained via references . For example, we don’t want to repeat subject demographic info in every recording entry – instead, have a Subjects table and link to it. Normalization improves data integrity (consistency) and clarity of the schema. In practice, this means creating separate tables for distinct entities (subjects, sessions, channels, etc.) and linking them with foreign keys. (Note: highly normalized designs can require more joins to get data out, but they ensure a clean, unambiguous dataset structure.)
  • Efficient Querying & Indexing: EEG datasets, especially raw signals, can contain millions of rows. To avoid slow queries, we rely on indexing. An index in SQL is like an index in a book – it allows the database to locate specific rows quickly without scanning the entire table . For instance, we might index the time column on a signals table, or index foreign keys (like session_id) so we can rapidly retrieve all data from a given session. Proper indexing can dramatically speed up common queries, such as “fetch all signal points for session X between 10 and 20 seconds” or “get all events of type ‘stimulus_A’ for subject Y.” When designing for machine learning, think about the typical data access patterns: Will you load data per session? Per trial/event? By subject? Ensure those query paths are optimized with indexes. (Be mindful that too many indexes or very large indexes can slow down inserts – it’s a balance.)
  • Relational Structure and Joins: A relational database like PostgreSQL is great for structured, linked data. We will have relationships like one subject has many sessionsone session has many signal data pointsone session has many events, etc. Designing the schema involves identifying these relationships and using primary keys (unique identifiers) and foreign keys (references to those identifiers) appropriately. This ensures referential integrity – e.g., every session_idin the signals table must actually exist in the Sessions table. It also allows us to perform JOINs in SQL to combine data (for example, joining signals with events and subjects to get a complete view for analysis). Good design means those joins will be logical and efficient.
  • Scalability and Volume: EEG data volume can be significant. If we store every sample of every channel, a single session can easily be millions of rows. Multiply that by dozens of subjects or sessions, and the database can become very large. There are a few ways to handle this:
  • Downsampling or summarizing: We might not need to store every sample if high-resolution data isn’t necessary for our ML task. For example, we could store signals downsampled to a lower rate for analysis, or store precomputed features per epoch instead of raw points.
  • Partitioning: PostgreSQL allows table partitioning, which could be useful to split a huge signals table by session or by subject, making queries on one session faster.
  • Using Binary Storage: Sometimes raw signals are stored in binary files (like EDF or HDF5) for efficiency, and the database stores only a reference or metadata about those files. For our simple design, we’ll assume we store the needed data in tables directly for simplicity, but it’s worth noting that extremely large raw data might live outside the SQL database.
  • Trade-offs: Fully normalized (split) schema vs. denormalized (fewer tables, some duplicate info) is a trade-off. Highly normalized is clean but requires joining tables frequently. In some cases, minor denormalization can speed up read performance (at the cost of some redundancy). Our design will aim for a balance but lean towards clarity (normal forms) given moderate data sizes.
  • Query Needs for ML: Think about how machine learning pipelines will use the data. Often, you’ll want to select a subset of data as a training set – e.g., “all epochs of stimulus X from all subjects” or “continuous data from minute 2 to 4 of each session labeled as condition Y”. The schema should make it easy to retrieve these slices. That means having clear ways to filter by subject, session, condition, or event type. If we plan to do supervised learning, we might have labels for certain time segments (e.g., “seizure vs non-seizure” or “eyes open vs eyes closed” segments). Storing those labels in an events table or a separate labels table is important. The schema should be extensible as well: we might add new derived features or additional metadata later, so designing with some flexibility (without hard-coding too many specifics) helps.

Indexing recap & other optimizations: We will likely set indexes on:

  • Primary keys (automatic in most SQL databases).
  • Foreign keys (to speed up joins on subject_id, session_id).
  • Time or sample index in the signals table (to quickly slice by time).
  • Event type or event time if we often query those (or use a composite index on session + event_type).

Additionally, ensure appropriate data types (e.g., use numeric types for continuous values, timestamp for time if using absolute times, etc.), and consider constraints (e.g., not allowing nulls where data is required, using CHECK constraints for valid value ranges if applicable).

With these considerations in mind, we can design a schema that is both normalized for clarity and indexed for performance, making it suitable for interactive querying and serving as a solid foundation for machine learning analysis.

PostgreSQL Schema Design

Let’s propose a simple PostgreSQL schema to store EEG data, consisting of about five tables. This schema is designed to capture the raw data, events, and metadata we discussed, in a way that’s easy to query for analysis or machine learning. We’ll outline each table and its columns, along with the relationships between tables:

  • Subjects: Stores information about each participant/subject.
  • Sessions: Stores information about each recording session or experiment run.
  • Channels: (Optional but recommended) Stores information about EEG channels/electrodes.
  • Signals: Stores the raw EEG signal data points (this table can grow large).
  • Events: Stores markers for events or annotations (stimuli, responses, etc.) during each session.

Below is a simplified schema definition (in a SQL-like format) illustrating these tables and their key fields:

-- 1. Subjects table: one record per participant
CREATE TABLE subjects (
    subject_id SERIAL PRIMARY KEY,   -- unique ID for the subject
    subject_code TEXT,              -- e.g., "S01", an arbitrary code or name
    age INT,
    gender TEXT,
    notes TEXT                      -- any additional info (optional)
);

-- 2. Sessions table: one record per EEG recording session
CREATE TABLE sessions (
    session_id SERIAL PRIMARY KEY,
    subject_id INT REFERENCES subjects(subject_id),  -- link to who was recorded
    session_date DATE,
    experiment TEXT,        -- e.g., "Visual Oddball Task" or "Resting State"
    condition TEXT,         -- e.g., "medicated" vs "unmedicated", or any condition label
    notes TEXT
    /* We could also store fields like sampling_rate, EEG device used, etc. */
);

-- 3. Channels table: one record per EEG channel (electrode)
CREATE TABLE channels (
    channel_id SERIAL PRIMARY KEY,
    name TEXT,         -- e.g., "Fz" or "P4" (10-20 system labels)
    region TEXT        -- e.g., "Frontal" (optional descriptive region)
    /* If different sessions have different channel setups, you might have a link to session or a separate mapping table.
       Here we assume a standard set of channels across sessions for simplicity. */
);

-- 4. Signals table: stores raw EEG voltage readings
CREATE TABLE signals (
    signal_id SERIAL PRIMARY KEY,
    session_id INT REFERENCES sessions(session_id),
    channel_id INT REFERENCES channels(channel_id),
    timestamp_ms INT,    -- time of the sample within the session (e.g., milliseconds from start)
    voltage FLOAT        -- recorded voltage value at that time (in microvolts)
    /* In practice, we might use a TIMESTAMP type if we have absolute times.
       Using an integer (sample index or ms) is convenient for relative timing in each session. */
);
-- We would add an index on (session_id, channel_id, timestamp_ms) for fast slicing by time or channel.
CREATE INDEX idx_signals_session_time ON signals(session_id, timestamp_ms);

-- 5. Events table: stores event markers (stimuli, responses, etc.)
CREATE TABLE events (
    event_id SERIAL PRIMARY KEY,
    session_id INT REFERENCES sessions(session_id),
    event_time_ms INT,    -- timestamp of the event within the session
    event_type TEXT,      -- label for the event, e.g., "stimulus_A" or "blink_artifact"
    event_value TEXT      -- additional info, e.g., stimulus ID or response result (optional)
);
-- Index events by session and time, and possibly by type if frequently queried by type:
CREATE INDEX idx_events_session_time ON events(session_id, event_time_ms);
CREATE INDEX idx_events_type ON events(event_type);

Let’s walk through this design:

  • Subjects: This table holds each subject’s info. We use a subject_id as the primary key (an integer that uniquely identifies each subject). We include some example columns like age and gender, which might be relevant for analysis. This table lets us avoid repeating subject details in every session or data entry – instead, we just reference the subject’s ID. If our project doesn’t need demographic details, this table could be as simple as an ID and maybe a code/name, but it’s good to have the structure in place for when such info is available.
  • Sessions: Each EEG recording session (or experiment trial) gets an entry. The session is linked to a subject_id (so we know who the data belongs to). We also store the date and perhaps the type of experiment or condition. For example, a subject might have two sessions: one labeled “Baseline (resting)” and another labeled “Post-training (resting)” on different dates – we can capture that here. The experiment and condition fields are free-text (they could also be coded as enums or separate lookup tables if we want to enforce consistency). Notes can include anything noteworthy about the session (e.g., “Subject was drowsy” or “Artifact noted at 5 min mark”). By querying the Sessions table, we can easily filter sessions by subject or by experiment type.
  • Channels: This table enumerates the EEG channels. If every session uses the same electrode layout (say a standard 32-channel cap), this table might list those 32 channels (with names like Fp1, Fp2, F3, etc.). We include a channel_id(PK) and the name. An optional region or description can be helpful (for example, channel Fz is in the frontal region). In a more complex scenario, different sessions might have different channel configurations – in that case we could have a separate mapping table to link sessions to the channels used. But to keep things simple, we assume a common set of channels. Storing channels separately is part of normalization (no need to repeat the channel name for every data point; we just reference the channel_id in the signals table). It also means we can easily filter or group by channel (e.g., “compare activity on frontal vs occipital channels” by using the region info).
  • Signals: This is the big table that holds the raw EEG measurements. Each row represents a single data point: a specific channel’s voltage at a specific time in a specific session. The columns are:
    • session_id (which session this reading is from),
    • channel_id (which electrode),
    • timestamp_ms (when in the session this reading occurred, here we use milliseconds from start as an example),
    • voltage (the value in microvolts).

We gave each row a unique signal_id as primary key (surrogate key), though we could also use a composite key of (session_id, channel_id, timestamp_ms) since no two data points should share all three. In practice, this table will have a lot of rows. For instance, a 10-second 32-channel recording at 256 Hz yields 81,920 rows as we calculated. Indexing this table on session_id and time will help make queries efficient. For example, if we want all data from session 5, the index quickly points to the section of this table for session 5. If we want a time window, the index can help find the start time quickly. We might also index by channel if we often fetch one channel’s data across time. Another consideration: storing time as an integer (milliseconds or sample number) is straightforward; using a true TIMESTAMP might be useful if aligning to real clock times or merging with other data streams, but it’s not necessary for analysis within one session.

Why store raw signals in the DB? – It makes the data self-contained and queryable (you can use SQL to filter or aggregate the signals). But note, if performance or size becomes an issue, one could store raw signals in external files and just keep file paths or summary in the database. Our schema here assumes we want the raw data accessible in SQL for maximum flexibility (e.g., maybe we’ll compute features with SQL queries or with Python after querying the DB).

  • Events: This table logs events or annotations that occur during a session. Each event has a time (event_time_ms), which should correspond to some point on the timeline of the session’s signals. The event_type can describe what happened – for example, “stimulus_A”, “stimulus_B”, “button_press”, “blink”, etc. This could include both experimental events (stimuli presented, responses made) and notes about artifacts or notable occurrences. We include an event_value as a generic field for any extra info (like the ID of the stimulus image, or the key pressed by the subject, etc., if applicable). By storing events in a table, we make it easy to join these with the signals. For instance, to extract epochs for an ERP, we can query for all signal data from 100 ms before to 500 ms after each stimulus_A event. Or, for machine learning, if we want to label segments of data, we can mark events like “seizure_start” and “seizure_end” or similar, then pull those segments as labeled data for training. Indexing events by session_id and time means we can quickly find all events in a session or the nearest event to a given time, etc. Indexing by event_type is useful if we want “find all occurrences of event X across all sessions” (e.g., all targets in an oddball paradigm for all subjects).

Relationships recap: In this schema, the relationships are:

  • Subjects ⟶ Sessions is a one-to-many (one subject can have many sessions). We enforce this with a foreign key sessions.subject_id ⟶ subjects.subject_id.
  • Sessions ⟶ Signals is one-to-many (a session has many signal data points). signals.session_id ⟶ sessions.session_id.
  • Sessions ⟶ Events is one-to-many. events.session_id ⟶ sessions.session_id.
  • Channels ⟶ Signals is one-to-many (each signal point is for one channel). signals.channel_id ⟶ channels.channel_id.

These relationships let us combine data as needed. For example, to get all data for a subject for a certain stimulus type, we could join Subjects ⟶ Sessions ⟶ Events ⟶ Signals (filter by subject and event type, then get corresponding signals). With indexes and proper keys, these joins and filters should be efficient.

Potential extensions: Our simple schema can be extended based on needs. We could add a Features table that stores precomputed features or statistics per session or per event (for instance, average alpha power per session, or P300 amplitude per target stimulus). We could also have a Montages table if different sessions use different channel layouts (mapping channel names per session). For now, the above design covers the essentials. It’s in 3rd normal form (each table has a clear purpose and there’s no duplicate data across tables ), and we’ve added indexes for performance. PostgreSQL will handle even millions of rows in the Signals table as long as we index and query properly (and we can always scale out or use partial loading if needed).

With our EEG data properly stored and organized, we can confidently move to the fun part – using this data to train and evaluate machine learning models!

Future Directions

Organizing EEG data is just the first step. With this foundation in place, we’re ready to dive into analysis and machine learning. In the next article, we’ll explore how to leverage the EEG data stored in our database for a machine learning project. This will include:

  • Data retrieval for ML: Using SQL (and maybe Python/pandas) to query the EEG data we need – for example, extracting segments around events to create a labeled dataset for training. We’ll show how our schema makes it straightforward to pull out, say, all 1-second epochs following a certain stimulus, along with their labels.
  • Feature engineering: Demonstrating how to compute features from raw EEG. This might involve filtering the signals, computing frequency band powers, or calculating ERP amplitudes for each trial. We could use PostgreSQL queries for simple aggregations or a scientific computing library after data extraction.
  • Machine learning models: Once we have features (or even raw time-series segments), we’ll feed them into ML models. We might start with something simple like logistic regression or SVM on summary features, and then tease more advanced approaches (maybe deep learning on EEG signals).
  • Use case example: We’ll likely center this on a concrete task – for instance, classifying whether a person saw a target or non-target stimulus from their EEG (a P300 detection task), or distinguishing between two mental states. This will show end-to-end how data from our database can train a predictive model.
  • Lessons on data handling: We’ll also discuss best practices in handling EEG data for ML – such as normalizing feature values, dealing with subject variability (maybe using subject as a random effect or using cross-subject evaluation), and avoiding overfitting given the often high dimensionality of EEG data.

Stay tuned for this next installment. By structuring the data well (as we’ve done in this article), we’ve set the stage for efficient experimentation. EEG data, with all its complexity, becomes far more approachable when organized in a relational database – and this opens the door to powerful analysis. In the upcoming article, we’ll see how those organized brainwave data can be turned into insights and predictive models using machine learning. The brain’s electrical whispers, once locked in a database, are ready to speak through data science!

Subscribe to Busy Brain

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe