EDFZ: working with compressed EDFs
Luna supports reading and writing compressed EDFs, i.e. with
decompression performed on-the-fly. Historically we referred to these
as EDFZ files. In current Luna, the practical format is simply a
gzipped EDF, typically with a .edf.gz extension.
Two points are important:
-
Luna can read any gzipped EDF, whether or not it was created by Luna
-
gzipped EDFs are read as a whole stream, i.e. there is no indexed random-access scheme and no accompanying
.idxfile
For large projects, there can be considerable savings in terms of disk
space from working with compressed files. Because the format is just
gzip, a compressed EDF can be easily decompressed with the standard
gunzip tool, e.g.:
cat my.edf.gz | gunzip > my.edf
Here we consider two use-cases (PSGs from the NSRR, and a hdEEG
study), to consider the space/time trade-offs in working with gzipped EDFs
instead of EDF. We'll also consider how this plays with Luna's
RECORD-SIZE command, which
allows you to change the internals of the EDF format. Because Luna
reads standard EDFs one record at a time, different records sizes can have
different impacts on different workflows. Here we'll consider
streaming over the entire EDF. For gzipped EDFs, Luna reads the file
as a whole stream rather than using random access into the compressed
representation.
These tests were performed on an iMac with a 4 GHz Intel Core i7 processor and 32 GB RAM.
Info
Naturally, if you are routinely using other tools to work with the EDFs alongside Luna, there will be limited value in using compressed files, unless those other tools can read gzipped files directly. That is, you would not want to have to retain uncompressed copies of the original EDF too... On the other hand, if you are moving EDFs to a separate system (e.g. a Linux compute cluster) specifically for Luna analyses, this might be a really good use-case for creating and migrating gzipped EDFs only.
Use-case #1, NSRR PSG
- ~10 hours with 14 signals
- 54Mb file
- two EEG (125 Hz) channels, ECG, EMG, EOG, respiratory and other signals
- original EDF record size of 1 second
From the original EDF, we'll obtain three other files, using
compression and/or a different record-size, using the following
commands. To write a compressed EDF, add the edfz option
to WRITE:
luna t1.lst -s 'WRITE edfz edf-dir=test/ edf-tag=z1 sample-list=z1.lst'
As well as the original 1 second record size, we'll try working with a
30-second record size. Because epochs cannot be smaller than the EDF
record size, we do not want to set this to be any larger. To increase
the record size to 30 seconds, but keep an uncompressed EDF, we'd use
the RECORD-SIZE command
(which also expects the same parameters are WRITE, as it forces an
immediate write of the reformatted EDF):
luna t1.lst -s 'RECORD-SIZE dur=30 edf-dir=test/ edf-tag=t30 sample-list=t30.lst'
Finally, to change both record size and write as a compressed EDF:
luna t1.lst -s 'RECORD-SIZE dur=30 edfz edf-dir=test/ edf-tag=z30 sample-list=z30.lst'
Here are the resulting file sizes (in bytes) for EDF and gzipped EDF:
EDF EDF.GZ RATIO
RS = 1 54495840 24578092 45%
RS = 30 54495840 23660147 43%
Using gzip reduces file size by more than half: from 54Mb to around 24Mb. In general, PSG files compress quite well. As expected, changing EDF record size has no impact on the EDF file size, whereas there is a small difference in the efficiency of compression.
Note
Current Luna no longer creates .idx files for compressed EDF
output. The output is a plain gzipped EDF, typically with an
.edf.gz extension.
But how much of a price do we pay (if any) for having to decompress on-the-fly, in terms of speed? In theory, working with compressed files could actually be faster than uncompressed files, depending on disk speed, CPU speed and the achieved compression ratio, but in general we expect it will slow things down a little.
We'll use the STATS command to calculate
summary statistics for every channel in the EDF or EDFZ, which ensures
that all data from the EDF is read into memory:
luna t1.lst -s STATS
The times taken (in seconds) to load the data and complete this command are:
EDF EDF.GZ
RS = 1 2.8 10.5
RS = 30 1.9 2.3
In this case, we see that working with gzip and small record sizes (1 second) incurs noticeable overhead (in relative terms), taking 10.5 instead of 2.8 seconds. However, using the larger 30-second record size speeds things up in both cases: in fact, the compressed EDF is now faster than the original EDF, but only using 43% of the disk space. You do pay a 0.4 second cost per-file on loading, but for any non-trivial analysis, this will be negligible in comparison to the overall processing time.
Importantly, identical results (from the STATS command) were
obtained for all four analyses. That is, gzip is a lossless
compression format. We can also check that after using gunzip on
the compressed EDF, we obtain an EDF that is identical to the original one
(for the same record size). For example, if test/my-t30.edf was the
original EDF:
cat test/my-t30.edf.gz | gunzip > test/my-t30-v2.edf
diff -q test/my-t30.edf test/my-t30-v2.edf
Use-case #2, sleep hdEEG
For the second example we consider a significantly larger, different type of EDF file: a hdEEG sleep dataset more than 70 times the size of the previous PSG file.
- ~9 hours with 63 signals
- 4 Gb file
- EEG channels sampled at 1000 Hz
- original EDF record size of 1 second
Using the same approach as above, here are the file sizes (in bytes)
and the compression ratios, as a function of record size (RS):
EDF EDF.GZ RATIO
RS = 1 4022440384 2568741043 64%
RS = 30 4021936384 2532976251 63%
These EEG files tend not to compress quite as well as the PSG files: after all, the brain is a more complex organ and so EEG has a higher information content. Nonetheless, if we are saving around 1.45 Gb per EDF, in a project with 100s of recordings, these differences will quickly become non-trivial.
In terms of speed, naturally everything takes longer (this is a much larger dataset, to be fair):
EDF EDF.GZ
RS = 1 168 201
RS = 30 140 151
We see a similar pattern to above, however, in that larger EDF record sizes increase speed overall, as well as making the speed difference between EDF and gzipped EDF files relatively trivial.
Conclusion
Overall, the combination of large (i.e. epoch-length) record sizes and
gzip compression appears to result in a favourable combination of performance in
terms of both space and time. If working with large datasets, or
making copies of original EDFs that are used by Luna only, you may
want to consider the edfz option of WRITE and RECORD-SIZE.
Naturally, there may be other considerations that impact
performance. The potential drawbacks are a) using larger record size
currently precludes the ability to look at epochs smaller than the
record size, b) if using other tools on the same set of EDFs that do
not support on-the-fly compression, there is no point in compressing
EDFs (with the caveat that it is easy to decompress with gunzip,
e.g. if you only want to occasionally use another tool to look at one
or two EDFs).
Footnote
We have not performed any type of exhaustive test, but purely in terms of time to load an EDF, Luna seems to compare favorably to the couple of other tools we've used. Taking the default (EDF, 1-second record size) hdEEG dataset above:
- edfReader R package: 8 minutes
- EEGLAB Matlab toolbox: 8 minutes 30 seconds
- lunaC: 2 minutes 48 seconds
- lunaR: 2 minutes 45 seconds
Although not formally checked, EEGLAB (using the Biosig toolbox) also appeared to use more than double the amount of RAM compared to Luna and edfReader (peaking at ~45 GB).
Context for comparisons
These other tools likely have options (e.g. memory mapping files in EEGLAB) that might greatly enhance performance: we have not investigated any of those options. Naturally, EEGLAB is a wonderfully powerful tool that is quite possibly doing a number of other checks, etc, behind the scenes, not to mention the fact that it encompasses a whole suite of EEG-based methods that are not even part of Luna, which has a very different and focussed use-case. The point of these comparisons is merely to note that Luna's baseline performance is at least comparable with other tools that are designed to analyse EDF files, which can be useful if wanting to perform standard types of analyses on large numbers of EDFs.