Sample Sets

This example demonstrates how to define an entire sample set which will automatically create a MSfile interface for each data file. This demo uses the MSfileSet and SampleMetadata from the previous example to create a SampleSet. The last section shows how to save all the SampleRun instances and SampleMetadata from the SampleSet, as new msAIr files (one for each SampleRun) and a single msAIm file for the SampleMetadata. This example concludes by creating the SampleSet again by loading from the msAI data files. The advantages of this new format is explained in that section.

Creating a sample set

Create the MSfileSet instance.

>>> mzml_dir = "./examples/data/mzML"
>>> ms_files = msData.MSfileSet(mzml_dir)

Create the SampleMetadata instance.

>>> csv_path = "./examples/data/metadata/coneflower_metadata.csv"
>>> cone_flower_metadata = SampleMetadata(csv_path)

Create the SampleSet. A set can be constructed from any MSfileSet along with 0 or more SampleMetadata. Upon creation, SampleRun instances are created for each MS file, but the MS data will not initialized until called. This allows a cheep view of the entire set to exist without importing all the data into memory.

>>> sample_set = SampleSet(ms_files, cone_flower_metadata)
>>> sample_set
         file_type  file_size                            path   class sampleType       site block treatment plantID  tissue     siteblock   sitetreatment polarity                                                run
filename
EP0482        mzML  12.862821  examples/data/mzML/EP0482.mzML  sample     sample  Rosemount    B1      HIGH    P360    seed  Rosemount_B1  Rosemount_HIGH  unknown  <msAI.samples.SampleRun object at 0x7f063ff54f50>
EP2421        mzML  15.133800  examples/data/mzML/EP2421.mzML  sample     sample  Rosemount    B1        R1    P109  flower  Rosemount_B1    Rosemount_R1  unknown  <msAI.samples.SampleRun object at 0x7f063fed80d0>
EP2536        mzML  12.745723  examples/data/mzML/EP2536.mzML  sample     sample  Rosemount    B1       LOW    P134    root  Rosemount_B1   Rosemount_LOW  unknown  <msAI.samples.SampleRun object at 0x7f063ff35550>

Accessing sample MS data and metadata

Get a single sample with filename.

>>> sample_set.df.loc["EP2421"]
file_type                                                     mzML
file_size                                                  15.1338
path                                examples/data/mzML/EP2421.mzML
class                                                       sample
sampleType                                                  sample
site                                                     Rosemount
block                                                           B1
treatment                                                       R1
plantID                                                       P109
tissue                                                      flower
siteblock                                             Rosemount_B1
sitetreatment                                         Rosemount_R1
polarity                                                   unknown
run              <msAI.samples.SampleRun object at 0x7f063fed80d0>
Name: EP2421, dtype: object

Get metadata values with label names.

>>> sample_set.df.loc["EP2421"].plantID
'P109'
>>> sample_set.df.loc["EP2421"].tissue
'flower'
>>> sample_set.df.loc["EP2421"].site
'Rosemount'
>>> sample_set.df.loc["EP2421"].treatment
'R1'

Note that a SampleRun is created,

>>> sample_set.df.loc["EP2421"].run
<msAI.samples.SampleRun object at 0x7f063fed80d0>

But MS data is not available until initialized.

>>> sample_set.df.loc["EP2421"].run.ms.spectra
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'spectra'

Initialize all MS data.

>>> sample_set.init_all_ms()

Access MS data and metadata.

>>> sample_set.df.loc["EP2421"].run.ms.run_date
'2017-06-28T04:10:21Z'
>>> sample_set.df.loc["EP2421"].run.ms.spectra
             rt  peak_count          tic  ms_lvl                                    filters
299    3.018841        1745   46977344.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
301    3.039366        1836   48066048.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
303    3.060012        2060   47754260.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
305    3.080646        1828   46855808.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
307    3.101156        1847   48759696.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
         ...         ...          ...     ...                                        ...
1591  15.918533        3416  118047380.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1593  15.938479        3328  128021860.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1595  15.958450        3348  128402500.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1597  15.978360        3156  152132620.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1599  15.998312        3285  174533700.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
[651 rows x 5 columns]
>>> sample_set.df.loc["EP2421"].run.ms.peaks
                            rt         mz             i
spec_id peak_number
299     0             3.018841  115.03919  36447.125000
        1             3.018841  115.05045   2975.487549
        2             3.018841  115.07568   2015.634644
        3             3.018841  115.51699   1233.632690
        4             3.018841  115.96244   4875.453613
                        ...        ...           ...
1599    3280         15.998312  987.60944  12299.823242
        3281         15.998312  989.54504  39011.988281
        3282         15.998312  991.56219  57488.519531
        3283         15.998312  992.56891  21931.212891
        3284         15.998312  993.56921   7275.180176
[1430013 rows x 3 columns]

Saving and loading sample sets

In this example workflow so far, the step requiring the most computational resources / time to complete was the step initializing the MS data - where data stored in mzML files is loaded into memory and structured as dataframes. When working with large data sets, this step becomes expensive to repeat.

If SampleRun data will be needed again, it can be saved in an alternative format (msAIr file) that enables faster access and smaller storage size. This msAIr file type is created by serializing and compressing a SampleRun instance, saving the state of all its in-memory data attributes. While there is an upfront cost to creating a msAIr save, future SampleRun instantiations from a msAIr file will be much faster as it is not necessary to parse the mzML file again. Additionally, since the entire SampleRun instance is saved, the results of calculations performed or new data attributes created will also be persist.

Saving

Define the paths to the directories where data will be saved.

>>> msAIr_dir = "./examples/data/msAIr"
>>> msAIm_dir = "./examples/data/msAIm"

Save all the samples in the SampleSet as msAIr files to a directory. The same filenames are used with the .msAIr extension.

>>> sample_set.save_all_ms(msAIr_dir)

A sha256 hash value is calculated for each sample and added to the SampleSet metadata.

>>> sample_set.df['msAIr_hash']
filename
EP0482    67a004385a71045b787c5cdc318d78fee3d890bf287473...
EP2421    fcf4c386c7051b6c5228faa120575a492eddfebf2b9914...
EP2536    b82ef4ddeaab36d5c9d68e2e0e192b1731fc5674430e10...
Name: msAIr_hash, dtype: object

Save the SampleSet metadata as a msAIm file to a directory, a sha256 hash is returned.

>>> sample_set.save_metadata(msAIm_dir, "sample_set1")
'dc0714b6fe0d05e10ef902bbb45f40d79ff50a87528c305c1f8161e0a15aeb6a'

Loading

Use the same path to the directory where the msAIr files were saved previously.

>>> msAIr_dir = "./examples/data/msAIr"

Create a MSfileSet from the msAIr files. New mzML files can also be added and used in the same way.

>>> msAIr_set = msData.MSfileSet(msAIr_dir)
>>> msAIr_set
         file_type  file_size                              path
filename
EP0482       msAIr   7.870908  examples/data/msAIr/EP0482.msAIr
EP2421       msAIr   9.659162  examples/data/msAIr/EP2421.msAIr
EP2536       msAIr   7.881509  examples/data/msAIr/EP2536.msAIr

Compare this set to the original mzML version created above - note the smaller sizes of the msAIr files.

>>> ms_files
         file_type  file_size                            path
filename
EP0482        mzML  12.862821  examples/data/mzML/EP0482.mzML
EP2421        mzML  15.133800  examples/data/mzML/EP2421.mzML
EP2536        mzML  12.745723  examples/data/mzML/EP2536.mzML

Define the path to the msAIm file created above.

>>> sample_set1_msAIm_path = "./examples/data/msAIm/sample_set1.msAIm"

Load the SampleMetadata from the msAIm file - notice the msAIr_hash column has been added.

>>> msAIm = SampleMetadata(sample_set1_msAIm_path)
>>> msAIm
           class sampleType       site block treatment plantID  tissue     siteblock   sitetreatment polarity                                         msAIr_hash
filename
EP0482    sample     sample  Rosemount    B1      HIGH    P360    seed  Rosemount_B1  Rosemount_HIGH  unknown  67a004385a71045b787c5cdc318d78fee3d890bf287473...
EP2421    sample     sample  Rosemount    B1        R1    P109  flower  Rosemount_B1    Rosemount_R1  unknown  fcf4c386c7051b6c5228faa120575a492eddfebf2b9914...
EP2536    sample     sample  Rosemount    B1       LOW    P134    root  Rosemount_B1   Rosemount_LOW  unknown  b82ef4ddeaab36d5c9d68e2e0e192b1731fc5674430e10...

Load the SampleSet and initialize.

>>> sample_set1 = SampleSet(msAIr_set, msAIm)
>>> sample_set1.init_all_ms()
>>> sample_set1
         file_type  file_size                              path   class sampleType       site block treatment plantID  tissue     siteblock   sitetreatment polarity                                         msAIr_hash                                                run
filename
EP0482       msAIr   7.870908  examples/data/msAIr/EP0482.msAIr  sample     sample  Rosemount    B1      HIGH    P360    seed  Rosemount_B1  Rosemount_HIGH  unknown  67a004385a71045b787c5cdc318d78fee3d890bf287473...  <msAI.samples.SampleRun object at 0x7fda7adb02d0>
EP2421       msAIr   9.659162  examples/data/msAIr/EP2421.msAIr  sample     sample  Rosemount    B1        R1    P109  flower  Rosemount_B1    Rosemount_R1  unknown  fcf4c386c7051b6c5228faa120575a492eddfebf2b9914...  <msAI.samples.SampleRun object at 0x7fda6cf5b0d0>
EP2536       msAIr   7.881509  examples/data/msAIr/EP2536.msAIr  sample     sample  Rosemount    B1       LOW    P134    root  Rosemount_B1   Rosemount_LOW  unknown  b82ef4ddeaab36d5c9d68e2e0e192b1731fc5674430e10...  <msAI.samples.SampleRun object at 0x7fda7adb0750>

Access MS data and metadata in the same way as before.

>>> sample_set1.df.loc["EP2421"]
file_type                                                    msAIr
file_size                                                  9.65916
path                              examples/data/msAIr/EP2421.msAIr
class                                                       sample
sampleType                                                  sample
site                                                     Rosemount
block                                                           B1
treatment                                                       R1
plantID                                                       P109
tissue                                                      flower
siteblock                                             Rosemount_B1
sitetreatment                                         Rosemount_R1
polarity                                                   unknown
msAIr_hash       fcf4c386c7051b6c5228faa120575a492eddfebf2b9914...
run              <msAI.samples.SampleRun object at 0x7fda6cf5b0d0>
Name: EP2421, dtype: object
>>> sample_set1.df.loc["EP2421"].plantID
'P109'
>>> sample_set1.df.loc["EP2421"].tissue
'flower'
>>> sample_set1.df.loc["EP2421"].site
'Rosemount'
>>> sample_set1.df.loc["EP2421"].treatment
'R1'
>>> sample_set1.df.loc["EP2421"].run.ms.run_date
'2017-06-28T04:10:21Z'
>>> sample_set1.df.loc["EP2421"].run.ms.spectra
             rt  peak_count          tic  ms_lvl                                    filters
299    3.018841        1745   46977344.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
301    3.039366        1836   48066048.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
303    3.060012        2060   47754260.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
305    3.080646        1828   46855808.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
307    3.101156        1847   48759696.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
         ...         ...          ...     ...                                        ...
1591  15.918533        3416  118047380.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1593  15.938479        3328  128021860.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1595  15.958450        3348  128402500.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1597  15.978360        3156  152132620.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
1599  15.998312        3285  174533700.0       1  FTMS + p ESI Full ms [115.0000-1000.0000]
[651 rows x 5 columns]
>>> sample_set1.df.loc["EP2421"].run.ms.peaks
                            rt         mz             i
spec_id peak_number
299     0             3.018841  115.03919  36447.125000
        1             3.018841  115.05045   2975.487549
        2             3.018841  115.07568   2015.634644
        3             3.018841  115.51699   1233.632690
        4             3.018841  115.96244   4875.453613
                        ...        ...           ...
1599    3280         15.998312  987.60944  12299.823242
        3281         15.998312  989.54504  39011.988281
        3282         15.998312  991.56219  57488.519531
        3283         15.998312  992.56891  21931.212891
        3284         15.998312  993.56921   7275.180176
[1430013 rows x 3 columns]