Basic Usage¶
This example demonstrates how to import and access data from MS and metadata files.
Note that while individual MSfile instances can manually be created for every sample run,
typically it is best to define a MSfileSet and SampleMetadata as shown below
and use these instances to create a SampleSet as demonstrated in the next example.
Working with mass spectrometry data files¶
Accessing data from a MS sample run¶
Define the path to a MS data file, in this case a mzML file type.
>>> sample1_mzml_path = "./examples/data/mzML/EP2421.mzML"
MS data is accessed using a MSfile implementation matching the file type.
MZMLfile is used with mzML data. On creation, the mzML file is imported into memory.
>>> sample1_ms = msData.MZMLfile(sample1_mzml_path)
The MSfile interface provides several properties for accessing MS metadata.
>>> sample1_ms.run_id
'EP2421'
>>> sample1_ms.run_date
'2017-06-28T04:10:21Z'
>>> sample1_ms.ms_file_version
'1.1.0'
>>> sample1_ms.spectrum_count
651
>>> sample1_ms.peak_count
1430013
>>> sample1_ms.tic_sum
103151911964.0
MS data is structured in dataframes and
accessed by the spectra and peaks properties.
- Spectra dataframe structure
- Index: spec_idColumns: rt, peak_count, tic, ms_lvl, filters
>>> sample1_ms.spectra
rt peak_count tic ms_lvl filters
299 3.018841 1745 46977344.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
301 3.039366 1836 48066048.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
303 3.060012 2060 47754260.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
305 3.080646 1828 46855808.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
307 3.101156 1847 48759696.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
... ... ... ... ...
1591 15.918533 3416 118047380.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
1593 15.938479 3328 128021860.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
1595 15.958450 3348 128402500.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
1597 15.978360 3156 152132620.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
1599 15.998312 3285 174533700.0 1 FTMS + p ESI Full ms [115.0000-1000.0000]
[651 rows x 5 columns]
- Peaks dataframe structure
- First Index Level: spec_idSecond Index Level: peak_numberColumns: rt, mz, i
>>> sample1_ms.peaks
rt mz i
spec_id peak_number
299 0 3.018841 115.03919 36447.125000
1 3.018841 115.05045 2975.487549
2 3.018841 115.07568 2015.634644
3 3.018841 115.51699 1233.632690
4 3.018841 115.96244 4875.453613
... ... ...
1599 3280 15.998312 987.60944 12299.823242
3281 15.998312 989.54504 39011.988281
3282 15.998312 991.56219 57488.519531
3283 15.998312 992.56891 21931.212891
3284 15.998312 993.56921 7275.180176
[1430013 rows x 3 columns]
Get an individual spectrum with spec_id value.
>>> sample1_ms.spectra.loc[303]
rt 3.06001
peak_count 2060
tic 4.77543e+07
ms_lvl 1
filters FTMS + p ESI Full ms [115.0000-1000.0000]
Name: 303, dtype: object
Get a summary / distribution of peak values.
>>> sample1_ms.peaks.describe().round(2)
rt mz i
count 1430013.00 1430013.00 1.430013e+06
mean 9.83 283.99 7.123009e+04
std 3.71 160.93 1.726981e+06
min 3.02 115.00 8.504700e+02
25% 6.79 167.07 5.720040e+03
50% 9.93 229.14 1.181848e+04
75% 12.90 349.25 3.166049e+04
max 16.00 999.95 9.182814e+08
Get all peaks in a spectrum with spec_id value.
>>> sample1_ms.peaks.loc[303]
rt mz i
peak_number
0 3.060012 115.03925 41569.882812
1 3.060012 115.05054 2562.014648
2 3.060012 115.07562 1966.861328
3 3.060012 115.08680 2180.555420
4 3.060012 115.52079 1273.498047
... ... ...
2055 3.060012 717.65051 2805.519287
2056 3.060012 787.67346 2972.889648
2057 3.060012 896.67566 2859.390381
2058 3.060012 909.33502 3785.186035
2059 3.060012 926.53265 2564.230713
[2060 rows x 3 columns]
Get a single peak with spec_id and peak_number.
>>> sample1_ms.peaks.loc[303, 100]
rt 3.060012
mz 125.060060
i 10957.689453
Name: (303, 100), dtype: float64
Creating a set of MS files from a data directory¶
Define the data directory path. By default, contents of sub directories will be recursively included.
>>> mzml_dir = "./examples/data/mzML"
Create a set of the MS files in the data directory.
This set is structured as a dataframe.
Creating a MSfileSet does not import the MS data into memory.
Rather, it provides a quick view of the MS data files available for use.
The next Sample Sets example demonstrates how this MS file set is used to create a SampleSet
and access the underlying MS data.
>>> ms_files = msData.MSfileSet(mzml_dir)
>>> ms_files
file_type file_size path
filename
EP0482 mzML 12.862821 examples/data/mzML/EP0482.mzML
EP2421 mzML 15.133800 examples/data/mzML/EP2421.mzML
EP2536 mzML 12.745723 examples/data/mzML/EP2536.mzML
Sample metadata¶
Additional sample metadata can be imported and associated with MS data.
Define the path to he metadata file.
>>> csv_path = "./examples/data/metadata/coneflower_metadata.csv"
Import metadata by creating a SampleMetadata instance.
At creation, metadata contents are initially imported into a dataframe with a numerical index.
Metadata labels and values are analyzed and a new index is automatically assigned, if possible.
This index will be used by SampleSet to match this metadata with corresponding MS data in MSfileSet.
- Requirements to auto index metadata:
Has 1 or more entries/rows
Has 2 or more labels/columns
For one and only one label/column:
All label/column values are unique
All entries/rows have a value for this label/column
>>> cone_flower_metadata = SampleMetadata(csv_path)
Access the metadata dataframe with the df attribute.
>>> cone_flower_metadata.df
class sampleType site block treatment plantID tissue siteblock sitetreatment polarity
sampleMetadata
EP0045 sample sample Becker B1 HIGH P031 leaf Becker_B1 Becker_HIGH unknown
EP0046 sample sample Becker B1 HIGH P032 leaf Becker_B1 Becker_HIGH unknown
EP0047 sample sample Becker B1 HIGH P033 leaf Becker_B1 Becker_HIGH unknown
EP0048 sample sample Becker B1 HIGH P034 leaf Becker_B1 Becker_HIGH unknown
EP0049 sample sample Becker B1 HIGH P035 leaf Becker_B1 Becker_HIGH unknown
... ... ... ... ... ... ... ... ... ...
EP2848 sample sample Becker B3 R1 P074 root Becker_B3 Becker_R1 unknown
EP2849 sample sample Becker B3 R1 P075 root Becker_B3 Becker_R1 unknown
EP2850 sample sample Becker B3 R1 P076 root Becker_B3 Becker_R1 unknown
EP2851 sample sample Becker B3 R1 P077 root Becker_B3 Becker_R1 unknown
EP2852 sample sample Becker B3 R1 P078 root Becker_B3 Becker_R1 unknown
[984 rows x 10 columns]
Get a summary of metadata contents.
>>> cone_flower_metadata.describe()
class sampleType site block treatment plantID tissue siteblock sitetreatment polarity
count 984 984 984 984 984 984 984 984 984 984
unique 1 1 2 3 6 365 5 6 12 1
top sample sample Becker B2 LOW P102 flower Becker_B1 Becker_R6 unknown
freq 984 984 510 330 167 4 216 172 87 984