miscUtils

Miscellaneous utilities used by msAI.

Todo
  • Add type info for funcs passed as arguments

msAI.miscUtils.logger = <Logger msAI.miscUtils (DEBUG)>

Module logger.

class msAI.miscUtils.FileGrabber[source]

Bases: object

Functions to grab files.

static multi_extensions(directory: str, *extensions: str, recursive: bool = True) → Iterable[pathlib.Path][source]

Creates an iterator of path objects to all files in a directory matching the passed extensions.

Use str(path_obj) to get the platform independent path string. Subdirectories will be recursively searched by default.

Parameters
  • directory – A string representation of the path to the directory. Path can be relative or absolute.

  • extensions – One or more file extensions specified as strings without leading (.).

  • recursive – A boolean indicating if files in subdirectories are included. Defaults to True.

Returns

An iterator of path objects to all files found.

static path_type(directory: str = '.') → str[source]

Get the path type of a directory.

Path type is identified by the class of Path object created. This test is used for determining what glob patterns to apply based on path case sensitivity. Windows paths are case insensitive, while Posix paths are case sensitive.

Parameters

directory – A string representation of the path to the directory. Path can be relative or absolute. Defaults to current directory.

Returns

A string of either 'posix' or 'windows', indicating the path type.

Raises

MiscUtilsError – For unknown path type.

class msAI.miscUtils.Sizer[source]

Bases: object

Functions to measure memory / storage sizes.

static obj_mb(obj: object) → float[source]

Measures the memory size of a python object in MBs.

Parameters

obj – The python object to measure.

Returns

The Python object’s size in memory in MBs.

static print_obj_mb(obj: object)[source]

Prints the memory size of a python object in MBs to 4 decimals.

Parameters

obj – The python object to measure.

static file_mb(file: str)[source]

Measures the storage size of a file in MBs.

Parameters

file – A string representation of the path to the file to measure. Path can be relative or absolute.

Returns

The storage size of the file in MBs.

static print_file_mb(file: str)[source]

Prints the storage size of a file in MBs to 4 decimals.

Parameters

file – A string representation of the path to the file to measure. Path can be relative or absolute.

class msAI.miscUtils.Saver[source]

Bases: object

Functions to save / load, serialize, and compress files and objects.

static save_obj(obj: object, file: str) → str[source]

Saves a python object to the path / filename given.

Data is serialized with pickle and compressed via bzip2. A sha256 hash is also calculated.

Parameters
  • obj – The python object to save.

  • file – A string representation of the path to the file to save. Path can be relative or absolute.

Returns

A sha256 hash as a string.

static get_hash(file: str) → str[source]

Calculates the sha256 hash of a file.

Parameters

file – A string representation of the path to the file to calculate a hash for. Path can be relative or absolute.

Returns

A sha256 hash as a string.

static verify_hash(file: str, test_hash: str) → bool[source]

Verifies the sha256 hash of a file.

Parameters
  • file – A string representation of the path to the file to calculate and compare hash value for. Path can be relative or absolute.

  • test_hash – A sha256 hash as a string to test against.

Returns

A boolean indicating if the hash value is verified. True means the calculated hash matches the test hash.

static load_obj(file: str, test_hash: Optional[str] = None) → Tuple[object, Optional[bool]][source]

Loads a previously saved object.

The file will be tested against a sha256 hash, if provided. Data is decompressed via bzip2 and deserialized with pickle.

Parameters
  • file – A string representation of the path to the file to load the object from. Path can be relative or absolute.

  • test_hash – A sha256 hash as a string to test against.

Returns

A tuple of the object and an optional boolean indicating if the hash of the saved file was verified.

class msAI.miscUtils.MultiTaskDF[source]

Bases: object

Functions to parallelize work on dataframes through multiprocessing.

static _partition_by_rows(df_in: pandas.core.frame.DataFrame, subset_func) → pandas.core.frame.DataFrame[source]

Partitions a dataframe into subsets across rows and assigns a worker to each to apply a function.

Creates a process pool with a number of workers equal to cpu count (by default), and splits the dataframe df_in into a number of subsets equal to number of workers. Each worker applies the subset_func to a dataframe subset in parallel.

Parameters
  • df_in – The input dataframe.

  • subset_func – A partial object containing the function to apply to each dataframe subset. This is received as a partial object, and its call input is completed with a dataframe subset after the dataframe is split.

Returns: A dataframe formed by concating all subset results.

static _run_on_subset_rows(func, df_subset: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Applies a function to each row in a dataframe subset.

Rows are passed to func as Series objects whose index is the dataframe’s columns.

Parameters
  • func – The function to apply to each row in the df_subset. This function must be a static method and return the row, reflecting the results. Additional arguments can be passed with a partial object by the caller.

  • df_subset – A dataframe subset, to which a single worker applies func to all rows.

Returns: A dataframe reflecting the changes from the applied func.

static parallelize_on_rows(df: pandas.core.frame.DataFrame, func) → pandas.core.frame.DataFrame[source]

Applies a function to rows in a dataframe in parallel.

Parameters
  • df – The input dataframe.

  • func – The function to apply to each row in the df. This function must be a static method and return the row, reflecting the results. Additional arguments can be passed with a partial object by the caller.

Returns: A new dataframe reflecting the changes from the applied func.

class msAI.miscUtils.EnvInfo[source]

Bases: object

Functions to get info about the environment running python.

static platform() → str[source]

Get a string (multiline) describing the platform in use.

static os() → str[source]

Get a string (multiline) describing the operating system in use.

static python() → str[source]

Get a string (multiline) describing the python interpreter in use.

static all() → str[source]

Get a string (multiline) describing the environment running python.

static mp_method() → str[source]

Get a string describing the start method used by the multiprocessing module to create new processes.

Defaults are set according to OS type:
POSIX = ‘fork’
Windows = ‘spawn’

Use this function to test and switch to single processing if necessary. Certain functions will fail under the spawn start method.