CSV Importer

class pruneabletree.csv_importer.CsvImporter(encoding='utf-8', sep=', ', dtype=None, na_values=None, class_index=-1, missing_threshold=0.75)[source]

Transform a CSV document to a numpy matrix of data such that the data is ready for use by decision tree classifiers. This implies that instances with missing values are removed and that one-hot encoding is applied to all non-numeric columns. The class column is processed with a label encoder.

Parameters:
encoding : string, ‘utf-8’ by default.

The encoding used to decode the input file.

sep : str, default ‘,’

Delimiter to use.

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}. Use str or object together with suitable na_values settings to preserve and not interpret dtype.

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. The following values are always interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

class_index : int, default -1 (i.e., the last column)

Column index of the class attribute. This column will not be present in the transform output, but will be kept separately in the y attribute of this transformer. Multi output scenarios are not supported.

missing_threshold : float (percentage), default 0.75

Indicates the least amount of data that must remain after removing instances with missing values without raising a warning. If less remain, a warning will be raised.

Attributes:
y : numpy array, [n_samples]

Data extracted from the CSV based on the given class_index and then encoded. This data is not returned by transform, but saved here instead.

original_y : numpy array, [n_samples]

Same as y, but before encoding.

Methods

fit(csv_file[, y]) Extract data from the given CSV file.
fit_transform(csv_file[, y]) Extract data from the given CSV file and return it as a numpy matrix.
fit_transform_both(csv_file) Extract data from the given CSV file and return it as a numpy matrix.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(csv_file) Extract data from the given CSV file and return it as a numpy matrix.
fit(csv_file, y=None)[source]

Extract data from the given CSV file.

Parameters:
csv_file : string

File path to CSV file.

Returns:
self
fit_transform(csv_file, y=None)[source]

Extract data from the given CSV file and return it as a numpy matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:
csv_file : string

File path to CSV file.

Returns:
X : numpy matrix, [n_samples, n_features]

Extracted data.

fit_transform_both(csv_file)[source]

Extract data from the given CSV file and return it as a numpy matrix. Also returns the encoded class values at the same time.

Parameters:
csv_file : string

File path to CSV file.

Returns:
X : numpy matrix, [n_samples, n_features]

Extracted data.

y : numpy array, [n_samples]

Data extracted from the CSV based on the given class_index and then encoded.

transform(csv_file)[source]

Extract data from the given CSV file and return it as a numpy matrix.

Parameters:
csv_file : string

File path to CSV file.

Returns:
X : numpy matrix, [n_samples, n_features]

Extracted data.