Friesian Feature API¶

friesian.feature.table¶

class zoo.friesian.feature.table.FeatureTable(df)[source]¶

Bases: zoo.friesian.feature.table.Table

encode_string(columns, indices)[source]¶

Encode columns with provided list of StringIndex

Parameters

columns – str or a list of str, target columns to be encoded.
indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column.

Returns

A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.

gen_string_idx(columns, freq_limit)[source]¶

Generate unique index value of categorical features

Parameters

columns – str or a list of str, target columns to generate StringIndex.
freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be ommited from the encoding. Can be represented as both an integer, dict or None. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc.

Returns

List of StringIndex

classmethod read_parquet(paths)[source]¶

Loads Parquet files, returning the result as a FeatureTable.

Parameters: paths – str or a list of str. The path/paths to Parquet file(s).
Returns: A FeatureTable

class zoo.friesian.feature.table.StringIndex(df, col_name)[source]¶

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths, col_name=None)[source]¶

Loads Parquet files, returning the result as a StringIndex.

Parameters

paths – str or a list of str. The path/paths to Parquet file(s).
col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.

Returns

A StringIndex.

write_parquet(path, mode='overwrite')[source]¶

Write StringIndex to Parquet file

Parameters

path – str. The path to the folder of the Parquet file. Note that the col_name will be used as basename of the Parquet file.
mode – str. append, overwrite, error or ignore. append: Append contents of this StringIndex to existing data. overwrite: Overwrite existing data. error: Throw an exception if data already exists. ignore: Silently ignore this operation if data already exists.

class zoo.friesian.feature.table.Table(df)[source]¶

Bases: object

broadcast()[source]¶: Marks a Table as small enough for use in broadcast joins

clip(columns, min=0)[source]¶

clips continuous values so that they are within a min bound. For instance by setting the min value to 0, all negative values in columns will be replaced with 0.

Parameters

columns – list of str, the target columns to be clipped.
min – int, The mininum value to clip values to: values less than this will be replaced with this value.

Returns

A new Table that replaced the value less than min with specified min

compute()[source]¶: Trigger computation of Table.

count()[source]¶

Returns the number of rows in this Table.

Returns: The number of rows in current Table

drop(*cols)[source]¶

Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).

Parameters: cols – a string name of the column to drop, or a list of string name of the columns to drop.
Returns: A new Table that drops the specified column.

fillna(value, columns)[source]¶

Replace null values.

Parameters

value – int, long, float, string, or boolean. Value to replace null values with.
columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, string or boolean, all columns will be filled.

Returns

A new Table that replaced the null values with specified value

log(columns, clipping=True)[source]¶

Calculates the log of continuous columns.

Parameters

columns – list of str, the target columns to calculate log.
clipping – boolean, if clipping=True, the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If clipping=False, log(x) will be calculated.

Returns

A new Table that replaced value in columns with logged value.

merge_cols(columns, target)[source]¶

Merge column values as a list to a new col.

Parameters

columns – list of str, the target columns to be merged.
target – str, the new column name of the merged column.

Returns

A new Table that replaced columns with a new target column of merged list value.

rename(columns)[source]¶

Rename columns with new column names

Parameters: columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”
Returns: A new Table with new column names.

show(n=20, truncate=True)[source]¶

Prints the first n rows to the console.

Parameters

n – int, number of rows to show.
truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

to_spark_df()[source]¶

Convert current Table to spark DataFrame

Returns: The converted spark DataFrame