API Reference
setup
Initializes the Setup instance.Parameters:
df(pd.DataFrame). The test dataset
Returns:
events_names(list): Name of the columns in the dataset to be filtered. Two columns must be selected- One that contains the users name (later on they will be pseudonymize)
- One that contains the event information we want to do estimations
privacy_method(str): Name of the privacy method that will be used to privatized the dataset (Only can be PCMeS or PHCMS)error_metric(str): Name of the metric we wil use to calculate the error (Only can be MSE, Lp Norm or RMSE)error_value(float): Value of the maximun error we want our data to have (in decimal). For example, if we want a 2% maximun error, 0.02 must be written.tolerance(float): Value of the tolerance of the error it is needed.
Returns:
- pd.DataFrame: The filtered DataFrame.
Parameters:
e(float): Privacy budget (epsilon).k(int): Number of hash functions.m(int): Number of buckets.
Returns:
- error_table (table). Table of all the parameters of error
- df_estimated (table). Table with the estimated frecuency of the events.
Parameters:
er(float, optional): Initial reference value for epsilon. Default is 150.
Returns:
- er(float): Privacy budget (epsilon) of reference for the next stage.
- k (int): Number of hash functions.
- m (int): Number of buckets.
epsilon) while satisfying the defined error constraint.
Parameters:
k(int): Number of hash functions.m(int): Number of buckets.
Returns:
e(int): Optimal epsilon value.
mask
Initializes the Mask instance by loading configuration parameters.Parameters:
privacy_level(str): Privacy level identifier.df(pd.DataFrame): The input dataset.
pd.DataFrame: The filtered and pseudonymized DataFrame.
f_estimated(pd.DataFrame): Estimated frequency distribution.f_real(pd.DataFrame): Real frequency distribution.
Returns:
- Placeholder for metric value calculation.
Input:
- User's name (
user_name).
Output:
- A 10-character pseudonymized hash of the user's name.
- None (uses class attributes like
privacy_level,error_value, andtolerance).
Output:
- The best optimized ϵ, privatized data, and associated coefficients.
aggregate
Updates the sketch matrix based on the privatized data using either the "PCMeS" or "PHCMS" privacy method. It processes the given data point and modifies the matrix accordingly.Input:
M: The sketch matrix.k: Parameter used in matrix updates.e: Privacy parameter.privacy_method: The privacy method to apply ("PCMeS" or "PHCMS").data_point: Data point used to update the matrix (could be vector, index, or weight).
Output:
- The updated sketch matrix (
M).
Output:
- Initializes the Agregate instance with privacy settings and an empty dictionary for user sketches.
Input:
user_data: Data related to a specific user.
Output:
- A tuple with the user's ID and the computed sketch matrix (
M) along with the number of data points (N).
Output:
- A dictionary (
sketch_by_user) mapping user IDs to their computed sketches.
estimate
Constructor for the Estimation class. It loads the necessary aggregated data (sketch_by_user) and privacy settings (k, m, epsilon, hashes, method) from JSON files to initialize the estimation process.
Output:
- Initializes the Estimation instance with user sketches and privacy settings.
M and the number of data points N. The estimation is based on a formula involving the privacy settings.
Input:
d: The element whose frequency is to be estimated.M: The sketch matrix for a user.N: The number of data points for the user.
Output:
- The estimated frequency of the element d for the user.
estimate_element method.
Input:
event: The event (element) whose frequency needs to be estimated.
Output:
- Prints the estimated frequency of the event for each user.
utils
Saves the initial setup configuration to a JSON file (setup_config.json), including parameters like k, m, epsilon, event names, privacy method, error metrics, tolerance, and p.
Input:
setup_instance: Object containing the setup parameters.
Output: JSON configuration file saved.
Loads the setup configuration from the previously saved JSON file.Output: Tuple with k, m, epsilon, events_names, privacy_method, error_metric, error_value, tolerance and p.
mask_config.json and privatized_dataset.csv).
Input:
mask_instance: Mask configuration instance.e: Epsilon value.coeffs: Hash function coefficients.privatized_dataset: Privatized dataset.
Output: JSON and CSV files saved.
Loads the mask configuration, rebuilds the hash functions, and loads the privatized dataset.
Output: Tuple withk, m, e, rebuilt hash functions, privacy_method, and the dataset as a DataFrame.
Input:
agregate_instance: Instance containing the user sketches.
Output: Loaded sketch_by_user object.
Generates a deterministic hash of an element using SHA-256, returning it as an integer.Input:
x: Element to be hashed.
Output: Deterministic hash as an integer.
Generatesk hash functions based on random coefficients over a finite field defined by p, mapped to m.
Input:
-
k: Number of hash functions. -
p: Large prime number for modular operations. -
c: Polynomial degree. -
m: Output range.
Output:
- List of hash functions.
- Dictionary with function parameters.
Input:
functions_params: Dictionary withcoefficients,p,m, andc.
Output:
- List of rebuilt hash functions.
Input:
-
real_freq: DataFrame with real frequencies. -
estimated_freq: Dictionary with estimated frequencies.
Output:
- List of tabulated results with real count, real percentage, estimated count, estimated percentage, difference, and percent error.
Input:
df: DataFrame containing a value column.
Output:
- DataFrame with columns Element and Frequency.
📎 Notes
-
Make sure that your input file does not have incorrectly named columns (e.g.,
Unnamed: 0). -
Pseudonymization is applied using simple hashing.