ptrail.preprocessing package

Submodules

ptrail.preprocessing.filters module

The filters module contains several data filtering functions like filtering the data based on time, date, proximity to a point and several others.

Authors: Yaksh J Haranwala, Salman Haidri
class ptrail.preprocessing.filters.Filters[source]

Bases: object

static filter_by_bounding_box(dataframe: PTRAILDataFrame, bounding_box: tuple, inside: bool = True)[source]

Given a bounding box, filter out all the points that are within/outside the bounding box and return a dataframe containing the filtered points.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe from which the data is to be filtered out.

  • bounding_box (tuple) – The bounding box which is to be used to filter the data.

  • inside (bool) – Indicate whether the data outside the bounding box is required or the data inside it.

Returns:

The filtered dataframe.

Return type:

PTRAILDataFrame

static filter_by_date(dataframe: PTRAILDataFrame, start_date: Optional[str] = None, end_date: Optional[str] = None)[source]

Filter the dataset by user-given time range.

Note

The following options are to be noted for filtering the data:
1. If the start_date and end_date both are not given, then entire dataset itself is returned.
2. If only start_date is given, then the trajectory data after (including the start date) the start date is returned.
3. If only end_date is given, then the trajectory data before (including the end date) the end date is returned.
4. If start_date and end_date both are given then the data between the start_date and end_date (included) are returned.
Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe that is to be filtered.

  • start_date (Optional[Text]) – The start date from which the points are to be filtered.

  • end_date (Optional[Text]) – The end date before which the points are to be filtered.

Returns:

The filtered dataframe containing the resultant data.

Return type:

PTRAILDataFrame

Raises:

ValueError: – When the start date is later than the end date.

static filter_by_datetime(dataframe: PTRAILDataFrame, start_dateTime: Optional[str] = None, end_dateTime: Optional[str] = None)[source]

Filter the dataset by user-given time range.

Note

The following options are to be noted for filtering the data.
1. If the start_dateTime and end_dateTime both are not given, then entire dataset itself is returned.
2. If only start_dateTime is given, then the trajectory data after (including the start datetime) the start date is returned.
3. If only end_dateTime is given, then the trajectory data before (including the end datetime) the end date is returned.
4. If start_dateTime and end_dateTime both are given then the data between the start_dateTime and end_dateTime (included) are returned.
Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe that is to be filtered.

  • start_dateTime (Optional[Text]) – The start dateTime from which the points are to be filtered.

  • end_dateTime (Optional[Text]) – The end dateTime before which the points are to be filtered.

Returns:

The filtered dataframe containing the resultant data.

Return type:

PTRAILDataFrame

Raises:

ValueError: – When the start datetime is later than the end datetime.

static filter_by_max_consecutive_distance(dataframe, max_distance: float)[source]

Remove the points that have a distance between 2 consecutive points greater than a user specified value.

Note

max_distance is given in metres.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

  • max_distance (float) – The consecutive distance threshold above which the points are to be removed.

Returns:

The filtered dataframe.

Return type:

PTRAILDataFrame

static filter_by_max_distance_and_speed(dataframe, max_distance: float, max_speed: float)[source]

Filter out values that have distance between consecutive points greater than a user-given distance and speed between consecutive points greater than a user-given speed

Note

The max_distance is given in metres

Note

The max_speed is given in metres/second (m/s).

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

  • max_distance (float) – The maximum distance between 2 consecutive points.

  • max_speed (float) – The maximum speed between 2 consecutive points.

Returns:

The filtered dataframe.

Return type:

pandas.DataFrame

static filter_by_max_speed(dataframe: PTRAILDataFrame, max_speed: float)[source]

Remove the data points which have speed more than a user given speed.

Note

The max_speed is given in the units m/s (metres per second).

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

  • max_speed (float) – The speed threshold above which the points are to be removed.

Returns:

PTRAILDataFrame Dataframe containing the resultant dataframe.

Return type:

PTRAILDataFrame

static filter_by_min_consecutive_distance(dataframe, min_distance: float)[source]

Remove the points that have a distance between 2 consecutive points lesser than a user specified value.

Note

min_distance is given in metres.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

  • min_distance (float) – The consecutive distance threshold below which the points are to be removed.

Returns:

The filtered dataframe.

Return type:

PTRAILDataFrame

static filter_by_min_distance_and_speed(dataframe, min_distance: float, min_speed: float)[source]

Filter out values that have distance between consecutive points lesser than a user-given distance and speed between consecutive points lesser than a user-given speed.

Note

The min_distance is given in metres.

Note

The min_speed is given in metres/second (m/s).

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

  • min_distance (float) – The minimum distance between 2 consecutive points.

  • min_speed (float) – The minimum speed between 2 consecutive points.

Returns:

The filtered dataframe.

Return type:

PTRAILDataFrame

static filter_by_min_speed(dataframe, min_speed: float)[source]

Remove the data points which have speed less than a user given speed.

Note

The min_speed is given in the units m/s (metres per second).

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

  • min_speed (float) – The speed threshold below which the points are to be removed.

Returns:

PTRAILDataFrame Dataframe containing the resultant dataframe.

Return type:

PTRAILDataFrame

static filter_by_traj_id(dataframe: PTRAILDataFrame, traj_id: str)[source]

Extract all the trajectory points of a particular trajectory specified by the trajectory’s ID.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe on which the filtering by ID is to be done.

  • traj_id (Text) – The ID of the trajectory which is to be extracted.

Returns:

The dataframe containing all the trajectory points of the specified trajectory.

Return type:

pandas.core.dataframe.DataFrame

Raises:

MissingTrajIDException: – This exception is raised when the Trajectory ID given by the user does not exist in the dataset.

static filter_outliers_by_consecutive_distance(dataframe: PTRAILDataFrame)[source]

Check the outlier points based on distance between 2 consecutive points. Outlier formula:

Lower outlier = Q1 - (1.5*IQR)
Higher outlier = Q3 + (1.5*IQR)
IQR = Inter quartile range = Q3 - Q1
We need to find points between lower and higher outlier
Parameters:

dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

Returns:

The dataframe which has been filtered.

Return type:

PTRAILDataFrame

static filter_outliers_by_consecutive_speed(dataframe)[source]

Check the outlier points based on distance between 2 consecutive points. Outlier formula:

Lower outlier = Q1 - (1.5*IQR)
Higher outlier = Q3 + (1.5*IQR)
IQR = Inter quartile range = Q3 - Q1
We need to find points between lower and higher outlier
Parameters:

dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.

Returns:

The dataframe which has been filtered.

Return type:

PTRAILDataFrame

static get_bounding_box_by_radius(lat: float, lon: float, radius: float)[source]

Calculates bounding box from a point according to the given radius.

Parameters:
  • lat (float) – The latitude of centroid point of the bounding box.

  • lon (float) – The longitude of centroid point of the bounding box.

  • radius (float) – The max radius of the bounding box. The radius is given in metres.

Returns:

The bounding box of the user specified size.

Return type:

tuple

References

https://mathmesquita.dev/2017/01/16/filtrando-localizacao-em-um-raio.html

static hampel_outlier_detection(dataframe, column_name: str)[source]

Use the hampel filter to remove outliers from the dataset on the basis of column specified by the user.

Warning

Do not use Hampel filter outlier detection and try to detect outliers with DateTime as it will raise a NotImplementedError as it has not been implemented yet by the original author of the Hampel filter.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe from which the outliers are to be removed.

  • column_name (Text) – The column on te basis of which the outliers are to be detected.

Returns:

The dataframe with the outliers removed.

Return type:

PTRAILDataFrame

Raises:

KeyError: – The user-specified column is not present in the dataset.

References

Pedrido, M.O., “Hampel”, (2020), GitHub repository, https://github.com/MichaelisTrofficus/hampel_filter

static remove_duplicates(dataframe: PTRAILDataFrame)[source]
Drop duplicates based on the four following columns:
  1. Trajectory ID

  2. DateTime

  3. Latitude

  4. Longitude

Duplicates will be dropped only when all the values in the above mentioned four columns are the same.

Returns:

The dataframe with dropped duplicates.

Return type:

PTRAILDataFrame

static remove_trajectories_with_less_points(dataframe, num_min_points: Optional[int] = 3)[source]

Remove out the trajectories from the dataframe which have few points.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe from which trajectories with few points are to be removed.

  • num_min_points (Optional[int], default = 2) – The minimum number of points that a trajectory should have if it is to be retained in the dataset.

Returns:

The filtered dataframe which does not contain the trajectories with few points anymore.

Return type:

PTRAILDataFrame

ptrail.preprocessing.helpers module

Warning

1. None of the methods in this module should be used directly while performing operations on data.
2. These methods are helpers for the interpolation methods in the interpolation.py module and hence run linearly and not in parallel which will result in slower execution time.
3. All the methods in this module perform calculation on a single Trajectory ID due to which it will wrong results on data with multiple trajectories. Instead, use the interpolation.py methods for faster and reliable calculations.

The helpers class has the functionalities that interpolate a point based on the given data by the user. The class contains the following 4 interpolation calculators:

  1. Linear Interpolation

  2. Cubic Interpolation

  3. Random-Walk Interpolation

  4. Kinematic Interpolation

Besides the interpolation helpers, there are also general utilities which are used in splitting up dataframes for running the code in parallel.

Authors: Yaksh J Haranwala, Salman Haidri
class ptrail.preprocessing.helpers.Helpers[source]

Bases: object

static cubic_help(df: Union[pandas.DataFrame, PTRAILDataFrame], id_: str, sampling_rate: float, class_label_col)[source]

This method takes a dataframe and uses cubic interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.

Warning

This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.

Parameters:
  • df (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.

  • id (Text) – The Trajectory ID of the points in the dataframe.

  • sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.

Returns:

The dataframe containing the trajectory enhanced with interpolated points.

Return type:

pandas.core.dataframe.DataFrame

static filt_df_by_date(dataframe, start_date, end_date)[source]
static hampel_help(df, column_name)[source]

This function is the helper function for the hampel_outlier_detection() function present in the filters module. The purpose of the function is to run the hampel filter on a single trajectory ID, remove the outliers and return the smaller dataframe.

Warning

This function should not be used directly as it will result in a slower execution of the function and might result in removal of points that are actually not outliers.

Warning

Do not use Hampel filter outlier detection and try to detect outliers with DateTime as it will raise a NotImplementedError as it has not been implemented yet by the original author of the Hampel filter.

Parameters:
  • df (PTRAILDataFrame/pd.core.dataframe.DataFrame) – The dataframe which the outliers are to be removed

  • column_name (Text) – The column based on which the outliers are to be removed.

Returns:

The dataframe where the outlier points are removed.

Return type:

pd.core.dataframe.DataFrame

static kinematic_help(dataframe: Union[pandas.DataFrame, PTRAILDataFrame], id_: str, sampling_rate: float, class_label_col)[source]

This method takes a dataframe and uses kinematic interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.

Warning

This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.

Parameters:
  • dataframe (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.

  • id (Text) – The Trajectory ID of the points in the dataframe.

  • sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.

Returns:

The dataframe containing the trajectory enhanced with interpolated points.

Return type:

pandas.core.dataframe.DataFrame

References

Nogueira, T.O., “kinematic_interpolation.py”, (2016), GitHub repository, https://gist.github.com/talespaiva/128980e3608f9bc5083b.js

static linear_help(dataframe: Union[pandas.DataFrame, PTRAILDataFrame], id_: str, sampling_rate: float, class_label_col)[source]

This method takes a dataframe and uses linear interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.

Warning

This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.

Parameters:
  • dataframe (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.

  • id (Text) – The Trajectory ID of the points in the dataframe.

  • sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.

Returns:

The dataframe containing the trajectory enhanced with interpolated points.

Return type:

pandas.core.dataframe.DataFrame

static random_walk_help(dataframe: PTRAILDataFrame, id_: str, sampling_rate: float, class_label_col)[source]

This method takes a dataframe and uses random-walk interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.

Warning

This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.

Parameters:
  • dataframe (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.

  • id (Text) – The Trajectory ID of the points in the dataframe.

  • sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.

Returns:

The dataframe containing the trajectory enhanced with interpolated points.

Return type:

pandas.core.dataframe.DataFrame

References

Etemad, M., Soares, A., Etemad, E. et al. SWS: an unsupervised trajectory segmentation algorithm based on change detection with interpolation kernels. Geoinformatica (2020)

static split_traj_helper(df, num_days)[source]
static stats_helper(df, target_col_name, segmented)[source]

Generate the stats of the kinematic features present in the Dataframe.

Parameters:
  • df (pandas.core.dataframe.DataFrame) – The dataframe containing the trajectory data and their features.

  • target_col_name (str) – This is the ‘y’ value that is used for ML tasks, this is asked to append the species back at the end.

  • segmented (Optional[bool]) – Indicate whether the trajectory has segments or not.

Returns:

A dataframe containing the stats of the given trajectory.

Return type:

pd.core.dataframe.DataFrame

ptrail.preprocessing.interpolation module

This class interpolates dataframe positions based on Datetime. It provides the user with the flexibility to use linear or cubic interpolation. In general, the user passes the dataframe, time jum and the interpolation type, based on the type the proper function is mapped. And if the time difference exceeds the time jump, the interpolated point is added to the position with large jump with a time increase of time jump. This interpolated row is added to the dataframe.

Authors: Yaksh J Haranwala, Salman Haidri
class ptrail.preprocessing.interpolation.Interpolation[source]

Bases: object

static interpolate_position(dataframe: PTRAILDataFrame, sampling_rate: float, ip_type: Optional[str] = 'linear', class_label_col: Optional[str] = '')[source]

Interpolate the position of an object and create new points using one of the interpolation methods provided by the Library. Currently, the library supports the following 4 interpolation methods:

  1. Linear Interpolation

  2. Cubic-Spline Interpolation

  3. Kinematic Interpolation

  4. Random Walk Interpolation

Warning

The Interpolation methods will only return the 4 mandatory library columns because it is not possible to interpolate other data that may or may not be present in the dataset apart from latitude, longitude and datetime. As a result, other columns are dropped.

Note

The time-jump parameter specifies where the new points are to be inserted based on the time difference between 2 consecutive points. However, it does not guarantee that the dataset will be brought down to having difference between 2 consecutive points equal to or less than the user specified time jump.

Note

The time-jump is specified in seconds. Hence, if the user-specified time-jump is not sensible, then the execution of the method will take a very long time.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe containing the original dataset.

  • sampling_rate (float) – The maximum time difference between 2 consecutive points.

  • ip_type (Optional[Text], default = linear) – The type of interpolation that is to be used.

  • class_label_col (Optional[Text], default = '') – The column header which contains the class label of the point.

Returns:

The dataframe containing the interpolated trajectory points.

Return type:

PTRAILDataFrame

ptrail.preprocessing.statistics module

The statistics module has several functionalities that calculate kinematic statistics of the trajectory, split trajectories, pivot dataframes etc. The main purpose of this module is to get the dataframe ready for Machine Learning tasks such as clustering, calssification etc.

Author: Yaksh J Haranwala
class ptrail.preprocessing.statistics.Statistics[source]

Bases: object

static generate_kinematic_stats(dataframe: PTRAILDataFrame, target_col_name: str, segmented: Optional[bool] = False)[source]

Generate the statistics of kinematic features for each unique trajectory in the dataframe.

Parameters:
  • dataframe (PTRAILDataFrame) – The dataframe containing the trajectory data.

  • target_col_name (str) – This is the ‘y’ value that is used for ML tasks, this is asked to append the target_col back at the end.

  • segmented (Optional[bool]) – Indicate whether the trajectory has segments or not.

Returns:

A pandas dataframe containing stats for all kinematic features for each unique trajectory in the dataframe.

Return type:

pandas.core.dataframe.DataFrame

static pivot_stats_df(dataframe, target_col_name: str, segmented: Optional[bool] = False)[source]

Given a dataframe with stats present in it, melt the dataframe to make it ready for ML tasks. This is specifically for melting the type of dataframe generated by the generate_kinematic_stats() function of the kinematic_features module.

Check the kinematic_features module for further details about the dataframe expected.

Parameters:
  • dataframe (pd.core.dataframe.DataFrame) – The dataframe containing stats.

  • target_col_name (str) – This is the ‘y’ value that is used for ML tasks, this is asked to append the target_col back at the end.

  • segmented (Optional[bool]) – Indicate whether the trajectory has segments or not.

Returns:

The dataframe above which is pivoted and has rows converted to columns.

Return type:

pd.core.dataframe.DataFrame

static segment_traj_by_days(dataframe: PTRAILDataFrame, num_days)[source]

Given a dataframe containing trajectory data, segment all the trajectories by each week.

Parameters:
  • df (PTRAILDataFrame) – The dataframe containing trajectory data.

  • num_days (int) – The number of days that each segment is supposed to have.

Returns:

The dataframe containing segmented trajectories with a new column added called segment_id

Return type:

pandas.core.dataframe.DataFrame

Module contents