Repository: alteryx/featuretools Branch: main Commit: 938a0f6ccb98 Files: 501 Total size: 2.3 MB Directory structure: gitextract_b07mgx0i/ ├── .codecov.yml ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── blank_issue.md │ │ ├── bug_report.md │ │ ├── config.yml │ │ ├── documentation_improvement.md │ │ └── feature_request.md │ ├── auto_assign.yml │ └── workflows/ │ ├── auto_approve_dependency_PRs.yaml │ ├── broken_link_check.yaml │ ├── build_docs.yaml │ ├── create_feedstock_pr.yaml │ ├── install_test.yaml │ ├── kickoff_evalml_unit_tests.yaml │ ├── latest_dependency_checker.yaml │ ├── lint_check.yaml │ ├── minimum_dependency_checker.yaml │ ├── performance-check.yaml │ ├── pull_request_check.yaml │ ├── release.yaml │ ├── release_notes_updated.yaml │ ├── test_without_test_dependencies.yaml │ ├── tests_with_latest_deps.yaml │ ├── tests_with_minimum_deps.yaml │ └── tests_with_woodwork_main_branch.yaml ├── .gitignore ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── LICENSE ├── Makefile ├── README.md ├── contributing.md ├── docs/ │ ├── Makefile │ ├── backport_release.md │ ├── make.bat │ ├── notebook_version_standardizer.py │ ├── pull_request_template.md │ └── source/ │ ├── _static/ │ │ └── style.css │ ├── api_reference.rst │ ├── conf.py │ ├── getting_started/ │ │ ├── afe.ipynb │ │ ├── getting_started_index.rst │ │ ├── handling_time.ipynb │ │ ├── primitives.ipynb │ │ ├── using_entitysets.ipynb │ │ └── woodwork_types.ipynb │ ├── guides/ │ │ ├── advanced_custom_primitives.ipynb │ │ ├── deployment.ipynb │ │ ├── feature_descriptions.ipynb │ │ ├── feature_selection.ipynb │ │ ├── guides_index.rst │ │ ├── performance.ipynb │ │ ├── specifying_primitive_options.ipynb │ │ ├── sql_database_integration.ipynb │ │ ├── time_series.ipynb │ │ └── tuning_dfs.ipynb │ ├── index.ipynb │ ├── install.md │ ├── release_notes.rst │ ├── resources/ │ │ ├── ecosystem.rst │ │ ├── frequently_asked_questions.ipynb │ │ ├── help.rst │ │ ├── resources_index.rst │ │ ├── transition_to_ft_v1.0.ipynb │ │ └── usage_tips/ │ │ ├── glossary.rst │ │ └── limitations.rst │ ├── set-headers.py │ ├── setup.py │ └── templates/ │ └── layout.html ├── featuretools/ │ ├── __init__.py │ ├── __main__.py │ ├── computational_backends/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── calculate_feature_matrix.py │ │ ├── feature_set.py │ │ ├── feature_set_calculator.py │ │ └── utils.py │ ├── config_init.py │ ├── demo/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── flight.py │ │ ├── mock_customer.py │ │ ├── retail.py │ │ └── weather.py │ ├── entityset/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── deserialize.py │ │ ├── entityset.py │ │ ├── relationship.py │ │ ├── serialize.py │ │ └── timedelta.py │ ├── exceptions.py │ ├── feature_base/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── cache.py │ │ ├── feature_base.py │ │ ├── feature_descriptions.py │ │ ├── feature_visualizer.py │ │ ├── features_deserializer.py │ │ ├── features_serializer.py │ │ └── utils.py │ ├── feature_discovery/ │ │ ├── FeatureCollection.py │ │ ├── LiteFeature.py │ │ ├── __init__.py │ │ ├── convertors.py │ │ ├── feature_discovery.py │ │ ├── type_defs.py │ │ └── utils.py │ ├── primitives/ │ │ ├── __init__.py │ │ ├── base/ │ │ │ ├── __init__.py │ │ │ ├── aggregation_primitive_base.py │ │ │ ├── primitive_base.py │ │ │ └── transform_primitive_base.py │ │ ├── options_utils.py │ │ ├── standard/ │ │ │ ├── __init__.py │ │ │ ├── aggregation/ │ │ │ │ ├── __init__.py │ │ │ │ ├── all_primitive.py │ │ │ │ ├── any_primitive.py │ │ │ │ ├── average_count_per_unique.py │ │ │ │ ├── avg_time_between.py │ │ │ │ ├── count.py │ │ │ │ ├── count_above_mean.py │ │ │ │ ├── count_below_mean.py │ │ │ │ ├── count_greater_than.py │ │ │ │ ├── count_inside_nth_std.py │ │ │ │ ├── count_inside_range.py │ │ │ │ ├── count_less_than.py │ │ │ │ ├── count_outside_nth_std.py │ │ │ │ ├── count_outside_range.py │ │ │ │ ├── date_first_event.py │ │ │ │ ├── entropy.py │ │ │ │ ├── first.py │ │ │ │ ├── first_last_time_delta.py │ │ │ │ ├── has_no_duplicates.py │ │ │ │ ├── is_monotonically_decreasing.py │ │ │ │ ├── is_monotonically_increasing.py │ │ │ │ ├── is_unique.py │ │ │ │ ├── kurtosis.py │ │ │ │ ├── last.py │ │ │ │ ├── max_consecutive_false.py │ │ │ │ ├── max_consecutive_negatives.py │ │ │ │ ├── max_consecutive_positives.py │ │ │ │ ├── max_consecutive_true.py │ │ │ │ ├── max_consecutive_zeros.py │ │ │ │ ├── max_count.py │ │ │ │ ├── max_min_delta.py │ │ │ │ ├── max_primitive.py │ │ │ │ ├── mean.py │ │ │ │ ├── median.py │ │ │ │ ├── median_count.py │ │ │ │ ├── min_count.py │ │ │ │ ├── min_primitive.py │ │ │ │ ├── mode.py │ │ │ │ ├── n_most_common.py │ │ │ │ ├── n_most_common_frequency.py │ │ │ │ ├── n_unique_days.py │ │ │ │ ├── n_unique_days_of_calendar_year.py │ │ │ │ ├── n_unique_days_of_month.py │ │ │ │ ├── n_unique_months.py │ │ │ │ ├── n_unique_weeks.py │ │ │ │ ├── num_consecutive_greater_mean.py │ │ │ │ ├── num_consecutive_less_mean.py │ │ │ │ ├── num_false_since_last_true.py │ │ │ │ ├── num_peaks.py │ │ │ │ ├── num_true.py │ │ │ │ ├── num_true_since_last_false.py │ │ │ │ ├── num_unique.py │ │ │ │ ├── num_zero_crossings.py │ │ │ │ ├── percent_true.py │ │ │ │ ├── percent_unique.py │ │ │ │ ├── skew.py │ │ │ │ ├── std.py │ │ │ │ ├── sum_primitive.py │ │ │ │ ├── time_since_first.py │ │ │ │ ├── time_since_last.py │ │ │ │ ├── time_since_last_false.py │ │ │ │ ├── time_since_last_max.py │ │ │ │ ├── time_since_last_min.py │ │ │ │ ├── time_since_last_true.py │ │ │ │ ├── trend.py │ │ │ │ └── variance.py │ │ │ └── transform/ │ │ │ ├── __init__.py │ │ │ ├── absolute_diff.py │ │ │ ├── binary/ │ │ │ │ ├── __init__.py │ │ │ │ ├── add_numeric.py │ │ │ │ ├── add_numeric_scalar.py │ │ │ │ ├── and_primitive.py │ │ │ │ ├── divide_by_feature.py │ │ │ │ ├── divide_numeric.py │ │ │ │ ├── divide_numeric_scalar.py │ │ │ │ ├── equal.py │ │ │ │ ├── equal_scalar.py │ │ │ │ ├── greater_than.py │ │ │ │ ├── greater_than_equal_to.py │ │ │ │ ├── greater_than_equal_to_scalar.py │ │ │ │ ├── greater_than_scalar.py │ │ │ │ ├── less_than.py │ │ │ │ ├── less_than_equal_to.py │ │ │ │ ├── less_than_equal_to_scalar.py │ │ │ │ ├── less_than_scalar.py │ │ │ │ ├── modulo_by_feature.py │ │ │ │ ├── modulo_numeric.py │ │ │ │ ├── modulo_numeric_scalar.py │ │ │ │ ├── multiply_boolean.py │ │ │ │ ├── multiply_numeric.py │ │ │ │ ├── multiply_numeric_boolean.py │ │ │ │ ├── multiply_numeric_scalar.py │ │ │ │ ├── not_equal.py │ │ │ │ ├── not_equal_scalar.py │ │ │ │ ├── or_primitive.py │ │ │ │ ├── scalar_subtract_numeric_feature.py │ │ │ │ ├── subtract_numeric.py │ │ │ │ └── subtract_numeric_scalar.py │ │ │ ├── cumulative/ │ │ │ │ ├── __init__.py │ │ │ │ ├── cum_count.py │ │ │ │ ├── cum_max.py │ │ │ │ ├── cum_mean.py │ │ │ │ ├── cum_min.py │ │ │ │ ├── cum_sum.py │ │ │ │ ├── cumulative_time_since_last_false.py │ │ │ │ └── cumulative_time_since_last_true.py │ │ │ ├── datetime/ │ │ │ │ ├── __init__.py │ │ │ │ ├── age.py │ │ │ │ ├── date_to_holiday.py │ │ │ │ ├── date_to_timezone.py │ │ │ │ ├── day.py │ │ │ │ ├── day_of_year.py │ │ │ │ ├── days_in_month.py │ │ │ │ ├── diff_datetime.py │ │ │ │ ├── distance_to_holiday.py │ │ │ │ ├── hour.py │ │ │ │ ├── is_federal_holiday.py │ │ │ │ ├── is_first_week_of_month.py │ │ │ │ ├── is_leap_year.py │ │ │ │ ├── is_lunch_time.py │ │ │ │ ├── is_month_end.py │ │ │ │ ├── is_month_start.py │ │ │ │ ├── is_quarter_end.py │ │ │ │ ├── is_quarter_start.py │ │ │ │ ├── is_weekend.py │ │ │ │ ├── is_working_hours.py │ │ │ │ ├── is_year_end.py │ │ │ │ ├── is_year_start.py │ │ │ │ ├── minute.py │ │ │ │ ├── month.py │ │ │ │ ├── part_of_day.py │ │ │ │ ├── quarter.py │ │ │ │ ├── season.py │ │ │ │ ├── second.py │ │ │ │ ├── time_since.py │ │ │ │ ├── time_since_previous.py │ │ │ │ ├── utils.py │ │ │ │ ├── week.py │ │ │ │ ├── weekday.py │ │ │ │ └── year.py │ │ │ ├── email/ │ │ │ │ ├── __init__.py │ │ │ │ ├── email_address_to_domain.py │ │ │ │ └── is_free_email_domain.py │ │ │ ├── exponential/ │ │ │ │ ├── __init__.py │ │ │ │ ├── exponential_weighted_average.py │ │ │ │ ├── exponential_weighted_std.py │ │ │ │ └── exponential_weighted_variance.py │ │ │ ├── file_extension.py │ │ │ ├── full_name_to_first_name.py │ │ │ ├── full_name_to_last_name.py │ │ │ ├── full_name_to_title.py │ │ │ ├── is_in.py │ │ │ ├── is_null.py │ │ │ ├── latlong/ │ │ │ │ ├── __init__.py │ │ │ │ ├── cityblock_distance.py │ │ │ │ ├── geomidpoint.py │ │ │ │ ├── haversine.py │ │ │ │ ├── is_in_geobox.py │ │ │ │ ├── latitude.py │ │ │ │ ├── longitude.py │ │ │ │ └── utils.py │ │ │ ├── natural_language/ │ │ │ │ ├── __init__.py │ │ │ │ ├── constants.py │ │ │ │ ├── count_string.py │ │ │ │ ├── mean_characters_per_word.py │ │ │ │ ├── median_word_length.py │ │ │ │ ├── num_characters.py │ │ │ │ ├── num_unique_separators.py │ │ │ │ ├── num_words.py │ │ │ │ ├── number_of_common_words.py │ │ │ │ ├── number_of_hashtags.py │ │ │ │ ├── number_of_mentions.py │ │ │ │ ├── number_of_unique_words.py │ │ │ │ ├── number_of_words_in_quotes.py │ │ │ │ ├── punctuation_count.py │ │ │ │ ├── title_word_count.py │ │ │ │ ├── total_word_length.py │ │ │ │ ├── upper_case_count.py │ │ │ │ ├── upper_case_word_count.py │ │ │ │ └── whitespace_count.py │ │ │ ├── not_primitive.py │ │ │ ├── nth_week_of_month.py │ │ │ ├── numeric/ │ │ │ │ ├── __init__.py │ │ │ │ ├── absolute.py │ │ │ │ ├── cosine.py │ │ │ │ ├── diff.py │ │ │ │ ├── natural_logarithm.py │ │ │ │ ├── negate.py │ │ │ │ ├── percentile.py │ │ │ │ ├── rate_of_change.py │ │ │ │ ├── same_as_previous.py │ │ │ │ ├── sine.py │ │ │ │ ├── square_root.py │ │ │ │ └── tangent.py │ │ │ ├── percent_change.py │ │ │ ├── postal/ │ │ │ │ ├── __init__.py │ │ │ │ ├── one_digit_postal_code.py │ │ │ │ └── two_digit_postal_code.py │ │ │ ├── savgol_filter.py │ │ │ ├── time_series/ │ │ │ │ ├── __init__.py │ │ │ │ ├── expanding/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── expanding_count.py │ │ │ │ │ ├── expanding_max.py │ │ │ │ │ ├── expanding_mean.py │ │ │ │ │ ├── expanding_min.py │ │ │ │ │ ├── expanding_std.py │ │ │ │ │ └── expanding_trend.py │ │ │ │ ├── lag.py │ │ │ │ ├── numeric_lag.py │ │ │ │ ├── rolling_count.py │ │ │ │ ├── rolling_max.py │ │ │ │ ├── rolling_mean.py │ │ │ │ ├── rolling_min.py │ │ │ │ ├── rolling_outlier_count.py │ │ │ │ ├── rolling_std.py │ │ │ │ ├── rolling_trend.py │ │ │ │ └── utils.py │ │ │ └── url/ │ │ │ ├── __init__.py │ │ │ ├── url_to_domain.py │ │ │ ├── url_to_protocol.py │ │ │ └── url_to_tld.py │ │ └── utils.py │ ├── selection/ │ │ ├── __init__.py │ │ ├── api.py │ │ └── selection.py │ ├── synthesis/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── deep_feature_synthesis.py │ │ ├── dfs.py │ │ ├── encode_features.py │ │ ├── get_valid_primitives.py │ │ └── utils.py │ ├── tests/ │ │ ├── __init__.py │ │ ├── computational_backend/ │ │ │ ├── __init__.py │ │ │ ├── test_calculate_feature_matrix.py │ │ │ ├── test_feature_set.py │ │ │ ├── test_feature_set_calculator.py │ │ │ └── test_utils.py │ │ ├── config_tests/ │ │ │ ├── __init__.py │ │ │ └── test_config.py │ │ ├── conftest.py │ │ ├── demo_tests/ │ │ │ ├── __init__.py │ │ │ └── test_demo_data.py │ │ ├── entityset_tests/ │ │ │ ├── __init__.py │ │ │ ├── test_es.py │ │ │ ├── test_es_metadata.py │ │ │ ├── test_last_time_index.py │ │ │ ├── test_plotting.py │ │ │ ├── test_relationship.py │ │ │ ├── test_serialization.py │ │ │ ├── test_timedelta.py │ │ │ └── test_ww_es.py │ │ ├── entry_point_tests/ │ │ │ ├── __init__.py │ │ │ ├── add-ons/ │ │ │ │ ├── __init__.py │ │ │ │ ├── featuretools_plugin/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── featuretools_plugin/ │ │ │ │ │ │ └── __init__.py │ │ │ │ │ └── setup.py │ │ │ │ └── featuretools_primitives/ │ │ │ │ ├── __init__.py │ │ │ │ ├── featuretools_primitives/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── existing_primitive.py │ │ │ │ │ ├── invalid_primitive.py │ │ │ │ │ └── new_primitive.py │ │ │ │ └── setup.py │ │ │ ├── test_plugin.py │ │ │ ├── test_primitives.py │ │ │ └── utils.py │ │ ├── feature_discovery/ │ │ │ ├── __init__.py │ │ │ ├── test_convertors.py │ │ │ ├── test_feature_collection.py │ │ │ ├── test_feature_discovery.py │ │ │ └── test_type_defs.py │ │ ├── primitive_tests/ │ │ │ ├── __init__.py │ │ │ ├── aggregation_primitive_tests/ │ │ │ │ ├── __init__.py │ │ │ │ ├── test_agg_primitives.py │ │ │ │ ├── test_count_aggregation_primitives.py │ │ │ │ ├── test_max_consecutive.py │ │ │ │ ├── test_num_consecutive.py │ │ │ │ ├── test_percent_true.py │ │ │ │ ├── test_rolling_primitive.py │ │ │ │ └── test_time_since.py │ │ │ ├── bad_primitive_files/ │ │ │ │ ├── __init__.py │ │ │ │ ├── multiple_primitives.py │ │ │ │ └── no_primitives.py │ │ │ ├── natural_language_primitives_tests/ │ │ │ │ ├── __init__.py │ │ │ │ ├── test_count_string.py │ │ │ │ ├── test_mean_characters_per_word.py │ │ │ │ ├── test_median_word_length.py │ │ │ │ ├── test_natural_language_primitives_terminate.py │ │ │ │ ├── test_num_characters.py │ │ │ │ ├── test_num_unique_separators.py │ │ │ │ ├── test_num_words.py │ │ │ │ ├── test_number_of_common_words.py │ │ │ │ ├── test_number_of_hashtags.py │ │ │ │ ├── test_number_of_mentions.py │ │ │ │ ├── test_number_of_unique_words.py │ │ │ │ ├── test_number_of_words_in_quotes.py │ │ │ │ ├── test_punctuation_count.py │ │ │ │ ├── test_title_word_count.py │ │ │ │ ├── test_total_word_length.py │ │ │ │ ├── test_upper_case_count.py │ │ │ │ ├── test_upper_case_word_count.py │ │ │ │ └── test_whitespace_count.py │ │ │ ├── primitives_to_install/ │ │ │ │ ├── __init__.py │ │ │ │ ├── custom_max.py │ │ │ │ ├── custom_mean.py │ │ │ │ └── custom_sum.py │ │ │ ├── test_absolute_diff.py │ │ │ ├── test_agg_feats.py │ │ │ ├── test_all_primitive_docstrings.py │ │ │ ├── test_direct_features.py │ │ │ ├── test_feature_base.py │ │ │ ├── test_feature_descriptions.py │ │ │ ├── test_feature_serialization.py │ │ │ ├── test_feature_utils.py │ │ │ ├── test_feature_visualizer.py │ │ │ ├── test_features_deserializer.py │ │ │ ├── test_features_serializer.py │ │ │ ├── test_groupby_transform_primitives.py │ │ │ ├── test_identity_features.py │ │ │ ├── test_overrides.py │ │ │ ├── test_primitive_base.py │ │ │ ├── test_primitive_utils.py │ │ │ ├── test_rolling_primitive_utils.py │ │ │ ├── test_transform_features.py │ │ │ ├── transform_primitive_tests/ │ │ │ │ ├── __init__.py │ │ │ │ ├── test_cumulative_time_since.py │ │ │ │ ├── test_datetoholiday_primitive.py │ │ │ │ ├── test_distancetoholiday_primitive.py │ │ │ │ ├── test_expanding_primitives.py │ │ │ │ ├── test_exponential_primitives.py │ │ │ │ ├── test_full_name_primitives.py │ │ │ │ ├── test_is_federal_holiday.py │ │ │ │ ├── test_latlong_primitives.py │ │ │ │ ├── test_percent_change.py │ │ │ │ ├── test_percent_unique.py │ │ │ │ ├── test_postal_primitives.py │ │ │ │ ├── test_same_as_previous.py │ │ │ │ ├── test_savgol_filter.py │ │ │ │ ├── test_season.py │ │ │ │ └── test_transform_primitive.py │ │ │ └── utils.py │ │ ├── profiling/ │ │ │ ├── __init__.py │ │ │ └── dfs_profile.py │ │ ├── requirement_files/ │ │ │ ├── latest_requirements.txt │ │ │ ├── minimum_core_requirements.txt │ │ │ ├── minimum_dask_requirements.txt │ │ │ └── minimum_test_requirements.txt │ │ ├── selection/ │ │ │ ├── __init__.py │ │ │ └── test_selection.py │ │ ├── synthesis/ │ │ │ ├── __init__.py │ │ │ ├── test_deep_feature_synthesis.py │ │ │ ├── test_dfs_method.py │ │ │ ├── test_encode_features.py │ │ │ └── test_get_valid_primitives.py │ │ ├── test_version.py │ │ ├── testing_utils/ │ │ │ ├── __init__.py │ │ │ ├── cluster.py │ │ │ ├── es_utils.py │ │ │ ├── features.py │ │ │ ├── generate_fake_dataframe.py │ │ │ └── mock_ds.py │ │ └── utils_tests/ │ │ ├── __init__.py │ │ ├── test_config.py │ │ ├── test_description_utils.py │ │ ├── test_entry_point.py │ │ ├── test_gen_utils.py │ │ ├── test_recommend_primitives.py │ │ ├── test_time_utils.py │ │ ├── test_trie.py │ │ └── test_utils_info.py │ ├── utils/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── common_tld_utils.py │ │ ├── description_utils.py │ │ ├── entry_point.py │ │ ├── gen_utils.py │ │ ├── plot_utils.py │ │ ├── recommend_primitives.py │ │ ├── s3_utils.py │ │ ├── schema_utils.py │ │ ├── time_utils.py │ │ ├── trie.py │ │ ├── utils_info.py │ │ └── wrangle.py │ └── version.py ├── pyproject.toml └── release.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .codecov.yml ================================================ codecov: notify: after_n_builds: 5 ================================================ FILE: .github/ISSUE_TEMPLATE/blank_issue.md ================================================ --- name: Blank Issue about: Create a blank issue title: '' labels: '' assignees: '' --- ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.md ================================================ --- name: Bug Report about: Create a bug report to help us improve Featuretools title: '' labels: 'bug' assignees: '' --- [A clear and concise description of what the bug is.] #### Code Sample, a copy-pastable example to reproduce your bug. ```python # Your code here ``` #### Output of ``featuretools.show_info()``
[paste the output of ``featuretools.show_info()`` here below this line]
================================================ FILE: .github/ISSUE_TEMPLATE/config.yml ================================================ blank_issues_enabled: true contact_links: - name: General Technical Question about: "If you have a question like *How should I create my EntitySet?* you can ask on StackOverflow using the #featuretools tag." url: https://stackoverflow.com/questions/tagged/featuretools - name: Real-time chat url: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA about: "If you want to meet others in the community and chat about all things Alteryx OSS then check out our Slack." ================================================ FILE: .github/ISSUE_TEMPLATE/documentation_improvement.md ================================================ --- name: Documentation Improvement about: Suggest an idea for improving the documentation title: '' labels: 'documentation' assignees: '' --- [a description of what documentation you believe needs to be fixed/improved] ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.md ================================================ --- name: Feature Request about: Suggest an idea for this project title: '' labels: 'new feature' assignees: '' --- - As a [user/developer], I wish I could use Featuretools to ... #### Code Example ```python # Your code here, if applicable ``` ================================================ FILE: .github/auto_assign.yml ================================================ # Set to author to set pr creator as assignee addAssignees: author ================================================ FILE: .github/workflows/auto_approve_dependency_PRs.yaml ================================================ name: Auto Approve Dependency PRs on: schedule: - cron: '*/30 * * * *' workflow_dispatch: workflow_run: workflows: ["Unit Tests - Latest Dependencies", "Unit Tests - 3.9 Minimum Dependencies"] branches: - 'latest-dep-update-[a-f0-9]+' - 'min-dep-update-[a-f0-9]+' types: - completed jobs: build: if: ${{ github.repository_owner == 'alteryx' }} runs-on: ubuntu-latest steps: - name: Find dependency PRs id: find_prs run: | gh auth status gh pr list --repo "${{ github.repository }}" --assignee "machineFL" --base main --state open --search "status:success review:required" --limit 1 --json number > dep_PRs_waiting_approval.json dep_pull_request=$(cat dep_PRs_waiting_approval.json | grep -Eo "[0-9]*") echo ::set-output name=dep_pull_request::${dep_pull_request} env: GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }} - name: Approve dependency PRs and enable auto-merge if: ${{ steps.find_prs.outputs.dep_pull_request > 1 }} run: | gh pr review --repo "${{ github.repository }}" --comment --body "auto approve" ${{ steps.find_prs.outputs.dep_pull_request }} gh pr review --repo "${{ github.repository }}" --approve ${{ steps.find_prs.outputs.dep_pull_request }} gh pr merge --repo "${{ github.repository }}" --auto --squash --delete-branch ${{ steps.find_prs.outputs.dep_pull_request }} env: GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }} ================================================ FILE: .github/workflows/broken_link_check.yaml ================================================ name: Broken link check on: workflow_dispatch: schedule: - cron: "* * * * 1" jobs: my-broken-link-checker: name: Check for broken links runs-on: ubuntu-latest strategy: fail-fast: false steps: - name: Check for broken links uses: ruzickap/action-my-broken-link-checker@v2 with: url: https://featuretools.alteryx.com/en/latest/ cmd_params: '--max-connections=10 --color=always --ignore-fragments --buffer-size=8192 --skip-tls-verification --exclude="(twitter|github|cloudflare|featuretools\\.alteryx\\.com\\/en\\/(stable|main|v.+).*)"' - name: Add to job output run: echo "${{steps.link-report.outputs.result}}" >> $GITHUB_STEP_SUMMARY ================================================ FILE: .github/workflows/build_docs.yaml ================================================ name: Build Docs on: pull_request: types: [opened, synchronize] push: branches: - main workflow_dispatch: env: PYARROW_IGNORE_TIMEZONE: 1 JAVA_HOME: "/usr/lib/jvm/java-11-openjdk-amd64" jobs: build_docs: name: ${{ matrix.python_version }} build docs runs-on: ubuntu-latest strategy: fail-fast: false matrix: python_version: ["3.9", "3.10", "3.11", "3.12"] steps: - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Set up python ${{ matrix.python_version }} uses: actions/setup-python@v4 with: python-version: ${{ matrix.python_version }} cache: 'pip' cache-dependency-path: 'pyproject.toml' - uses: actions/cache@v3 id: cache with: path: ${{ env.pythonLocation }} key: ${{ matrix.python_version }}-docs-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01 - name: Build featuretools package run: | make package - name: Install complete version of featuretools from sdist (not using cache) if: steps.cache.outputs.cache-hit != 'true' run: | python -m pip install "unpacked_sdist/[dev]" - name: Install complete version of featuretools from sdist (using cache) if: steps.cache.outputs.cache-hit == 'true' run: | python -m pip install "unpacked_sdist/[dev]" --no-deps - name: Install apt packages run: | sudo apt update sudo apt install -y pandoc sudo apt install -y graphviz python -m pip check - name: Build docs run: make -C docs/ -e "SPHINXOPTS=-W -j auto" clean html ================================================ FILE: .github/workflows/create_feedstock_pr.yaml ================================================ on: workflow_dispatch: inputs: version: description: 'released PyPI version to use (ex - v1.11.1)' required: true name: Create Feedstock PR jobs: create_feedstock_pr: name: Create Feedstock PR runs-on: ubuntu-latest steps: - name: Checkout inputted version uses: actions/checkout@v3 with: repository: ${{ github.event.pull_request.head.repo.full_name }} ref: ${{ github.event.inputs.version }} path: "./featuretools" - name: Pull latest from upstream for user forked feedstock run: | gh auth status gh repo sync alteryx/featuretools-feedstock --branch main --source conda-forge/featuretools-feedstock --force env: GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }} - uses: actions/checkout@v3 with: repository: alteryx/featuretools-feedstock ref: main path: "./featuretools-feedstock" fetch-depth: '0' - name: Run Create Feedstock meta YAML id: create-feedstock-meta uses: alteryx/create-feedstock-meta-yaml@v4 with: project: "featuretools" pypi_version: ${{ github.event.inputs.version }} project_metadata_filepath: "featuretools/pyproject.toml" meta_yaml_filepath: "featuretools-feedstock/recipe/meta.yaml" add_to_test_requirements: "graphviz !=2.47.2" - name: View updated meta yaml run: cat featuretools-feedstock/recipe/meta.yaml - name: Push updated yaml run: | cd featuretools-feedstock git config --unset-all http.https://github.com/.extraheader git config --global user.email "machineOSS@alteryx.com" git config --global user.name "machineAYX Bot" git remote set-url origin https://${{ secrets.AUTO_APPROVE_TOKEN }}@github.com/alteryx/featuretools-feedstock git checkout -b ${{ github.event.inputs.version }} git add recipe/meta.yaml git commit -m "${{ github.event.inputs.version }}" git push origin ${{ github.event.inputs.version }} - name: Adding URL to job output run: | echo "Conda Feedstock Pull Request: https://github.com/alteryx/featuretools-feedstock/pull/new/${{ github.event.inputs.version }}" >> $GITHUB_STEP_SUMMARY ================================================ FILE: .github/workflows/install_test.yaml ================================================ name: Install Test on: pull_request: types: [opened, synchronize] push: branches: - main env: ALTERYX_OPEN_SRC_UPDATE_CHECKER: False jobs: install_ft_complete: name: ${{ matrix.os }} - ${{ matrix.python_version }} install featuretools complete strategy: fail-fast: false matrix: os: [ubuntu-latest, macos-latest, windows-latest] python_version: ["3.9", "3.10", "3.11", "3.12"] runs-on: ${{ matrix.os }} steps: - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Set up python ${{ matrix.python_version }} uses: actions/setup-python@v4 with: python-version: ${{ matrix.python_version }} cache: 'pip' cache-dependency-path: 'pyproject.toml' - name: Build featuretools package run: | make package - name: Install complete version of featuretools from sdist run: | python -m pip install "unpacked_sdist/[complete]" - name: Test by importing packages run: | python -c "import premium_primitives" python -c "from nlp_primitives import PolarityScore" - name: Check package conflicts run: | python -m pip check - name: Verify extra_requires commands run: | python -m pip install "unpacked_sdist/[nlp]" ================================================ FILE: .github/workflows/kickoff_evalml_unit_tests.yaml ================================================ name: Kickoff EvalML Unit Tests on: push: branches: - main workflow_dispatch: jobs: kickoff: name: Run EvalML unit tests if: github.repository_owner == 'alteryx' runs-on: ubuntu-latest steps: - name: Run workflow for EvalML unit tests run: gh workflow run unit_tests_with_featuretools_main_branch.yaml --repo "alteryx/evalml" env: GITHUB_TOKEN: ${{ secrets.REPO_SCOPED_TOKEN }} ================================================ FILE: .github/workflows/latest_dependency_checker.yaml ================================================ # This workflow will install dependenies and if any critical dependencies have changed a pull request # will be created which will trigger a CI run with the new dependencies. name: Latest Dependency Checker on: schedule: - cron: '0 * * * *' workflow_dispatch: jobs: build: if: ${{ github.repository_owner == 'alteryx' }} runs-on: ubuntu-latest timeout-minutes: 5 steps: - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - uses: actions/setup-python@v4 with: python-version: 3.9 - name: Update dependencies run: | python -m pip install --upgrade pip python -m pip install -e ".[dask,test]" make checkdeps OUTPUT_PATH=featuretools/tests/requirement_files/latest_requirements.txt - name: Create pull request uses: peter-evans/create-pull-request@v3 with: token: ${{ secrets.REPO_SCOPED_TOKEN }} commit-message: Update latest dependencies title: Automated Latest Dependency Updates author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> body: "This is an auto-generated PR with **latest** dependency updates. Please do not delete the `latest-dep-update` branch because it's needed by the auto-dependency bot." branch: latest-dep-update branch-suffix: short-commit-hash base: main assignees: machineFL reviewers: machineAYX ================================================ FILE: .github/workflows/lint_check.yaml ================================================ name: Lint Check on: pull_request: types: [opened, synchronize] push: branches: - main jobs: lint_check: name: ${{ matrix.python_version }} lint check runs-on: ubuntu-latest strategy: fail-fast: false matrix: python_version: ["3.12"] steps: - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Set up python ${{ matrix.python_version }} uses: actions/setup-python@v4 with: python-version: ${{ matrix.python_version }} cache: 'pip' cache-dependency-path: 'pyproject.toml' - uses: actions/cache@v3 id: cache with: path: ${{ env.pythonLocation }} key: ${{ matrix.python_version }}-lint-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01 - name: Install featuretools with optional, dev, and test requirements (not using cache) if: steps.cache.outputs.cache-hit != 'true' run: | python -m pip install -e .[dev] - name: Install featuretools with no requirements (using cache) if: steps.cache.outputs.cache-hit == 'true' run: | python -m pip install -e .[dev] --no-deps - name: Run lint test run: make lint ================================================ FILE: .github/workflows/minimum_dependency_checker.yaml ================================================ name: Minimum Dependency Checker on: workflow_dispatch: push: branches: - main paths: - 'pyproject.toml' jobs: build: runs-on: ubuntu-latest steps: - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Run min dep generator - test reqs id: min_dep_gen_test uses: alteryx/minimum-dependency-generator@v3 with: paths: 'pyproject.toml' options: 'dependencies' extras_require: 'test' output_filepath: featuretools/tests/requirement_files/minimum_test_requirements.txt - name: Run min dep generator - core reqs id: min_dep_gen_core uses: alteryx/minimum-dependency-generator@v3 with: paths: 'pyproject.toml' options: 'dependencies' output_filepath: featuretools/tests/requirement_files/minimum_core_requirements.txt - name: Run min dep generator - dask id: min_dep_gen_dask uses: alteryx/minimum-dependency-generator@v3 with: paths: 'pyproject.toml' options: 'dependencies' extras_require: 'dask' output_filepath: featuretools/tests/requirement_files/minimum_dask_requirements.txt - name: Create Pull Request uses: peter-evans/create-pull-request@v3 with: token: ${{ secrets.REPO_SCOPED_TOKEN }} commit-message: Update minimum dependencies title: Automated Minimum Dependency Updates author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> body: "This is an auto-generated PR with **minimum** dependency updates. Please do not delete the `min-dep-update` branch because it's needed by the auto-dependency bot." branch: min-dep-update branch-suffix: short-commit-hash base: main assignees: machineFL reviewers: machineAYX ================================================ FILE: .github/workflows/performance-check.yaml ================================================ name: performance-check on: push: branches: - main workflow_dispatch: jobs: run-performance-analysis: runs-on: ubuntu-latest steps: - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v1 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: ${{ secrets.AWS_REGION }} - name: Run Lambda env: lambda_function: ${{ secrets.LAMBDA_FUNC }} run: | echo "{\"TestCommit\": \"$GITHUB_SHA\", \"Flags\": \"--upload-slack\"}" | base64 > payload.b64 aws lambda invoke --function-name $lambda_function --payload file://payload.b64 --invocation-type Event /dev/stdout 1>/dev/null ================================================ FILE: .github/workflows/pull_request_check.yaml ================================================ name: Pull Request Check on: pull_request: types: [opened, edited, reopened, synchronize] jobs: pull_request_check: name: pull request check runs-on: ubuntu-latest steps: - uses: nearform-actions/github-action-check-linked-issues@v1.4.5 id: check-linked-issues with: exclude-branches: "release_v**, backport_v**, main, latest-dep-update-**, min-dep-update-**, dependabot/**" github-token: ${{ secrets.REPO_SCOPED_TOKEN }} ================================================ FILE: .github/workflows/release.yaml ================================================ on: release: types: [published] name: Release jobs: pypi-publish: name: PyPI Release runs-on: ubuntu-latest permissions: id-token: write steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 - name: Install deps run: | python -m pip install --quiet --upgrade pip python -m pip install --quiet --upgrade build python -m pip install --quiet --upgrade setuptools - name: Remove build artifacts and docs run: | rm -rf .eggs/ dist/ build/ docs/ - name: Build distribution run: python -m build - name: Publish package distributions to PyPI uses: pypa/gh-action-pypi-publish@release/v1 - name: Run workflow to create feedstock pull request run: | gh workflow run create_feedstock_pr.yaml --repo "alteryx/featuretools" -f version=${{ github.event.release.tag_name }} env: GITHUB_TOKEN: ${{ secrets.REPO_SCOPED_TOKEN }} ================================================ FILE: .github/workflows/release_notes_updated.yaml ================================================ name: Release Notes Updated on: pull_request: types: [opened, synchronize] jobs: release_notes_updated: name: release notes updated runs-on: ubuntu-latest steps: - name: Check for development branch id: branch shell: python env: REF: ${{ github.event.pull_request.head.ref }} run: | from re import compile import os main = '^main$' release = '^release_v\d+\.\d+\.\d+$' backport = '^backport_v\d+\.\d+\.\d+$' dep_update = '^latest-dep-update-[a-f0-9]{7}$' min_dep_update = '^min-dep-update-[a-f0-9]{7}$' regex = main, release, backport, dep_update, min_dep_update patterns = list(map(compile, regex)) ref = os.environ["REF"] is_dev = not any(pattern.match(ref) for pattern in patterns) print('::set-output name=is_dev::' + str(is_dev)) - if: ${{ steps.branch.outputs.is_dev == 'true' }} name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - if: ${{ steps.branch.outputs.is_dev == 'true' }} name: Check if release notes were updated run: cat docs/source/release_notes.rst | grep ":pr:\`${{ github.event.number }}\`" ================================================ FILE: .github/workflows/test_without_test_dependencies.yaml ================================================ name: Test without Test Dependencies on: pull_request: types: [opened, synchronize] push: branches: - main workflow_dispatch: jobs: use_featuretools_without_test_dependencies: name: Test featuretools without Test Dependencies runs-on: ubuntu-latest strategy: fail-fast: false steps: - name: Set up python 3.10 uses: actions/setup-python@v4 with: python-version: "3.10" - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Build featuretools and install run: | make package python -m pip install unpacked_sdist/ - name: Run simple featuretools usage run: | import featuretools as ft es = ft.demo.load_mock_customer(return_entityset=True) ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["count"], trans_primitives=["month"], max_depth=1, ) from featuretools.primitives import IsFreeEmailDomain is_free_email_domain = IsFreeEmailDomain() is_free_email_domain(['name@gmail.com', 'name@featuretools.com']).tolist() shell: python ================================================ FILE: .github/workflows/tests_with_latest_deps.yaml ================================================ name: Tests on: pull_request: types: [opened, synchronize] push: branches: - main workflow_dispatch: jobs: tests: name: ${{ matrix.python_version }} unit tests runs-on: ubuntu-latest strategy: fail-fast: false matrix: python_version: ["3.9", "3.10", "3.11", "3.12"] steps: - uses: actions/setup-python@v4 with: python-version: ${{ matrix.python_version }} - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Build featuretools package run: make package - name: Set up pip and graphviz run: | pip config --site set global.progress_bar off python -m pip install --upgrade pip sudo apt update && sudo apt install -y graphviz - name: Install featuretools with test requirements run: | python -m pip install -e unpacked_sdist/ python -m pip install -e unpacked_sdist/[test,dask] - if: ${{ matrix.python_version == 3.9 }} name: Generate coverage args run: echo "coverage_args=--cov=featuretools --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml" >> $GITHUB_ENV - if: ${{ env.coverage_args }} name: Erase coverage files run: | cd unpacked_sdist coverage erase - name: Run unit tests run: | cd unpacked_sdist pytest featuretools/ -n auto ${{ env.coverage_args }} - if: ${{ env.coverage_args }} name: Upload coverage to Codecov uses: codecov/codecov-action@v3 with: token: ${{ secrets.CODECOV_TOKEN }} fail_ci_if_error: true files: ${{ github.workspace }}/coverage.xml verbose: true win_unit_tests: name: ${{ matrix.python_version }} windows unit tests runs-on: windows-latest strategy: fail-fast: false matrix: python_version: ["3.9", "3.10", "3.11", "3.12"] steps: - name: Download miniconda shell: pwsh run: | $File = "Miniconda3-latest-Windows-x86_64.exe" $Uri = "https://repo.anaconda.com/miniconda/$File" $ProgressPreference = "silentlyContinue" Invoke-WebRequest -Uri $Uri -Outfile "$env:USERPROFILE/$File" $hashFromFile = Get-FileHash "$env:USERPROFILE/$File" -Algorithm SHA256 $hashFromUrl = "f4d6147b40ea6822255c2dcec8bb0d357c09e230976213f70d7b8c4a10d86bb0" if ($hashFromFile.Hash -ne "$hashFromUrl") { Throw "$File hashes do not match" } - name: Install miniconda shell: cmd run: start /wait "" %UserProfile%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /D=%UserProfile%\Miniconda3 - name: Create python ${{ matrix.python_version }} environment shell: pwsh run: | . $env:USERPROFILE\Miniconda3\shell\condabin\conda-hook.ps1 conda create -n featuretools python=${{ matrix.python_version }} - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - name: Install featuretools with test requirements shell: pwsh run: | . $env:USERPROFILE\Miniconda3\shell\condabin\conda-hook.ps1 conda activate featuretools conda config --add channels conda-forge conda install -q -y -c conda-forge python-graphviz graphviz python -m pip install --upgrade pip python -m pip install .[test,dask] - name: Run unit tests run: | . $env:USERPROFILE\Miniconda3\shell\condabin\conda-hook.ps1 conda activate featuretools pytest featuretools\ -n auto ================================================ FILE: .github/workflows/tests_with_minimum_deps.yaml ================================================ name: Tests - Minimum Dependencies on: pull_request: types: [opened, synchronize] push: branches: - main workflow_dispatch: jobs: py39_tests_minimum_dependencies: name: Tests - 3.9 Minimum Dependencies runs-on: ubuntu-latest strategy: fail-fast: false matrix: python_version: ["3.9"] steps: - name: Checkout repository uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.ref }} repository: ${{ github.event.pull_request.head.repo.full_name }} - uses: actions/setup-python@v4 with: python-version: 3.9 - name: Config pip, upgrade pip, and install graphviz run: | sudo apt update sudo apt install -y graphviz pip config --site set global.progress_bar off python -m pip install --upgrade pip python -m pip install wheel - name: Install featuretools with no dependencies run: | python -m pip install -e . --no-dependencies - name: Install featuretools - minimum tests dependencies run: | python -m pip install -r featuretools/tests/requirement_files/minimum_test_requirements.txt - name: Install featuretools - minimum core dependencies run: | python -m pip install -r featuretools/tests/requirement_files/minimum_core_requirements.txt - name: Install featuretools - minimum Dask dependencies run: | python -m pip install -r featuretools/tests/requirement_files/minimum_dask_requirements.txt - name: Run unit tests without code coverage run: python -m pytest -x -n auto featuretools/tests/ ================================================ FILE: .github/workflows/tests_with_woodwork_main_branch.yaml ================================================ name: Tests - Featuretools with Woodwork main branch on: workflow_dispatch: jobs: tests_woodwork_main: if: ${{ github.repository_owner == 'alteryx' }} name: ${{ matrix.python_version }} tests ${{ matrix.libraries }} runs-on: ubuntu-latest strategy: fail-fast: true matrix: python_version: ["3.9", "3.10", "3.11", "3.12"] steps: - uses: actions/setup-python@v4 with: python-version: ${{ matrix.python_version }} - name: Checkout repository uses: actions/checkout@v3 - name: Build featuretools package run: make package - name: Set up pip and graphviz run: | pip config --site set global.progress_bar off python -m pip install -U pip sudo apt update && sudo apt install -y graphviz - name: Install Woodwork & Featuretools - test requirements run: | python -m pip install -e unpacked_sdist/[test,dask] python -m pip uninstall -y woodwork python -m pip install https://github.com/alteryx/woodwork/archive/main.zip - name: Log test run info run: | echo "Run unit tests without code coverage for ${{ matrix.python_version }}" echo "Testing with woodwork version:" `python -c "import woodwork; print(woodwork.__version__)"` - name: Run unit tests without code coverage run: pytest featuretools/ -n auto slack_alert_failure: name: Send Slack alert if failure needs: tests_woodwork_main runs-on: ubuntu-latest if: ${{ always() }} steps: - name: Send Slack alert if failure if: ${{ needs.tests_woodwork_main.result != 'success' }} id: slack uses: slackapi/slack-github-action@v1 with: payload: | { "url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} ================================================ FILE: .gitignore ================================================ # docs/source/generated/ docs/source/getting_started/graphs venv/ data/ installed/ output.csv htmlcov/ .idea/ featuretools/tests/integration_data/*.csv featuretools/tests/integration_data/*.gzip featuretools/tests/integration_data/customers.gzip featuretools/tests/integration_data/log-0.gzip featuretools/tests/integration_data/log-1.gzip featuretools/tests/integration_data/log.gzip featuretools/tests/integration_data/products.gzip featuretools/tests/integration_data/regions.gzip featuretools/tests/integration_data/sessions.gzip featuretools/tests/integration_data/stores.gzip **/dask-worker-space/* *.dirlock *.~lock* unpacked_sdist/ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class **/.DS_Store .DS_Store # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover .hypothesis/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # pyenv .python-version # celery beat schedule file celerybeat-schedule # SageMath parsed files *.sage.py # dotenv .env # virtualenv .venv venv/ ENV/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ # pickle files *.p *.pickle .pytest_cache #IDE .vscode .devcontainer *.stats Dockerfile.arm .dockerignore ================================================ FILE: .pre-commit-config.yaml ================================================ exclude: | (?x) .html$|.csv$|.svg$|.md$|.txt$|.json$|.xml$|.pickle$|^.github/| (LICENSE.*|README.*) repos: - repo: https://github.com/kynan/nbstripout rev: 0.5.0 hooks: - id: nbstripout entry: nbstripout language: python types: [jupyter] - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.3.0 hooks: - id: end-of-file-fixer - id: trailing-whitespace - repo: https://github.com/MarcoGorelli/absolufy-imports rev: v0.3.1 hooks: - id: absolufy-imports files: ^featuretools/ - repo: https://github.com/asottile/add-trailing-comma rev: v2.2.3 hooks: - id: add-trailing-comma name: Add trailing comma - repo: https://github.com/charliermarsh/ruff-pre-commit rev: 'v0.3.3' hooks: - id: ruff types_or: [ python, pyi, jupyter ] args: - --fix - --config=./pyproject.toml - id: ruff-format types_or: [ python, pyi, jupyter ] args: - --config=./pyproject.toml ================================================ FILE: .readthedocs.yaml ================================================ # .readthedocs.yaml # Read the Docs configuration file # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details # Required version: 2 # Build documentation in the docs/ directory with Sphinx sphinx: configuration: docs/source/conf.py # Optionally build your docs in additional formats such as PDF and ePub formats: [] build: os: "ubuntu-22.04" tools: python: "3.9" apt_packages: - graphviz - openjdk-11-jre-headless jobs: post_build: - export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64" python: install: - method: pip path: . extra_requirements: - docs ================================================ FILE: LICENSE ================================================ BSD 3-Clause License Copyright (c) 2017, Feature Labs, Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ================================================ FILE: Makefile ================================================ .PHONY: clean clean: find . -name '*.pyo' -delete find . -name '*.pyc' -delete find . -name __pycache__ -delete find . -name '*~' -delete find . -name '.coverage.*' -delete .PHONY: lint lint: python docs/notebook_version_standardizer.py check-execution ruff check . --config=./pyproject.toml ruff format . --check --config=./pyproject.toml .PHONY: lint-fix lint-fix: python docs/notebook_version_standardizer.py standardize ruff check . --fix --config=./pyproject.toml ruff format . --config=./pyproject.toml .PHONY: test test: python -m pytest featuretools/ -n auto .PHONY: testcoverage testcoverage: python -m pytest featuretools/ --cov=featuretools -n auto .PHONY: installdeps installdeps: upgradepip pip install -e . .PHONY: installdeps-dev installdeps-dev: upgradepip pip install -e ".[dev]" pre-commit install .PHONY: installdeps-test installdeps-test: upgradepip pip install -e ".[test]" .PHONY: checkdeps checkdeps: $(eval allow_list='holidays|scipy|numpy|pandas|tqdm|cloudpickle|distributed|dask|psutil|woodwork') pip freeze | grep -v "alteryx/featuretools.git" | grep -E $(allow_list) > $(OUTPUT_PATH) .PHONY: upgradepip upgradepip: python -m pip install --upgrade pip .PHONY: upgradebuild upgradebuild: python -m pip install --upgrade build .PHONY: upgradesetuptools upgradesetuptools: python -m pip install --upgrade setuptools .PHONY: package package: upgradepip upgradebuild upgradesetuptools python -m build $(eval PACKAGE=$(shell python -c 'import setuptools; setuptools.setup()' --version)) tar -zxvf "dist/featuretools-${PACKAGE}.tar.gz" mv "featuretools-${PACKAGE}" unpacked_sdist ================================================ FILE: README.md ================================================

Featuretools

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to Know about Machine Learning

Tests Documentation Status PyPI Version Anaconda Version StackOverflow PyPI Downloads


[Featuretools](https://www.featuretools.com) is a python library for automated feature engineering. See the [documentation](https://docs.featuretools.com) for more information. ## Installation Install with pip ``` python -m pip install featuretools ``` or from the Conda-forge channel on [conda](https://anaconda.org/conda-forge/featuretools): ``` conda install -c conda-forge featuretools ``` ### Add-ons You can install add-ons individually or all at once by running: ``` python -m pip install "featuretools[complete]" ``` **Premium Primitives** - Use Premium Primitives from the premium-primitives repo ``` python -m pip install "featuretools[premium]" ``` **NLP Primitives** - Use Natural Language Primitives from the nlp-primitives repo ``` python -m pip install "featuretools[nlp]" ``` **Dask Support** - Use Dask to run DFS with njobs > 1 ``` python -m pip install "featuretools[dask]" ``` ## Example Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions. ```python >> import featuretools as ft >> es = ft.demo.load_mock_customer(return_entityset=True) >> es.plot() ``` Featuretools can automatically create a single table of features for any "target dataframe" ```python >> feature_matrix, features_defs = ft.dfs(entityset=es, target_dataframe_name="customers") >> feature_matrix.head(5) ``` ``` zip_code COUNT(transactions) COUNT(sessions) SUM(transactions.amount) MODE(sessions.device) MIN(transactions.amount) MAX(transactions.amount) YEAR(join_date) SKEW(transactions.amount) DAY(join_date) ... SUM(sessions.MIN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MEAN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) STD(sessions.MAX(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) MIN(sessions.SKEW(transactions.amount)) customer_id ... 1 60091 131 10 10236.77 desktop 5.60 149.95 2008 0.070041 1 ... 169.77 0.610052 41.95 791.976505 175.939423 9.299023 -0.377150 5.857976 1 -0.395358 2 02139 122 8 9118.81 mobile 5.81 149.15 2008 0.028647 20 ... 114.85 0.492531 42.96 596.243506 230.333502 10.925037 0.962350 7.420480 1 -0.470007 3 02139 78 5 5758.24 desktop 6.78 147.73 2008 0.070814 10 ... 64.98 0.645728 21.77 369.770121 471.048551 9.819148 -0.244976 12.537259 1 -0.630425 4 60091 111 8 8205.28 desktop 5.73 149.56 2008 0.087986 30 ... 83.53 0.516262 17.27 584.673126 322.883448 13.065436 -0.548969 12.738488 1 -0.497169 5 02139 58 4 4571.37 tablet 5.91 148.17 2008 0.085883 19 ... 73.09 0.830112 27.46 313.448942 198.522508 8.950528 0.098885 5.599228 1 -0.396571 [5 rows x 69 columns] ``` We now have a feature vector for each customer that can be used for machine learning. See the [documentation on Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html) for more examples. Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to [define your own custom primitives](https://featuretools.alteryx.com/en/stable/getting_started/primitives.html#defining-custom-primitives). ## Demos **Predict Next Purchase** [Repository](https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/) | [Notebook](https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/Tutorial.ipynb) In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask. For more examples of how to use Featuretools, check out our [demos](https://www.featuretools.com/demos) page. ## Testing & Development The Featuretools community welcomes pull requests. Instructions for testing and development are available [here.](https://featuretools.alteryx.com/en/stable/install.html#development) ## Support The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question: 1. For usage questions, use [Stack Overflow](https://stackoverflow.com/questions/tagged/featuretools) with the `featuretools` tag. 2. For bugs, issues, or feature requests start a [Github issue](https://github.com/alteryx/featuretools/issues). 3. For discussion regarding development on the core library, use [Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). 4. For everything else, the core developers can be reached by email at open_source_support@alteryx.com ## Citing Featuretools If you use Featuretools, please consider citing the following paper: James Max Kanter, Kalyan Veeramachaneni. [Deep feature synthesis: Towards automating data science endeavors.](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf) *IEEE DSAA 2015*. BibTeX entry: ```bibtex @inproceedings{kanter2015deep, author = {James Max Kanter and Kalyan Veeramachaneni}, title = {Deep feature synthesis: Towards automating data science endeavors}, booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015}, pages = {1--10}, year = {2015}, organization={IEEE} } ``` ## Built at Alteryx **Featuretools** is an open source project maintained by [Alteryx](https://www.alteryx.com). To see the other open source projects we’re working on visit [Alteryx Open Source](https://www.alteryx.com/open-source). If building impactful data science pipelines is important to you or your business, please get in touch.

Alteryx Open Source

================================================ FILE: contributing.md ================================================ # Contributing to Featuretools :+1::tada: First off, thank you for taking the time to contribute! :tada::+1: Whether you are a novice or experienced software developer, all contributions and suggestions are welcome! There are many ways to contribute to Featuretools, with the most common ones being contribution of code or documentation to the project. **To contribute, you can:** 1. Help users on our [Slack channel](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). Answer questions under the featuretools tag on [Stack Overflow](https://stackoverflow.com/questions/tagged/featuretools) 2. Submit a pull request for one of [Good First Issues](https://github.com/alteryx/featuretools/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+First+Issue%22) 3. Make changes to the codebase, see [Contributing to the codebase](#Contributing-to-the-Codebase). 4. Improve our documentation, which can be found under the [docs](docs/) directory or at https://docs.featuretools.com 5. [Report issues](#Report-issues) you're facing, and give a "thumbs up" on issues that others reported and that are relevant to you. Issues should be used for bugs, and feature requests only. 6. Spread the word: reference Featuretools from your blog and articles, link to it from your website, or simply star it in GitHub to say "I use it". * If you would like to be featured on [ecosystem page](https://featuretools.alteryx.com/en/stable/resources/ecosystem.html), you can submit a [pull request](https://github.com/alteryx/featuretools). ## Contributing to the Codebase Before starting major work, you should touch base with the maintainers of Featuretools by filing an issue on GitHub or posting a message in the [#development channel on Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). This will increase the likelihood your pull request will eventually get merged in. #### 1. Fork and clone repo * The code is hosted on GitHub, so you will need to use Git to fork the project and make changes to the codebase. To start, go to the [Featuretools GitHub page](https://github.com/alteryx/featuretools) and click the `Fork` button. * After you have created the fork, you will want to clone the fork to your machine and connect your version of the project to the upstream Featuretools repo. ```bash git clone https://github.com/your-user-name/featuretools.git cd featuretools git remote add upstream https://github.com/alteryx/featuretools ``` * Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. You can run the following steps to create a separate virtual environment, and install Featuretools in editable mode. ```bash python -m venv venv source venv/bin/activate make installdeps git checkout -b issue####-branch_name ``` * You will need to install GraphViz, and Pandoc to run all unit tests & build docs: > Pandoc is only needed to build the documentation locally. **macOS (Intel)** (use [Homebrew](https://brew.sh/)): ```console brew install graphviz pandoc ``` **macOS (M1)** (use [Homebrew](https://brew.sh/)): ```console brew install graphviz pandoc ``` **Ubuntu**: ```console sudo apt install graphviz pandoc -y ``` #### 2. Implement your Pull Request * Implement your pull request. If needed, add new tests or update the documentation. * Before submitting to GitHub, verify the tests run and the code lints properly ```bash # runs linting make lint # will fix some common linting issues automatically make lint-fix # runs test make test ``` * If you made changes to the documentation, build the documentation locally. ```bash # go to docs and build cd docs make html # view docs locally open build/html/index.html ``` * Before you commit, a few lint fixing hooks will run. You can also manually run these. ```bash # run linting hooks only on changed files pre-commit run # run linting hooks on all files pre-commit run --all-files ``` #### 3. Submit your Pull Request * Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. * If you need to update your code with the latest changes from the main Featuretools repo, you can do that by running the commands below, which will merge the latest changes from the Featuretools `main` branch into your current local branch. You may need to resolve merge conflicts if there are conflicts between your changes and the upstream changes. After the merge, you will need to push the updates to your forked repo after running these commands. ```bash git fetch upstream git merge upstream/main ``` * Create a pull request to merge the changes from your forked repo branch into the Featuretools `main` branch. Creating the pull request will automatically run our continuous integration. * If this is your first contribution, you will need to sign the Contributor License Agreement as directed. * Update the "Future Release" section of the release notes (`docs/source/release_notes.rst`) to include your pull request and add your github username to the list of contributors. Add a description of your PR to the subsection that most closely matches your contribution: * Enhancements: new features or additions to Featuretools. * Fixes: things like bugfixes or adding more descriptive error messages. * Changes: modifications to an existing part of Featuretools. * Documentation Changes * Testing Changes Documentation or testing changes rarely warrant an individual release notes entry; the PR number can be added to their respective "Miscellaneous changes" entries. * We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's reviewed by a maintainer of Featuretools, passes continuous integration, we will merge it, and you will have successfully contributed to Featuretools! ## Report issues When reporting issues please include as much detail as possible about your operating system, Featuretools version and python version. Whenever possible, please also include a brief, self-contained code example that demonstrates the problem. ================================================ FILE: docs/Makefile ================================================ # Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = build GENDIR = source/generated # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don\'t have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source .PHONY: help help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " applehelp to make an Apple Help Book" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " epub3 to make an epub3" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" @echo " coverage to run coverage check of the documentation (if enabled)" @echo " dummy to check syntax errors of document sources" .PHONY: clean clean: rm -rf $(BUILDDIR)/* rm -rf $(GENDIR)/* .PHONY: html html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html $(SPHINXOPTS) @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." .PHONY: dirhtml dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." .PHONY: singlehtml singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." .PHONY: pickle pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." .PHONY: json json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." .PHONY: htmlhelp htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." .PHONY: qthelp qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/featuretools.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/featuretools.qhc" .PHONY: applehelp applehelp: $(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp @echo @echo "Build finished. The help book is in $(BUILDDIR)/applehelp." @echo "N.B. You won't be able to view it unless you put it in" \ "~/Library/Documentation/Help or install it in your application" \ "bundle." .PHONY: devhelp devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/featuretools" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/featuretools" @echo "# devhelp" .PHONY: epub epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." .PHONY: epub3 epub3: $(SPHINXBUILD) -b epub3 $(ALLSPHINXOPTS) $(BUILDDIR)/epub3 @echo @echo "Build finished. The epub3 file is in $(BUILDDIR)/epub3." .PHONY: latex latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." .PHONY: latexpdf latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." .PHONY: latexpdfja latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." .PHONY: text text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." .PHONY: man man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." .PHONY: texinfo texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." .PHONY: info info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." .PHONY: gettext gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." .PHONY: changes changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." .PHONY: linkcheck linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." .PHONY: doctest doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." .PHONY: coverage coverage: $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage @echo "Testing of coverage in the sources finished, look at the " \ "results in $(BUILDDIR)/coverage/python.txt." .PHONY: xml xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." .PHONY: pseudoxml pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." .PHONY: dummy dummy: $(SPHINXBUILD) -b dummy $(ALLSPHINXOPTS) $(BUILDDIR)/dummy @echo @echo "Build finished. Dummy builder generates no files." ================================================ FILE: docs/backport_release.md ================================================ # Backport Release Process In situations where we need to backport commits to earlier versions of our software, we'll need to perform the release process slightly differently than a normal release.

Backport Release

This document outlines the differences between a normal release and a backport release. It uses the same outline as the [Release Guide](../release.md). ## 0. Pre-Release Checklist Before starting the backport release process, verify the following: - Get agreement on the latest commit to use for targeting the release. A backport release will be targeted on some commit other than the latest on main. Many times the new target will be an old release, which will have a tag that can be referenced--for example `v0.11.1`. - Get agreement on the commits to port over for the backport release. - Get agreement on the version number to use for the backport release. #### Version Numbering for Backport Releases Featuretools uses [semantic versioning](https://semver.org/). Every release has a major, minor and patch version number, and are displayed like so: `..`. **A backport release will increment the patch version.** This may be an intermediate number between two preexisting releases--for example a new `0.11.2` to be added between existing `0.11.1` and `0.12.0` releases. It can also be a new latest release--so `0.12.1` in the same situation--using only some of the commits that are present in the Future Release section of the release notes. ## 0.5. Create target branch for backport release #### Checkout intended target commit 1. Checkout the agreed upon latest commit for targeting the release. If this is a previous release, you may checkout its tag with `git checkout v0.11.1`. #### Create backport branch 1. Branch off of the target commit. For the branch name, please use the most recent major and minor versions to this commit (in this example `0` and `11` respectively), leaving the patch number as an `x`. This means that we would create `0.11.x` in the working example. This is necessary so that if any further backport releases are needed, we could continue to use this branch as the target. This branch is to be treated as `main` is treated in a normal release. It will be the target for our release. This branch will be automatically protected (unless the version exceeds 9.Y.x or X.99.x, in which case contact the repo team about expanding the protection rules) to avoid unintended commits from making their way into the release undetected. #### Port over desired commits 1. Create a feature branch off the backport branch. For the branch name, please use "backport_vX.Y.Z" as the naming scheme (e.g. "backport_v0.11.2). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry. 2. Cherry-pick the desired commits onto `backport_v0.11.2`. 3. Create a pull request with the backport `0.11.x` branch as its target, get confirmation that the desired changes were added, and confirm that the CI checks pass. 4. Under the "Future Release" section in the release notes, include the ported over commits' release notes (don't remove them from their original location back on `main`), indicating that they are a backport of the original PR. ``` Future Release ============== * Enhancements * Fixes * Fix bug (backport of :pr:`1110`) * Changes * Documentation Changes * Testing Changes Thanks to the following people for contributing to this release: ``` 5. Merge the PR into the `0.11.x` backport branch ## 1. Create Featuretools Backport release on Github With our backport branch `0.11.x` as our target, we now proceed with the release of `0.11.2`. #### Create release branch 1. **Branch off of the backport branch `0.11.x`.** For the branch name, please use "release_vX.Y.Z" as the naming scheme (e.g. "release_v0.11.2"). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry. #### Bump version number 1. Bump `__version__` in `setup.py`, `featuretools/version.py`, and `featuretools/tests/test_version.py`. #### Update Release Notes 1. Replace **"Future Release"** in `docs/source/release_notes.rst` with the current date ``` v0.11.2 Sep 28, 2020 ==================== ``` 2. Remove any unused Release Notes sections for this release (e.g. Fixes, Testing Changes) 3. Add yourself to the list of contributors to this release and **put the contributors in alphabetical order** 4. The release PR does not need to be mentioned in the list of changes 5. Add a commented out "Future Release" section with all of the Release Notes sections above the current section ``` .. Future Release ============== * Enhancements * Fixes * Changes * Documentation Changes * Testing Changes .. Thanks to the following people for contributing to this release: ``` #### Create Release PR A [release pr](https://github.com/alteryx/featuretools/pull/1915) should have the version number as the title and the release notes for that release as the PR body text. The contributors list is not necessary. The special sphinx docs syntax (:pr:\`547\`) needs to be changed to github link syntax (#547). Checklist before merging: - All tests are currently green on checkin and on `0.11.x`. - The ReadtheDocs build for the release PR branch has passed, and the resulting docs contain the expected release notes. - PR has been reviewed and approved. - Confirm with the team that `0.11.x` will be frozen until step 2 (Github Release) is complete. ## 2. Create Github Release After the release pull request has been merged into the `0.11.x` branch, it is time draft the github release. [Example release](https://github.com/alteryx/featuretools/releases/tag/v1.6.0) - **The target should be the `0.11.x` backport branch** - The tag should be the version number with a v prefix (e.g. v0.11.2) - Release title is the same as the tag - Release description should be the full Release Notes updates for the release, including the line thanking contributors. Contributors should also have their links changed from the docs syntax (:user:\`gsheni\`) to github syntax (@gsheni) - This is not a pre-release - Publishing the release will automatically upload the package to PyPI Note that this backported release will show up on the repository's front page as the latest release even if there is technically a later `0.12.0` release. ## Release on conda-forge If a later release exists, conda-forge will not automatically create a new PR in [conda-forge/featuretools-feedstock](https://github.com/conda-forge/featuretools-feedstock/pulls). Instead a PR will need to be manually created. You can do either of the following: - Branch off of the 0.11.1 meta.yaml update commit for the 0.11.2 meta.yaml changes. This is "cleaner" and sometimes easier, but if migration files (like py310) have been added between 0.11.1 and 0.12.0 you will have to add them in and re-render yourself. - Tack the 0.11.2 changes on after the 0.12.0 update commit in the feedstock repo. This means that if any of the boilerplate has changed, you do not have to manually re-add it yourself. An example of this can be seen from a Woodwork backport release [here](https://github.com/conda-forge/woodwork-feedstock/pull/32). Once the PR is created: 1. Update requirements changes in `recipe/meta.yaml` - you may need to handle the version, source links, and SHA256 if you had to open the PR yourself. You will also need to update the requirements. 2. After tests pass, a maintainer will merge the PR in ================================================ FILE: docs/make.bat ================================================ @ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source set I18NSPHINXOPTS=%SPHINXOPTS% source if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. epub3 to make an epub3 echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. xml to make Docutils-native XML files echo. pseudoxml to make pseudoxml-XML files for display purposes echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled echo. coverage to run coverage check of the documentation if enabled echo. dummy to check syntax errors of document sources goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) REM Check if sphinx-build is available and fallback to Python version if any %SPHINXBUILD% 1>NUL 2>NUL if errorlevel 9009 goto sphinx_python goto sphinx_ok :sphinx_python set SPHINXBUILD=python -m sphinx.__init__ %SPHINXBUILD% 2> nul if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) :sphinx_ok if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\featuretools.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\featuretools.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "epub3" ( %SPHINXBUILD% -b epub3 %ALLSPHINXOPTS% %BUILDDIR%/epub3 if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub3 file is in %BUILDDIR%/epub3. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdf" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf cd %~dp0 echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdfja" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf-ja cd %~dp0 echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) if "%1" == "coverage" ( %SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage if errorlevel 1 exit /b 1 echo. echo.Testing of coverage in the sources finished, look at the ^ results in %BUILDDIR%/coverage/python.txt. goto end ) if "%1" == "xml" ( %SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml if errorlevel 1 exit /b 1 echo. echo.Build finished. The XML files are in %BUILDDIR%/xml. goto end ) if "%1" == "pseudoxml" ( %SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml if errorlevel 1 exit /b 1 echo. echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml. goto end ) if "%1" == "dummy" ( %SPHINXBUILD% -b dummy %ALLSPHINXOPTS% %BUILDDIR%/dummy if errorlevel 1 exit /b 1 echo. echo.Build finished. Dummy builder generates no files. goto end ) :end ================================================ FILE: docs/notebook_version_standardizer.py ================================================ import json import os import click DOCS_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), "source") def _get_ipython_notebooks(docs_source): directories_to_skip = ["_templates", "generated", ".ipynb_checkpoints"] notebooks = [] for root, _, filenames in os.walk(docs_source): if any(dir_ in root for dir_ in directories_to_skip): continue for filename in filenames: if filename.endswith(".ipynb"): notebooks.append(os.path.join(root, filename)) return notebooks def _check_delete_empty_cell(notebook, delete=True): with open(notebook, "r") as f: source = json.load(f) cell = source["cells"][-1] if cell["cell_type"] == "code" and cell["source"] == []: # this is an empty cell, which we should delete if delete: source["cells"] = source["cells"][:-1] else: return False if delete: with open(notebook, "w") as f: json.dump(source, f, ensure_ascii=False, indent=1) else: return True def _check_execution_and_output(notebook): with open(notebook, "r") as f: source = json.load(f) for cells in source["cells"]: if cells["cell_type"] == "code" and ( cells["execution_count"] is not None or cells["outputs"] != [] ): return False return True def _check_python_version(notebook, default_version): with open(notebook, "r") as f: source = json.load(f) if source["metadata"]["language_info"]["version"] != default_version: return False return True def _fix_python_version(notebook, default_version): with open(notebook, "r") as f: source = json.load(f) source["metadata"]["language_info"]["version"] = default_version with open(notebook, "w") as f: json.dump(source, f, ensure_ascii=False, indent=1) def _fix_execution_and_output(notebook): with open(notebook, "r") as f: source = json.load(f) for cells in source["cells"]: if cells["cell_type"] == "code" and cells["execution_count"] is not None: cells["execution_count"] = None cells["outputs"] = [] source["metadata"]["kernelspec"]["display_name"] = "Python 3" source["metadata"]["kernelspec"]["name"] = "python3" with open(notebook, "w") as f: json.dump(source, f, ensure_ascii=False, indent=1) def _get_notebooks_with_executions_and_empty(notebooks, default_version="3.9.2"): executed = [] empty_last_cell = [] versions = [] for notebook in notebooks: if not _check_execution_and_output(notebook): executed.append(notebook) if not _check_delete_empty_cell(notebook, delete=False): empty_last_cell.append(notebook) if not _check_python_version(notebook, default_version): versions.append(notebook) return (executed, empty_last_cell, versions) def _fix_versions(notebooks, default_version="3.9.2"): for notebook in notebooks: _fix_python_version(notebook, default_version) def _remove_notebook_empty_last_cell(notebooks): for notebook in notebooks: _check_delete_empty_cell(notebook, delete=True) def _standardize_outputs(notebooks): for notebook in notebooks: _fix_execution_and_output(notebook) @click.group() def cli(): """no-op""" @cli.command() def standardize(): notebooks = _get_ipython_notebooks(DOCS_PATH) ( executed_notebooks, empty_cells, versions, ) = _get_notebooks_with_executions_and_empty(notebooks) if executed_notebooks: _standardize_outputs(executed_notebooks) executed_notebooks = ["\t" + notebook for notebook in executed_notebooks] executed_notebooks = "\n".join(executed_notebooks) click.echo(f"Removed the outputs for:\n {executed_notebooks}") if empty_cells: _remove_notebook_empty_last_cell(empty_cells) empty_cells = ["\t" + notebook for notebook in empty_cells] empty_cells = "\n".join(empty_cells) click.echo(f"Removed the empty cells for:\n {empty_cells}") if versions: _fix_versions(versions) versions = ["\t" + notebook for notebook in versions] versions = "\n".join(versions) click.echo(f"Fixed python versions for:\n {versions}") @cli.command() def check_execution(): notebooks = _get_ipython_notebooks(DOCS_PATH) ( executed_notebooks, empty_cells, versions, ) = _get_notebooks_with_executions_and_empty(notebooks) if executed_notebooks: executed_notebooks = ["\t" + notebook for notebook in executed_notebooks] executed_notebooks = "\n".join(executed_notebooks) raise SystemExit( f"The following notebooks have executed outputs:\n {executed_notebooks}\n" "Please run make lint-fix to fix this.", ) if empty_cells: empty_cells = ["\t" + notebook for notebook in empty_cells] empty_cells = "\n".join(empty_cells) raise SystemExit( f"The following notebooks have empty cells at the end:\n {empty_cells}\n" "Please run make lint-fix to fix this.", ) if versions: versions = ["\t" + notebook for notebook in versions] versions = "\n".join(versions) raise SystemExit( f"The following notebooks have the wrong Python version: \n {versions}\n" "Please run make lint-fix to fix this.", ) if __name__ == "__main__": cli() ================================================ FILE: docs/pull_request_template.md ================================================ ### Pull Request Description (replace this text with your description) ----- *After creating the pull request: in order to pass the **release_notes_updated** check you will need to update the "Future Release" section of* `docs/source/release_notes.rst` *to include this pull request.* ================================================ FILE: docs/source/_static/style.css ================================================ .footer { background-color: #0D2345; padding-bottom: 40px; padding-top: 40px; width: 100%; } .footer-cell-1 { grid-row: 1; grid-column: 1 / 3; } .footer-cell-2 { grid-row: 1; grid-column: 4; margin-bottom: 15px; text-align: right; } .footer-cell-3 { grid-row: 2; grid-column: 1 / 5; } .footer-cell-4 { grid-row: 3; grid-column: 1 / 3; } .footer-container { display: grid; margin-left: 10%; margin-right: 10%; } .footer-image-alteryx { padding-top: 22px; width: 270px; } .footer-image-copyright { width: 180px; } .footer-image-github { width: 50px; } .footer-image-twitter { width: 60px; } .footer-line { border-top: 2px solid white; margin-left: 7px; margin-right: 15px; } ================================================ FILE: docs/source/api_reference.rst ================================================ .. _api_ref: API Reference ============= .. currentmodule:: featuretools Demo Datasets ~~~~~~~~~~~~~ .. currentmodule:: featuretools.demo .. autosummary:: :toctree: generated/ load_retail load_mock_customer load_flight load_weather Deep Feature Synthesis ~~~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ dfs get_valid_primitives Timedelta ~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ Timedelta Time utils ~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ make_temporal_cutoffs Feature Primitives ~~~~~~~~~~~~~~~~~~ Primitive Types --------------- .. currentmodule:: featuretools.primitives .. autosummary:: :toctree: generated/ TransformPrimitive AggregationPrimitive .. _api_ref.aggregation_features: Aggregation Primitives ---------------------- .. autosummary:: :toctree: generated/ All Any AverageCountPerUnique AvgTimeBetween Count CountAboveMean CountBelowMean CountGreaterThan CountInsideNthSTD CountInsideRange CountLessThan CountOutsideNthSTD CountOutsideRange DateFirstEvent Entropy First FirstLastTimeDelta HasNoDuplicates IsMonotonicallyDecreasing IsMonotonicallyIncreasing IsUnique Kurtosis Last Max MaxConsecutiveFalse MaxConsecutiveNegatives MaxConsecutivePositives MaxConsecutiveTrue MaxConsecutiveZeros MaxCount MaxMinDelta Mean Median MedianCount Min MinCount Mode NMostCommon NMostCommonFrequency NUniqueDays NUniqueDaysOfCalendarYear NUniqueMonths NUniqueWeeks NumConsecutiveGreaterMean NumConsecutiveLessMean NumFalseSinceLastTrue NumPeaks NumTrue NumTrueSinceLastFalse NumUnique NumZeroCrossings PercentTrue PercentUnique Skew Std Sum TimeSinceFirst TimeSinceLast TimeSinceLastFalse TimeSinceLastMax TimeSinceLastMin TimeSinceLastTrue Trend Variance Transform Primitives -------------------- Binary Transform Primitives *************************** .. autosummary:: :toctree: generated/ AddNumeric AddNumericScalar DivideByFeature DivideNumeric DivideNumericScalar Equal EqualScalar GreaterThan GreaterThanEqualTo GreaterThanEqualToScalar GreaterThanScalar LessThan LessThanEqualTo LessThanEqualToScalar LessThanScalar ModuloByFeature ModuloNumeric ModuloNumericScalar MultiplyBoolean MultiplyNumeric MultiplyNumericBoolean MultiplyNumericScalar NotEqual NotEqualScalar ScalarSubtractNumericFeature SubtractNumeric SubtractNumericScalar Combine features **************** .. autosummary:: :toctree: generated/ IsIn And Or Not .. _api_ref.cumulative_features: Cumulative Transform Primitives ******************************* .. autosummary:: :toctree: generated/ Diff DiffDatetime TimeSincePrevious CumCount CumSum CumMean CumMin CumMax CumulativeTimeSinceLastFalse CumulativeTimeSinceLastTrue Datetime Transform Primitives ***************************** .. autosummary:: :toctree: generated/ Age DateToHoliday DateToTimeZone Day DayOfYear DaysInMonth DistanceToHoliday Hour IsFederalHoliday IsFirstWeekOfMonth IsLeapYear IsLunchTime IsMonthEnd IsMonthStart IsQuarterEnd IsQuarterStart IsWeekend IsWorkingHours IsYearEnd IsYearStart Minute Month NthWeekOfMonth PartOfDay Quarter Season Second TimeSince Week Weekday Year Email, URL and File Transform Primitives **************************************** .. autosummary:: :toctree: generated/ EmailAddressToDomain FileExtension IsFreeEmailDomain URLToDomain URLToProtocol URLToTLD Exponential Transform Primitives ******************************** .. autosummary:: :toctree: generated/ ExponentialWeightedAverage ExponentialWeightedSTD ExponentialWeightedVariance General Transform Primitives **************************** .. autosummary:: :toctree: generated/ AbsoluteDiff Absolute Cosine IsNull NaturalLogarithm Negate Percentile PercentChange RateOfChange SameAsPrevious SavgolFilter Sine SquareRoot Tangent Variance Location Transform Primitives ***************************** .. autosummary:: :toctree: generated/ CityblockDistance GeoMidpoint Haversine IsInGeoBox Latitude Longitude Name Transform Primitives ************************* .. autosummary:: :toctree: generated/ FullNameToFirstName FullNameToLastName FullNameToTitle NaturalLanguage Transform Primitives ************************************ .. autosummary:: :toctree: generated/ CountString MeanCharactersPerWord MedianWordLength NumCharacters NumUniqueSeparators NumWords NumberOfCommonWords NumberOfHashtags NumberOfMentions NumberOfUniqueWords NumberOfWordsInQuotes PunctuationCount TitleWordCount TotalWordLength UpperCaseCount UpperCaseWordCount WhitespaceCount Postal Code Primitives ********************** .. autosummary:: :toctree: generated/ OneDigitPostalCode TwoDigitPostalCode Time Series Transform Primitives ******************************** .. autosummary:: :toctree: generated/ ExpandingCount ExpandingMax ExpandingMean ExpandingMin ExpandingSTD ExpandingTrend Lag RollingCount RollingMax RollingMean RollingMin RollingOutlierCount RollingSTD RollingTrend Feature methods --------------- .. currentmodule:: featuretools.feature_base .. autosummary:: :toctree: generated/ FeatureBase.rename FeatureBase.get_depth Feature calculation ~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ calculate_feature_matrix .. approximate_features Feature descriptions ~~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ describe_feature Feature visualization ~~~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ graph_feature Feature encoding ~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ encode_features Feature Selection ~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools.selection .. autosummary:: :toctree: generated/ remove_low_information_features remove_highly_correlated_features remove_highly_null_features remove_single_value_features Feature Matrix utils ~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools.computational_backends .. autosummary:: :toctree: generated/ replace_inf_values Saving and Loading Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ save_features load_features .. _api_ref.dataset: EntitySet, Relationship ~~~~~~~~~~~~~~~~~~~~~~~ Constructors ------------ .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ EntitySet Relationship EntitySet load and prepare data ------------------------------- .. autosummary:: :toctree: generated/ EntitySet.add_dataframe EntitySet.add_interesting_values EntitySet.add_last_time_indexes EntitySet.add_relationship EntitySet.add_relationships EntitySet.concat EntitySet.normalize_dataframe EntitySet.set_secondary_time_index EntitySet.replace_dataframe EntitySet serialization ------------------------------- .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ read_entityset .. currentmodule:: featuretools.entityset .. autosummary:: :toctree: generated/ EntitySet.to_csv EntitySet.to_pickle EntitySet.to_parquet EntitySet query methods ----------------------- .. autosummary:: :toctree: generated/ EntitySet.__getitem__ EntitySet.find_backward_paths EntitySet.find_forward_paths EntitySet.get_forward_dataframes EntitySet.get_backward_dataframes EntitySet.query_by_values EntitySet visualization ----------------------- .. autosummary:: :toctree: generated/ EntitySet.plot Relationship attributes ----------------------- .. autosummary:: :toctree: generated/ Relationship.parent_column Relationship.child_column Relationship.parent_dataframe Relationship.child_dataframe Data Type Util Methods ---------------------- .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ list_logical_types list_semantic_tags Primitive Util Methods ---------------------- .. currentmodule:: featuretools .. autosummary:: :toctree: generated/ get_recommended_primitives list_primitives summarize_primitives ================================================ FILE: docs/source/conf.py ================================================ # -*- coding: utf-8 -*- # # featuretools documentation build configuration file, created by # sphinx-quickstart on Thu May 19 20:40:30 2016. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import os import shutil import subprocess import sys from pathlib import Path import featuretools # run setup script path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "setup.py") subprocess.check_call([sys.executable, path]) # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.insert(0, os.path.abspath("../featuretools")) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ "sphinx.ext.autodoc", "sphinx.ext.autosummary", "sphinx.ext.napoleon", "sphinx.ext.ifconfig", "sphinx.ext.githubpages", "nbsphinx", "IPython.sphinxext.ipython_console_highlighting", "IPython.sphinxext.ipython_directive", "sphinx.ext.extlinks", "sphinx.ext.viewcode", "sphinx.ext.graphviz", "sphinx_inline_tabs", "sphinx_copybutton", "myst_parser", ] # ipython_mplbackend = None ipython_execlines = ["import pandas as pd", "pd.set_option('display.width', 1000000)"] # autosummary_generate=True autosummary_generate = ["api_reference.rst"] # Add any paths that contain templates here, relative to this directory. templates_path = ["templates"] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # source_suffix = ['.rst', '.md'] # The encoding of source files. # source_encoding = 'utf-8-sig' # The master toctree document. master_doc = "index" # General information about the project. project = "Featuretools" copyright = "2019, Feature Labs. BSD License" author = "Feature Labs, Inc." latex_documents = [ (master_doc, "featuretools.tex", "test Documentation", "test", "manual"), ] latex_elements = { "preamble": r""" \usepackage[utf8]{inputenc} """, } # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = featuretools.__version__ # The full version, including alpha/beta/rc tags. release = featuretools.__version__ # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = "en" # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: # today = '' # Else, today_fmt is used as the format for a strftime call. # today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path exclude_patterns = ["**.ipynb_checkpoints"] # The reST default role (used for this markup: `text`) to use for all # documents. # default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. # add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). # add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. # show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = "sphinx" # A list of ignored prefixes for module index sorting. # modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. # keep_warnings = False # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = "pydata_sphinx_theme" # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. html_theme_options = { "pygment_light_style": "tango", "pygment_dark_style": "native", "icon_links": [ { "name": "GitHub", "url": "https://github.com/alteryx/featuretools", "icon": "fab fa-github-square", "type": "fontawesome", }, { "name": "Twitter", "url": "https://twitter.com/AlteryxOSS", "icon": "fab fa-twitter-square", "type": "fontawesome", }, { "name": "Slack", "url": "https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA", "icon": "fab fa-slack", "type": "fontawesome", }, { "name": "StackOverflow", "url": "https://stackoverflow.com/questions/tagged/featuretools", "icon": "fab fa-stack-overflow", "type": "fontawesome", }, ], "collapse_navigation": False, "navigation_depth": 2, } # Add any paths that contain custom themes here, relative to this directory. # html_theme_path = [] # The name for this set of Sphinx documents. # " v documentation" by default. # html_title = u'featuretools v0.1' # A shorter title for the navigation bar. Default is the same as html_title. # html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. html_logo = "_static/images/featuretools_nav2.svg" # The name of an image file (relative to this directory) to use as a favicon of # the docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. html_favicon = "_static/images/favicon.ico" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. # html_extra_path = [] # If not None, a 'Last updated on:' timestamp is inserted at every page # bottom, using the given strftime format. # The empty string is equivalent to '%b %d, %Y'. # html_last_updated_fmt = None # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. # html_use_smartypants = True # Custom sidebar templates, maps document names to template names. html_sidebars = { "**": ["globaltoc.html", "relations.html", "sourcelink.html", "searchbox.html"], } # Additional templates that should be rendered to pages, maps page names to # template names. # html_additional_pages = {} # If false, no module index is generated. # html_domain_indices = True # If false, no index is generated. # html_use_index = True # If true, the index is split into individual pages for each letter. # html_split_index = False # If true, links to the reST sources are added to the pages. # html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. html_show_sphinx = False # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. # html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. # html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). # html_file_suffix = None # Language to be used for generating the HTML full-text search index. # Sphinx supports the following languages: # 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja' # 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh' # html_search_language = 'en' # A dictionary with options for the search language support, empty by default. # 'ja' uses this config value. # 'zh' user can custom change `jieba` dictionary path. # html_search_options = {'type': 'default'} # The name of a javascript file (relative to the configuration directory) that # implements a search results scorer. If empty, the default will be used. # html_search_scorer = 'scorer.js' # Output file base name for HTML help builder. htmlhelp_basename = "featuretoolsdoc" # -- Options for Markdown files ---------------------------------------------- myst_admonition_enable = True myst_deflist_enable = True myst_heading_anchors = 3 # -- Options for Sphinx Copy Button ------------------------------------------ copybutton_prompt_text = "myinputprompt" copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: " copybutton_prompt_is_regexp = True # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', # Latex figure (float) alignment #'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ( master_doc, "featuretools.tex", "Featuretools Documentation", "Feature Labs, Inc.", "manual", ), ] # The name of an image file (relative to this directory) to place at the top of # the title page. # latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. # latex_use_parts = False # If true, show page references after internal links. # latex_show_pagerefs = False # If true, show URL addresses after external links. # latex_show_urls = False # Documents to append as an appendix to all manuals. # latex_appendices = [] # If false, no module index is generated. # latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [(master_doc, "featuretools", "featuretools Documentation", [author], 1)] # If true, show URL addresses after external links. # man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ( master_doc, "featuretools", "featuretools Documentation", author, "featuretools", "One line description of project.", "Miscellaneous", ), ] # Documents to append as an appendix to all manuals. # texinfo_appendices = [] # If false, no module index is generated. # texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. # texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. # texinfo_no_detailmenu = False nbsphinx_execute = "auto" extlinks = { "issue": ("https://github.com/alteryx/featuretools/issues/%s", "GH#%s"), "pr": ("https://github.com/alteryx/featuretools/pull/%s", "GH#%s"), "user": ("https://github.com/%s", "@%s"), } # Napoleon settings napoleon_google_docstring = True napoleon_numpy_docstring = True napoleon_include_init_with_doc = False napoleon_include_private_with_doc = False napoleon_include_special_with_doc = True napoleon_use_admonition_for_examples = False napoleon_use_admonition_for_notes = False napoleon_use_admonition_for_references = False napoleon_use_ivar = False napoleon_use_param = True napoleon_use_rtype = True def setup(app): home_dir = os.environ.get("HOME", "/") ipython_p = Path(home_dir + "/.ipython/profile_default/startup") ipython_p.mkdir(parents=True, exist_ok=True) file_p = os.path.abspath(os.path.dirname(__file__)) shutil.copy( file_p + "/set-headers.py", home_dir + "/.ipython/profile_default/startup", ) app.add_css_file("style.css") ================================================ FILE: docs/source/getting_started/afe.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Deep Feature Synthesis\n", "\n", "Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on relational and temporal data.\n", "\n", "## Input Data\n", "\n", "Deep Feature Synthesis requires structured datasets in order to perform feature engineering. To demonstrate the capabilities of DFS, we will use a mock customer transactions dataset.\n" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note ::\n", "\n", " Before using DFS, it is recommended that you prepare your data as an :class:`EntitySet`. See :doc:`using_entitysets` to learn how." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once data is prepared as an `.EntitySet`, we are ready to automatically generate features for a target dataframe - e.g. `customers`.\n", "\n", "## Running DFS\n", "\n", "Typically, without automated feature engineering, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer's behavior. In this example, an expert might be interested in features such as: *total number of sessions* or *month the customer signed up*.\n", "\n", "These features can be generated by DFS when we specify the target_dataframe as `customers` and `\"count\"` and `\"month\"` as primitives." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"count\"],\n", " trans_primitives=[\"month\"],\n", " max_depth=1,\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, `\"count\"` is an **aggregation primitive** because it computes a single value based on many sessions related to one customer. `\"month\"` is called a **transform primitive** because it takes one value for a customer transforms it to another." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note ::\n", "\n", " Feature primitives are a fundamental component to Featuretools. To learn more read :doc:`primitives`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating \"Deep Features\"\n", "\n", "The name Deep Feature Synthesis comes from the algorithm's ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the \"depth\" of a feature. The `max_depth` parameter controls the maximum depth of the features returned by DFS. Let us try running DFS with `max_depth=2`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\", \"sum\", \"mode\"],\n", " trans_primitives=[\"month\", \"hour\"],\n", " max_depth=2,\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this [paper](https://www.jmaxkanter.com/papers/DSAA_DSM_2015.pdf). In the returned feature matrix, let us understand one of the depth 2 features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix[[\"MEAN(sessions.SUM(transactions.amount))\"]]" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "For each customer this feature\n", "\n", "1. calculates the ``sum`` of all transaction amounts per session to get total amount per session,\n", "2. then applies the ``mean`` to the total amounts across multiple sessions to identify the *average amount spent per session*\n", "\n", "We call this feature a \"deep feature\" with a depth of 2.\n", "\n", "Let's look at another depth 2 feature that calculates for every customer *the most common hour of the day when they start a session*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix[[\"MODE(sessions.HOUR(session_start))\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each customer this feature calculates\n", "\n", "1. The `hour` of the day each of his or her sessions started, then\n", "2. uses the statistical function `mode` to identify the most common hour he or she started a session\n", "\n", "Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note ::\n", " You can graphically visualize the lineage of a feature by calling :func:`featuretools.graph_feature` on it. You can also generate an English description of the feature with :func:`featuretools.describe_feature`. See :doc:`/guides/feature_descriptions` for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Changing Target DataFrame\n", "\n", "DFS is powerful because we can create a feature matrix for any dataframe in our dataset. If we switch our target dataframe to \"sessions\", we can synthesize features for each session instead of each customer. Now, we can use these features to predict the outcome of a session." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[\"mean\", \"sum\", \"mode\"],\n", " trans_primitives=[\"month\", \"hour\"],\n", " max_depth=2,\n", ")\n", "feature_matrix.head(5)" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "As we can see, DFS will also build deep features based on a parent dataframe, in this case the customer of a particular session. For example, the feature below calculates the mean transaction amount of the customer of the session." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix[[\"customers.MEAN(transactions.amount)\"]].head(5)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Improve feature output\n", "~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "To learn about the parameters to change in DFS read :doc:`/guides/tuning_dfs`.\n", "\n", "\n", ".. here it maybe nice to have a table that shows the number of features generated for AirBnB and other KAGGLE datasets once we have them. We can also give the user access to it." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: docs/source/getting_started/getting_started_index.rst ================================================ Getting Started --------------- For a quick introduction to Featuretools, check out our :ref:`5 minute quick start guide `. How to start working with Featuretools; the main concepts: .. toctree:: :maxdepth: 1 using_entitysets afe primitives woodwork_types handling_time ================================================ FILE: docs/source/getting_started/handling_time.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "a8104f18", "metadata": {}, "source": [ "# Handling Time\n", "\n", "\n", "When performing feature engineering with temporal data, carefully selecting the data that is used for any calculation is paramount. By annotating dataframes with a Woodwork **time index** column and providing a **cutoff time** during feature calculation, Featuretools will automatically filter out any data after the cutoff time before running any calculations." ] }, { "cell_type": "raw", "id": "9cd9cb82", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " This guide focuses on performing feature engineering on temporal data, but it is not specific to feature engineering for time series problems, which are their own class of machine learning problems. A guide on **using Featuretools for time series feature engineering** can be found `here <../guides/time_series.ipynb>`_." ] }, { "cell_type": "markdown", "id": "32c2ae4d", "metadata": {}, "source": [ "## What is the Time Index?\n", "\n", "\n", "The time index is the column in the data that specifies when the data in each row became known. For example, let's examine a table of customer transactions:" ] }, { "cell_type": "code", "execution_count": null, "id": "ebbcb40b", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.options.display.max_columns = 200" ] }, { "cell_type": "code", "execution_count": null, "id": "8202f11a", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)\n", "es[\"transactions\"].head()" ] }, { "cell_type": "markdown", "id": "cd26087b", "metadata": {}, "source": [ "In this table, there is one row for every transaction and a ``transaction_time`` column that specifies when the transaction took place. This means that ``transaction_time`` is the time index because it indicates when the information in each row became known and available for feature calculations. For now, ignore the ``_ft_last_time`` column. That is a featuretools-generated column that will be discussed later on.\n", "\n", "However, not every datetime column is a time index. Consider the ``customers`` dataframe:" ] }, { "cell_type": "code", "execution_count": null, "id": "87dd0a0d", "metadata": {}, "outputs": [], "source": [ "es[\"customers\"]" ] }, { "cell_type": "markdown", "id": "c89d548d", "metadata": {}, "source": [ "Here, we have two time columns, ``join_date`` and ``birthday``. While either column might be useful for making features, the ``join_date`` should be used as the time index because it indicates when that customer first became available in the dataset." ] }, { "cell_type": "raw", "id": "85b51512", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. important::\n", "\n", " The **time index** is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored." ] }, { "cell_type": "markdown", "id": "00e3c365", "metadata": {}, "source": [ "# What is the Cutoff Time?\n", "The **cutoff_time** specifies the last point in time that a row’s data can be used for a feature calculation. Any data after this point in time will be filtered out before calculating features.\n", "\n", "For example, let's consider a dataset of timestamped customer transactions, where we want to predict whether customers ``1``, ``2`` and ``3`` will spend $500 between ``04:00`` on January 1 and the end of the day. When building features for this prediction problem, we need to ensure that no data after ``04:00`` is used in our calculations.\n", "\n", "\"retail" ] }, { "cell_type": "raw", "id": "19855e77", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "We pass the cutoff time to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix` using the ``cutoff_time`` argument like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "a0717f7d", "metadata": {}, "outputs": [], "source": [ "fm, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=pd.Timestamp(\"2014-1-1 04:00\"),\n", " instance_ids=[1, 2, 3],\n", " cutoff_time_in_index=True,\n", ")\n", "fm" ] }, { "cell_type": "markdown", "id": "feafa08d", "metadata": {}, "source": [ "Even though the entityset contains the complete transaction history for each customer, only data with a time index up to and including the cutoff time was used to calculate the features above.\n", "\n", "## Using a Cutoff Time DataFrame\n", "\n", "\n", "Oftentimes, the training examples for machine learning will come from different points in time. To specify a unique cutoff time for each row of the resulting feature matrix, we can pass a dataframe which includes one column for the instance id and another column for the corresponding cutoff time. These columns can be in any order, but they must be named properly. The column with the instance ids must either be named ``instance_id`` or have the same name as the target dataframe ``index``. The column with the cutoff time values must either be named ``time`` or have the same name as the target dataframe ``time_index``.\n", "\n", "The column names for the instance ids and the cutoff time values should be unambiguous. Passing a dataframe that contains both a column with the same name as the target dataframe ``index`` and a column named ``instance_id`` will result in an error. Similarly, if the cutoff time dataframe contains both a column with the same name as the target dataframe ``time_index`` and a column named ``time`` an error will be raised." ] }, { "cell_type": "raw", "id": "6ffaffd0", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " Only the columns corresponding to the instance ids and the cutoff times are used to calculate features. Any additional columns passed through are appended to the resulting feature matrix. This is typically used to pass through machine learning labels to ensure that they stay aligned with the feature matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "fa5cc115", "metadata": {}, "outputs": [], "source": [ "cutoff_times = pd.DataFrame()\n", "cutoff_times[\"customer_id\"] = [1, 2, 3, 1]\n", "cutoff_times[\"time\"] = pd.to_datetime(\n", " [\"2014-1-1 04:00\", \"2014-1-1 05:00\", \"2014-1-1 06:00\", \"2014-1-1 08:00\"]\n", ")\n", "cutoff_times[\"label\"] = [True, True, False, True]\n", "cutoff_times\n", "fm, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=cutoff_times,\n", " cutoff_time_in_index=True,\n", ")\n", "fm" ] }, { "cell_type": "markdown", "id": "6185bb0d", "metadata": {}, "source": [ "We can now see that every row of the feature matrix is calculated at the corresponding time in the cutoff time dataframe. Because we calculate each row at a different time, it is possible to have a repeat customer. In this case, we calculated the feature vector for customer 1 at both ``04:00`` and ``08:00``.\n", "\n", "Training Window\n", "---------------\n", "\n", "By default, all data up to and including the cutoff time is used. We can restrict the amount of historical data that is selected for calculations using a \"training window.\"\n", "\n", "Here's an example of using a two hour training window:" ] }, { "cell_type": "code", "execution_count": null, "id": "e321d463", "metadata": {}, "outputs": [], "source": [ "window_fm, window_features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=cutoff_times,\n", " cutoff_time_in_index=True,\n", " training_window=\"2 hour\",\n", ")\n", "\n", "window_fm" ] }, { "cell_type": "markdown", "id": "4ee67c4d", "metadata": {}, "source": [ "We can see that that the counts for the same feature are lower after we shorten the training window:" ] }, { "cell_type": "code", "execution_count": null, "id": "93d6b9ae", "metadata": {}, "outputs": [], "source": [ "fm[[\"COUNT(transactions)\"]]\n", "\n", "window_fm[[\"COUNT(transactions)\"]]" ] }, { "cell_type": "markdown", "id": "ad7c73c4", "metadata": {}, "source": [ "## Setting a Last Time Index\n", "\n", "The training window in Featuretools limits the amount of past data that can be used while calculating a particular feature vector. A row in the dataframe is filtered out if the value of its time index is either before or after the training window. This works for dataframes where a row occurs at a single point in time. However, a row can sometimes exist for a duration.\n", "\n", "For example, a customer's session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user has in a given time period, we often want to count all the sessions that had *any* transaction during the training window. To accomplish this, we need to not only know when a session starts, but also when it ends. The last time that an instance appears in the data is stored in the `_ft_last_time` column on the dataframe. We can compare the time index and the last time index of the ``sessions`` dataframe above:" ] }, { "cell_type": "code", "execution_count": null, "id": "493c8193", "metadata": {}, "outputs": [], "source": [ "last_time_index_col = es[\"sessions\"].ww.metadata.get(\"last_time_index\")\n", "es[\"sessions\"][[\"session_start\", last_time_index_col]].head()" ] }, { "cell_type": "markdown", "id": "b7f1c5cb", "metadata": {}, "source": [ "Featuretools can automatically add last time indexes to every DataFrame in an ``Entityset`` by running ``EntitySet.add_last_time_indexes()``. When using a training window, if a `last_time_index has` been set, Featuretools will check to see if the `last_time_index` is after the start of the training window. That, combined with the cutoff time, allows DFS to discover which data is relevant for a given training window.\n", "\n", "\n", "## Excluding data at cutoff times" ] }, { "cell_type": "raw", "id": "b44bee57", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The ``cutoff_time`` is the last point in time where data can be used for feature\n", "calculation. If you don't want to use the data at the cutoff time in feature\n", "calculation, you can exclude that data by setting ``include_cutoff_time`` to\n", "``False`` in :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`.\n", "If you set it to ``True`` (the default behavior), data from the cutoff time point\n", "will be used." ] }, { "cell_type": "markdown", "id": "2e92d895", "metadata": {}, "source": [ "Setting ``include_cutoff_time`` to ``False`` also impacts how data at the edges\n", "of training windows are included or excluded. Take this slice of data as an example:" ] }, { "cell_type": "code", "execution_count": null, "id": "76f9676f", "metadata": {}, "outputs": [], "source": [ "df = es[\"transactions\"]\n", "df[df[\"session_id\"] == 1].head()" ] }, { "cell_type": "markdown", "id": "ce77f6fd", "metadata": {}, "source": [ "Looking at the data, transactions occur every 65 seconds. To check how ``include_cutoff_time``\n", "effects training windows, we can calculate features at the time of a transaction\n", "while using a 65 second training window. This creates a training window with a\n", "transaction at both endpoints of the window. For this example, we'll find the sum\n", "of all transactions for session id 1 that are in the training window." ] }, { "cell_type": "code", "execution_count": null, "id": "1841d78b", "metadata": {}, "outputs": [], "source": [ "from featuretools.primitives import Sum\n", "\n", "sum_log = ft.Feature(\n", " es[\"transactions\"].ww[\"amount\"],\n", " parent_dataframe_name=\"sessions\",\n", " primitive=Sum,\n", ")\n", "cutoff_time = pd.DataFrame(\n", " {\n", " \"session_id\": [1],\n", " \"time\": [\"2014-01-01 00:04:20\"],\n", " }\n", ").astype({\"time\": \"datetime64[ns]\"})" ] }, { "cell_type": "markdown", "id": "3c15be10", "metadata": {}, "source": [ "With ``include_cutoff_time=True``, the oldest point in the training window\n", "(``2014-01-01 00:03:15``) is excluded and the cutoff time point is included. This\n", "means only transaction 371 is in the training window, so the sum of all transaction\n", "amounts is 31.54" ] }, { "cell_type": "code", "execution_count": null, "id": "f782683a", "metadata": {}, "outputs": [], "source": [ "# Case1. include_cutoff_time = True\n", "actual = ft.calculate_feature_matrix(\n", " features=[sum_log],\n", " entityset=es,\n", " cutoff_time=cutoff_time,\n", " cutoff_time_in_index=True,\n", " training_window=\"65 seconds\",\n", " include_cutoff_time=True,\n", ")\n", "actual" ] }, { "cell_type": "markdown", "id": "324246db", "metadata": {}, "source": [ "Whereas with ``include_cutoff_time=False``, the oldest point in the window is\n", "included and the cutoff time point is excluded. So in this case transaction 116\n", "is included and transaction 371 is exluded, and the sum is 78.92\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9b63bc68", "metadata": {}, "outputs": [], "source": [ "# Case2. include_cutoff_time = False\n", "actual = ft.calculate_feature_matrix(\n", " features=[sum_log],\n", " entityset=es,\n", " cutoff_time=cutoff_time,\n", " cutoff_time_in_index=True,\n", " training_window=\"65 seconds\",\n", " include_cutoff_time=False,\n", ")\n", "actual" ] }, { "cell_type": "markdown", "id": "4329314f", "metadata": {}, "source": [ "Approximating Features by Rounding Cutoff Times\n", "-----------------------------------------------\n", "\n", "For each unique cutoff time, Featuretools must perform operations to select the data that’s valid for computations. If there are a large number of unique cutoff times relative to the number of instances for which we are calculating features, the time spent filtering data can add up. By reducing the number of unique cutoff times, we minimize the overhead from searching for and extracting data for feature calculations.\n", "\n", "One way to decrease the number of unique cutoff times is to round cutoff times to an earlier point in time. An earlier cutoff time is always valid for predictive modeling — it just means we’re not using some of the data we could potentially use while calculating that feature. So, we gain computational speed by losing a small amount of information.\n", "\n", "To understand when an approximation is useful, consider calculating features for a model to predict fraudulent credit card transactions. In this case, an important feature might be, \"the average transaction amount for this card in the past\". While this value can change every time there is a new transaction, updating it less frequently might not impact accuracy." ] }, { "cell_type": "raw", "id": "3628cc1c", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " The bank BBVA used approximation when building a predictive model for credit card fraud using Featuretools. For more details, see the \"Real-time deployment considerations\" section of the `white paper `_ describing the work involved.\n" ] }, { "cell_type": "raw", "id": "4bf10090", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The frequency of approximation is controlled using the ``approximate`` parameter to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`. For example, the following code would approximate aggregation features at 1 day intervals::" ] }, { "cell_type": "markdown", "id": "641981d0", "metadata": {}, "source": [ " fm = ft.calculate_feature_matrix(features=features,\n", " entityset=es_transactions,\n", " cutoff_time=ct_transactions,\n", " approximate=\"1 day\")\n", "\n", "In this computation, features that can be approximated will be calculated at 1 day intervals, while features that cannot be approximated (e.g \"where did this transaction occur?\") will be calculated at the exact cutoff time.\n", "\n", "\n", "## Secondary Time Index\n", "\n", "It is sometimes the case that information in a dataset is updated or added after a row has been created. This means that certain columns may actually become known after the time index for a row. Rather than drop those columns to avoid leaking information, we can create a secondary time index to indicate when those columns become known." ] }, { "cell_type": "raw", "id": "6f8197f9", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The :func:`Flights ` entityset is a good example of a dataset where column values in a row become known at different times. Each trip is recorded in the ``trip_logs`` dataframe, and has many times associated with it." ] }, { "cell_type": "code", "execution_count": null, "id": "d6043477", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import urllib.request as urllib2\n", "\n", "opener = urllib2.build_opener()\n", "opener.addheaders = [(\"Testing\", \"True\")]\n", "urllib2.install_opener(opener)" ] }, { "cell_type": "code", "execution_count": null, "id": "abf92463", "metadata": {}, "outputs": [], "source": [ "es_flight = ft.demo.load_flight(nrows=100)\n", "es_flight\n", "es_flight[\"trip_logs\"].head(3)" ] }, { "cell_type": "markdown", "id": "36827ff9", "metadata": {}, "source": [ "For every trip log, the time index is ``date_scheduled``, which is when the airline decided on the scheduled departure and arrival times, as well as what route will be flown. We don't know the rest of the information about the actual departure/arrival times and the details of any delay at this time. However, it is possible to know everything about how a trip went after it has arrived, so we can use that information at any time after the flight lands.\n", "\n", "Using a secondary time index, we can indicate to Featuretools which columns in our flight logs are known at the time the flight is scheduled, plus which are known at the time the flight lands.\n", "\n", "\"flight\n", "\n", "In Featuretools, when adding the dataframe to the ``EntitySet``, we set the secondary time index to be the arrival time like this:\n", "\n", " es = ft.EntitySet('Flight Data')\n", " arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\n", " 'national_airspace_delay', 'security_delay',\n", " 'late_aircraft_delay', 'canceled', 'diverted',\n", " 'taxi_in', 'taxi_out', 'air_time', 'dep_time']\n", "\n", " es.add_dataframe(\n", " dataframe_name='trip_logs',\n", " dataframe=data,\n", " index='trip_log_id',\n", " make_index=True,\n", " time_index='date_scheduled',\n", " secondary_time_index={'arr_time': arr_time_columns})\n", "\n", "By setting a secondary time index, we can still use the delay information from a row, but only when it becomes known." ] }, { "cell_type": "raw", "id": "eaef7ec8", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. hint::\n", "\n", " It's often a good idea to use a secondary time index if your entityset has inline labels. If you know when the label would be valid for use, it's possible to automatically create very predictive features using historical labels." ] }, { "cell_type": "markdown", "id": "03448def", "metadata": {}, "source": [ "## Flight Predictions\n", "\n", "Let's make some features at varying times using the flight example described above. Trip ``14`` is a flight from CLT to PHX on January 31, 2017 and trip ``92`` is a flight from PIT to DFW on January 1. We can set any cutoff time before the flight is scheduled to depart, emulating how we would make the prediction at that point in time.\n", "\n", "We set two cutoff times for trip ``14`` at two different times: one which is more than a month before the flight and another which is only 5 days before. For trip ``92``, we'll only set one cutoff time, three days before it is scheduled to leave.\n", "\n", "\"flight\n", "\n", "Our cutoff time dataframe looks like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "c338105b", "metadata": {}, "outputs": [], "source": [ "ct_flight = pd.DataFrame()\n", "ct_flight[\"trip_log_id\"] = [14, 14, 92]\n", "ct_flight[\"time\"] = pd.to_datetime([\"2016-12-28\", \"2017-1-25\", \"2016-12-28\"])\n", "ct_flight[\"label\"] = [True, True, False]\n", "ct_flight" ] }, { "cell_type": "markdown", "id": "f26db5dd", "metadata": {}, "source": [ "Now, let's calculate the feature matrix:" ] }, { "cell_type": "code", "execution_count": null, "id": "bd56c24e", "metadata": {}, "outputs": [], "source": [ "fm, features = ft.dfs(\n", " entityset=es_flight,\n", " target_dataframe_name=\"trip_logs\",\n", " cutoff_time=ct_flight,\n", " cutoff_time_in_index=True,\n", " agg_primitives=[\"max\"],\n", " trans_primitives=[\"month\"],\n", ")\n", "fm[\n", " [\n", " \"flights.origin\",\n", " \"flights.dest\",\n", " \"label\",\n", " \"flights.MAX(trip_logs.arr_delay)\",\n", " \"MONTH(scheduled_dep_time)\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "id": "f367279c", "metadata": {}, "source": [ "Let's understand the output:\n", "\n", "1. A row was made for every id-time pair in ``ct_flight``, which is returned as the index of the feature matrix.\n", "\n", "2. The output was sorted by cutoff time. Because of the sorting, it's often helpful to pass in a label with the cutoff time dataframe so that it will remain sorted in the same fashion as the feature matrix. Any additional columns beyond ``id`` and ``cutoff_time`` will not be used for making features.\n", "\n", "3. The column ``flights.MAX(trip_logs.arr_delay)`` is not always defined. It can only have any real values when there are historical flights to aggregate. Notice that, for trip ``14``, there wasn't any historical data when we made the feature a month in advance, but there **were** flights to aggregate when we shortened it to 5 days. These are powerful features that are often excluded in manual processes because of how hard they are to make.\n", "\n", "\n", "Creating and Flattening a Feature Tensor\n", "----------------------------------------" ] }, { "cell_type": "raw", "id": "3d5f23cc", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The :func:`featuretools.make_temporal_cutoffs` function generates a series of equally spaced cutoff times from a given set of cutoff times and instance ids." ] }, { "cell_type": "markdown", "id": "a7b677e7", "metadata": {}, "source": [ "This function can be paired with DFS to create and flatten a feature tensor rather than making multiple feature matrices at different delays.\n", "\n", "The function\n", "takes in the the following parameters:\n", "\n", " * ``instance_ids (list, pd.Series, or np.ndarray)``: A list of instances.\n", " * ``cutoffs (list, pd.Series, or np.ndarray)``: An associated list of cutoff times.\n", " * ``window_size (str or pandas.DateOffset)``: The amount of time between each cutoff time in the created time series.\n", " * ``start (datetime.datetime or pd.Timestamp)``: The first cutoff time in the created time series.\n", " * ``num_windows (int)``: The number of cutoff times to create in the created time series.\n", "\n", "Only two of the three options ``window_size``, ``start``, and ``num_windows`` need to be specified to uniquely determine an equally-spaced set of cutoff times at which to compute each instance.\n", "\n", "If your cutoff times are the ones used above:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e7648a9d", "metadata": {}, "outputs": [], "source": [ "cutoff_times" ] }, { "cell_type": "markdown", "id": "9bda6ff4", "metadata": {}, "source": [ "Then passing in ``window_size='1h'`` and ``num_windows=2`` makes one row an hour over the last two hours to produce the following new dataframe. The result can be directly passed into DFS to make features at the different time points." ] }, { "cell_type": "code", "execution_count": null, "id": "b4204f47", "metadata": {}, "outputs": [], "source": [ "temporal_cutoffs = ft.make_temporal_cutoffs(\n", " cutoff_times[\"customer_id\"], cutoff_times[\"time\"], window_size=\"1h\", num_windows=2\n", ")\n", "temporal_cutoffs\n", "fm, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=temporal_cutoffs,\n", " cutoff_time_in_index=True,\n", ")\n", "fm" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/getting_started/primitives.ipynb ================================================ { "cells": [ { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _primitives:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature primitives\n", "Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.\n", "\n", "## Why primitives?\n", "The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.\n", "\n", "A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: *average time between events*. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.\n", "\n", "DFS achieves the same feature by stacking two primitives `\"time_since_previous\"` and `\"mean\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\"],\n", " trans_primitives=[\"time_since_previous\"],\n", " features_only=True,\n", ")\n", "\n", "feature_defs" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: \n", "\n", " The primitive arguments to DFS (eg. ``agg_primitives`` and ``trans_primitives`` in the example above) accept ``snake_case``, ``camelCase``, or ``TitleCase`` strings of included Featuretools primitives (ie. ``time_since_previous``, ``timeSincePrevious``, and ``TimeSincePrevious`` are all acceptable inputs).\n", "\n", ".. note::\n", "\n", " When ``dfs`` is called with ``features_only=True``, only feature definitions are returned as output. By default this parameter is set to ``False``. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\", \"max\", \"min\", \"std\", \"skew\"],\n", " trans_primitives=[\"time_since_previous\"],\n", ")\n", "\n", "feature_matrix[\n", " [\n", " \"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"STD(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation vs Transform Primitive\n", "\n", "In the example above, we use two types of primitives.\n", "\n", "**Aggregation primitives:** These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an EntitySet. E.g: `\"count\"`, `\"sum\"`, `\"avg_time_between\"`." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. graphviz:: graphs/agg_feat.dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Transform primitives:** These primitives take one or more columns from a dataframe as an input and output a new column for that dataframe. They are applied to a single dataframe. E.g: `\"hour\"`, `\"time_since_previous\"`, `\"absolute\"`." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. graphviz:: graphs/trans_feat.dot\n", "\n", "\n", "The above graphs were generated using the :func:`graph_feature ` function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a DataFrame that lists and describes each built-in primitive in Featuretools, call `ft.list_primitives()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ft.list_primitives().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a DataFrame of metrics that summarizes various properties and capabilities of all of the built-in primitives in Featuretools, call `ft.summarize_primitives()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ft.summarize_primitives()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Custom Primitives\n", "\n", "The library of primitives in Featuretools is constantly expanding. Users can define their own primitive using the APIs below. To define a primitive, a user will\n", "\n", "\n", " * Specify the type of primitive `Aggregation` or `Transform`\n", " * Define the input and output data types\n", " * Write a function in python to do the calculation\n", " * Annotate with attributes to constrain how it is applied\n", "\n", "\n", "Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from woodwork.column_schema import ColumnSchema\n", "from woodwork.logical_types import Datetime, NaturalLanguage\n", "\n", "from featuretools.primitives import AggregationPrimitive, TransformPrimitive\n", "from featuretools.tests.testing_utils import make_ecommerce_entityset" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "### Simple Custom Primitives" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Absolute(TransformPrimitive):\n", " name = \"absolute\"\n", " input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", "\n", " def get_function(self):\n", " def absolute(column):\n", " return abs(column)\n", "\n", " return absolute" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "Above, we created a new transform primitive that can be used with Deep Feature Synthesis by deriving a new primitive class using `TransformPrimitive` as a base and overriding `get_function` to return a function that calculates the feature. Additionally, we set the input data types that the primitive applies to and the return data type. Input and return data types are defined using a Woodwork ColumnSchema. A full guide on Woodwork logical types and semantic tags can be found in the Woodwork [Understanding Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide.\n", "\n", "Similarly, we can make a new aggregation primitive using `AggregationPrimitive`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Maximum(AggregationPrimitive):\n", " name = \"maximum\"\n", " input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", "\n", " def get_function(self):\n", " def maximum(column):\n", " return max(column)\n", "\n", " return maximum" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "Because we defined an aggregation primitive, the function takes in a list of values but only returns one.\n", "\n", "Now that we've defined two primitives, we can use them with the dfs function as if they were built-in primitives." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[Maximum],\n", " trans_primitives=[Absolute],\n", " max_depth=2,\n", ")\n", "\n", "feature_matrix.head(5)[\n", " [\n", " \"customers.MAXIMUM(transactions.amount)\",\n", " \"MAXIMUM(transactions.ABSOLUTE(amount))\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "### Word Count Example\n", "\n", "Here we define a transform primitive, `WordCount`, which counts the number of words in each row of an input and returns a list of the counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class WordCount(TransformPrimitive):\n", " \"\"\"\n", " Counts the number of words in each row of the column. Returns a list\n", " of the counts for each row.\n", " \"\"\"\n", "\n", " name = \"word_count\"\n", " input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", "\n", " def get_function(self):\n", " def word_count(column):\n", " word_counts = []\n", " for value in column:\n", " words = value.split(None)\n", " word_counts.append(len(words))\n", " return word_counts\n", "\n", " return word_count" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = make_ecommerce_entityset()\n", "\n", "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[\"sum\", \"mean\", \"std\"],\n", " trans_primitives=[WordCount],\n", ")\n", "\n", "feature_matrix[\n", " [\n", " \"customers.WORD_COUNT(favorite_quote)\",\n", " \"STD(log.WORD_COUNT(comments))\",\n", " \"SUM(log.WORD_COUNT(comments))\",\n", " \"MEAN(log.WORD_COUNT(comments))\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "By adding some aggregation primitives as well, Deep Feature Synthesis was able to make four new features from one new primitive.\n", "\n", "### Multiple Input Types\n", "\n", "If a primitive requires multiple features as input, `input_types` has multiple elements, eg `[ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]` would mean the primitive requires two columns with the semantic tag `numeric` as input. Below is an example of a primitive that has multiple input features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MeanSunday(AggregationPrimitive):\n", " \"\"\"\n", " Finds the mean of non-null values of a feature that occurred on Sundays\n", " \"\"\"\n", "\n", " name = \"mean_sunday\"\n", " input_types = [\n", " ColumnSchema(semantic_tags={\"numeric\"}),\n", " ColumnSchema(logical_type=Datetime),\n", " ]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", "\n", " def get_function(self):\n", " def mean_sunday(numeric, datetime):\n", " days = pd.DatetimeIndex(datetime).weekday.values\n", " df = pd.DataFrame({\"numeric\": numeric, \"time\": days})\n", " return df[df[\"time\"] == 6][\"numeric\"].mean()\n", "\n", " return mean_sunday" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[MeanSunday],\n", " trans_primitives=[],\n", " max_depth=1,\n", ")\n", "\n", "feature_matrix[\n", " [\n", " \"MEAN_SUNDAY(log.value, datetime)\",\n", " \"MEAN_SUNDAY(log.value_2, datetime)\",\n", " ]\n", "]" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: docs/source/getting_started/using_entitysets.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Representing Data with EntitySets\n", "\n", "An ``EntitySet`` is a collection of dataframes and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools take ``dataframes`` and ``relationships`` as separate arguments, it is recommended to create an ``EntitySet``, so you can more easily manipulate your data as needed.\n", "\n", "## The Raw Data\n", "\n", "Below we have two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "data = ft.demo.load_mock_customer()\n", "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", "\n", "transactions_df.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the second dataframe is a list of products involved in those transactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "products_df = data[\"products\"]\n", "products_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an EntitySet\n", "\n", "First, we initialize an ``EntitySet``. If you'd like to give it a name, you can optionally provide an ``id`` to the constructor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.EntitySet(id=\"customer_data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding dataframes\n", "\n", "To get started, we add the transactions dataframe to the `EntitySet`. In the call to ``add_dataframe``, we specify three important parameters:\n", "\n", "* The ``index`` parameter specifies the column that uniquely identifies rows in the dataframe.\n", "* The ``time_index`` parameter tells Featuretools when the data was created.\n", "* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from woodwork.logical_types import Categorical, PostalCode\n", "\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\",\n", " dataframe=transactions_df,\n", " index=\"transaction_id\",\n", " time_index=\"transaction_time\",\n", " logical_types={\n", " \"product_id\": Categorical,\n", " \"zip_code\": PostalCode,\n", " },\n", ")\n", "\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use a setter on the ``EntitySet`` object to add dataframes" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. currentmodule:: featuretools\n", "\n", "\n", ".. note ::\n", "\n", " You can also use a setter on the ``EntitySet`` object to add dataframes\n", "\n", " ``es[\"transactions\"] = transactions_df``\n", "\n", " that this will use the default implementation of `add_dataframe`, notably the following:\n", "\n", " * if the DataFrame does not have `Woodwork `_ initialized, the first column will be the index column\n", " * if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n", " * if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.\n", "\n", ".. note ::\n", "\n", " You can also display your `EntitySet` structure graphically by calling :meth:`.EntitySet.plot`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method associates each column in the dataframe to a [Woodwork](https://woodwork.alteryx.com/) logical type. Each logical type can have an associated standard semantic tag that helps define the column data type. If you don't specify the logical type for a column, it gets inferred based on the underlying data. The logical types and semantic tags are listed in the schema of the dataframe. For more information on working with logical types and semantic tags, take a look at the [Woodwork documention](https://woodwork.alteryx.com/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].ww.schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can do that same thing with our products dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.add_dataframe(\n", " dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n", ")\n", "\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With two dataframes in our `EntitySet`, we can add a relationship between them.\n", "\n", "## Adding a Relationship\n", "\n", "We want to relate these two dataframes by the columns called \"product_id\" in each dataframe. Each product has multiple transactions associated with it, so it is called the **parent dataframe**, while the transactions dataframe is known as the **child dataframe**. When specifying relationships, we need four parameters: the parent dataframe name, the parent column name, the child dataframe name, and the child column name. Note that each relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.add_relationship(\"products\", \"product_id\", \"transactions\", \"product_id\")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we see the relationship has been added to our `EntitySet`.\n", "\n", "## Creating a dataframe from an existing table\n", "\n", "When working with raw data, it is common to have sufficient information to justify the creation of new dataframes. In order to create a new dataframe and relationship for sessions, we \"normalize\" the transaction dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.normalize_dataframe(\n", " base_dataframe_name=\"transactions\",\n", " new_dataframe_name=\"sessions\",\n", " index=\"session_id\",\n", " make_time_index=\"session_start\",\n", " additional_columns=[\n", " \"device\",\n", " \"customer_id\",\n", " \"zip_code\",\n", " \"session_start\",\n", " \"join_date\",\n", " ],\n", ")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the output above, we see this method did two operations:\n", "\n", "1. It created a new dataframe called \"sessions\" based on the \"session_id\" and \"session_start\" columns in \"transactions\"\n", "2. It added a relationship connecting \"transactions\" and \"sessions\"\n", "\n", "If we look at the schema from the transactions dataframe and the new sessions dataframe, we see two more operations that were performed automatically:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].ww.schema" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"sessions\"].ww.schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. It removed \"device\", \"customer_id\", \"zip_code\" and \"join_date\" from \"transactions\" and created a new columns in the sessions dataframe. This reduces redundant information as the those properties of a session don't change between transactions.\n", "2. It copied and marked \"session_start\" as a time index column into the new sessions dataframe to indicate the beginning of a session. If the base dataframe has a time index and ``make_time_index`` is not set, ``normalize_dataframe`` will create a time index for the new dataframe. In this case it would create a new time index called \"first_transactions_time\" using the time of the first transaction of each session. If we don't want this time index to be created, we can set ``make_time_index=False``.\n", "\n", "If we look at the dataframes, we can see what ``normalize_dataframe`` did to the actual data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"sessions\"].head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To finish preparing this dataset, create a \"customers\" dataframe using the same method call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.normalize_dataframe(\n", " base_dataframe_name=\"sessions\",\n", " new_dataframe_name=\"customers\",\n", " index=\"customer_id\",\n", " make_time_index=\"join_date\",\n", " additional_columns=[\"zip_code\", \"join_date\"],\n", ")\n", "\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the EntitySet\n", "\n", "Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let's build a feature matrix for each product in our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name=\"products\")\n", "\n", "feature_matrix" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "vscode": { "languageId": "raw" } }, "source": [ "As we can see, the features from DFS use the relational structure of our `EntitySet`. Therefore it is important to think carefully about the dataframes that we create." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: docs/source/getting_started/woodwork_types.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "b95b28c1", "metadata": {}, "source": [ "# Woodwork Typing in Featuretools\n", "\n", "Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system that contained objects called Variables. Now and moving forward, Featuretools will use an external data typing library for its typing: [Woodwork](https://woodwork.alteryx.com/en/stable/index.html).\n", "\n", "Understanding the Woodwork types that exist and how Featuretools uses Woodwork's type system will allow users to:\n", " - build EntitySets that best represent their data\n", " - understand the possible input and return types for Featuretools' Primitives\n", " - understand what features will get generated from a given set of data and primitives.\n", "\n", "Read the [Understanding Woodwork Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide for an in-depth walkthrough of the available Woodwork types that are outlined below.\n", "\n", "For users that are familiar with the old `Variable` objects, the [Transitioning to Featuretools Version 1.0](../resources/transition_to_ft_v1.0.ipynb) guide will be useful for converting Variable types to Woodwork types.\n", "\n", "## Physical Types \n", "Physical types define how the data in a Woodwork DataFrame is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s `dtype`.\n", "\n", "Knowing a Woodwork DataFrame's physical types is important because Pandas relies on these types when performing DataFrame operations. Each Woodwork `LogicalType` class has a single physical type associated with it.\n", "\n", "## Logical Types\n", "Logical types add additional information about how data should be interpreted or parsed beyond what can be contained in a physical type. In fact, multiple logical types have the same physical type, each imparting a different meaning that's not contained in the physical type alone.\n", "\n", "In Featuretools, a column's logical type informs how data is read into an EntitySet and how it gets used down the line in Deep Feature Synthesis.\n", "\n", "Woodwork provides many different logical types, which can be seen with the `list_logical_types` function." ] }, { "cell_type": "code", "execution_count": null, "id": "497712b0", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "ft.list_logical_types()" ] }, { "cell_type": "markdown", "id": "cfe99d0f", "metadata": {}, "source": [ "Featuretools will perform type inference to assign logical types to the data in EntitySets if none are provided, but it is also possible to specify which logical types should be set for any column (provided that the data in that column is compatible with the logical type).\n", "\n", "To learn more about how logical types are used in EntitySets, see the [Creating EntitySets](using_entitysets.ipynb) guide.\n", "\n", "To learn more about setting logical types directly on a DataFrame, see the Woodwork guide on [working with Logical Types](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Logical-Types). \n", "\n", "## Semantic Tags\n", "Semantic tags provide additional information to columns about the meaning or potential uses of data. Columns can have many or no semantic tags. Some tags are added by Woodwork, some are added by Featuretools, and users can add additional tags as they see fit.\n", "\n", "To learn more about setting semantic tags directly on a DataFrame, see the Woodwork guide on [working with Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Semantic-Tags). \n", "\n", "### Woodwork-defined Semantic Tags\n", "\n", "Woodwork will add certain semantic tags to columns at initialization. These can be standard tags that may be associated with different sets of logical types or index tags. There are also tags that users can add to confer a suggested meaning to columns in Woodwork.\n", "\n", "To get a list of these tags, you can use the `list_semantic_tags` function." ] }, { "cell_type": "code", "execution_count": null, "id": "11f25bd9", "metadata": {}, "outputs": [], "source": [ "ft.list_semantic_tags()" ] }, { "cell_type": "markdown", "id": "29222810", "metadata": {}, "source": [ "Above we see the semantic tags that are defined within Woodwork. These tags inform how Featuretools is able to interpret data, an example of which can be seen in the `Age` primitive, which requires that the `date_of_birth` semantic tag be present on a column.\n", "\n", "The `date_of_birth` tag will not get automatically added by Woodwork, so in order for Featuretools to be able to use the `Age` primitive, the `date_of_birth` tag must be manually added to any columns to which it applies.\n", "\n", "### Featuretools-defined Semantic Tags\n", "\n", "Just like Woodwork specifies semantic tags internally, Featuretools also defines a few tags of its own that allow the full set of Features to be generated. These tags have specific meanings when they are present on a column.\n", "\n", "- `'last_time_index'` - added by Featuretools to the last time index column of a DataFrame. Indicates that this column has been created by Featuretools.\n", "- `'foreign_key'` - used to indicate that this column is the child column of a relationship, meaning that this column is related to a corresponding index column of another dataframe in the EntitySet.\n", "\n", "\n", "## Woodwork Throughout Featuretools\n", "\n", "Now that we've described the elements that make up Woodwork's type system, lets see them in action in Featuretools.\n", "\n", "### Woodwork in EntitySets\n", "For more information on building EntitySets using Woodwork, see the [EntitySet guide](using_entitysets.ipynb).\n", "\n", "Let's look at the Woodwork typing information as it's stored in a demo EntitySet of retail data:" ] }, { "cell_type": "code", "execution_count": null, "id": "bd9c1ec9", "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_retail()\n", "es" ] }, { "cell_type": "markdown", "id": "267880c4", "metadata": {}, "source": [ "Woodwork typing information is not stored in the EntitySet object, but rather is stored in the individual DataFrames that make up the EntitySet. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork information via the `ww` namespace:" ] }, { "cell_type": "code", "execution_count": null, "id": "aa1966fd", "metadata": {}, "outputs": [], "source": [ "df = es[\"products\"]\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "164b1138", "metadata": {}, "outputs": [], "source": [ "df.ww" ] }, { "cell_type": "markdown", "id": "4bffac54", "metadata": {}, "source": [ "Notice how the three columns showing this DataFrame's typing information are the three elements of typing information outlined at the beginning of this guide. To reiterate: By defining physical types, logical types, and semantic tags for each column in a DataFrame, we've defined a DataFrame's Woodwork schema, and with it, we can gain an understanding of the contents of each column.\n", "\n", "This column-specific typing information that exists for every column in every DataFrame in an EntitySet is an integral part of Deep Feature Synthesis' ability to generate features for an EntitySet.\n", "\n", "### Woodwork in DFS\n", "As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For an in-depth explanation of Primitives in Featuretools, see the [Feature Primitives](primitives.ipynb) guide. Here, we'll look at how the Woodwork types come together into a `ColumnSchema` object to describe Primitive input and return types.\n", "\n", "Below is a Woodwork `ColumnSchema` that we've obtained from the `'product_id'` column in the `products` DataFrame in the retail EntitySet." ] }, { "cell_type": "code", "execution_count": null, "id": "349e5274", "metadata": {}, "outputs": [], "source": [ "products_df = es[\"products\"]\n", "product_ids_series = products_df.ww[\"product_id\"]\n", "column_schema = product_ids_series.ww.schema\n", "column_schema" ] }, { "cell_type": "markdown", "id": "8e8c0ccf", "metadata": {}, "source": [ "This combination of logical type and semantic tag typing information is a `ColumnSchema`. In the case above, the `ColumnSchema` describes the **type definition** for a single column of data. \n", "\n", "Notice that there is no physical type in a `ColumnSchema`. This is because a `ColumnSchema` is a collection of Woodwork types that doesn't have any data tied to it and therefore has no physical representation. Because a `ColumnSchema` object is not tied to any data, it can also be used to describe a **type space** into which other columns may or may not fall.\n", "\n", "This flexibility of the `ColumnSchema` class allows `ColumnSchema` objects to be used both as type definitions for every column in an EntitySet as well as input and return type spaces for every Primitive in Featuretools.\n", "\n", "Let's look at a different column in a different DataFrame to see how this works:" ] }, { "cell_type": "code", "execution_count": null, "id": "f3bb3ffe", "metadata": {}, "outputs": [], "source": [ "order_products_df = es[\"order_products\"]\n", "order_products_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "1aae3378", "metadata": {}, "outputs": [], "source": [ "quantity_series = order_products_df.ww[\"quantity\"]\n", "column_schema = quantity_series.ww.schema\n", "column_schema" ] }, { "cell_type": "markdown", "id": "f067db9a", "metadata": {}, "source": [ "The `ColumnSchema` above has been pulled from the `'quantity'` column in the `order_products` DataFrame in the retail EntitySet. This is a **type definition**. \n", "\n", "If we look at the Woodwork typing information for the `order_products` DataFrame, we can see that there are several columns that will have similar `ColumnSchema` type definitions. If we wanted to describe subsets of those columns, we could define several `ColumnSchema` **type spaces**" ] }, { "cell_type": "code", "execution_count": null, "id": "bc2bfae6", "metadata": {}, "outputs": [], "source": [ "es[\"order_products\"].ww" ] }, { "cell_type": "markdown", "id": "73257dcf", "metadata": {}, "source": [ "Below are several `ColumnSchema`s that all would include our `quantity` column, but each of them describes a different type space. These `ColumnSchema`s get more restrictive as we go down:\n", "\n", "##### Entire DataFrame\n", "No restrictions have been placed; any column falls into this definition. This would include the whole DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "f6614c98", "metadata": {}, "outputs": [], "source": [ "from woodwork.column_schema import ColumnSchema\n", "\n", "ColumnSchema()" ] }, { "cell_type": "markdown", "id": "299fc7d2", "metadata": {}, "source": [ "An example of a Primitive with this `ColumnSchema` as its input type is the `IsNull` transform primitive.\n", "\n", "##### By Semantic Tag\n", "Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns as well. It will not include the `index` column which, despite containing integers, has had its standard tags replaced by the `'index'` tag." ] }, { "cell_type": "code", "execution_count": null, "id": "16c1a5a9", "metadata": {}, "outputs": [], "source": [ "ColumnSchema(semantic_tags={\"numeric\"})" ] }, { "cell_type": "code", "execution_count": null, "id": "0932d05d", "metadata": {}, "outputs": [], "source": [ "df = es[\"order_products\"].ww.select(include=\"numeric\")\n", "df.ww" ] }, { "cell_type": "markdown", "id": "a5ec95c8", "metadata": {}, "source": [ "And example of a Primitive with this `ColumnSchema` as its input type is the `Mean` aggregation primitive.\n", "\n", "##### By Logical Type\n", "Only columns with logical type of `Integer` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply." ] }, { "cell_type": "code", "execution_count": null, "id": "79bd3d4f", "metadata": {}, "outputs": [], "source": [ "from woodwork.logical_types import Integer\n", "\n", "ColumnSchema(logical_type=Integer)" ] }, { "cell_type": "code", "execution_count": null, "id": "e905229e", "metadata": {}, "outputs": [], "source": [ "df = es[\"order_products\"].ww.select(include=\"Integer\")\n", "df.ww" ] }, { "cell_type": "markdown", "id": "2f752200", "metadata": {}, "source": [ "##### By Logical Type and Semantic Tag\n", "The column must have logical type `Integer` and have the `numeric` semantic tag, excluding index columns." ] }, { "cell_type": "code", "execution_count": null, "id": "6da51b75", "metadata": {}, "outputs": [], "source": [ "ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})" ] }, { "cell_type": "code", "execution_count": null, "id": "a96d92f6", "metadata": {}, "outputs": [], "source": [ "df = es[\"order_products\"].ww.select(include=\"numeric\")\n", "df = df.ww.select(include=\"Integer\")\n", "df.ww" ] }, { "cell_type": "markdown", "id": "71e0359b", "metadata": {}, "source": [ "In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall. This is how Featuretools determines which columns in a DataFrame are valid for a Primitive in building Features during DFS.\n", "\n", "Each Primitive has `input_types` and a `return_type` that are described by a Woodwork `ColumnSchema`. Every DataFrame in an EntitySet has Woodwork initialized on it. This means that when an EntitySet is passed into DFS, Featuretools can select the relevant columns in the DataFrame that are valid for the Primitive's `input_types`. We then get a Feature that has a `column_schema` property that indicates what that Feature's typing definition is in a way that lets DFS stack features on top of one another.\n", "\n", "In this way, Featuretools is able to leverage the base unit of Woodwork typing information, the `ColumnSchema`, and use it in concert with an EntitySet of Woodwork DataFrames in order to build Features with Deep Feature Synthesis." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/guides/advanced_custom_primitives.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced Custom Primitives Guide" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "import numpy as np\n", "from woodwork.column_schema import ColumnSchema\n", "from woodwork.logical_types import Datetime, NaturalLanguage\n", "\n", "import featuretools as ft\n", "from featuretools.primitives import TransformPrimitive\n", "from featuretools.tests.testing_utils import make_ecommerce_entityset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Primitives with Additional Arguments\n", "\n", "Some features require more advanced calculations than others. Advanced features usually entail additional arguments to help output the desired value. With custom primitives, you can use primitive arguments to help you create advanced features.\n", "\n", "### String Count Example\n", "\n", "In this example, you will learn how to make custom primitives that take in additional arguments. You will create a primitive to count the number of times a specific string value occurs inside a text.\n", "\n", "First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in a text column as the input and return a numeric column as the output, so set the input type to a Woodwork `ColumnSchema` with logical type `NaturalLanguage` and the return type to a Woodwork `ColumnSchema` with the semantic tag `'numeric'`. The specific string value is the additional argument, so define it as a *keyword* argument inside `__init__`. Then, override `get_function` to return a primitive function that will calculate the feature.\n", "\n", "Featuretools' primitives use Woodwork's `ColumnSchema` to control the input and return types of columns for the primitive. For more information about using the Woodwork typing system in Featuretools, see the [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb) guide." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class StringCount(TransformPrimitive):\n", " \"\"\"Count the number of times the string value occurs.\"\"\"\n", "\n", " name = \"string_count\"\n", " input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", "\n", " def __init__(self, string=None):\n", " self.string = string\n", "\n", " def get_function(self):\n", " def string_count(column):\n", " assert self.string is not None, \"string to count needs to be defined\"\n", " # this is a naive implementation used for clarity\n", " counts = [text.lower().count(self.string) for text in column]\n", " return counts\n", "\n", " return string_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have a primitive that is reusable for different string values. For example, you can create features based on the number of times the word \"the\" appears in a text. Create an instance of the primitive where the string value is \"the\" and pass the primitive into DFS to generate the features. The feature name will automatically reflect the string value of the primitive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = make_ecommerce_entityset()\n", "\n", "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[\"sum\", \"mean\", \"std\"],\n", " trans_primitives=[StringCount(string=\"the\")],\n", ")\n", "\n", "feature_matrix[\n", " [\n", " \"STD(log.STRING_COUNT(comments, string=the))\",\n", " \"SUM(log.STRING_COUNT(comments, string=the))\",\n", " \"MEAN(log.STRING_COUNT(comments, string=the))\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Features with Multiple Outputs\n", "\n", "Some calculations output more than a single value. With custom primitives, you can make the most of these calculations by creating a feature for each output value.\n", "\n", "### Case Count Example\n", "\n", "In this example, you will learn how to make custom primitives that output multiple features. You will create a primitive that outputs the count of upper case and lower case letters of a text.\n", "\n", "First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in a text column as the input and return two numeric columns as the output, so set the input type to a Woodwork `ColumnSchema` with logical type `NaturalLanguage` and the return type to a Woodwork `ColumnSchema` with semantic tag `'numeric'`. Since this primitive returns two columns, also set `number_output_features` to two. Then, override `get_function` to return a primitive function that will calculate the feature and return a list of columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class CaseCount(TransformPrimitive):\n", " \"\"\"Return the count of upper case and lower case letters of a text.\"\"\"\n", "\n", " name = \"case_count\"\n", " input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", " number_output_features = 2\n", "\n", " def get_function(self):\n", " def case_count(array):\n", " # this is a naive implementation used for clarity\n", " upper = np.array([len(re.findall(\"[A-Z]\", i)) for i in array])\n", " lower = np.array([len(re.findall(\"[a-z]\", i)) for i in array])\n", " return upper, lower\n", "\n", " return case_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have a primitive that outputs two columns. One column contains the count for the upper case letters. The other column contains the count for the lower case letters. Pass the primitive into DFS to generate features. By default, the feature name will reflect the index of the output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[],\n", " trans_primitives=[CaseCount],\n", ")\n", "\n", "feature_matrix[\n", " [\n", " \"customers.CASE_COUNT(favorite_quote)[0]\",\n", " \"customers.CASE_COUNT(favorite_quote)[1]\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom Naming for Multiple Outputs\n", "\n", "When you create a primitive that outputs multiple features, you can also define custom naming for each of those features.\n", "\n", "### Hourly Sine and Cosine Example\n", "\n", "In this example, you will learn how to apply custom naming for multiple outputs. You will create a primitive that outputs the sine and cosine of the hour.\n", "\n", "First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in the time index as the input and return two numeric columns as the output. Set the input type to a Woodwork `ColumnSchema` with a logical type of `Datetime` and the semantic tag `'time_index'`. Next, set the return type to a Woodwork `ColumnSchema` with semantic tag `'numeric'` and set `number_output_features` to two. Then, override `get_function` to return a primitive function that will calculate the feature and return a list of columns. Also, override `generate_names` to return a list of the feature names that you define." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class HourlySineAndCosine(TransformPrimitive):\n", " \"\"\"Returns the sine and cosine of the hour.\"\"\"\n", "\n", " name = \"hourly_sine_and_cosine\"\n", " input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n", " return_type = ColumnSchema(semantic_tags={\"numeric\"})\n", "\n", " number_output_features = 2\n", "\n", " def get_function(self):\n", " def hourly_sine_and_cosine(column):\n", " sine = np.sin(column.dt.hour)\n", " cosine = np.cos(column.dt.hour)\n", " return sine, cosine\n", "\n", " return hourly_sine_and_cosine\n", "\n", " def generate_names(self, base_feature_names):\n", " name = self.generate_name(base_feature_names)\n", " return f\"{name}[sine]\", f\"{name}[cosine]\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have a primitive that outputs two columns. One column contains the sine of the hour. The other column contains the cosine of the hour. Pass the primitive into DFS to generate features. The feature name will reflect the custom naming you defined." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"log\",\n", " agg_primitives=[],\n", " trans_primitives=[HourlySineAndCosine],\n", ")\n", "\n", "feature_matrix.head()[\n", " [\n", " \"HOURLY_SINE_AND_COSINE(datetime)[sine]\",\n", " \"HOURLY_SINE_AND_COSINE(datetime)[cosine]\",\n", " ]\n", "]" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: docs/source/guides/deployment.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "92a0dab5", "metadata": {}, "source": [ "# Deployment\n", "\n", "Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.\n", "\n", "## Saving Features\n", "\n", "First, let's build some generate some training and test data in the same format. We use a random seed to generate different data for the test." ] }, { "cell_type": "raw", "id": "129c8011", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note ::\n", "\n", " Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools." ] }, { "cell_type": "code", "execution_count": null, "id": "01c19e97", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es_train = ft.demo.load_mock_customer(return_entityset=True)\n", "es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)" ] }, { "cell_type": "markdown", "id": "042f8c02", "metadata": {}, "source": [ "Now let's build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data." ] }, { "cell_type": "code", "execution_count": null, "id": "6bcc87a0", "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es_train, target_dataframe_name=\"customers\"\n", ")\n", "\n", "feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\n", "feature_matrix_enc" ] }, { "cell_type": "markdown", "id": "03ffe00a", "metadata": {}, "source": [ "Now, we can use [featuretools.save_features](../generated/featuretools.save_features.rst#featuretools.save_features) to save a list features to a json file" ] }, { "cell_type": "code", "execution_count": null, "id": "79d4ff65", "metadata": {}, "outputs": [], "source": [ "ft.save_features(features_enc, \"feature_definitions.json\")" ] }, { "cell_type": "markdown", "id": "67723f25", "metadata": {}, "source": [ "## Calculating Feature Matrix for New Data\n", "\n", "We can use [featuretools.load_features](../generated/featuretools.load_features.rst#featuretools.load_features) to read in a list of saved features to calculate for our new entity set." ] }, { "cell_type": "code", "execution_count": null, "id": "a8f728c0", "metadata": {}, "outputs": [], "source": [ "saved_features = ft.load_features(\"feature_definitions.json\")" ] }, { "cell_type": "markdown", "id": "1624ea4d", "metadata": {}, "source": [ "After we load the features back in, we can calculate the feature matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "f37f61e0", "metadata": {}, "outputs": [], "source": [ "feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "c9f39b54", "metadata": {}, "source": [ "As you can see above, we have the exact same features as before, but calculated using the test data." ] }, { "cell_type": "markdown", "id": "42a47ad9", "metadata": {}, "source": [ "## Exporting Feature Matrix\n", "\n", "### Save as csv\n", "\n", "The feature matrix is a pandas DataFrame that we can save to disk" ] }, { "cell_type": "code", "execution_count": null, "id": "570c69fa", "metadata": {}, "outputs": [], "source": [ "feature_matrix.to_csv(\"feature_matrix.csv\")" ] }, { "cell_type": "markdown", "id": "f0fc5342", "metadata": {}, "source": [ "We can also read it back in as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "297db0a6", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "saved_fm = pd.read_csv(\"feature_matrix.csv\", index_col=\"customer_id\")\n", "saved_fm" ] }, { "cell_type": "code", "execution_count": null, "id": "1b84dc51", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import os\n", "\n", "os.remove(\"feature_definitions.json\")\n", "os.remove(\"feature_matrix.csv\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/guides/feature_descriptions.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "1557274d", "metadata": {}, "source": [ "# Generating Feature Descriptions\n", "\n", "As features become more complicated, their names can become harder to understand. Both the [describe_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.graph_feature.html) function and the [graph_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.describe_feature.html) function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the ``describe_feature`` function can be augmented by providing custom definitions and templates to improve the resulting descriptions. " ] }, { "cell_type": "code", "execution_count": null, "id": "cdb8b3eb", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\", \"sum\", \"mode\", \"n_most_common\"],\n", " trans_primitives=[\"month\", \"hour\"],\n", " max_depth=2,\n", " features_only=True,\n", ")" ] }, { "cell_type": "markdown", "id": "01f8209c", "metadata": {}, "source": [ "By default, ``describe_feature`` uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions. " ] }, { "cell_type": "code", "execution_count": null, "id": "35b86722", "metadata": {}, "outputs": [], "source": [ "feature_defs[9]" ] }, { "cell_type": "code", "execution_count": null, "id": "e24bee8d", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[9])" ] }, { "cell_type": "code", "execution_count": null, "id": "5402e848", "metadata": {}, "outputs": [], "source": [ "feature_defs[14]" ] }, { "cell_type": "code", "execution_count": null, "id": "ac22c09c", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[14])" ] }, { "cell_type": "markdown", "id": "ff9b7b35", "metadata": {}, "source": [ "## Improving Descriptions\n", "\n", "While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions. \n", "\n", "#### Feature Descriptions\n", "Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a `ColumnSchema` or feature is, or to provide descriptions that take advantage of a user's existing knowledge about the data or domain. " ] }, { "cell_type": "code", "execution_count": null, "id": "33b2f8e5", "metadata": {}, "outputs": [], "source": [ "feature_descriptions = {\"customers: join_date\": \"the date the customer joined\"}\n", "\n", "ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)" ] }, { "cell_type": "markdown", "id": "218147f4", "metadata": {}, "source": [ "For example, the above replaces the column name, ``\"join_date\"``, with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the ``description`` attribute present on each `ColumnSchema`:" ] }, { "cell_type": "code", "execution_count": null, "id": "597e20a6", "metadata": {}, "outputs": [], "source": [ "join_date_column_schema = es[\"customers\"].ww.columns[\"join_date\"]\n", "join_date_column_schema.description = \"the date the customer joined\"\n", "\n", "es[\"customers\"].ww.columns[\"join_date\"].description" ] }, { "cell_type": "code", "execution_count": null, "id": "6c013615", "metadata": {}, "outputs": [], "source": [ "feature = ft.TransformFeature(es[\"customers\"].ww[\"join_date\"], ft.primitives.Hour)\n", "feature" ] }, { "cell_type": "code", "execution_count": null, "id": "03e828b4", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature)" ] }, { "cell_type": "raw", "id": "689cbd98", "metadata": {}, "source": [ ".. note::\n", "\n", " When setting a description on a column in a DataFrame as described above, be careful to avoid setting the description via ``df.ww[col_name].ww.description``. The use of ``df.ww[col_name]`` creates an entirely new Series object that is not related to the EntitySet from which feature descriptions are built. Therefore, setting the description in any way other than going through the ``columns`` attribute will not set the column's description in a way that will be propogated to the feature description. " ] }, { "cell_type": "markdown", "id": "10e779f5", "metadata": {}, "source": [ "Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to ``describe_feature`` with ``feature_descriptions``, the description in the `feature_descriptions` parameter will take presedence.\n", "\n", "Feature descriptions can also be provided for generated features." ] }, { "cell_type": "code", "execution_count": null, "id": "5d1f8667", "metadata": {}, "outputs": [], "source": [ "feature_descriptions = {\n", " \"sessions: SUM(transactions.amount)\": \"the total transaction amount for a session\"\n", "}\n", "\n", "feature_defs[14]" ] }, { "cell_type": "code", "execution_count": null, "id": "b90b8e4e", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)" ] }, { "cell_type": "markdown", "id": "83217b19", "metadata": {}, "source": [ "Here, we create and pass in a custom description of the intermediate feature ``SUM(transactions.amount)``. The description for ``MEAN(sessions.SUM(transactions.amount))``, which is built on top of ``SUM(transactions.amount)``, uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form ``\"[dataframe_name]: [feature_name]\"``, as shown above.\n", "\n", "#### Primitive Templates\n", "Primitives descriptions are generated using primitive templates. By default, these are defined using the ``description_template`` attribute on the primitive. Primitives without a template default to using the ``name`` attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into ``describe_feature`` through the ``primitive_templates`` argument. " ] }, { "cell_type": "code", "execution_count": null, "id": "50f1bfb8", "metadata": {}, "outputs": [], "source": [ "primitive_templates = {\"sum\": \"the total of {}\"}\n", "\n", "feature_defs[6]" ] }, { "cell_type": "code", "execution_count": null, "id": "c1fb53a3", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "9b9cceca", "metadata": {}, "source": [ "In this example, we override the default template of ``'the sum of {}'`` with our custom template ``'the total of {}'``. The description uses our custom template instead of the default.\n", "\n", "Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the \"nth\" form is available through the ``nth_slice`` keyword." ] }, { "cell_type": "code", "execution_count": null, "id": "15ed472c", "metadata": {}, "outputs": [], "source": [ "feature = feature_defs[5]\n", "feature" ] }, { "cell_type": "code", "execution_count": null, "id": "54a5a6fd", "metadata": {}, "outputs": [], "source": [ "primitive_templates = {\n", " \"n_most_common\": [\n", " \"the 3 most common elements of {}\", # generic multi-output feature\n", " \"the {nth_slice} most common element of {}\",\n", " ]\n", "} # template for each slice\n", "\n", "ft.describe_feature(feature, primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "49aae7d2", "metadata": {}, "source": [ "Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:" ] }, { "cell_type": "code", "execution_count": null, "id": "1bd3a3cf", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[0], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "607299ff", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[1], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "30f4235f", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[2], primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "17953d54", "metadata": {}, "source": [ "Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template." ] }, { "cell_type": "code", "execution_count": null, "id": "bad05646", "metadata": {}, "outputs": [], "source": [ "primitive_templates = {\n", " \"n_most_common\": [\n", " \"the 3 most common elements of {}\",\n", " \"the most common element of {}\",\n", " \"the second most common element of {}\",\n", " \"the third most common element of {}\",\n", " ]\n", "}\n", "\n", "ft.describe_feature(feature, primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "fdad1868", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[0], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "90a85bd0", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[1], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "b63d47a7", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[2], primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "1942ea49", "metadata": {}, "source": [ "Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the ``describe_feature`` function using the ``metadata_file`` keyword argument. Descriptions passed in directly through the ``feature_descriptions`` and ``primitive_templates`` keyword arguments will take precedence over any descriptions provided in the JSON metadata file." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/guides/feature_selection.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection\n", "\n", "Featuretools provides users with the ability to remove features that are unlikely to be useful in building an effective machine learning model. Reducing the number of features in the feature matrix can both produce better results in the model as well as reduce the computational cost involved in prediction.\n", "\n", "Featuretools enables users to perform feature selection on the results of Deep Feature Synthesis with three functions:\n", "\n", "- `ft.selection.remove_highly_null_features`\n", "- `ft.selection.remove_single_value_features`\n", "- `ft.selection.remove_highly_correlated_features`\n", "\n", "We will describe each of these functions in depth, but first we must create an entity set with which we can run `ft.dfs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import featuretools as ft\n", "from featuretools.demo.flight import load_flight\n", "from featuretools.selection import (\n", " remove_highly_correlated_features,\n", " remove_highly_null_features,\n", " remove_single_value_features,\n", ")\n", "\n", "es = load_flight(nrows=50)\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove Highly Null Features\n", "\n", "We might have a dataset with columns that have many null values. Deep Feature Synthesis might build features off of those null columns, creating even more highly null features. In this case, we might want to remove any features whose null values pass a certain threshold. Below is our feature matrix with such a case:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fm, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"trip_logs\",\n", " cutoff_time=pd.DataFrame(\n", " {\n", " \"trip_log_id\": [30, 1, 2, 3, 4],\n", " \"time\": pd.to_datetime([\"2016-09-22 00:00:00\"] * 5),\n", " }\n", " ),\n", " trans_primitives=[],\n", " agg_primitives=[],\n", " max_depth=2,\n", ")\n", "fm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We look at the above feature matrix and decide to remove the highly null features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ft.selection.remove_highly_null_features(fm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that calling `remove_highly_null_features` didn't remove every feature that contains a null value. By default, we only remove features where the percentage of null values in the calculated feature matrix is above 95%. If we want to lower that threshold, we can set the `pct_null_threshold` paramter ourselves." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "remove_highly_null_features(fm, pct_null_threshold=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove Single Value Features\n", "\n", "Another situation we might run into is one where our calculated features don't have any variance. In those cases, we are likely to want to remove the uninteresting features. For that, we use `remove_single_value_features`.\n", "\n", "Let's see what happens when we remove the single value features of the feature matrix below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fm" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note ::\n", " A list of feature definitions such as those created by `dfs` can be provided to the feature selection functions.\n", " Doing this will change the outputs to include an updated list of feature definitions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_fm, new_features = remove_single_value_features(fm, features=features)\n", "new_fm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the features definitions for the updated feature matrix, we can see that the features that were removed are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(features) - set(new_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the function used as it is above, null values are not considered when counting a feature's unique values. If we'd like to consider `NaN` its own value, we can set `count_nan_as_value` to `True` and we'll see `flights.carrier` and `flights.flight_num` back in the matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_fm, new_features = remove_single_value_features(\n", " fm, features=features, count_nan_as_value=True\n", ")\n", "new_fm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features that were removed are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(features) - set(new_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove Highly Correlated Features\n", "\n", "The last feature selection function we have allows us to remove features that would likely be redundant to the model we're attempting to build by considering the correlation between pairs of calculated features.\n", "\n", "When two features are determined to be highly correlated, we remove the more complex of the two. For example, say we have two features: `col` and `-(col)`.\n", "\n", "We can see that `-(col)` is just the negation of `col`, and so we can guess those features are going to be highly correlated. `-(col)` has has the `Negate` primitive applied to it, so it is more complex than the identity feature `col`. Therefore, if we only want one of `col` and `-(col)`, we should keep the identity feature. For features that don't have an obvious difference in complexity, we discard the feature that comes later in the feature matrix. \n", "\n", "Let's try this out on our data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fm, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"trip_logs\",\n", " trans_primitives=[\"negate\"],\n", " agg_primitives=[],\n", " max_depth=3,\n", ")\n", "fm.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have some pretty clear correlations here between all the features and their negations.\n", "\n", "Now, using `remove_highly_correlated_features`, our default threshold for correlation is 95% correlated, and we get all of the obviously correlated features removed, leaving just the less complex features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_fm, new_features = remove_highly_correlated_features(fm, features=features)\n", "new_fm.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features that were removed are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(features) - set(new_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Change the correlation threshold\n", "\n", "We can lower the threshold at which to remove correlated features if we'd like to be more restrictive by using the `pct_corr_threshold` parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_fm, new_features = remove_highly_correlated_features(\n", " fm, features=features, pct_corr_threshold=0.9\n", ")\n", "new_fm.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features that were removed are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(features) - set(new_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Check a Subset of Features\n", "\n", "If we only want to check a subset of features, we can set `features_to_check` to the list of features whose correlation we'd like to check, and no features outside of that list will be removed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_fm, new_features = remove_highly_correlated_features(\n", " fm,\n", " features=features,\n", " features_to_check=[\"air_time\", \"distance\", \"flights.distance_group\"],\n", ")\n", "new_fm.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features that were removed are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(features) - set(new_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Protect Features from Removal\n", "\n", "To protect specific features from being removed from the feature matrix, we can include a list of `features_to_keep`, and these features will not be removed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_fm, new_features = remove_highly_correlated_features(\n", " fm,\n", " features=features,\n", " features_to_keep=[\"air_time\", \"distance\", \"flights.distance_group\"],\n", ")\n", "new_fm.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features that were removed are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(features) - set(new_features)" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "interpreter": { "hash": "eadebc3a8a3dd54e52de25d3077ea0e41c7a462ff73c567da199d6de4c02ed7d" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: docs/source/guides/guides_index.rst ================================================ Guides --------------- Guides on more advanced Featuretools functionality .. toctree:: :maxdepth: 1 tuning_dfs specifying_primitive_options performance deployment advanced_custom_primitives feature_descriptions feature_selection time_series sql_database_integration ================================================ FILE: docs/source/guides/performance.ipynb ================================================ { "cells": [ { "cell_type": "raw", "id": "2c5291f3", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _performance:" ] }, { "cell_type": "markdown", "id": "9dab133a", "metadata": {}, "source": [ "# Improving Computational Performance\n", "\n", "Feature engineering is a computationally expensive task. While Featuretools comes with reasonable default settings for feature calculation, there are a number of built-in approaches to improve computational performance based on dataset and problem specific considerations.\n", "\n", "## Reduce number of unique cutoff times\n", "Each row in a feature matrix created by Featuretools is calculated at a specific cutoff time that represents the last point in time that data from any dataframe in an entityset can be used to calculate the feature. As a result, calculations incur an overhead in finding the subset of allowed data for each distinct time in the calculation." ] }, { "cell_type": "raw", "id": "6ab1a83a", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " Featuretools is very precise in how it deals with time. For more information, see :doc:`/getting_started/handling_time`." ] }, { "cell_type": "markdown", "id": "051fbaba", "metadata": {}, "source": [ "If there are many unique cutoff times, it is often worthwhile to figure out how to have fewer. This can be done manually by figuring out which unique times are necessary for the prediction problem or automatically using [approximate](../getting_started/handling_time.ipynb#Approximating-Features-by-Rounding-Cutoff-Times).\n", "\n", "## Parallel Feature Computation\n", "\n", "Computational performance can often be improved by parallelizing the feature calculation process. There are several different approaches that can be used to perform parallel feature computation with Featuretools. An overview of the most commonly used approaches is provided below." ] }, { "cell_type": "markdown", "id": "b47e770f", "metadata": {}, "source": [ "\n", "### Simple Parallel Feature Computation\n", "If using a pandas `EntitySet`, Featuretools can optionally compute features on multiple cores. The simplest way to control the amount of parallelism is to specify the `n_jobs` parameter:\n", "\n", "```python3\n", "fm = ft.calculate_feature_matrix(features=features,\n", " entityset=entityset,\n", " cutoff_time=cutoff_time,\n", " n_jobs=2,\n", " verbose=True)\n", "```\n", "The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entityset, so memory use will be proportional to the number of parallel processes. Because the entityset has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to `calculate_feature_matrix`, read the section below on using a persistent cluster.\n", "\n", "#### Adjust chunk size\n", "By default, Featuretools calculates rows with the same cutoff time simultaneously. The *chunk_size* parameter limits the maximum number of rows that will be grouped and then calculated together. If calculation is done using parallel processing, the default chunk size is set to be `1 / n_jobs` to ensure the computation can be spread across available workers. Normally, this behavior works well, but if there are only a few unique cutoff times it can lead to higher peak memory usage (due to more intermediate calculations stored in memory) or limited parallelism (if the number of chunks is less than *n_jobs*).\n", "\n", "By setting `chunk_size`, we can limit the maximum number of rows in each group to specific number or a percentage of the overall data when calling `ft.dfs` or `ft.calculate_feature_matrix`:\n", "\n", "```python3\n", "# use maximum 100 rows per chunk\n", "feature_matrix, features_list = ft.dfs(entityset=es,\n", " target_dataframe_name=\"customers\",\n", " chunk_size=100)\n", "```\n", "\n", "We can also set chunk size to be a percentage of total rows:\n", "\n", "```python3\n", "# use maximum 5% of all rows per chunk\n", "feature_matrix, features_list = ft.dfs(entityset=es,\n", " target_dataframe_name=\"customers\",\n", " chunk_size=.05)\n", "```\n", "\n", "#### Using persistent cluster\n", "Behind the scenes, Featuretools uses [Dask's](http://dask.pydata.org/) distributed scheduler to implement multiprocessing. When you only specify the `n_jobs` parameter, a cluster will be created for that specific feature matrix calculation and destroyed once calculations have finished. A drawback of this is that each time a feature matrix is calculated, the entityset has to be transmitted to the workers again. To avoid this, we would like to reuse the same cluster between calls. The way to do this is by creating a cluster first and telling featuretools to use it with the `dask_kwargs` parameter:\n", "\n", "```python3\n", "import featuretools as ft\n", "from dask.distributed import LocalCluster\n", "\n", "cluster = LocalCluster()\n", "fm_1 = ft.calculate_feature_matrix(features=features_1,\n", " entityset=entityset,\n", " cutoff_time=cutoff_time,\n", " dask_kwargs={'cluster': cluster},\n", " verbose=True)\n", "```\n", "\n", "The 'cluster' value can either be the actual cluster object or a string of the address the cluster's scheduler can be reached at. The call below would also work. This second feature matrix calculation will not need to resend the entityset data to the workers because it has already been saved on the cluster.\n", "\n", "```python3\n", "fm_2 = ft.calculate_feature_matrix(features=features_2,\n", " entityset=entityset,\n", " cutoff_time=cutoff_time,\n", " dask_kwargs={'cluster': cluster.scheduler.address},\n", " verbose=True)\n", "```" ] }, { "cell_type": "raw", "id": "57aaa835", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " When using a persistent cluster, Featuretools publishes a copy of the ``EntitySet`` to the cluster the first time it calculates a feature matrix. Based on the ``EntitySet``'s metadata the cluster will reuse it for successive computations. This means if two ``EntitySets`` have the same metadata but different row values (e.g. new data is added to the ``EntitySet``), Featuretools won’t recopy the second ``EntitySet`` in later calls. A simple way to avoid this scenario is to use a unique ``EntitySet`` id." ] }, { "cell_type": "markdown", "id": "cdecad1d", "metadata": {}, "source": [ "#### Using the distributed dashboard\n", "Dask.distributed has a web-based diagnostics dashboard that can be used to analyze the state of the workers and tasks. It can also be useful for tracking memory use or visualizing task run-times. An in-depth description of the web interface can be found [here](https://distributed.readthedocs.io/en/latest/web.html).\n", "\n", "![Distributed dashboard image](../_static/images/dashboard.png)\n", "\n", "The dashboard requires an additional python package, bokeh, to work. Once bokeh is installed, the web interface will be launched by default when a LocalCluster is created. The cluster created by featuretools when using `n_jobs` does not enable the web interface automatically. To do so, the port to launch the main web interface on must be specified in `dask_kwargs`:\n", "\n", "```python3\n", "fm = ft.calculate_feature_matrix(features=features,\n", " entityset=entityset,\n", " cutoff_time=cutoff_time,\n", " n_jobs=2,\n", " dask_kwargs={'diagnostics_port': 8787}\n", " verbose=True)\n", "```\n", "\n", "### Parallel Computation by Partitioning Data\n", "As an alternative to Featuretools' parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large pandas `EntitySet` because the current parallel implementation sends the entire `EntitySet` to each worker which may exhaust the worker memory. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster." ] }, { "cell_type": "markdown", "id": "795cc323", "metadata": {}, "source": [ "When an entire dataset is not required to calculate the features for a given set of instances, we can split the data into independent partitions and calculate on each partition. For example, imagine we are calculating features for customers and the features are \"number of other customers in this zip code\" or \"average age of other customers in this zip code\". In this case, we can load in data partitioned by zip code. As long as we have all of the data for a zip code when calculating, we can calculate all features for a subset of customers.\n", "\n", "An example of this approach can be seen in the [Predict Next Purchase demo notebook](https://github.com/featuretools/predict_next_purchase). In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using [Dask](https://dask.pydata.org/), which could also be used to scale the computation to a cluster of computers. A framework like [Spark](https://spark.apache.org/) could be used similarly.\n", "\n", "An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the [Featuretools on Dask notebook](https://github.com/Featuretools/Automated-Manual-Comparison/blob/main/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb). This approach is detailed in the [Parallelizing Feature Engineering with Dask article](https://medium.com/feature-labs-engineering/scaling-featuretools-with-dask-ce46f9774c7d) on the Feature Labs engineering blog. Dask allows for simple scaling to multiple cores on a single computer or multiple machines on a cluster.\n", "\n", "For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the [Feature Engineering on Spark notebook](https://github.com/Featuretools/predict-customer-churn/blob/main/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb). This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/guides/specifying_primitive_options.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "ba92172a", "metadata": {}, "source": [ "# Specifying Primitive Options\n", "\n", "By default, DFS will apply primitives across all dataframes and columns. This behavior can be altered through a few different parameters. Dataframes and columns can be optionally ignored or included for an entire DFS run or on a per-primitive basis, enabling greater control over features and less run time overhead." ] }, { "cell_type": "code", "execution_count": null, "id": "106d36a3", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "from featuretools.tests.testing_utils import make_ecommerce_entityset\n", "\n", "es = make_ecommerce_entityset()\n", "\n", "features_list = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mode\"],\n", " trans_primitives=[\"weekday\"],\n", " features_only=True,\n", ")\n", "features_list" ] }, { "cell_type": "markdown", "id": "29ae225d", "metadata": {}, "source": [ "## Specifying Options for an Entire Run\n", "\n", "The `ignore_dataframes` and `ignore_columns` parameters of DFS control dataframes and columns that should be ignored for all primitives. This is useful for ignoring columns or dataframes that don't relate to the problem or otherwise shouldn't be included in the DFS run." ] }, { "cell_type": "code", "execution_count": null, "id": "2d481527", "metadata": {}, "outputs": [], "source": [ "# ignore the 'log' and 'cohorts' dataframes entirely\n", "# ignore the 'birthday' column in 'customers' and the 'device_name' column in 'sessions'\n", "features_list = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mode\"],\n", " trans_primitives=[\"weekday\"],\n", " ignore_dataframes=[\"log\", \"cohorts\"],\n", " ignore_columns={\"sessions\": [\"device_name\"], \"customers\": [\"birthday\"]},\n", " features_only=True,\n", ")\n", "features_list" ] }, { "cell_type": "markdown", "id": "4a9bd7e2", "metadata": {}, "source": [ "DFS completely ignores the `log` and `cohorts` dataframes when creating features. It also ignores the columns `device_name` and `birthday` in `sessions` and `customers` respectively. However, both of these options can be overridden by individual primitive options in the `primitive_options` parameter.\n", "\n", "## Specifying for Individual Primitives\n", "Options for individual primitives or groups of primitives are set by the `primitive_options` parameter of DFS. This parameter maps any desired options to specific primitives. In the case of conflicting options, options set at this level will override options set at the entire DFS run level, and the include options will always take priority over their ignore counterparts.\n", "\n", "Using the string primitive name or the primitive type will apply the options to all primitives of the same name. You can also set options for a specific instance of a primitive by using the primitive instance as a key in the `primitive_options` dictionary. Note, however, that specifying options for a specific instance will result in that instance ignoring any options set for the generic primitive through options with the primitive name or class as the key. \n", "\n", "### Specifying Dataframes for Individual Primitives\n", "Which dataframes to include/ignore can also be specified for a single primitive or a group of primitives. Dataframes can be ignored using the `ignore_dataframes` option in `primitive_options`, while dataframes to explicitly include are set by the ``include_dataframes`` option. When ``include_dataframes`` is given, all dataframes not listed are ignored by the primitive. No columns from any excluded dataframe will be used to generate features with the given primitive." ] }, { "cell_type": "code", "execution_count": null, "id": "8bcbf11a", "metadata": {}, "outputs": [], "source": [ "# ignore the 'cohorts' and 'log' dataframes, but only for the primitive 'mode'\n", "# include only the 'customers' dataframe for the primitives 'weekday' and 'day'\n", "features_list = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mode\"],\n", " trans_primitives=[\"weekday\", \"day\"],\n", " primitive_options={\n", " \"mode\": {\"ignore_dataframes\": [\"cohorts\", \"log\"]},\n", " (\"weekday\", \"day\"): {\"include_dataframes\": [\"customers\"]},\n", " },\n", " features_only=True,\n", ")\n", "features_list" ] }, { "cell_type": "markdown", "id": "b5cbbff0", "metadata": {}, "source": [ "In this example, DFS would only use the `customers` dataframe for both `weekday` and `day`, and would use all dataframes except `cohorts` and `log` for `mode`.\n", "\n", "### Specifying Columns for Individual Primitives\n", "\n", "Specific columns can also be explicitly included/ignored for a primitive or group of primitives. Columns to\n", "ignore is set by the `ignore_columns` option, while columns to include are set by `include_columns`. When the\n", "`include_columns` option is set, no other columns from that dataframe will be used to make features with the given primitive." ] }, { "cell_type": "code", "execution_count": null, "id": "f9e42358", "metadata": {}, "outputs": [], "source": [ "# Include the columns 'product_id' and 'zipcode', 'device_type', and 'cancel_reason' for 'mean'\n", "# Ignore the columns 'signup_date' and 'cancel_date' for 'weekday'\n", "features_list = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mode\"],\n", " trans_primitives=[\"weekday\"],\n", " primitive_options={\n", " \"mode\": {\n", " \"include_columns\": {\n", " \"log\": [\"product_id\", \"zipcode\"],\n", " \"sessions\": [\"device_type\"],\n", " \"customers\": [\"cancel_reason\"],\n", " }\n", " },\n", " \"weekday\": {\"ignore_columns\": {\"customers\": [\"signup_date\", \"cancel_date\"]}},\n", " },\n", " features_only=True,\n", ")\n", "features_list" ] }, { "cell_type": "markdown", "id": "88ea7094", "metadata": {}, "source": [ "Here, `mode` will only use the columns `product_id` and `zipcode` from the dataframe `log`, `device_type`\n", "from the dataframe `sessions`, and `cancel_reason` from `customers`. For any other dataframe, `mode` will use all\n", "columns. The `weekday` primitive will use all columns in all dataframes except for `signup_date` and `cancel_date`\n", "from the `customers` dataframe.\n", "\n", "\n", "### Specifying GroupBy Options\n", "\n", "GroupBy Transform Primitives also have the additional options `include_groupby_dataframes`, `ignore_groupby_dataframes`, `include_groupby_columns`, and `ignore_groupby_columns`. These options are used to specify dataframes and columns to include/ignore as groupings for inputs. By default, DFS only groups by foreign key columns. Specifying `include_groupby_columns` overrides this default, and will only group by columns given. On the other hand, `ignore_groupby_columns` will continue to use only the foreign key columns, ignoring any columns specified that are also foreign key columns. Note that if including non-foreign key columns to group by, the included columns must be categorical columns. " ] }, { "cell_type": "code", "execution_count": null, "id": "1c1046b5", "metadata": {}, "outputs": [], "source": [ "features_list = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"log\",\n", " agg_primitives=[],\n", " trans_primitives=[],\n", " groupby_trans_primitives=[\"cum_sum\", \"cum_count\"],\n", " primitive_options={\n", " \"cum_sum\": {\"ignore_groupby_columns\": {\"log\": [\"product_id\"]}},\n", " \"cum_count\": {\n", " \"include_groupby_columns\": {\"log\": [\"product_id\", \"priority_level\"]},\n", " \"ignore_groupby_dataframes\": [\"sessions\"],\n", " },\n", " },\n", " features_only=True,\n", ")\n", "features_list" ] }, { "cell_type": "markdown", "id": "10616725", "metadata": {}, "source": [ "We ignore `product_id` as a groupby for `cum_sum` but still use any other foreign key columns in that or any other dataframe. For `cum_count`, we use only `product_id` and `priority_level` as groupbys. Note that `cum_sum` doesn't use\n", "`priority_level` because it's not a foreign key column, but we explicitly include it for `cum_count`. Finally, note that specifying groupby options doesn't affect what features the primitive is applied to. For example, `cum_count` ignores the dataframe `sessions` for groupbys, but the feature `` is still made. The groupby is from the target dataframe `log`, so the feature is valid given the associated options. To ignore the `sessions` dataframe for `cum_count`, the `ignore_dataframes` option for `cum_count` would need to include `sessions`.\n", "\n", "\n", "## Specifying for each Input for Multiple Input Primitives\n", "\n", "For primitives that take multiple columns as input, such as `Trend`, the above options can be specified for each input by passing them in as a list. If only one option dictionary is given, it is used for all inputs. The length of the list provided must match the number of inputs the primitive takes." ] }, { "cell_type": "code", "execution_count": null, "id": "2e808749", "metadata": {}, "outputs": [], "source": [ "features_list = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"trend\"],\n", " trans_primitives=[],\n", " primitive_options={\n", " \"trend\": [\n", " {\"ignore_columns\": {\"log\": [\"value_many_nans\"]}},\n", " {\"include_columns\": {\"customers\": [\"signup_date\"], \"log\": [\"datetime\"]}},\n", " ]\n", " },\n", " features_only=True,\n", ")\n", "features_list" ] }, { "cell_type": "markdown", "id": "53d5d207", "metadata": {}, "source": [ "Here, we pass in a list of primitive options for trend. We ignore the column `value_many_nans` for the first input\n", "to `trend`, and include the column `signup_date` from `customers` for the second input." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/guides/sql_database_integration.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SQL Database Integration \n", "\n", "`featuretools_sql` is an add-on library that supports automatic `EntitySet` creation from a relational database.\n", "\n", "Currently, `featuretools_sql` is compatible with the following systems:\n", "\n", "* `MySQL` \n", "* `PostgreSQL`\n", "* `Snowflake`\n", "\n", "The `DBConnector` object exposed by the `featuretools_sql` library provides the interface to connecting to the DBMS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installing featuretools_sql \n", "\n", "Install with pip\n", "\n", "```\n", "python -m pip install \"featuretools[sql]\" \n", "``` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connecting to your database instance \n", "\n", "Depending on your choice of DBMS, you may have to provide different pieces of information to the `DBConnector` object.\n", "\n", "If you want to connect to a `MySQL` instance, you must pass the string `\"mysql\"` into the `system_name` argument.\n", "\n", "If you want to connect to a `PostgreSQL` instance, you must pass the string `\"postgresql\"` into the `system_name` argument.\n", "\n", "If you want to connect to a `Snowflake` instance, you must pass the string `\"snowflake\"` into the `system_name` argument.\n", "\n", "Here is an example call to the constructor of the object, connecting to a `PostgreSQL` database:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python \n", "from featuretools_sql.connector import DBConnector\n", "\n", "connector_object = DBConnector(\n", " system_name=\"postgresql\",\n", " user=\"postgres\",\n", " host=\"localhost\",\n", " port=\"5432\",\n", " database=\"postgres\",\n", " schema=\"public\",\n", ")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the choice of RDBMS does affect the required arguments -- for example, if you were connecting to a `MySQL` instance, you would not need a `schema` argument. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting to an EntitySet \n", "\n", "You can call the `get_entityset` method to instruct the `DBConnector` object to build an EntitySet. \n", "\n", "This method will loop through all the tables in the database and copy them into dataframes. Then it will populate the relationships data structure. It will finally pass those two arguments into the EntitySet constructor in Featuretools, and return the object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python \n", "es = connector_object.get_entityset()\n", "``` \n", "\n", "Optionally, you can pass in table names to the `select_only` parameter if you only want to include a subset of the tables in the database. \n", "\n", "```python \n", "es = connector_object.get_entityset(select_only=[\"Products\", \"Transactions\"])\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examining the EntitySet's member data \n", "\n", "You can examine the member data of the `DBConnector` object to ensure that it imported data correctly.\n", "\n", "To access the dataframes it imported, access the `.dataframes` attribute. To access the relationships data structure, access the `.relationships` attribute.\n", "\n", "If you would like to visualize the EntitySet as a graph, you can call `es.plot()`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calling DFS \n", "\n", "The EntitySet object is ready to be passed into Featuretools's `DFS` algorithm! Read more about `DFS` [here]([https://featuretools.alteryx.com/en/stable/getting_started/afe.html#running-dfs). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.12 64-bit ('venv_x86')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "vscode": { "interpreter": { "hash": "3f6b062a214ec48d1657976024d6bc68979519d14a33afb6ad033fc2e4189514" } } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: docs/source/guides/time_series.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": null, "id": "17f894b5", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "import pandas as pd\n", "\n", "import featuretools as ft\n", "from featuretools.demo.weather import load_weather\n", "from featuretools.primitives import Lag, RollingMean, RollingMin" ] }, { "cell_type": "markdown", "id": "a8104f18", "metadata": {}, "source": [ "# Feature Engineering for Time Series Problems" ] }, { "cell_type": "raw", "id": "9cd9cb82", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " This guide focuses on feature engineering for single-table time series problems; it does not cover how to handle temporal multi-table data for other machine learning problem types. A more general guide on handling time in Featuretools can be found `here <../getting_started/handling_time.ipynb>`_." ] }, { "cell_type": "markdown", "id": "0cf3cebc", "metadata": {}, "source": [ "Time series forecasting consists of predicting future values of a target using earlier observations. In datasets that are used in time series problems, there is an inherent temporal ordering to the data (determined by a time index), and the sequential target values we're predicting are highly dependent on one another. Feature engineering for time series problems exploits the fact that more recent observations are more predictive than more distant ones.\n", "\n", "This guide will explore how to use Featuretools for automating feature engineering for univariate time series problems, or problems in which only the time index and target column are included.\n", " \n", "We'll be working with a temperature demo EntitySet that contains one DataFrame, `temperatures`. The `temperatures` dataframe contains the minimum daily temperatures that we will be predicting. In total, it has three columns: `id`, `Temp`, and `Date`. The `id` column is the index that is necessary for Featuretools' purposes. The other two are important for univariate time series problems: `Date` is our time index, and `Temp` is our target column. The engineered features will be built from these two columns." ] }, { "cell_type": "code", "execution_count": null, "id": "862e46da", "metadata": {}, "outputs": [], "source": [ "es = load_weather()\n", "\n", "es[\"temperatures\"].head(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "90242e31", "metadata": {}, "outputs": [], "source": [ "es[\"temperatures\"][\"Temp\"].plot(ylabel=\"Temp (C)\")" ] }, { "cell_type": "markdown", "id": "060eb035", "metadata": {}, "source": [ "## Understanding The Feature Engineering Window\n", "\n", "In multi-table datasets, a feature engineering window for a single row in the target DataFrame extends forward in time over observations in child DataFrames starting at the time index and ending when either the cutoff time or last time index is reached. \n", "\n", "![Multi Table Timeline](../_static/images/multi_table_FE_timeline.png)\n", "\n", "In single-table time series datasets, the feature engineering window for a single value extends backwards in time within the same column. Because of this, the concepts of cutoff time and last time index are not relevant in the same way.\n", "\n", "For example: The cutoff time for a single-table time series dataset would create the training and test data split. During DFS, features would not be calculated after the cutoff time. This same behavior can often times be achieved more simply by splitting the data prior to creating the EntitySet, since filtering the data at feature matrix calculation is more computationally intensive than splitting the data ahead of time.\n", "\n", "```\n", "split_point = int(df.shape[0]*.7)\n", "\n", "training_data = df[:split_point]\n", "test_data = df[split_point:]\n", "```\n", "\n", "So, since we can't use the existing parameters for defining each observation's feature engineering window, we'll need to define new the concepts of `gap` and `window_length`. These will allow us to set a feature engineering window that exists prior to each observation.\n", "\n", "## Gap and Window Length\n", "\n", "Note that we will be using integers when defining the gap and window length. This implies that our data occurs at evenly spaced intervals--in this case daily--so a number `n` corresponds to `n` days. Support for unevenly spaced intervals is ongoing and can be explored with the Woodwork method [df.ww.infer_temporal_frequencies](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.infer_temporal_frequencies.html#woodwork.table_accessor.WoodworkTableAccessor.infer_temporal_frequencies).\n", "\n", "If we are at a point in time `t`, we have access to information from times less than `t` (past values), and we do not have information from times greater than `t` (future values). Our limitations in feature engineering, then, will come from when exactly before `t` we have access to the data. \n", "\n", "Consider an example where we're recording data that takes a week to ingest; the earliest data we have access to is from seven days ago, or `t - 7`. We'll call this our `gap`. A `gap` of 0 would include the instance itself, which we must be careful to avoid in time series problems, as this exposes our target.\n", "\n", "We also need to determine how far back in time before `t - 7` we can go. Too far back, and we may lose the potency of our recent observations, but too recent, and we may not capture the full spectrum of behaviors displayed by the data. In this example, let's say that we only want to look at 5 days worth of data at a time. We'll call this our `window_length`. \n", "\n", "![Time Series Timeline](../_static/images/time_series_FE_timeline.png)" ] }, { "cell_type": "code", "execution_count": null, "id": "a90799f1", "metadata": {}, "outputs": [], "source": [ "gap = 7\n", "window_length = 5" ] }, { "cell_type": "markdown", "id": "460b4c49", "metadata": {}, "source": [ "With these two parameters (`gap` and `window_length`) set, we have defined our feature engineering window. Now, we can move onto defining our feature primitives.\n", "\n", "## Time Series Primitives\n", "\n", "There are three types of primitives we'll focus on for time series problems. One of them will extract features from the time index, and the other two types will extract features from our target column. \n", "\n", "### Datetime Transform Primitives\n", "\n", "We need a way of implicating time in our time series features. Yes, using recent temperatures is incredibly predictive in determining future temperatures, but there is also a whole host of historical data suggesting that the month of the year is a pretty good indicator for the temperature outside. However, if we look at the data, we'll see that, though the day changes, the observations are always taken at the same hour, so the `Hour` primitive will not likely be useful. Of course, in a dataset that is measured at an hourly frequency or one more granular, `Hour` may be incrediby predictive. " ] }, { "cell_type": "code", "execution_count": null, "id": "65246092", "metadata": {}, "outputs": [], "source": [ "datetime_primitives = [\"Day\", \"Year\", \"Weekday\", \"Month\"]" ] }, { "cell_type": "markdown", "id": "95d8c86a", "metadata": {}, "source": [ "The full list of datetime transform primitives can be seen [here](https://featuretools.alteryx.com/en/latest/api_reference.html#datetime-transform-primitives).\n", "\n", "### Delaying Primitives\n", "\n", "The simplest thing we can do with our target column is to build features that are delayed (or lagging) versions of the target column. We'll make one feature per observation in our feature engineering windows, so we'll range over time from `t - gap - window_length` to `t - gap`. \n", "\n", "For this purpose, we can use our `Lag` primitive and create one primitive for each instance in our window. " ] }, { "cell_type": "code", "execution_count": null, "id": "b9e1fa8f", "metadata": {}, "outputs": [], "source": [ "delaying_primitives = [Lag(periods=i + gap) for i in range(window_length)]" ] }, { "cell_type": "markdown", "id": "03cd4474", "metadata": {}, "source": [ "### Rolling Transform Primitives\n", "\n", "Since we have access to the entire feature engineering window, we can aggregate over that window. Featuretools has several rolling primitives with which we can achieve this. Here, we'll use the `RollingMean` and `RollingMin` primitives, setting the `gap` and `window_length` accordingly. Here, the gap is incredibly important, because when the gap is zero, it means the current observation's taret value is present in the window, which exposes our target.\n", "\n", "This concern also exists for other primitives that reference earlier values in the dataframe. Because of this, when using primitives for time series feature engineering, one must be incredibly careful to not use primitives on the target column that incorporate the current observation when calculating a feature value." ] }, { "cell_type": "code", "execution_count": null, "id": "ed6cc722", "metadata": {}, "outputs": [], "source": [ "rolling_mean_primitive = RollingMean(\n", " window_length=window_length, gap=gap, min_periods=window_length\n", ")\n", "\n", "rolling_min_primitive = RollingMin(\n", " window_length=window_length, gap=gap, min_periods=window_length\n", ")" ] }, { "cell_type": "markdown", "id": "1eb2a6e1", "metadata": {}, "source": [ "The full list of rolling transform primitives can be seen [here](https://featuretools.alteryx.com/en/latest/api_reference.html#rolling-transform-primitives).\n", "\n", "## Run DFS\n", "\n", "Now that we've definied our time series primitives, we can pass them into DFS and get our feature matrix! \n", "\n", "Let's take a look at an actual feature engineering window as we defined with `gap` and `window_length` above. Below is an example of how we can extract many features using the same feature engineering window without exposing our target value.\n", "\n", "![FE Window](../_static/images/window_calculations.png)\n", "\n", "With the image above, we see how all of our defined primitives get used to create many features from just the two columns we have access to." ] }, { "cell_type": "code", "execution_count": null, "id": "42f52b73", "metadata": {}, "outputs": [], "source": [ "fm, f = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"temperatures\",\n", " trans_primitives=(\n", " datetime_primitives\n", " + delaying_primitives\n", " + [rolling_mean_primitive, rolling_min_primitive]\n", " ),\n", " cutoff_time=pd.Timestamp(\"1987-1-30\"),\n", ")\n", "\n", "f" ] }, { "cell_type": "code", "execution_count": null, "id": "9e8ce29d", "metadata": {}, "outputs": [], "source": [ "fm.iloc[:, [0, 2, 6, 7, 8, 9]].head(15)" ] }, { "cell_type": "markdown", "id": "b984ff57", "metadata": {}, "source": [ "Above is our time series feature matrix! The rolling and delayed features are built from our target column, but do not expose it. We can now use the feature matrix to create a machine learning model that predicts future minimum daily temperatures." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/guides/tuning_dfs.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "a4329c7d", "metadata": {}, "source": [ "# Tuning Deep Feature Synthesis\n", "\n", "There are several parameters that can be tuned to change the output of DFS. We'll explore these parameters using the following `transactions` EntitySet." ] }, { "cell_type": "code", "execution_count": null, "id": "12607fd8", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es" ] }, { "cell_type": "markdown", "id": "6ef15160", "metadata": {}, "source": [ "## Using \"Seed Features\"\n", "\n", "Seed features are manually defined and problem specific features that a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.\n", "\n", "By using seed features, we can include domain specific knowledge in feature engineering automation. For the seed feature below, the domain knowlege may be that, for a specific retailer, a transaction above $125 would be considered an expensive purchase." ] }, { "cell_type": "code", "execution_count": null, "id": "b35f388e", "metadata": {}, "outputs": [], "source": [ "expensive_purchase = ft.Feature(es[\"transactions\"].ww[\"amount\"]) > 125\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"percent_true\"],\n", " seed_features=[expensive_purchase],\n", ")\n", "feature_matrix[[\"PERCENT_TRUE(transactions.amount > 125)\"]]" ] }, { "cell_type": "markdown", "id": "8703d4b3", "metadata": {}, "source": [ "We can now see that the ``PERCENT_TRUE`` primitive was automatically applied to the boolean `expensive_purchase` feature from the `transactions` table. The feature produced as a result can be understood as the percentage of transactions for a customer that are considered expensive.\n", "\n", "## Add \"interesting\" values to columns\n", "\n", "Sometimes we want to create features that are conditioned on a second value before calculations are performed. We call this extra filter a \"where clause\". Where clauses are used in Deep Feature Synthesis by including primitives in the `where_primitives` parameter to DFS.\n", "\n", "By default, where clauses are built using the ``interesting_values`` of a column.\n", "\n", "Interesting values can be automatically determined and added for each DataFrame in a pandas EntitySet by calling `es.add_interesting_values()`." ] }, { "cell_type": "code", "execution_count": null, "id": "b6e88923", "metadata": {}, "outputs": [], "source": [ "values_dict = {\"device\": [\"desktop\", \"mobile\", \"tablet\"]}\n", "es.add_interesting_values(dataframe_name=\"sessions\", values=values_dict)" ] }, { "cell_type": "markdown", "id": "beee9073", "metadata": {}, "source": [ "Interesting values are stored in the DataFrame's Woodwork typing information." ] }, { "cell_type": "code", "execution_count": null, "id": "c70ff02e", "metadata": {}, "outputs": [], "source": [ "es[\"sessions\"].ww.columns[\"device\"].metadata" ] }, { "cell_type": "markdown", "id": "ddec8e5a", "metadata": {}, "source": [ "Now that interesting values are set for the `device` column in the `sessions` table, we can specify the aggregation primitives for which we want where clauses using the ``where_primitives`` parameter to DFS." ] }, { "cell_type": "code", "execution_count": null, "id": "6eaabad8", "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"count\", \"avg_time_between\"],\n", " where_primitives=[\"count\", \"avg_time_between\"],\n", " trans_primitives=[],\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "681a19db", "metadata": {}, "source": [ "Now, we have several new potentially useful features. Here are two of them that are built off of the where clause \"where the device used was a tablet\":" ] }, { "cell_type": "code", "execution_count": null, "id": "31a2a94e", "metadata": {}, "outputs": [], "source": [ "feature_matrix[\n", " [\n", " \"COUNT(sessions WHERE device = tablet)\",\n", " \"AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "id": "7b43a4a5", "metadata": {}, "source": [ "The first geature, `COUNT(sessions WHERE device = tablet)`, can be understood as indicating *how many sessions a customer completed on a tablet*.\n", "\n", "The second feature, `AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)`, calculates *the time between those sessions*.\n", "\n", "We can see that customer who only had 0 or 1 sessions on a tablet had ``NaN`` values for average time between such sessions.\n", "\n", "\n", "## Encoding categorical features\n", "\n", "Machine learning algorithms typically expect all numeric data or data that has defined numeric representations, like boolean values corresponding to `0` and `1`. When Deep Feature Synthesis generates categorical features, we can encode them using Featureools." ] }, { "cell_type": "code", "execution_count": null, "id": "a2ccb27b", "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mode\"],\n", " trans_primitives=[\"time_since\"],\n", " max_depth=1,\n", ")\n", "\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "a50adb54", "metadata": {}, "source": [ "This feature matrix contains 2 columns that are categorical in nature, ``zip_code`` and ``MODE(sessions.device)``. We can use the feature matrix and feature definitions to encode these categorical values into boolean values. Featuretools offers functionality to apply one hot encoding to the output of DFS." ] }, { "cell_type": "code", "execution_count": null, "id": "088672ac", "metadata": {}, "outputs": [], "source": [ "feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\n", "feature_matrix_enc" ] }, { "cell_type": "markdown", "id": "54076098", "metadata": {}, "source": [ "The returned feature matrix is now encoded in a way that is interpretable to machine learning algorithms. Notice how the columns that did not need encoding are still included. Additionally, we get a new set of feature definitions that contain the encoded values." ] }, { "cell_type": "code", "execution_count": null, "id": "db8dd84b", "metadata": {}, "outputs": [], "source": [ "features_enc" ] }, { "cell_type": "markdown", "id": "b4bda3a2", "metadata": {}, "source": [ "These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read the [Deployment](deployment.ipynb) guide." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/index.ipynb ================================================ { "cells": [ { "cell_type": "raw", "id": "25bd9564", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _quick-start:" ] }, { "cell_type": "markdown", "id": "4746904c", "metadata": {}, "source": [ "# What is Featuretools?\n", "\"Featuretools\"\n", "\n", "\n", "**Featuretools** is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.\n", "\n", "\n", "## 5 Minute Quick Start\n", "\n", "Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions." ] }, { "cell_type": "code", "execution_count": null, "id": "2ed1924f", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft" ] }, { "cell_type": "markdown", "id": "3bc51d89", "metadata": {}, "source": [ "#### Load Mock Data" ] }, { "cell_type": "code", "execution_count": null, "id": "be39a49a", "metadata": {}, "outputs": [], "source": [ "data = ft.demo.load_mock_customer()" ] }, { "cell_type": "markdown", "id": "eb2552f2", "metadata": {}, "source": [ "#### Prepare data\n", "\n", "In this toy dataset, there are 3 DataFrames.\n", "\n", "- **customers**: unique customers who had sessions\n", "- **sessions**: unique sessions and associated attributes\n", "- **transactions**: list of events in this session\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9bb55d86", "metadata": {}, "outputs": [], "source": [ "customers_df = data[\"customers\"]\n", "customers_df" ] }, { "cell_type": "code", "execution_count": null, "id": "2054eb2a", "metadata": {}, "outputs": [], "source": [ "sessions_df = data[\"sessions\"]\n", "sessions_df.sample(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "348e7614", "metadata": {}, "outputs": [], "source": [ "transactions_df = data[\"transactions\"]\n", "transactions_df.sample(5)" ] }, { "cell_type": "markdown", "id": "59fc2126", "metadata": {}, "source": [ "First, we specify a dictionary with all the DataFrames in our dataset. The DataFrames are passed in with their index column and time index column if one exists for the DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "b3fdc96a", "metadata": {}, "outputs": [], "source": [ "dataframes = {\n", " \"customers\": (customers_df, \"customer_id\"),\n", " \"sessions\": (sessions_df, \"session_id\", \"session_start\"),\n", " \"transactions\": (transactions_df, \"transaction_id\", \"transaction_time\"),\n", "}" ] }, { "cell_type": "markdown", "id": "e0d84890", "metadata": {}, "source": [ "Second, we specify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the \"one\" DataFrame, the \"parent DataFrame\". A relationship between a parent and child is defined like this:\n", " \n", " (parent_dataframe, parent_column, child_dataframe, child_column)\n", "\n", "In this dataset we have two relationships" ] }, { "cell_type": "code", "execution_count": null, "id": "fc4366dc", "metadata": {}, "outputs": [], "source": [ "relationships = [\n", " (\"sessions\", \"session_id\", \"transactions\", \"session_id\"),\n", " (\"customers\", \"customer_id\", \"sessions\", \"customer_id\"),\n", "]" ] }, { "cell_type": "raw", "id": "758f8fd4", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " To manage setting up DataFrames and relationships, we recommend using the :class:`EntitySet ` class which offers convenient APIs for managing data like this. See :doc:`getting_started/using_entitysets` for more information." ] }, { "cell_type": "markdown", "id": "330d66b0", "metadata": {}, "source": [ "#### Run Deep Feature Synthesis\n", "\n", "A minimal input to DFS is a dictionary of DataFrames, a list of relationships, and the name of the target DataFrame whose features we want to calculate. The ouput of DFS is a feature matrix and the corresponding list of feature definitions.\n", "\n", "Let's first create a feature matrix for each customer in the data" ] }, { "cell_type": "code", "execution_count": null, "id": "13cae382", "metadata": {}, "outputs": [], "source": [ "feature_matrix_customers, features_defs = ft.dfs(\n", " dataframes=dataframes,\n", " relationships=relationships,\n", " target_dataframe_name=\"customers\",\n", ")\n", "feature_matrix_customers" ] }, { "cell_type": "markdown", "id": "71628a1c", "metadata": {}, "source": [ "We now have dozens of new features to describe a customer's behavior.\n", "\n", "#### Change target DataFrame\n", "One of the reasons DFS is so powerful is that it can create a feature matrix for *any* DataFrame in our EntitySet. For example, if we wanted to build features for sessions." ] }, { "cell_type": "code", "execution_count": null, "id": "4cfe1aca", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "dataframes = {\n", " \"customers\": (customers_df.copy(), \"customer_id\"),\n", " \"sessions\": (sessions_df.copy(), \"session_id\", \"session_start\"),\n", " \"transactions\": (transactions_df.copy(), \"transaction_id\", \"transaction_time\"),\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "84fec203", "metadata": {}, "outputs": [], "source": [ "feature_matrix_sessions, features_defs = ft.dfs(\n", " dataframes=dataframes, relationships=relationships, target_dataframe_name=\"sessions\"\n", ")\n", "feature_matrix_sessions.head(5)" ] }, { "cell_type": "raw", "id": "a67d574e", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Understanding Feature Output\n", "~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "In general, Featuretools references generated features through the feature name. In order to make features easier to understand, Featuretools offers two additional tools, :func:`featuretools.graph_feature` and :func:`featuretools.describe_feature`, to help explain what a feature is and the steps Featuretools took to generate it. Let's look at this example feature:" ] }, { "cell_type": "code", "execution_count": null, "id": "9c791dda", "metadata": {}, "outputs": [], "source": [ "feature = features_defs[18]\n", "feature" ] }, { "cell_type": "markdown", "id": "84b5be0f", "metadata": {}, "source": [ "##### Feature lineage graphs\n", "\n", "Feature lineage graphs visually walk through feature generation. Starting from the base data, they show step by step the primitives applied and intermediate features generated to create the final feature." ] }, { "cell_type": "code", "execution_count": null, "id": "0cd93f3d", "metadata": {}, "outputs": [], "source": [ "ft.graph_feature(feature)" ] }, { "cell_type": "raw", "id": "d6e5e0a1", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. graphviz:: getting_started/graphs/demo_feat.dot\n", "\n", "Feature descriptions\n", "\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n", "\n", "Featuretools can also automatically generate English sentence descriptions of features. Feature descriptions help to explain what a feature is, and can be further improved by including manually defined custom definitions. See :doc:`/guides/feature_descriptions` for more details on how to customize automatically generated feature descriptions." ] }, { "cell_type": "code", "execution_count": null, "id": "3bdbe1c0", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature)" ] }, { "cell_type": "markdown", "id": "44635e1f", "metadata": {}, "source": [ "## What's next?\n", "\n", "\n", "* Learn about [Representing Data with EntitySets](getting_started/using_entitysets.ipynb)\n", "* Apply automated feature engineering with [Deep Feature Synthesis](getting_started/afe.ipynb)\n", "* Explore [runnable demos](https://www.featuretools.com/demos) based on real world use cases\n", "* Can't find what you're looking for? Ask for [help](resources/help.rst)" ] }, { "cell_type": "raw", "id": "cb2d443c", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Table of contents\n", "-----------------\n", "\n", ".. toctree::\n", " :maxdepth: 1\n", "\n", " install\n", "\n", ".. toctree::\n", " :maxdepth: 2\n", "\n", " getting_started/getting_started_index\n", " guides/guides_index\n", "\n", ".. toctree::\n", " :maxdepth: 1\n", " :caption: Resources and References\n", "\n", " resources/resources_index\n", " api_reference\n", " release_notes\n", "\n", "Other links\n", "------------\n", "* :ref:`genindex`\n", "* :ref:`search`\n" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/install.md ================================================ # Install Featuretools is available for Python 3.9 - 3.12. It can be installed from [pypi](https://pypi.org/project/featuretools/), [conda-forge](https://anaconda.org/conda-forge/featuretools), or from [source](https://github.com/alteryx/featuretools). To install Featuretools, run the following command: ````{tab} PyPI ```console $ python -m pip install featuretools ``` ```` ````{tab} Conda ```console $ conda install -c conda-forge featuretools ``` ```` ## Add-ons Featuretools allows users to install add-ons individually or all at once: ````{tab} PyPI ```{tab} All Add-ons ```console $ python -m pip install "featuretools[complete]" ``` ```{tab} Dask ```console $ python -m pip install "featuretools[dask]" ``` ```{tab} NLP Primitives ```console $ python -m pip install "featuretools[nlp]" ``` ```{tab} Premium Primitives ```console $ python -m pip install "featuretools[premium]" ``` ```` ````{tab} Conda ```{tab} All Add-ons ```console $ conda install -c conda-forge nlp-primitives dask distributed ``` ```{tab} NLP Primitives ```console $ conda install -c conda-forge nlp-primitives ``` ```{tab} Dask ```console $ conda install -c conda-forge dask distributed ``` ```` - **NLP Primitives**: Use Natural Language Processing Primitives in Featuretools - **Premium Primitives**: Use primitives from Premium Primitives in Featuretools - **Dask**: Use to run `calculate_feature_matrix` in parallel with `n_jobs` ## Installing Graphviz In order to use `EntitySet.plot` or `featuretools.graph_feature` you will need to install the graphviz library. ````{tab} macOS (Intel, M1) :new-set: ```{tab} pip ```console $ brew install graphviz $ python -m pip install graphviz ``` ```{tab} conda ```console $ brew install graphviz $ conda install -c conda-forge python-graphviz ``` ```` ````{tab} Ubuntu ```{tab} pip ```console $ sudo apt install graphviz $ python -m pip install graphviz ``` ```{tab} conda ```console $ sudo apt install graphviz $ conda install -c conda-forge python-graphviz ``` ```` ````{tab} Windows ```{tab} pip ```console $ python -m pip install graphviz ``` ```{tab} conda ```console $ conda install -c conda-forge python-graphviz ``` ```` If you installed graphviz for **Windows** with `pip`, install graphviz.exe from the [official source](https://graphviz.org/download/#windows). ## Source To install Featuretools from source, clone the repository from [GitHub](https://github.com/alteryx/featuretools), and install the dependencies. ```bash git clone https://github.com/alteryx/featuretools.git cd featuretools python -m pip install . ``` ## Docker It is also possible to run Featuretools inside a Docker container. You can do so by installing it as a package inside a container (following the normal install guide) or creating a new image with Featuretools pre-installed, using the following commands in your `Dockerfile`: ```dockerfile FROM --platform=linux/x86_64 python:3.9-slim-buster RUN apt update && apt -y update RUN apt install -y build-essential RUN pip3 install --upgrade --quiet pip RUN pip3 install featuretools ``` # Development To make contributions to the codebase, please follow the guidelines [here](https://github.com/alteryx/featuretools/blob/main/contributing.md). ================================================ FILE: docs/source/release_notes.rst ================================================ .. _release_notes: Release Notes ------------- Future Release ============== * Enhancements * Fixes * Changes * Restrict numpy to <2.0.0 (:pr:`2743`) * Documentation Changes * Update API Docs to include previously missing primitives (:pr:`2737`) * Testing Changes Thanks to the following people for contributing to this release: :user:`thehomebrewnerd` v1.31.0 May 14, 2024 ==================== * Enhancements * Add support for Python 3.12 (:pr:`2713`) * Fixes * Move ``flatten_list`` util function into ``feature_discovery`` module to fix import bug (:pr:`2702`) * Changes * Temporarily restrict Dask version (:pr:`2694`) * Remove support for creating ``EntitySets`` from Dask or Pyspark dataframes (:pr:`2705`) * Bump minimum versions of ``tqdm`` and ``pip`` in requirements files (:pr:`2716`) * Use ``filter`` arg in call to ``tarfile.extractall`` to safely deserialize EntitySets (:pr:`2722`) * Testing Changes * Fix serialization test to work with pytest 8.1.1 (:pr:`2694`) * Update to allow minimum dependency checker to run properly (:pr:`2709`) * Update pull request check CI action (:pr:`2720`) * Update release notes updated check CI action (:pr:`2726`) Thanks to the following people for contributing to this release: :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * With this release of Featuretools, EntitySets can no longer be created from Dask or Pyspark dataframes. The behavior when using pandas dataframes to create EntitySets remains unchanged. v1.30.0 Feb 26, 2024 ==================== * Changes * Update min requirements for numpy, pandas and Woodwork (:pr:`2681`) * Update release notes version for release(:pr:`2689`) * Testing Changes * Update ``make_ecommerce_entityset`` to work without Dask (:pr:`2677`) Thanks to the following people for contributing to this release: :user:`tamargrey`, :user:`thehomebrewnerd` v1.29.0 Feb 16, 2024 ==================== .. warning:: This release of Featuretools will not support Python 3.8 * Fixes * Fix dependency issues (:pr:`2644`, :pr:`2656`) * Add workaround for pandas 2.2.0 bug with nunique and unpin pandas deps (:pr:`2657`) * Changes * Fix deprecation warnings with is_categorical_dtype (:pr:`2641`) * Remove woodwork, pyarrow, numpy, and pandas pins for spark installation (:pr:`2661`) * Documentation Changes * Update Featuretools logo to display properly in dark mode (:pr:`2632`) * Remove references to premium primitives while release isnt possible (:pr:`2674`) * Testing Changes * Update tests for compatibility with new versions of ``holidays`` (:pr:`2636`) * Update ruff to 0.1.6 and use ruff linter/formatter (:pr:`2639`) * Update ``release.yaml`` to use trusted publisher for PyPI releases (:pr:`2646`, :pr:`2653`, :pr:`2654`) * Update dependency checkers and tests to include Dask (:pr:`2658`) * Fix the tests that run with Woodwork main so they can be triggered (:pr:`2657`) * Fix minimum dependency checker action (:pr:`2664`) * Fix Slack alert for tests with Woodwork main branch (:pr:`2668`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`thehomebrewnerd`, :user:`tamargrey`, :user:`LakshmanKishore` v1.28.0 Oct 26, 2023 ==================== * Fixes * Fix bug with default value in ``PercentTrue`` primitive (:pr:`2627`) * Changes * Refactor ``featuretools/tests/primitive_tests/utils.py`` to leverage list comprehensions for improved Pythonic quality (:pr:`2607`) * Refactor ``can_stack_primitive_on_inputs`` (:pr:`2522`) * Update s3 bucket for docs image (:pr:`2593`) * Temporarily restrict pandas max version to ``<2.1.0`` and pyarrow to ``<13.0.0`` (:pr:`2609`) * Update for compatibility with pandas version ``2.1.0`` and remove pandas upper version restriction (:pr:`2616`) * Documentation Changes * Fix badge on README for tests (:pr:`2598`) * Update readthedocs config to use build.os (:pr:`2601`) * Testing Changes * Update airflow looking glass performance tests workflow (:pr:`2615`) * Removed old performance testing workflow (:pr:`2620`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`petejanuszewski1`, :user:`thehomebrewnerd`, :user:`tosemml` v1.27.0 Jul 24, 2023 ==================== * Enhancements * Add support for Python 3.11 (:pr:`2583`) * Add support for ``pandas`` v2.0 (:pr:`2585`) * Changes * Remove natural language primitives add-on (:pr:`2570`) * Updates to address various warnings (:pr:`2589`) * Testing Changes * Run looking glass performance tests on merge via Airflow (:pr:`2575`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`petejanuszewski1`, :user:`sbadithe`, :user:`thehomebrewnerd` v1.26.0 Apr 27, 2023 ==================== * Enhancements * Introduce New Single-Table DFS Algorithm (:pr:`2516`). This includes **experimental** functionality and is not officially supported. * Add premium primitives install command (:pr:`2545`) * Fixes * Fix Description of ``DaysInMonth`` (:pr:`2547`) * Changes * Make Dask an optional dependency (:pr:`2560`) Thanks to the following people for contributing to this release: :user:`dvreed77`, :user:`gsheni`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * Dask is now an optional dependency of Featuretools. Users that run ``calculate_feature_matrix`` with ``n_jobs`` set to anything other than 1, will now need to install Dask prior to running ``calculate_feature_matrix``. The required Dask dependencies can be installed with ``pip install "featuretools[dask]"``. v1.25.0 Apr 13, 2023 ==================== * Enhancements * Add ``MaxCount``, ``MedianCount``, ``MaxMinDelta``, ``NUniqueDays``, ``NMostCommonFrequency``, ``NUniqueDaysOfCalendarYear``, ``NUniqueDaysOfMonth``, ``NUniqueMonths``, ``NUniqueWeeks``, ``IsFirstWeekOfMonth`` (:pr:`2533`) * Add ``HasNoDuplicates``, ``NthWeekOfMonth``, ``IsMonotonicallyDecreasing``, ``IsMonotonicallyIncreasing``, ``IsUnique`` (:pr:`2537`) * Fixes * Fix release notes header version (:pr:`2544`) * Changes * Restrict pandas to < 2.0.0 (:pr:`2533`) * Upgrade minimum pandas to 1.5.0 (:pr:`2537`) * Removed the ``Correlation`` and ``AutoCorrelation`` primitive as these could lead to data leakage (:pr:`2537`) * Remove IntegerNullable support for ``Kurtosis`` primitive (:pr:`2537`) Thanks to the following people for contributing to this release: :user:`gsheni` v1.24.0 Mar 28, 2023 ==================== * Enhancements * Add ``AverageCountPerUnique``, ``CountryCodeToContinent``, ``FileExtension``, ``FirstLastTimeDelta``, ``SavgolFilter``, ``CumulativeTimeSinceLastFalse``, ``CumulativeTimeSinceLastTrue``, ``PercentChange``, ``PercentUnique`` (:pr:`2485`) * Add ``FullNameToFirstName``, ``FullNameToLastName``, ``FullNameToTitle``, ``AutoCorrelation``, ``Correlation``, ``DateFirstEvent`` (:pr:`2507`) * Add ``Kurtosis``, ``MinCount``, ``NumFalseSinceLastTrue``, ``NumPeaks``, ``NumTrueSinceLastFalse``, ``NumZeroCrossings`` (:pr:`2514`) * Fixes * Pin github-action-check-linked-issues to 1.4.5 (:pr:`2497`) * Support Woodwork's update numeric inference (integers as strings) (:pr:`2505`) * Update ``SubtractNumeric`` Primitive with commutative class property (:pr:`2527`) * Changes * Separate Makefile command for core requirements, test requirements and dev requirements (:pr:`2518`) Thanks to the following people for contributing to this release: :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD` v1.23.0 Feb 15, 2023 ==================== * Changes * Change ``TotalWordLength`` and ``UpperCaseWordCount`` to return ``IntegerNullable`` (:pr:`2474`) * Testing Changes * Add GitHub Actions cache to speed up workflows (:pr:`2475`) * Fix latest dependency checker install command (:pr:`2476`) * Add pull request check for linked issues to CI workflow (:pr:`2477`, :pr:`2481`) * Remove make package from lint workflow (:pr:`2479`) Thanks to the following people for contributing to this release: :user:`dvreed77`, :user:`gsheni`, :user:`sbadithe` v1.22.0 Jan 31, 2023 ==================== * Enhancements * Add ``AbsoluteDiff``, ``SameAsPrevious``, ``Variance``, ``Season``, ``UpperCaseWordCount`` transform primitives (:pr:`2460`) * Fixes * Fix bug with consecutive spaces in ``NumWords`` (:pr:`2459`) * Fix for compatibility with ``holidays`` v0.19.0 (:pr:`2471`) * Changes * Specify black and ruff config arguments in pre-commit-config (:pr:`2456`) * ``NumCharacters`` returns null given null input (:pr:`2463`) * Documentation Changes * Update ``release.md`` with instructions for launching Looking Glass performance test runs (:pr:`2461`) * Pin ``jupyter-client==7.4.9`` to fix broken documentation build (:pr:`2463`) * Unpin jupyter-client documentation requirement (:pr:`2468`) * Testing Changes * Add test suites for ``NumWords`` and ``NumCharacters`` primitives (:pr:`2459`, :pr:`2463`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd` v1.21.0 Jan 18, 2023 ==================== * Enhancements * Add `get_recommended_primitives` function to featuretools (:pr:`2398`) * Changes * Update build_docs workflow to only run for Python 3.8 and Python 3.10 (:pr:`2447`) * Documentation Changes * Minor fix to release notes (:pr:`2444`) * Testing Changes * Add test that checks for Natural Language primitives timing out against edge-case input (:pr:`2429`) * Fix test compatibility with composeml 0.10 (:pr:`2439`) * Minimum dependency unit test jobs do not abort if one job fails (:pr:`2437`) * Run Looking Glass performance tests on merge to main (:pr:`2440`, :pr:`2441`) * Add ruff for linting and replace isort/flake8 (:pr:`2448`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd` v1.20.0 Jan 5, 2023 =================== * Enhancements * Add ``TimeSinceLastFalse``, ``TimeSinceLastMax``, ``TimeSinceLastMin``, and ``TimeSinceLastTrue`` primitives (:pr:`2418`) * Add ``MaxConsecutiveFalse``, ``MaxConsecutiveNegatives``, ``MaxConsecutivePositives``, ``MaxConsecutiveTrue``, ``MaxConsecutiveZeros``, ``NumConsecutiveGreaterMean``, ``NumConsecutiveLessMean`` (:pr:`2420`) * Fixes * Fix typo in ``_handle_binary_comparison`` function name and update ``set_feature_names`` docstring (:pr:`2388`) * Only allow Datetime time index as input to ``RateOfChange`` primitive (:pr:`2408`) * Prevent catastrophic backtracking in regex for ``NumberOfWordsInQuotes`` (:pr:`2413`) * Fix to eliminate fragmentation ``PerformanceWarning`` in ``feature_set_calculator.py`` (:pr:`2424`) * Fix serialization of ``NumberOfCommonWords`` feature with custom word_set (:pr:`2432`) * Improve edge case handling in NaturalLanguage primitives by standardizing delimiter regex (:pr:`2423`) * Remove support for ``Datetime`` and ``Ordinal`` inputs in several primitives to prevent creation of Features that cannot be calculated (:pr:`2434`) * Changes * Refactor ``_all_direct_and_same_path`` by deleting call to ``_features_have_same_path`` (:pr:`2400`) * Refactor ``_build_transform_features`` by iterating over ``input_features`` once (:pr:`2400`) * Iterate only once over ``ignore_columns`` in ``DeepFeatureSynthesis`` init (:pr:`2397`) * Resolve empty Pandas series warnings (:pr:`2403`) * Initialize Woodwork with ``init_with_partial_schama`` instead of ``init`` in ``EntitySet.add_last_time_indexes`` (:pr:`2409`) * Updates for compatibility with numpy 1.24.0 (:pr:`2414`) * The ``delimiter_regex`` parameter for ``TotalWordLength`` has been renamed to ``do_not_count`` (:pr:`2423`) * Documentation Changes * Remove unused sections from 1.19.0 notes (:pr:`2396`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * The ``delimiter_regex`` parameter for ``TotalWordLength`` has been renamed to ``do_not_count``. Old saved features that had a non-default value for the parameter will no longer load. * Support for ``Datetime`` and ``Ordinal`` inputs has been removed from the ``LessThanScalar``, ``GreaterThanScalar``, ``LessThanEqualToScalar`` and ``GreaterThanEqualToScalar`` primitives. v1.19.0 Dec 9, 2022 =================== * Enhancements * Add ``OneDigitPostalCode`` and ``TwoDigitPostalCode`` primitives (:pr:`2365`) * Add ``ExpandingCount``, ``ExpandingMin``, ``ExpandingMean``, ``ExpandingMax``, ``ExpandingSTD``, and ``ExpandingTrend`` primitives (:pr:`2343`) * Fixes * Fix DeepFeatureSynthesis to consider the ``base_of_exclude`` family of attributes when creating transform features(:pr:`2380`) * Fix bug with negative version numbers in ``test_version`` (:pr:`2389`) * Fix bug in ``MultiplyNumericBoolean`` primitive that can cause an error with certain input dtype combinations (:pr:`2393`) * Testing Changes * Fix version comparison in ``test_holiday_out_of_range`` (:pr:`2382`) Thanks to the following people for contributing to this release: :user:`sbadithe`, :user:`thehomebrewnerd` v1.18.0 Nov 15, 2022 ==================== * Enhancements * Add ``RollingOutlierCount`` primitive (:pr:`2129`) * Add ``RateOfChange`` primitive (:pr:`2359`) * Fixes * Sets ``uses_full_dataframe`` for ``Rolling*`` and ``Exponential*`` primitives (:pr:`2354`) * Updates for compatibility with upcoming Woodwork release 0.21.0 (:pr:`2363`) * Updates demo dataset location to use new links (:pr:`2366`) * Fix ``test_holiday_out_of_range`` after ``holidays`` release 0.17 (:pr:`2373`) * Changes * Remove click and CLI functions (``list-primitives``, ``info``) (:pr:`2353`, :pr:`2358`) * Documentation Changes * Build docs in parallel with Sphinx (:pr:`2351`) * Use non-editable install to allow local docs build (:pr:`2367`) * Remove primitives.featurelabs.com website from documentation (:pr:`2369`) * Testing Changes * Replace use of pytest's tmpdir fixture with tmp_path (:pr:`2344`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * The featuretools CLI has been completely removed. v1.17.0 Oct 31, 2022 ==================== * Enhancements * Add featuretools-sklearn-transformer as an extra installation option (:pr:`2335`) * Add CountAboveMean, CountBelowMean, CountGreaterThan, CountInsideNthSTD, CountInsideRange, CountLessThan, CountOutsideNthSTD, CountOutsideRange (:pr:`2336`) * Changes * Restructure primitives directory to use individual primitives files (:pr:`2331`) * Restrict 2022.10.1 for dask and distributed (:pr:`2347`) * Documentation Changes * Add Featuretools-SQL to Install page on documentation (:pr:`2337`) * Fixes broken link in Featuretools documentation (:pr:`2339`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd` v1.16.0 Oct 24, 2022 ==================== * Enhancements * Add ExponentialWeighted primitives and DateToTimeZone primitive (:pr:`2318`) * Add 14 natural language primitives from ``nlp_primitives`` library (:pr:`2328`) * Documentation Changes * Fix typos in ``aggregation_primitive_base.py`` and ``features_deserializer.py`` (:pr:`2317`) (:pr:`2324`) * Update SQL integration documentation to reflect Snowflake compatibility (:pr:`2313`) * Testing Changes * Add Windows install test (:pr:`2330`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`sbadithe`, :user:`thehomebrewnerd` v1.15.0 Oct 6, 2022 =================== * Enhancements * Add ``series_library`` attribute to ``EntitySet`` dictionary (:pr:`2257`) * Leverage ``Library`` Enum inheriting from ``str`` (:pr:`2275`) * Changes * Change default gap for Rolling* primitives from 0 to 1 to prevent accidental leakage (:pr:`2282`) * Updates for pandas 1.5.0 compatibility (:pr:`2290`, :pr:`2291`, :pr:`2308`) * Exclude documentation files from release workflow (:pr:`2295`) * Bump requirements for optional pyspark dependency (:pr:`2299`) * Bump ``scipy`` and ``woodwork[spark]`` dependencies (:pr:`2306`) * Documentation Changes * Add documentation describing how to use ``featuretools_sql`` with ``featuretools`` (:pr:`2262`) * Remove ``featuretools_sql`` as a docs requirement (:pr:`2302`) * Fix typo in ``DiffDatetime`` doctest (:pr:`2314`) * Fix typo in ``EntitySet`` documentation (:pr:`2315`) * Testing Changes * Remove graphviz version restrictions in Windows CI tests (:pr:`2285`) * Run CI tests with ``pytest -n auto`` (:pr:`2298`, :pr:`2310`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * The ``EntitySet`` schema has been updated to include a ``series_library`` attribute * The default behavior of the ``Rolling*`` primitives has changed in this release. If this primitive was used without defining the ``gap`` value, the feature values returned with this release will be different than feature values from prior releases. v1.14.0 Sep 1, 2022 =================== * Enhancements * Replace ``NumericLag`` with ``Lag`` primitive (:pr:`2252`) * Refactor build_features to speed up long running DFS calls by 50% (:pr:`2224`) * Fixes * Fix compatibility issues with holidays 0.15 (:pr:`2254`) * Changes * Update release notes to make clear conda release portion (:pr:`2249`) * Use pyproject.toml only (move away from setup.cfg) (:pr:`2260`, :pr:`2263`, :pr:`2265`) * Add entry point instructions for pyproject.toml project (:pr:`2272`) * Documentation Changes * Fix to remove warning from Using Spark EntitySets Guide (:pr:`2258`) * Testing Changes * Add tests/profiling/dfs_profile.py (:pr:`2224`) * Add workflow to test featuretools without test dependencies (:pr:`2274`) Thanks to the following people for contributing to this release: :user:`cp2boston`, :user:`gsheni`, :user:`ozzieD`, :user:`stefaniesmith`, :user:`thehomebrewnerd` v1.13.0 Aug 18, 2022 ==================== * Fixes * Allow boolean columns to be included in remove_highly_correlated_features (:pr:`2231`) * Changes * Refactor schema version checking to use `packaging` method (:pr:`2230`) * Extract duplicated logic for Rolling primitives into a general utility function (:pr:`2218`) * Set pandas version to >=1.4.0 (:pr:`2246`) * Remove workaround in `roll_series_with_gap` caused by pandas version < 1.4.0 (:pr:`2246`) * Documentation Changes * Add line breaks between sections of IsFederalHoliday primitive docstring (:pr:`2235`) * Testing Changes * Update create feedstock PR forked repo to use (:pr:`2223`, :pr:`2237`) * Update development requirements and use latest for documentation (:pr:`2225`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`ozzieD`, :user:`sbadithe`, :user:`tamargrey` v1.12.1 Aug 4, 2022 =================== * Fixes * Update ``Trend`` and ``RollingTrend`` primitives to work with ``IntegerNullable`` inputs (:pr:`2204`) * ``camel_and_title_to_snake`` handles snake case strings with numbers (:pr:`2220`) * Change ``_get_description`` to split on blank lines to avoid truncating primitive descriptions (:pr:`2219`) * Documentation Changes * Add instructions to add new users to featuretools feedstock (:pr:`2215`) * Testing Changes * Add create feedstock PR workflow (:pr:`2181`) * Add performance tests for python 3.9 and 3.10 (:pr:`2198`, :pr:`2208`) * Add test to ensure primitive docstrings use standardized verbs (:pr:`2200`) * Configure codecov to avoid premature PR comments (:pr:`2209`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd` v1.12.0 Jul 19, 2022 ==================== .. warning:: This release of Featuretools will not support Python 3.7 * Enhancements * Add ``IsWorkingHours`` and ``IsLunchTime`` transform primitives (:pr:`2130`) * Add periods parameter to ``Diff`` and add ``DiffDatetime`` primitive (:pr:`2155`) * Add ``RollingTrend`` primitive (:pr:`2170`) * Fixes * Resolves Woodwork integration test failure and removes Python version check for codecov (:pr:`2182`) * Changes * Drop Python 3.7 support (:pr:`2169`, :pr:`2186`) * Add pre-commit hooks for linting (:pr:`2177`) * Documentation Changes * Augment single table entry in DFS to include information about passing in a dictionary for `dataframes` argument (:pr:`2160`) * Testing Changes * Standardize imports across test files to simplify accessing featuretools functions (:pr:`2166`) * Split spark tests into multiple CI jobs to speed up runtime (:pr:`2183`) Thanks to the following people for contributing to this release: :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe` v1.11.1 Jul 5, 2022 =================== * Fixes * Remove 24th hour from PartOfDay primitive and add 0th hour (:pr:`2167`) Thanks to the following people for contributing to this release: :user:`tamargrey` v1.11.0 Jun 30, 2022 ==================== * Enhancements * Add datetime and string types as valid arguments to dfs ``cutoff_time`` (:pr:`2147`) * Add ``PartOfDay`` transform primitive (:pr:`2128`) * Add ``IsYearEnd``, ``IsYearStart`` transform primitives (:pr:`2124`) * Add ``Feature.set_feature_names`` method to directly set output column names for multi-output features (:pr:`2142`) * Include np.nan testing for ``DayOfYear`` and ``DaysInMonth`` primitives (:pr:`2146`) * Allow dfs kwargs to be passed into ``get_valid_primitives`` (:pr:`2157`) * Changes * Improve serialization and deserialization to reduce storage of duplicate primitive information (:pr:`2136`, :pr:`2127`, :pr:`2144`) * Sort core requirements and test requirements in setup cfg (:pr:`2152`) * Testing Changes * Fix pandas warning and reduce dask .apply warnings (:pr:`2145`) * Pin graphviz version used in windows tests (:pr:`2159`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd` v1.10.0 Jun 23, 2022 ==================== * Enhancements * Add ``DayOfYear``, ``DaysInMonth``, ``Quarter``, ``IsLeapYear``, ``IsQuarterEnd``, ``IsQuarterStart`` transform primitives (:pr:`2110`, :pr:`2117`) * Add ``IsMonthEnd``, ``IsMonthStart`` transform primitives (:pr:`2121`) * Move ``Quarter`` test cases (:pr:`2123`) * Add ``summarize_primitives`` function for getting metrics about available primitives (:pr:`2099`) * Changes * Changes for compatibility with numpy 1.23.0 (:pr:`2135`, :pr:`2137`) * Documentation Changes * Update contributing.md to add pandoc (:pr:`2103`, :pr:`2104`) * Update NLP primitives section of API reference (:pr:`2109`) * Fixing release notes formatting (:pr:`2139`) * Testing Changes * Latest dependency checker installs spark dependencies (:pr:`2112`) * Fix test failures with pyspark v3.3.0 (:pr:`2114`, :pr:`2120`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd` v1.9.2 Jun 10, 2022 =================== * Fixes * Add feature origin information to all multi-output feature columns (:pr:`2102`) * Documentation Changes * Update contributing.md to add pandoc (:pr:`2103`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`thehomebrewnerd` v1.9.1 May 27, 2022 =================== * Enhancements * Update ``DateToHoliday`` and ``DistanceToHoliday`` primitives to work with timezone-aware inputs (:pr:`2056`) * Changes * Delete setup.py, MANIFEST.in and move configuration to pyproject.toml (:pr:`2046`) * Documentation Changes * Update slack invite link to new (:pr:`2044`) * Add slack and stackoverflow icon to footer (:pr:`2087`) * Update dead links in docs and docstrings (:pr:`2092`, :pr:`2095`) * Testing Changes * Skip test for ``normalize_dataframe`` due to different error coming from Woodwork in 0.16.3 (:pr:`2052`) * Fix Woodwork install in test with Woodwork main branch (:pr:`2055`) * Use codecov action v3 (:pr:`2039`) * Add workflow to kickoff EvalML unit tests with Featuretools main (:pr:`2072`) * Rename yml to yaml for GitHub Actions workflows (:pr:`2073`, :pr:`2077`) * Update Dask test fixtures to prevent flaky behavior (:pr:`2079`) * Update Makefile with better pkg command (:pr:`2081`) * Add scheduled workflow that checks for broken links in documentation (:pr:`2084`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd` v1.9.0 Apr 27, 2022 =================== * Enhancements * Improve ``UnusedPrimitiveWarning`` with additional information (:pr:`2003`) * Update DFS primitive matching to use all inputs defined in primitive ``input_types`` (:pr:`2019`) * Add ``MultiplyNumericBoolean`` primitive (:pr:`2035`) * Fixes * Fix issue with Ordinal inputs to binary comparison primitives (:pr:`2024`, :pr:`2025`) * Changes * Updated autonormalize version requirement (:pr:`2002`) * Remove extra NaN checking in LatLong primitives (:pr:`1924`) * Normalize LatLong NaN values during EntitySet creation (:pr:`1924`) * Pass primitive dictionaries into ``check_primitive`` to avoid repetitive calls (:pr:`2016`) * Remove ``Boolean`` and ``BooleanNullable`` from ``MultiplyNumeric`` primitive inputs (:pr:`2022`) * Update serialization for compatibility with Woodwork version 0.16.1 (:pr:`2030`) * Documentation Changes * Update README text to Alteryx (:pr:`2010`, :pr:`2015`) * Testing Changes * Update unit tests with Woodwork main branch workflow name (:pr:`2033`) * Add slack alert for failing unit tests with Woodwork main branch (:pr:`2040`) Thanks to the following people for contributing to this release: :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`thehomebrewnerd` Note ++++ * The update to the DFS algorithm in this release may cause the number of features returned by ``ft.dfs`` to increase in some cases. v1.8.0 Mar 31, 2022 =================== * Changes * Removed ``make_trans_primitive`` and ``make_agg_primitive`` utility functions (:pr:`1970`) * Documentation Changes * Update project urls in setup cfg to include Twitter and Slack (:pr:`1981`) * Update nbconvert to version 6.4.5 to fix docs build issue (:pr:`1984`) * Update ReadMe to have centered badges and add docs badge (:pr:`1993`) * Add M1 installation instructions to docs and contributing (:pr:`1997`) * Testing Changes * Updated scheduled workflows to only run on Alteryx owned repos (:pr:`1973`) * Updated minimum dependency checker to use new version with write file support (:pr:`1975`, :pr:`1976`) * Add black linting package and remove autopep8 (:pr:`1978`) * Update tests for compatibility with Woodwork version 0.15.0 (:pr:`1984`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * The utility functions ``make_trans_primitive`` and ``make_agg_primitive`` have been removed. To create custom primitives, define the primitive class directly. v1.7.0 Mar 16, 2022 =================== * Enhancements * Add support for Python 3.10 (:pr:`1940`) * Added the SquareRoot, NaturalLogarithm, Sine, Cosine and Tangent primitives (:pr:`1948`) * Fixes * Updated the conda install commands to specify the channel (:pr:`1917`) * Changes * Update error message when DFS returns an empty list of features (:pr:`1919`) * Remove ``list_variable_types`` and related directories (:pr:`1929`) * Transition to use pyproject.toml and setup.cfg (moving away from setup.py) (:pr:`1941`, :pr:`1950`, :pr:`1952`, :pr:`1954`, :pr:`1957`, :pr:`1964`) * Replace Koalas with pandas API on Spark (:pr:`1949`) * Documentation Changes * Add time series guide (:pr:`1896`) * Update minimum nlp_primitives requirement for docs (:pr:`1925`) * Add GitHub URL for PyPi (:pr:`1928`) * Add backport release support (:pr:`1932`) * Update instructions in ``release.md`` (:pr:`1963`) * Testing Changes * Update test cases to cover __main__.py file (:pr:`1927`) * Upgrade moto requirement (:pr:`1929`, :pr:`1938`) * Add Python 3.9 linting, install complete, and docs build CI tests (:pr:`1934`) * Add CI workflow to test with latest woodwork main branch (:pr:`1936`) * Add lower bound for wheel for minimum dependency checker and limit lint CI tests to Python 3.10 (:pr:`1945`) * Fix non-deterministic test in ``test_es.py`` (:pr:`1961`) Thanks to the following people for contributing to this release: :user:`andriyor`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kushal-gopal`, :user:`mingdavidqi`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tvdboom` Breaking Changes ++++++++++++++++ * The deprecated utility ``list_variable_types`` has been removed from Featuretools. v1.6.0 Feb 17, 2022 =================== * Enhancements * Add ``IsFederalHoliday`` transform primitive (:pr:`1912`) * Fixes * Fix to catch new ``NotImplementedError`` raised by ``holidays`` library for unknown country (:pr:`1907`) * Changes * Remove outdated pandas workaround code (:pr:`1906`) * Documentation Changes * Add in-line tabs and copy-paste functionality to docs (:pr:`1905`) * Testing Changes * Fix URL deserialization file (:pr:`1909`) Thanks to the following people for contributing to this release: :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd` v1.5.0 Feb 14, 2022 =================== .. warning:: Featuretools may not support Python 3.7 in next non-bugfix release. * Enhancements * Add ability to use offset alias strings as inputs to rolling primitives (:pr:`1809`) * Update to add support for pandas version 1.4.0 (:pr:`1881`, :pr:`1895`) * Fixes * Fix ``featuretools_primitives`` entry point (:pr:`1891`) * Changes * Allow only snake camel and title case for primitives (:pr:`1854`) * Add autonormalize as an add-on library (:pr:`1840`) * Add DateToHoliday Transform Primitive (:pr:`1848`) * Add DistanceToHoliday Transform Primitive (:pr:`1853`) * Temporarily restrict pandas and koalas max versions (:pr:`1863`) * Add ``__setitem__`` method to overload ``add_dataframe`` method on EntitySet (:pr:`1862`) * Add support for woodwork 0.12.0 (:pr:`1872`, :pr:`1897`) * Split Datetime and LatLong primitives into separate files (:pr:`1861`) * Null values will not be included in index of normalized dataframe (:pr:`1897`) * Documentation Changes * Bump ipython version (:pr:`1857`) * Update README.md with Alteryx link (:pr:`1886`) * Testing Changes * Add check for package conflicts with install workflow (:pr:`1843`) * Change auto approve workflow to use assignee (:pr:`1843`) * Update auto approve workflow to delete branch and change on trigger (:pr:`1852`) * Upgrade tests to use compose version 0.8.0 (:pr:`1856`) * Updated deep feature synthesis and feature serialization tests to use new primitive files (:pr:`1861`) Thanks to the following people for contributing to this release: :user:`dvreed77`, :user:`gsheni`, :user:`jacobboney`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999` Breaking Changes ++++++++++++++++ * When using ``normalize_dataframe`` to create a new dataframe, the new dataframe's index will not include a null value. v1.4.0 Jan 10, 2022 =================== * Enhancements * Add LatLong transform primitives - GeoMidpoint, IsInGeoBox, CityblockDistance (:pr:`1814`) * Add issue templates for bugs, feature requests and documentation improvements (:pr:`1834`) * Fixes * Fix bug where Woodwork initialization could fail on feature matrix if cutoff times caused null values to be introduced (:pr:`1810`) * Changes * Skip code coverage for specific dask usage lines (:pr:`1829`) * Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (:pr:`1833`) * Remove pyyaml as a requirement (:pr:`1833`) * Documentation Changes * Remove testing on conda forge in release.md (:pr:`1811`) * Testing Changes * Enable auto-merge for minimum and latest dependency merge requests (:pr:`1818`, :pr:`1821`, :pr:`1822`) * Change auto approve workfow to use PR number and run every 30 minutes (:pr:`1827`) * Add auto approve workflow to run when unit tests complete (:pr:`1837`) * Test deserializing from S3 with mocked S3 fixtures only (:pr:`1825`) * Remove fastparquet as a test requirement (:pr:`1833`) Thanks to the following people for contributing to this release: :user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd` v1.3.0 Dec 2, 2021 ================== * Enhancements * Add ``NumericLag`` transform primitive (:pr:`1797`) * Changes * Update pip to 21.3.1 for test requirements (:pr:`1789`) * Documentation Changes * Add Docker install instructions and documentation on the install page. (:pr:`1785`) * Update install page on documentation with correct python version (:pr:`1784`) * Fix formatting in Improving Computational Performance guide (:pr:`1786`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`HenryRocha`, :user:`tamargrey` :user:`thehomebrewnerd` v1.2.0 Nov 15, 2021 =================== * Enhancements * Add Rolling Transform primitives with integer parameters (:pr:`1770`) * Fixes * Handle new graphviz FORMATS import (:pr:`1770`) * Changes * Add new version of featuretools_tsfresh_primitives as an add-on library (:pr:`1772`) * Add ``load_weather`` as demo dataset for time series :pr:`1777` Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`tamargrey` v1.1.0 Nov 2, 2021 ================== * Fixes * Check ``base_of_exclude`` attribute on primitive instead feature class (:pr:`1749`) * Pin upper bound for pyspark (:pr:`1748`) * Fix ``get_unused_primitives`` only recognizes lowercase primitive strings (:pr:`1733`) * Require newer versions of dask and distributed (:pr:`1762`) * Fix bug with pass-through columns of cutoff_time df when n_jobs > 1 (:pr:`1765`) * Changes * Add new version of nlp_primitives as an add-on library (:pr:`1743`) * Change name of date_of_birth (column name) to birthday in mock dataset (:pr:`1754`) * Documentation Changes * Upgrade Sphinx and fix docs configuration error (:pr:`1760`) * Testing Changes * Modify CI to run unit test with latest dependencies on python 3.9 (:pr:`1738`) * Added Python version standardizer to Jupyter notebook linting (:pr:`1741`) Thanks to the following people for contributing to this release: :user:`bchen1116`, :user:`gsheni`, :user:`HenryRocha`, :user:`jeff-hernandez`, :user:`ridicolos`, :user:`rwedge` v1.0.0 Oct 12, 2021 =================== * Enhancements * Add support for creating EntitySets from Woodwork DataTables (:pr:`1277`) * Add ``EntitySet.__deepcopy__`` that retains Woodwork typing information (:pr:`1465`) * Add ``EntitySet.__getstate__`` and ``EntitySet.__setstate__`` to preserve typing when pickling (:pr:`1581`) * Returned feature matrix has woodwork typing information (:pr:`1664`) * Fixes * Fix ``DFSTransformer`` Documentation for Featuretools 1.0 (:pr:`1605`) * Fix ``calculate_feature_matrix`` time type check and ``encode_features`` for synthesis tests (:pr:`1580`) * Revert reordering of categories in ``Equal`` and ``NotEqual`` primitives (:pr:`1640`) * Fix bug in ``EntitySet.add_relationship`` that caused ``foreign_key`` tag to be lost (:pr:`1675`) * Update DFS to not build features on last time index columns in dataframes (:pr:`1695`) * Changes * Remove ``add_interesting_values`` from ``Entity`` (:pr:`1269`) * Move ``set_secondary_time_index`` method from ``Entity`` to ``EntitySet`` (:pr:`1280`) * Refactor Relationship creation process (:pr:`1370`) * Replaced ``Entity.update_data`` with ``EntitySet.update_dataframe`` (:pr:`1398`) * Move validation check for uniform time index to ``EntitySet`` (:pr:`1400`) * Replace ``Entity`` objects in ``EntitySet`` with Woodwork dataframes (:pr:`1405`) * Refactor ``EntitySet.plot`` to work with Woodwork dataframes (:pr:`1468`) * Move ``last_time_index`` to be a column on the DataFrame (:pr:`1456`) * Update serialization/deserialization to work with Woodwork (:pr:`1452`) * Refactor ``EntitySet.query_by_values`` to work with Woodwork dataframes (:pr:`1467`) * Replace ``list_variable_types`` with ``list_logical_types`` (:pr:`1477`) * Allow deep EntitySet equality check (:pr:`1480`) * Update ``EntitySet.concat`` to work with Woodwork DataFrames (:pr:`1490`) * Add function to list semantic tags (:pr:`1486`) * Initialize Woodwork on feature matrix in ``remove_highly_correlated_features`` if necessary (:pr:`1618`) * Remove categorical-encoding as an add-on library (will be added back later) (:pr:`1632`) * Remove autonormalize as an add-on library (will be added back later) (:pr:`1636`) * Remove tsfresh, nlp_primitives, sklearn_transformer as an add-on library (will be added back later) (:pr:`1638`) * Update input and return types for ``CumCount`` primitive (:pr:`1651`) * Standardize imports of Woodwork (:pr:`1526`) * Rename target entity to target dataframe (:pr:`1506`) * Replace ``entity_from_dataframe`` with ``add_dataframe`` (:pr:`1504`) * Create features from Woodwork columns (:pr:`1582`) * Move default variable description logic to ``generate_description`` (:pr:`1403`) * Update Woodwork to version 0.4.0 with ``LogicalType.transform`` and LogicalType instances (:pr:`1451`) * Update Woodwork to version 0.4.1 with Ordinal order values and whitespace serialization fix (:pr:`1478`) * Use ``ColumnSchema`` for primitive input and return types (:pr:`1411`) * Update features to use Woodwork and remove ``Entity`` and ``Variable`` classes (:pr:`1501`) * Re-add ``make_index`` functionality to EntitySet (:pr:`1507`) * Use ``ColumnSchema`` in DFS primitive matching (:pr:`1523`) * Updates from Featuretools v0.26.0 (:pr:`1539`) * Leverage Woodwork better in ``add_interesting_values`` (:pr:`1550`) * Update ``calculate_feature_matrix`` to use Woodwork (:pr:`1533`) * Update Woodwork to version 0.6.0 with changed categorical inference (:pr:`1597`) * Update ``nlp-primitives`` requirement for Featuretools 1.0 (:pr:`1609`) * Remove remaining references to ``Entity`` and ``Variable`` in code (:pr:`1612`) * Update Woodwork to version 0.7.1 with changed initialization (:pr:`1648`) * Removes outdated workaround code related to a since-resolved pandas issue (:pr:`1677`) * Remove unused ``_dataframes_equal`` and ``camel_to_snake`` functions (:pr:`1683`) * Update Woodwork to version 0.8.0 for improved performance (:pr:`1689`) * Remove redundant typecasting in ``encode_features`` (:pr:`1694`) * Speed up ``encode_features`` if not inplace, some space cost (:pr:`1699`) * Clean up comments and commented out code (:pr:`1701`) * Update Woodwork to version 0.8.1 for improved performance (:pr:`1702`) * Documentation Changes * Add a Woodwork Typing in Featuretools guide (:pr:`1589`) * Add a resource guide for transitioning to Featuretools 1.0 (:pr:`1627`) * Update ``using_entitysets`` page to use Woodwork (:pr:`1532`) * Update FAQ page to use Woodwork integration (:pr:`1649`) * Update DFS page to be Jupyter notebook and use Woodwork integration (:pr:`1557`) * Update Feature Primitives page to be Jupyter notebook and use Woodwork integration (:pr:`1556`) * Update Handling Time page to be Jupyter notebook and use Woodwork integration (:pr:`1552`) * Update Advanced Custom Primitives page to be Jupyter notebook and use Woodwork integration (:pr:`1587`) * Update Deployment page to use Woodwork integration (:pr:`1588`) * Update Using Dask EntitySets page to be Jupyter notebook and use Woodwork integration (:pr:`1590`) * Update Specifying Primitive Options page to be Jupyter notebook and use Woodwork integration (:pr:`1593`) * Update API Reference to match Featuretools 1.0 API (:pr:`1600`) * Update Index page to be Jupyter notebook and use Woodwork integration (:pr:`1602`) * Update Feature Descriptions page to be Jupyter notebook and use Woodwork integration (:pr:`1603`) * Update Using Koalas EntitySets page to be Jupyter notebook and use Woodwork integration (:pr:`1604`) * Update Glossary to use Woodwork integration (:pr:`1608`) * Update Tuning DFS page to be Jupyter notebook and use Woodwork integration (:pr:`1610`) * Fix small formatting issues in Documentation (:pr:`1607`) * Remove Variables page and more references to variables (:pr:`1629`) * Update Feature Selection page to use Woodwork integration (:pr:`1618`) * Update Improving Performance page to be Jupyter notebook and use Woodwork integration (:pr:`1591`) * Fix typos in transition guide (:pr:`1672`) * Update installation instructions for 1.0.0rc1 announcement in docs (:pr:`1707`, :pr:`1708`, :pr:`1713`, :pr:`1716`) * Fixed broken link for Demo notebook in README.md (:pr:`1728`) * Update ``contributing.md`` to improve instructions for external contributors (:pr:`1723`) * Manually revert changes made by :pr:`1677` and :pr:`1679`. The related bug in pandas still exists. (:pr:`1731`) * Testing Changes * Remove entity tests (:pr:`1521`) * Fix broken ``EntitySet`` tests (:pr:`1548`) * Fix broken primitive tests (:pr:`1568`) * Added Jupyter notebook cleaner to the linters (:pr:`1719`) * Update reviewers for minimum and latest dependency checkers (:pr:`1715`) * Full coverage for EntitySet.__eq__ method (:pr:`1725`) * Add tests to verify all primitives can be initialized without parameter values (:pr:`1726`) Thanks to the following people for contributing to this release: :user:`bchen1116`, :user:`gsheni`, :user:`HenryRocha`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`VaishnaviNandakumar` Breaking Changes ++++++++++++++++ * ``Entity.add_interesting_values`` has been removed. To add interesting values for a single entity, call ``EntitySet.add_interesting_values`` and pass the name of the dataframe for which to add interesting values in the ``dataframe_name`` parameter (:pr:`1405`, :pr:`1370`). * ``Entity.set_secondary_time_index`` has been removed and replaced by ``EntitySet.set_secondary_time_index`` with an added ``dataframe_name`` parameter to specify the dataframe on which to set the secondary time index (:pr:`1405`, :pr:`1370`). * ``Relationship`` initialization has been updated to accept four name values for the parent dataframe, parent column, child dataframe and child column instead of accepting two ``Variable`` objects (:pr:`1405`, :pr:`1370`). * ``EntitySet.add_relationship`` has been updated to accept dataframe and column name values or a ``Relationship`` object. Adding a relationship from a ``Relationship`` object now requires passing the relationship as a keyword argument (:pr:`1405`, :pr:`1370`). * ``Entity.update_data`` has been removed. To update the dataframe, call ``EntitySet.replace_dataframe`` and use the ``dataframe_name`` parameter (:pr:`1630`, :pr:`1522`). * The data in an ``EntitySet`` is no longer stored in ``Entity`` objects. Instead, dataframes with Woodwork typing information are used. Accordingly, most language referring to “entities” will now refer to “dataframes”, references to “variables” will now refer to “columns”, and “variable types” will use the Woodwork type system’s “logical types” and “semantic tags” (:pr:`1405`). * The dictionary of tuples passed to ``EntitySet.__init__`` has replaced the ``variable_types`` element with separate ``logical_types`` and ``semantic_tags`` dictionaries (:pr:`1405`). * ``EntitySet.entity_from_dataframe`` no longer exists. To add new tables to an entityset, use``EntitySet.add_dataframe`` (:pr:`1405`). * ``EntitySet.normalize_entity`` has been renamed to ``EntitySet.normalize_dataframe`` (:pr:`1405`). * Instead of raising an error at ``EntitySet.add_relationship`` when the dtypes of parent and child columns do not match, Featuretools will now check whether the Woodwork logical type of the parent and child columns match. If they do not match, there will now be a warning raised, and Featuretools will attempt to update the logical type of the child column to match the parent’s (:pr:`1405`). * If no index is specified at ``EntitySet.add_dataframe``, the first column will only be used as index if Woodwork has not been initialized on the DataFrame. When adding a dataframe that already has Woodwork initialized, if there is no index set, an error will be raised (:pr:`1405`). * Featuretools will no longer re-order columns in DataFrames so that the index column is the first column of the DataFrame (:pr:`1405`). * Type inference can now be performed on Dask and Koalas dataframes, though a warning will be issued indicating that this may be computationally intensive (:pr:`1405`). * EntitySet.time_type is no longer stored as Variable objects. Instead, Woodwork typing is used, and a numeric time type will be indicated by the ``'numeric'`` semantic tag string, and a datetime time type will be indicated by the ``Datetime`` logical type (:pr:`1405`). * ``last_time_index``, ``secondary_time_index``, and ``interesting_values`` are no longer attributes of an entityset’s tables that can be accessed directly. Now they must be accessed through the metadata of the Woodwork DataFrame, which is a dictionary (:pr:`1405`). * The helper function ``list_variable_types`` will be removed in a future release and replaced by ``list_logical_types``. In the meantime, ``list_variable_types`` will return the same output as ``list_logical_types`` (:pr:`1447`). What's New in this Release ++++++++++++++++++++++++++ **Adding Interesting Values** To add interesting values for a single entity, call ``EntitySet.add_interesting_values`` passing the id of the dataframe for which interesting values should be added. .. code-block:: python >>> es.add_interesting_values(dataframe_name='log') **Setting a Secondary Time Index** To set a secondary time index for a specific dataframe, call ``EntitySet.set_secondary_time_index`` passing the dataframe name for which to set the secondary time index along with the dictionary mapping the secondary time index column to the for which the secondary time index applies. .. code-block:: python >>> customers_secondary_time_index = {'cancel_date': ['cancel_reason']} >>> es.set_secondary_time_index(dataframe_name='customers', customers_secondary_time_index) **Creating a Relationship and Adding to an EntitySet** Relationships are now created by passing parameters identifying the entityset along with four string values specifying the parent dataframe, parent column, child dataframe and child column. Specifying parameter names is optional. .. code-block:: python >>> new_relationship = Relationship( ... entityset=es, ... parent_dataframe_name='customers', ... parent_column_name='id', ... child_dataframe_name='sessions', ... child_column_name='customer_id' ... ) Relationships can now be added to EntitySets in one of two ways. The first approach is to pass in name values for the parent dataframe, parent column, child dataframe and child column. Specifying parameter names is optional with this approach. .. code-block:: python >>> es.add_relationship( ... parent_dataframe_name='customers', ... parent_column_name='id', ... child_dataframe_name='sessions', ... child_column_name='customer_id' ... ) Relationships can also be added by passing in a previously created ``Relationship`` object. When using this approach the ``relationship`` parameter name must be included. .. code-block:: python >>> es.add_relationship(relationship=new_relationship) **Replace DataFrame** To replace a dataframe in an EntitySet with a new dataframe, call ``EntitySet.replace_dataframe`` and pass in the name of the dataframe to replace along with the new data. .. code-block:: python >>> es.replace_dataframe(dataframe_name='log', df=df) **List Logical Types and Semantic Tags** Logical types and semantic tags have replaced variable types to parse and interpret columns. You can list all the available logical types by calling ``featuretools.list_logical_types``. .. code-block:: python >>> ft.list_logical_types() You can list all the available semantic tags by calling ``featuretools.list_semantic_tags``. .. code-block:: python >>> ft.list_semantic_tags() v0.27.1 Sep 2, 2021 =================== * Documentation Changes * Add banner to docs about upcoming Featuretools 1.0 release (:pr:`1669`) Thanks to the following people for contributing to this release: :user:`thehomebrewnerd` v0.27.0 Aug 31, 2021 ==================== * Changes * Remove autonormalize, tsfresh, nlp_primitives, sklearn_transformer, caegorical_encoding as an add-on libraries (will be added back later) (:pr:`1644`) * Emit a warning message when a ``featuretools_primitives`` entrypoint throws an exception (:pr:`1662`) * Throw a ``RuntimeError`` when two primitives with the same name are encountered during ``featuretools_primitives`` entrypoint handling (:pr:`1662`) * Prevent the ``featuretools_primitives`` entrypoint loader from loading non-class objects as well as the ``AggregationPrimitive`` and ``TransformPrimitive`` base classes (:pr:`1662`) * Testing Changes * Update latest dependency checker with proper install command (:pr:`1652`) * Update isort dependency (:pr:`1654`) Thanks to the following people for contributing to this release: :user:`davesque`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge` v0.26.2 Aug 17, 2021 ==================== * Documentation Changes * Specify conda channel and Windows exe in graphviz installation instructions (:pr:`1611`) * Remove GA token from the layout html (:pr:`1622`) * Testing Changes * Add additional reviewers to minimum and latest dependency checkers (:pr:`1558`, :pr:`1562`, :pr:`1564`, :pr:`1567`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`simha104` v0.26.1 Jul 23, 2021 ==================== * Fixes * Set ``name`` attribute for ``EmailAddressToDomain`` primitive (:pr:`1543`) * Documentation Changes * Remove and ignore unnecessary graph files (:pr:`1544`) Thanks to the following people for contributing to this release: :user:`davesque`, :user:`rwedge` v0.26.0 Jul 15, 2021 ==================== * Enhancements * Add ``replace_inf_values`` utility function for replacing ``inf`` values in a feature matrix (:pr:`1505`) * Add URLToProtocol, URLToDomain, URLToTLD, EmailAddressToDomain, IsFreeEmailDomain as transform primitives (:pr:`1508`, :pr:`1531`) * Fixes * ``include_entities`` correctly overrides ``exclude_entities`` in ``primitive_options`` (:pr:`1518`) * Documentation Changes * Prevent logging on build (:pr:`1498`) * Testing Changes * Test featuretools on pandas 1.3.0 release candidate and make fixes (:pr:`1492`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999` v0.25.0 Jun 11, 2021 ==================== * Enhancements * Add ``get_valid_primitives`` function (:pr:`1462`) * Add ``EntitySet.dataframe_type`` attribute (:pr:`1473`) * Changes * Upgrade minimum alteryx open source update checker to 2.0.0 (:pr:`1460`) * Testing Changes * Upgrade minimum pip requirement for testing to 21.1.2 (:pr:`1475`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge` v0.24.1 May 26, 2021 ==================== * Fixes * Update minimum pyyaml requirement to 5.4 (:pr:`1433`) * Update minimum psutil requirement to 5.6.6 (:pr:`1438`) * Documentation Changes * Update nbsphinx version to fix docs build issue (:pr:`1436`) * Testing Changes * Create separate worksflows for each CI job (:pr:`1422`) * Add minimum dependency checker to generate minimum requirement files (:pr:`1428`) * Add unit tests against minimum dependencies for python 3.7 on PRs and main (:pr:`1432`, :pr:`1445`) * Update minimum urllib3 requirement to 1.26.5 (:pr:`1457`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd` v0.24.0 Apr 30, 2021 ==================== * Changes * Add auto assign bot on GitHub (:pr:`1380`) * Reduce DFS max_depth to 1 if single entity in entityset (:pr:`1412`) * Drop Python 3.6 support (:pr:`1413`) * Documentation Changes * Improve formatting of release notes (:pr:`1396`) * Testing Changes * Update Dask/Koalas test fixtures (:pr:`1382`) * Update Spark config in test fixtures and docs (:pr:`1387`, :pr:`1389`) * Don't cancel other CI jobs if one fails (:pr:`1386`) * Update boto3 and urllib3 version requirements (:pr:`1394`) * Update token for dependency checker PR creation (:pr:`1402`, :pr:`1407`, :pr:`1409`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd` v0.23.3 Mar 31, 2021 ==================== .. warning:: The next non-bugfix release of Featuretools will not support Python 3.6 * Changes * Minor updates to work with Koalas version 1.7.0 (:pr:`1351`) * Explicitly mention Python 3.8 support in setup.py classifiers (:pr:`1371`) * Fix issue with smart-open version 5.0.0 (:pr:`1372`, :pr:`1376`) * Testing Changes * Make release notes updated check separate from unit tests (:pr:`1347`) * Performance tests now specify which commit to check (:pr:`1354`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd` v0.23.2 Feb 26, 2021 ==================== .. warning:: The next non-bugfix release of Featuretools will not support Python 3.6 * Enhancements * The ``list_primitives`` function returns valid input types and the return type (:pr:`1341`) * Fixes * Restrict numpy version when installing koalas (:pr:`1329`) * Changes * Warn python 3.6 users support will be dropped in future release (:pr:`1344`) * Documentation Changes * Update docs for defining custom primitives (:pr:`1332`) * Update featuretools release instructions (:pr:`1345`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge` v0.23.1 Jan 29, 2021 ==================== * Fixes * Calculate direct features uses default value if parent missing (:pr:`1312`) * Fix bug and improve tests for ``EntitySet.__eq__`` and ``Entity.__eq__`` (:pr:`1323`) * Documentation Changes * Update Twitter link to documentation toolbar (:pr:`1322`) * Testing Changes * Unpin python-graphviz package on Windows (:pr:`1296`) * Reorganize and clean up tests (:pr:`1294`, :pr:`1303`, :pr:`1306`) * Trigger tests on pull request events (:pr:`1304`, :pr:`1315`) * Remove unnecessary test skips on Windows (:pr:`1320`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`seriallazer`, :user:`thehomebrewnerd` v0.23.0 Dec 31, 2020 ==================== * Fixes * Fix logic for inferring variable type from unusual dtype (:pr:`1273`) * Allow passing entities without relationships to ``calculate_feature_matrix`` (:pr:`1290`) * Changes * Move ``query_by_values`` method from ``Entity`` to ``EntitySet`` (:pr:`1251`) * Move ``_handle_time`` method from ``Entity`` to ``EntitySet`` (:pr:`1276`) * Remove usage of ``ravel`` to resolve unexpected warning with pandas 1.2.0 (:pr:`1286`) * Documentation Changes * Fix installation command for Add-ons (:pr:`1279`) * Fix various broken links in documentation (:pr:`1313`) * Testing Changes * Use repository-scoped token for dependency check (:pr:`1245`:, :pr:`1248`) * Fix install error during docs CI test (:pr:`1250`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * ``Entity.query_by_values`` has been removed and replaced by ``EntitySet.query_by_values`` with an added ``entity_id`` parameter to specify which entity in the entityset should be used for the query. v0.22.0 Nov 30, 2020 ==================== * Enhancements * Allow variable descriptions to be set directly on variable (:pr:`1207`) * Add ability to add feature description captions to feature lineage graphs (:pr:`1212`) * Add support for local tar file in read_entityset (:pr:`1228`) * Fixes * Updates to fix unit test errors from koalas 1.4 (:pr:`1230`, :pr:`1232`) * Documentation Changes * Removed link to unused feedback board (:pr:`1220`) * Update footer with Alteryx Innovation Labs (:pr:`1221`) * Update links to repo in documentation to use alteryx org url (:pr:`1224`) * Testing Changes * Update release notes check to use new repo url (:pr:`1222`) * Use new version of pull request Github Action (:pr:`1234`) * Upgrade pip during featuretools[complete] test (:pr:`1236`) * Migrated CI tests to github actions (:pr:`1226`, :pr:`1237`, :pr:`1239`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd` v0.21.0 Oct 30, 2020 ==================== * Enhancements * Add ``describe_feature`` to generate an English language feature description for a given feature (:pr:`1201`) * Fixes * Update ``EntitySet.add_last_time_indexes`` to work with Koalas 1.3.0 (:pr:`1192`, :pr:`1202`) * Changes * Keep koalas requirements in separate file (:pr:`1195`) * Documentation Changes * Added footer to the documentation (:pr:`1189`) * Add guide for feature selection functions (:pr:`1184`) * Fix README.md badge with correct link (:pr:`1200`) * Testing Changes * Add ``pyspark`` and ``koalas`` to automated dependency checks (:pr:`1191`) * Add DockerHub credentials to CI testing environment (:pr:`1204`) * Update premium primitives job name on CI (:pr:`1205`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd` v0.20.0 Sep 30, 2020 ==================== .. warning:: The Text variable type has been deprecated and been replaced with the NaturalLanguage variable type. The Text variable type will be removed in a future release. * Fixes * Allow FeatureOutputSlice features to be serialized (:pr:`1150`) * Fix duplicate label column generation when labels are passed in cutoff times and approximate is being used (:pr:`1160`) * Determine calculate_feature_matrix behavior with approximate and a cutoff df that is a subclass of a pandas DataFrame (:pr:`1166`) * Changes * Text variable type has been replaced with NaturalLanguage (:pr:`1159`) * Documentation Changes * Update release doc for clarity and to add Future Release template (:pr:`1151`) * Use the PyData Sphinx theme (:pr:`1169`) * Testing Changes * Stop requiring single-threaded dask scheduler in tests (:pr:`1163`, :pr:`1170`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`tuethan1999` v0.19.0 Sep 8, 2020 =================== * Enhancements * Support use of Koalas DataFrames in entitysets (:pr:`1031`) * Add feature selection functions for null, correlated, and single value features (:pr:`1126`) * Fixes * Fix ``encode_features`` converting excluded feature columns to a numeric dtype (:pr:`1123`) * Improve performance of unused primitive check in dfs (:pr:`1140`) * Changes * Remove the ability to stack transform primitives (:pr:`1119`, :pr:`1145`) * Sort primitives passed to ``dfs`` to get consistent ordering of features\* (:pr:`1119`) * Documentation Changes * Added return values to dfs and calculate_feature_matrix (:pr:`1125`) * Testing Changes * Better test case for normalizing from no time index to time index (:pr:`1113`) \* When passing multiple instances of a primitive built with ``make_trans_primitive`` or ``maxe_agg_primitive``, those instances must have the same relative order when passed to ``dfs`` to ensure a consistent ordering of features. Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999` Breaking Changes ++++++++++++++++ * ``ft.dfs`` will no longer build features from Transform primitives where one of the inputs is a Transform feature, a GroupByTransform feature, or a Direct Feature of a Transform / GroupByTransform feature. This will make some features that would previously be generated by ``ft.dfs`` only possible if explicitly specified in ``seed_features``. v0.18.1 Aug 12, 2020 ==================== * Fixes * Fix ``EntitySet.plot()`` when given a dask entityset (:pr:`1086`) * Changes * Use ``nlp-primitives[complete]`` install for ``nlp_primitives`` extra in ``setup.py`` (:pr:`1103`) * Documentation Changes * Fix broken downloads badge in README.md (:pr:`1107`) * Testing Changes * Use CircleCI matrix jobs in config to trigger multiple runs of same job with different parameters (:pr:`1105`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`systemshift`, :user:`thehomebrewnerd` v0.18.0 Jul 31, 2020 ==================== * Enhancements * Warn user if supplied primitives are not used during dfs (:pr:`1073`) * Fixes * Use more consistent and uniform warnings (:pr:`1040`) * Fix issue with missing instance ids and categorical entity index (:pr:`1050`) * Remove warnings.simplefilter in feature_set_calculator to un-silence warnings (:pr:`1053`) * Fix feature visualization for features with '>' or '<' in name (:pr:`1055`) * Fix boolean dtype mismatch between encode_features and dfs and calculate_feature_matrix (:pr:`1082`) * Update primitive options to check reversed inputs if primitive is commutative (:pr:`1085`) * Fix inconsistent ordering of features between kernel restarts (:pr:`1088`) * Changes * Make DFS match ``TimeSince`` primitive with all ``Datetime`` types (:pr:`1048`) * Change default branch to ``main`` (:pr:`1038`) * Raise TypeError if improper input is supplied to ``Entity.delete_variables()`` (:pr:`1064`) * Updates for compatibility with pandas 1.1.0 (:pr:`1079`, :pr:`1089`) * Set pandas version to pandas>=0.24.1,<2.0.0. Filter pandas deprecation warning in Week primitive. (:pr:`1094`) * Documentation Changes * Remove benchmarks folder (:pr:`1049`) * Add custom variables types section to variables page (:pr:`1066`) * Testing Changes * Add fixture for ``ft.demo.load_mock_customer`` (:pr:`1036`) * Refactor Dask test units (:pr:`1052`) * Implement automated process for checking critical dependencies (:pr:`1045`, :pr:`1054`, :pr:`1081`) * Don't run changelog check for release PRs or automated dependency PRs (:pr:`1057`) * Fix non-deterministic behavior in Dask test causing codecov issues (:pr:`1070`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`monti-python`, :user:`rwedge`, :user:`systemshift`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`wsankey` v0.17.0 Jun 30, 2020 ==================== * Enhancements * Add ``list_variable_types`` and ``graph_variable_types`` for Variable Types (:pr:`1013`) * Add ``graph_feature`` to generate a feature lineage graph for a given feature (:pr:`1032`) * Fixes * Improve warnings when using a Dask dataframe for cutoff times (:pr:`1026`) * Error if attempting to add entityset relationship where child variable is also child index (:pr:`1034`) * Changes * Remove ``Feature.get_names`` (:pr:`1021`) * Remove unnecessary ``pd.Series`` and ``pd.DatetimeIndex`` calls from primitives (:pr:`1020`, :pr:`1024`) * Improve cutoff time handling when a single value or no value is passed (:pr:`1028`) * Moved ``find_variable_types`` to Variable utils (:pr:`1013`) * Documentation Changes * Add page on Variable Types to describe some Variable Types, and util functions (:pr:`1013`) * Remove featuretools enterprise from documentation (:pr:`1022`) * Add development install instructions to contributing.md (:pr:`1030`) * Testing Changes * Add ``required`` flag to CircleCI codecov upload command (:pr:`1035`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd`, :user:`tuethan1999` Breaking Changes ++++++++++++++++ * Removed ``Feature.get_names``, ``Feature.get_feature_names`` should be used instead v0.16.0 Jun 5, 2020 =================== * Enhancements * Support use of Dask DataFrames in entitysets (:pr:`783`) * Add ``make_index`` when initializing an EntitySet by passing in an ``entities`` dictionary (:pr:`1010`) * Add ability to use primitive classes and instances as keys in primitive_options dictionary (:pr:`993`) * Fixes * Cleanly close tqdm instance (:pr:`1018`) * Resolve issue with ``NaN`` values in ``LatLong`` columns (:pr:`1007`) * Testing Changes * Update tests for numpy v1.19.0 compatability (:pr:`1016`) Thanks to the following people for contributing to this release: :user:`Alex-Monahan`, :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd` v0.15.0 May 29, 2020 ==================== * Enhancements * Add ``get_default_aggregation_primitives`` and ``get_default_transform_primitives`` (:pr:`945`) * Allow cutoff time dataframe columns to be in any order (:pr:`969`, :pr:`995`) * Add Age primitive, and make it a default transform primitive for DFS (:pr:`987`) * Add ``include_cutoff_time`` arg - control whether data at cutoff times are included in feature calculations (:pr:`959`) * Allow ``variables_types`` to be referenced by their ``type_string`` for the ``entity_from_dataframe`` function (:pr:`988`) * Fixes * Fix errors with Equals and NotEquals primitives when comparing categoricals or different dtypes (:pr:`968`) * Normalized type_strings of ``Variable`` classes so that the ``find_variable_types`` function produces a dictionary with a clear key to name transition (:pr:`982`, :pr:`996`) * Remove pandas.datetime in test_calculate_feature_matrix due to deprecation (:pr:`998`) * Documentation Changes * Add python 3.8 support for docs (:pr:`983`) * Adds consistent Entityset Docstrings (:pr:`986`) * Testing Changes * Add automated tests for python 3.8 environment (:pr:`847`) * Update testing dependencies (:pr:`976`) Thanks to the following people for contributing to this release: :user:`ctduffy`, :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rightx2`, :user:`rwedge`, :user:`sebrahimi1988`, :user:`thehomebrewnerd`, :user:`tuethan1999` Breaking Changes ++++++++++++++++ * Calls to ``featuretools.dfs`` or ``featuretools.calculate_feature_matrix`` that use a cutoff time dataframe, but do not label the time column with either the target entity time index variable name or as ``time``, will now result in an ``AttributeError``. Previously, the time column was selected to be the first column that was not the instance id column. With this update, the position of the column in the dataframe is no longer used to determine the time column. Now, both instance id columns and time columns in a cutoff time dataframe can be in any order as long as they are named properly. * The ``type_string`` attributes of all ``Variable`` subclasses are now a snake case conversion of their class names. This changes the ``type_string`` of the ``Unknown``, ``IPAddress``, ``EmailAddress``, ``SubRegionCode``, ``FilePath``, ``LatLong``, and ``ZIPcode`` classes. Old saved entitysets that used these variables may load incorrectly. v0.14.0 Apr 30, 2020 ==================== * Enhancements * ft.encode_features - use less memory for one-hot encoded columns (:pr:`876`) * Fixes * Use logger.warning to fix deprecated logger.warn (:pr:`871`) * Add dtype to interesting_values to fix deprecated empty Series with no dtype (:pr:`933`) * Remove overlap in training windows (:pr:`930`) * Fix progress bar in notebook (:pr:`932`) * Changes * Change premium primitives CI test to Python 3.6 (:pr:`916`) * Remove Python 3.5 support (:pr:`917`) * Documentation Changes * Fix README links to docs (:pr:`872`) * Fix Github links with correct organizations (:pr:`908`) * Fix hyperlinks in docs and docstrings with updated address (:pr:`910`) * Remove unused script for uploading docs to AWS (:pr:`911`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge` Breaking Changes ++++++++++++++++ * Using training windows in feature calculations can result in different values than previous versions. This was done to prevent consecutive training windows from overlapping by excluding data at the oldest point in time. For example, if we use a cutoff time at the first minute of the hour with a one hour training window, the first minute of the previous hour will no longer be included in the feature calculation. v0.13.4 Mar 27, 2020 ==================== .. warning:: The next non-bugfix release of Featuretools will not support Python 3.5 * Fixes * Fix ft.show_info() not displaying in Jupyter notebooks (:pr:`863`) * Changes * Added Plugin Warnings at Entry Point (:pr:`850`, :pr:`869`) * Documentation Changes * Add links to primitives.featurelabs.com (:pr:`860`) * Add source code links to API reference (:pr:`862`) * Update links for testing Dask/Spark integrations (:pr:`867`) * Update release documentation for featuretools (:pr:`868`) * Testing Changes * Miscellaneous changes (:pr:`861`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`FreshLeaf8865`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd` v0.13.3 Feb 28, 2020 ==================== * Fixes * Fix a connection closed error when using n_jobs (:pr:`853`) * Changes * Pin msgpack dependency for Python 3.5; remove dataframe from Dask dependency (:pr:`851`) * Documentation Changes * Update link to help documentation page in Github issue template (:pr:`855`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`rwedge` v0.13.2 Jan 31, 2020 ==================== * Enhancements * Support for Pandas 1.0.0 (:pr:`844`) * Changes * Remove dependency on s3fs library for anonymous downloads from S3 (:pr:`825`) * Testing Changes * Added GitHub Action to automatically run performance tests (:pr:`840`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`rwedge` v0.13.1 Dec 28, 2019 ==================== * Fixes * Raise error when given wrong input for ignore_variables (:pr:`826`) * Fix multi-output features not created when there is no child data (:pr:`834`) * Removing type casting in Equals and NotEquals primitives (:pr:`504`) * Changes * Replace pd.timedelta time units that were deprecated (:pr:`822`) * Move sklearn wrapper to separate library (:pr:`835`, :pr:`837`) * Testing Changes * Run unit tests in windows environment (:pr:`790`) * Update boto3 version requirement for tests (:pr:`838`) Thanks to the following people for contributing to this release: :user:`jeffzi`, :user:`kmax12`, :user:`rwedge`, :user:`systemshift` v0.13.0 Nov 30, 2019 ==================== * Enhancements * Added GitHub Action to auto upload releases to PyPI (:pr:`816`) * Fixes * Fix issue where some primitive options would not be applied (:pr:`807`) * Fix issue with converting to pickle or parquet after adding interesting features (:pr:`798`, :pr:`823`) * Diff primitive now calculates using all available data (:pr:`824`) * Prevent DFS from creating Identity Features of globally ignored variables (:pr:`819`) * Changes * Remove python 2.7 support from serialize.py (:pr:`812`) * Make smart_open, boto3, and s3fs optional dependencies (:pr:`827`) * Documentation Changes * remove python 2.7 support and add 3.7 in install.rst (:pr:`805`) * Fix import error in docs (:pr:`803`) * Fix release title formatting in changelog (:pr:`806`) * Testing Changes * Use multiple CPUS to run tests on CI (:pr:`811`) * Refactor test entityset creation to avoid saving to disk (:pr:`813`, :pr:`821`) * Remove get_values() from test_es.py to remove warnings (:pr:`820`) Thanks to the following people for contributing to this release: :user:`frances-h`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`systemshift` Breaking Changes ++++++++++++++++ * The libraries used for downloading or uploading from S3 or URLs are now optional and will no longer be installed by default. To use this functionality they will need to be installed separately. * The fix to how the Diff primitive is calculated may slow down the overall calculation time of feature lists that use this primitive. v0.12.0 Oct 31, 2019 ==================== * Enhancements * Added First primitive (:pr:`770`) * Added Entropy aggregation primitive (:pr:`779`) * Allow custom naming for multi-output primitives (:pr:`780`) * Fixes * Prevents user from removing base entity time index using additional_variables (:pr:`768`) * Fixes error when a multioutput primitive was supplied to dfs as a groupby trans primitive (:pr:`786`) * Changes * Drop Python 2 support (:pr:`759`) * Add unit parameter to AvgTimeBetween (:pr:`771`) * Require Pandas 0.24.1 or higher (:pr:`787`) * Documentation Changes * Update featuretools slack link (:pr:`765`) * Set up repo to use Read the Docs (:pr:`776`) * Add First primitive to API reference docs (:pr:`782`) * Testing Changes * CircleCI fixes (:pr:`774`) * Disable PIP progress bars (:pr:`775`) Thanks to the following people for contributing to this release: :user:`ablacke-ayx`, :user:`BoopBoopBeepBoop`, :user:`jeffzi`, :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd`, :user:`twdobson` v0.11.0 Sep 30, 2019 ==================== .. warning:: The next non-bugfix release of Featuretools will not support Python 2 * Enhancements * Improve how files are copied and written (:pr:`721`) * Add number of rows to graph in entityset.plot (:pr:`727`) * Added support for pandas DateOffsets in DFS and Timedelta (:pr:`732`) * Enable feature-specific top_n value using a dictionary in encode_features (:pr:`735`) * Added progress_callback parameter to dfs() and calculate_feature_matrix() (:pr:`739`, :pr:`745`) * Enable specifying primitives on a per column or per entity basis (:pr:`748`) * Fixes * Fixed entity set deserialization (:pr:`720`) * Added error message when DateTimeIndex is a variable but not set as the time_index (:pr:`723`) * Fixed CumCount and other group-by transform primitives that take ID as input (:pr:`733`, :pr:`754`) * Fix progress bar undercounting (:pr:`743`) * Updated training_window error assertion to only check against observations (:pr:`728`) * Don't delete the whole destination folder while saving entityset (:pr:`717`) * Changes * Raise warning and not error on schema version mismatch (:pr:`718`) * Change feature calculation to return in order of instance ids provided (:pr:`676`) * Removed time remaining from displayed progress bar in dfs() and calculate_feature_matrix() (:pr:`739`) * Raise warning in normalize_entity() when time_index of base_entity has an invalid type (:pr:`749`) * Remove toolz as a direct dependency (:pr:`755`) * Allow boolean variable types to be used in the Multiply primitive (:pr:`756`) * Documentation Changes * Updated URL for Compose (:pr:`716`) * Testing Changes * Update dependencies (:pr:`738`, :pr:`741`, :pr:`747`) Thanks to the following people for contributing to this release: :user:`angela97lin`, :user:`chidauri`, :user:`christopherbunn`, :user:`frances-h`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`MarcoGorelli`, :user:`rwedge`, :user:`thehomebrewnerd` Breaking Changes ++++++++++++++++ * Feature calculations will return in the order of instance ids provided instead of the order of time points instances are calculated at. v0.10.1 Aug 25, 2019 ==================== * Fixes * Fix serialized LatLong data being loaded as strings (:pr:`712`) * Documentation Changes * Fixed FAQ cell output (:pr:`710`) Thanks to the following people for contributing to this release: :user:`gsheni`, :user:`rwedge` v0.10.0 Aug 19, 2019 ==================== .. warning:: The next non-bugfix release of Featuretools will not support Python 2 * Enhancements * Give more frequent progress bar updates and update chunk size behavior (:pr:`631`, :pr:`696`) * Added drop_first as param in encode_features (:pr:`647`) * Added support for stacking multi-output primitives (:pr:`679`) * Generate transform features of direct features (:pr:`623`) * Added serializing and deserializing from S3 and deserializing from URLs (:pr:`685`) * Added nlp_primitives as an add-on library (:pr:`704`) * Added AutoNormalize to Featuretools plugins (:pr:`699`) * Added functionality for relative units (month/year) in Timedelta (:pr:`692`) * Added categorical-encoding as an add-on library (:pr:`700`) * Fixes * Fix performance regression in DFS (:pr:`637`) * Fix deserialization of feature relationship path (:pr:`665`) * Set index after adding ancestor relationship variables (:pr:`668`) * Fix user-supplied variable_types modification in Entity init (:pr:`675`) * Don't calculate dependencies of unnecessary features (:pr:`667`) * Prevent normalize entity's new entity having same index as base entity (:pr:`681`) * Update variable type inference to better check for string values (:pr:`683`) * Changes * Moved dask, distributed imports (:pr:`634`) * Documentation Changes * Miscellaneous changes (:pr:`641`, :pr:`658`) * Modified doc_string of top_n in encoding (:pr:`648`) * Hyperlinked ComposeML (:pr:`653`) * Added FAQ (:pr:`620`, :pr:`677`) * Fixed FAQ question with multiple question marks (:pr:`673`) * Testing Changes * Add master, and release tests for premium primitives (:pr:`660`, :pr:`669`) * Miscellaneous changes (:pr:`672`, :pr:`674`) Thanks to the following people for contributing to this release: :user:`alexjwang`, :user:`allisonportis`, :user:`ayushpatidar`, :user:`CJStadler`, :user:`ctduffy`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`jeremyliweishih`, :user:`kmax12`, :user:`rwedge`, :user:`zhxt95`, v0.9.1 Jul 3, 2019 ==================== * Enhancements * Speedup groupby transform calculations (:pr:`609`) * Generate features along all paths when there are multiple paths between entities (:pr:`600`, :pr:`608`) * Fixes * Select columns of dataframe using a list (:pr:`615`) * Change type of features calculated on Index features to Categorical (:pr:`602`) * Filter dataframes through forward relationships (:pr:`625`) * Specify Dask version in requirements for python 2 (:pr:`627`) * Keep dataframe sorted by time during feature calculation (:pr:`626`) * Fix bug in encode_features that created duplicate columns of features with multiple outputs (:pr:`622`) * Changes * Remove unused variance_selection.py file (:pr:`613`) * Remove Timedelta data param (:pr:`619`) * Remove DaysSince primitive (:pr:`628`) * Documentation Changes * Add installation instructions for add-on libraries (:pr:`617`) * Clarification of Multi Output Feature Creation (:pr:`638`) * Miscellaneous changes (:pr:`632`, :pr:`639`) * Testing Changes * Miscellaneous changes (:pr:`595`, :pr:`612`) Thanks to the following people for contributing to this release: :user:`CJStadler`, :user:`kmax12`, :user:`rwedge`, :user:`gsheni`, :user:`kkleidal`, :user:`ctduffy` v0.9.0 Jun 19, 2019 =================== * Enhancements * Add unit parameter to timesince primitives (:pr:`558`) * Add ability to install optional add on libraries (:pr:`551`) * Load and save features from open files and strings (:pr:`566`) * Support custom variable types (:pr:`571`) * Support entitysets which have multiple paths between two entities (:pr:`572`, :pr:`544`) * Added show_info function, more output information added to CLI `featuretools info` (:pr:`525`) * Fixes * Normalize_entity specifies error when 'make_time_index' is an invalid string (:pr:`550`) * Schema version added for entityset serialization (:pr:`586`) * Renamed features have names correctly serialized (:pr:`585`) * Improved error message for index/time_index being the same column in normalize_entity and entity_from_dataframe (:pr:`583`) * Removed all mentions of allow_where (:pr:`587`, :pr:`588`) * Removed unused variable in normalize entity (:pr:`589`) * Change time since return type to numeric (:pr:`606`) * Changes * Refactor get_pandas_data_slice to take single entity (:pr:`547`) * Updates TimeSincePrevious and Diff Primitives (:pr:`561`) * Remove unecessary time_last variable (:pr:`546`) * Documentation Changes * Add Featuretools Enterprise to documentation (:pr:`563`) * Miscellaneous changes (:pr:`552`, :pr:`573`, :pr:`577`, :pr:`599`) * Testing Changes * Miscellaneous changes (:pr:`559`, :pr:`569`, :pr:`570`, :pr:`574`, :pr:`584`, :pr:`590`) Thanks to the following people for contributing to this release: :user:`alexjwang`, :user:`allisonportis`, :user:`CJStadler`, :user:`ctduffy`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge` v0.8.0 May 17, 2019 =================== * Rename NUnique to NumUnique (:pr:`510`) * Serialize features as JSON (:pr:`532`) * Drop all variables at once in normalize_entity (:pr:`533`) * Remove unnecessary sorting from normalize_entity (:pr:`535`) * Features cache their names (:pr:`536`) * Only calculate features for instances before cutoff (:pr:`523`) * Remove all relative imports (:pr:`530`) * Added FullName Variable Type (:pr:`506`) * Add error message when target entity does not exist (:pr:`520`) * New demo links (:pr:`542`) * Remove duplicate features check in DFS (:pr:`538`) * featuretools_primitives entry point expects list of primitive classes (:pr:`529`) * Update ALL_VARIABLE_TYPES list (:pr:`526`) * More Informative N Jobs Prints and Warnings (:pr:`511`) * Update sklearn version requirements (:pr:`541`) * Update Makefile (:pr:`519`) * Remove unused parameter in Entity._handle_time (:pr:`524`) * Remove build_ext code from setup.py (:pr:`513`) * Documentation updates (:pr:`512`, :pr:`514`, :pr:`515`, :pr:`521`, :pr:`522`, :pr:`527`, :pr:`545`) * Testing updates (:pr:`509`, :pr:`516`, :pr:`517`, :pr:`539`) Thanks to the following people for contributing to this release: :user:`bphi`, :user:`CharlesBradshaw`, :user:`CJStadler`, :user:`glentennis`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge` Breaking Changes ++++++++++++++++ * ``NUnique`` has been renamed to ``NumUnique``. Previous behavior .. code-block:: python from featuretools.primitives import NUnique New behavior .. code-block:: python from featuretools.primitives import NumUnique v0.7.1 Apr 24, 2019 =================== * Automatically generate feature name for controllable primitives (:pr:`481`) * Primitive docstring updates (:pr:`489`, :pr:`492`, :pr:`494`, :pr:`495`) * Change primitive functions that returned strings to return functions (:pr:`499`) * CLI customizable via entrypoints (:pr:`493`) * Improve calculation of aggregation features on grandchildren (:pr:`479`) * Refactor entrypoints to use decorator (:pr:`483`) * Include doctests in testing suite (:pr:`491`) * Documentation updates (:pr:`490`) * Update how standard primitives are imported internally (:pr:`482`) Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`CharlesBradshaw`, :user:`glentennis`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`minkvsky`, :user:`rwedge`, :user:`thehomebrewnerd` v0.7.0 Mar 29, 2019 =================== * Improve Entity Set Serialization (:pr:`361`) * Support calling a primitive instance's function directly (:pr:`461`, :pr:`468`) * Support other libraries extending featuretools functionality via entrypoints (:pr:`452`) * Remove featuretools install command (:pr:`475`) * Add GroupByTransformFeature (:pr:`455`, :pr:`472`, :pr:`476`) * Update Haversine Primitive (:pr:`435`, :pr:`462`) * Add commutative argument to SubtractNumeric and DivideNumeric primitives (:pr:`457`) * Add FilePath variable_type (:pr:`470`) * Add PhoneNumber, DateOfBirth, URL variable types (:pr:`447`) * Generalize infer_variable_type, convert_variable_data and convert_all_variable_data methods (:pr:`423`) * Documentation updates (:pr:`438`, :pr:`446`, :pr:`458`, :pr:`469`) * Testing updates (:pr:`440`, :pr:`444`, :pr:`445`, :pr:`459`) Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`CharlesBradshaw`, :user:`ColCarroll`, :user:`glentennis`, :user:`grayskripko`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`jrkinley`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge` Breaking Changes ++++++++++++++++ * ``ft.dfs`` now has a ``groupby_trans_primitives`` parameter that DFS uses to automatically construct features that group by an ID column and then apply a transform primitive to search group. This change applies to the following primitives: ``CumSum``, ``CumCount``, ``CumMean``, ``CumMin``, and ``CumMax``. Previous behavior .. code-block:: python ft.dfs(entityset=es, target_entity='customers', trans_primitives=["cum_mean"]) New behavior .. code-block:: python ft.dfs(entityset=es, target_entity='customers', groupby_trans_primitives=["cum_mean"]) * Related to the above change, cumulative transform features are now defined using a new feature class, ``GroupByTransformFeature``. Previous behavior .. code-block:: python ft.Feature([base_feature, groupby_feature], primitive=CumulativePrimitive) New behavior .. code-block:: python ft.Feature(base_feature, groupby=groupby_feature, primitive=CumulativePrimitive) v0.6.1 Feb 15, 2019 =================== * Cumulative primitives (:pr:`410`) * Entity.query_by_values now preserves row order of underlying data (:pr:`428`) * Implementing Country Code and Sub Region Codes as variable types (:pr:`430`) * Added IPAddress and EmailAddress variable types (:pr:`426`) * Install data and dependencies (:pr:`403`) * Add TimeSinceFirst, fix TimeSinceLast (:pr:`388`) * Allow user to pass in desired feature return types (:pr:`372`) * Add new configuration object (:pr:`401`) * Replace NUnique get_function (:pr:`434`) * _calculate_idenity_features now only returns the features asked for, instead of the entire entity (:pr:`429`) * Primitive function name uniqueness (:pr:`424`) * Update NumCharacters and NumWords primitives (:pr:`419`) * Removed Variable.dtype (:pr:`416`, :pr:`433`) * Change to zipcode rep, str for pandas (:pr:`418`) * Remove pandas version upper bound (:pr:`408`) * Make S3 dependencies optional (:pr:`404`) * Check that agg_primitives and trans_primitives are right primitive type (:pr:`397`) * Mean primitive changes (:pr:`395`) * Fix transform stacking on multi-output aggregation (:pr:`394`) * Fix list_primitives (:pr:`391`) * Handle graphviz dependency (:pr:`389`, :pr:`396`, :pr:`398`) * Testing updates (:pr:`402`, :pr:`417`, :pr:`433`) * Documentation updates (:pr:`400`, :pr:`409`, :pr:`415`, :pr:`417`, :pr:`420`, :pr:`421`, :pr:`422`, :pr:`431`) Thanks to the following people for contributing to this release: :user:`CharlesBradshaw`, :user:`csala`, :user:`floscha`, :user:`gsheni`, :user:`jxwolstenholme`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge` v0.6.0 Jan 30, 2018 =================== * Primitive refactor (:pr:`364`) * Mean ignore NaNs (:pr:`379`) * Plotting entitysets (:pr:`382`) * Add seed features later in DFS process (:pr:`357`) * Multiple output column features (:pr:`376`) * Add ZipCode Variable Type (:pr:`367`) * Add `primitive.get_filepath` and example of primitive loading data from external files (:pr:`380`) * Transform primitives take series as input (:pr:`385`) * Update dependency requirements (:pr:`378`, :pr:`383`, :pr:`386`) * Add modulo to override tests (:pr:`384`) * Update documentation (:pr:`368`, :pr:`377`) * Update README.md (:pr:`366`, :pr:`373`) * Update CI tests (:pr:`359`, :pr:`360`, :pr:`375`) Thanks to the following people for contributing to this release: :user:`floscha`, :user:`gsheni`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge` v0.5.1 Dec 17, 2018 =================== * Add missing dependencies (:pr:`353`) * Move comment to note in documentation (:pr:`352`) v0.5.0 Dec 17, 2018 =================== * Add specific error for duplicate additional/copy_variables in normalize_entity (:pr:`348`) * Removed EntitySet._import_from_dataframe (:pr:`346`) * Removed time_index_reduce parameter (:pr:`344`) * Allow installation of additional primitives (:pr:`326`) * Fix DatetimeIndex variable conversion (:pr:`342`) * Update Sklearn DFS Transformer (:pr:`343`) * Clean up entity creation logic (:pr:`336`) * remove casting to list in transform feature calculation (:pr:`330`) * Fix sklearn wrapper (:pr:`335`) * Add readme to pypi * Update conda docs after move to conda-forge (:pr:`334`) * Add wrapper for scikit-learn Pipelines (:pr:`323`) * Remove parse_date_cols parameter from EntitySet._import_from_dataframe (:pr:`333`) Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`georgewambold`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, and :user:`rwedge`. v0.4.1 Nov 29, 2018 =================== * Resolve bug preventing using first column as index by default (:pr:`308`) * Handle return type when creating features from Id variables (:pr:`318`) * Make id an optional parameter of EntitySet constructor (:pr:`324`) * Handle primitives with same function being applied to same column (:pr:`321`) * Update requirements (:pr:`328`) * Clean up DFS arguments (:pr:`319`) * Clean up Pandas Backend (:pr:`302`) * Update properties of cumulative transform primitives (:pr:`320`) * Feature stability between versions documentation (:pr:`316`) * Add download count to GitHub readme (:pr:`310`) * Fixed #297 update tests to check error strings (:pr:`303`) * Remove usage of fixtures in agg primitive tests (:pr:`325`) v0.4.0 Oct 31, 2018 =================== * Remove ft.utils.gen_utils.getsize and make pympler a test requirement (:pr:`299`) * Update requirements.txt (:pr:`298`) * Refactor EntitySet.find_path(...) (:pr:`295`) * Clean up unused methods (:pr:`293`) * Remove unused parents property of Entity (:pr:`283`) * Removed relationships parameter (:pr:`284`) * Improve time index validation (:pr:`285`) * Encode features with "unknown" class in categorical (:pr:`287`) * Allow where clauses on direct features in Deep Feature Synthesis (:pr:`279`) * Change to fullargsspec (:pr:`288`) * Parallel verbose fixes (:pr:`282`) * Update tests for python 3.7 (:pr:`277`) * Check duplicate rows cutoff times (:pr:`276`) * Load retail demo data using compressed file (:pr:`271`) v0.3.1 Sep 28, 2018 =================== * Handling time rewrite (:pr:`245`) * Update deep_feature_synthesis.py (:pr:`249`) * Handling return type when creating features from DatetimeTimeIndex (:pr:`266`) * Update retail.py (:pr:`259`) * Improve Consistency of Transform Primitives (:pr:`236`) * Update demo docstrings (:pr:`268`) * Handle non-string column names (:pr:`255`) * Clean up merging of aggregation primitives (:pr:`250`) * Add tests for Entity methods (:pr:`262`) * Handle no child data when calculating aggregation features with multiple arguments (:pr:`264`) * Add `is_string` utils function (:pr:`260`) * Update python versions to match docker container (:pr:`261`) * Handle where clause when no child data (:pr:`258`) * No longer cache demo csvs, remove config file (:pr:`257`) * Avoid stacking "expanding" primitives (:pr:`238`) * Use randomly generated names in retail csv (:pr:`233`) * Update README.md (:pr:`243`) v0.3.0 Aug 27, 2018 =================== * Improve performance of all feature calculations (:pr:`224`) * Update agg primitives to use more efficient functions (:pr:`215`) * Optimize metadata calculation (:pr:`229`) * More robust handling when no data at a cutoff time (:pr:`234`) * Workaround categorical merge (:pr:`231`) * Switch which CSV is associated with which variable (:pr:`228`) * Remove unused kwargs from query_by_values, filter_and_sort (:pr:`225`) * Remove convert_links_to_integers (:pr:`219`) * Add conda install instructions (:pr:`223`, :pr:`227`) * Add example of using Dask to parallelize to docs (:pr:`221`) v0.2.2 Aug 20, 2018 =================== * Remove unnecessary check no related instances call and refactor (:pr:`209`) * Improve memory usage through support for pandas categorical types (:pr:`196`) * Bump minimum pandas version from 0.20.3 to 0.23.0 (:pr:`216`) * Better parallel memory warnings (:pr:`208`, :pr:`214`) * Update demo datasets (:pr:`187`, :pr:`201`, :pr:`207`) * Make primitive lookup case insensitive (:pr:`213`) * Use capital name (:pr:`211`) * Set class name for Min (:pr:`206`) * Remove ``variable_types`` from normalize entity (:pr:`205`) * Handle parquet serialization with last time index (:pr:`204`) * Reset index of cutoff times in calculate feature matrix (:pr:`198`) * Check argument types for .normalize_entity (:pr:`195`) * Type checking ignore entities. (:pr:`193`) v0.2.1 Jul 2, 2018 ================== * Cpu count fix (:pr:`176`) * Update flight (:pr:`175`) * Move feature matrix calculation helper functions to separate file (:pr:`177`) v0.2.0 Jun 22, 2018 =================== * Multiprocessing (:pr:`170`) * Handle unicode encoding in repr throughout Featuretools (:pr:`161`) * Clean up EntitySet class (:pr:`145`) * Add support for building and uploading conda package (:pr:`167`) * Parquet serialization (:pr:`152`) * Remove variable stats (:pr:`171`) * Make sure index variable comes first (:pr:`168`) * No last time index update on normalize (:pr:`169`) * Remove list of times as on option for `cutoff_time` in `calculate_feature_matrix` (:pr:`165`) * Config does error checking to see if it can write to disk (:pr:`162`) v0.1.21 May 30, 2018 ==================== * Support Pandas 0.23.0 (:pr:`153`, :pr:`154`, :pr:`155`, :pr:`159`) * No EntitySet required in loading/saving features (:pr:`141`) * Use s3 demo csv with better column names (:pr:`139`) * more reasonable start parameter (:pr:`149`) * add issue template (:pr:`133`) * Improve tests (:pr:`136`, :pr:`137`, :pr:`144`, :pr:`147`) * Remove unused functions (:pr:`140`, :pr:`143`, :pr:`146`) * Update documentation after recent changes / removals (:pr:`157`) * Rename demo retail csv file (:pr:`148`) * Add names for binary (:pr:`142`) * EntitySet repr to use get_name rather than id (:pr:`134`) * Ensure config dir is writable (:pr:`135`) v0.1.20 Apr 13, 2018 ==================== * Primitives as strings in DFS parameters (:pr:`129`) * Integer time index bugfixes (:pr:`128`) * Add make_temporal_cutoffs utility function (:pr:`126`) * Show all entities, switch shape display to row/col (:pr:`124`) * Improved chunking when calculating feature matrices (:pr:`121`) * fixed num characters nan fix (:pr:`118`) * modify ignore_variables docstring (:pr:`117`) v0.1.19 Mar 21, 2018 ==================== * More descriptive DFS progress bar (:pr:`69`) * Convert text variable to string before NumWords (:pr:`106`) * EntitySet.concat() reindexes relationships (:pr:`96`) * Keep non-feature columns when encoding feature matrix (:pr:`111`) * Uses full entity update for dependencies of uses_full_entity features (:pr:`110`) * Update column names in retail demo (:pr:`104`) * Handle Transform features that need access to all values of entity (:pr:`91`) v0.1.18 Feb 27, 2018 ==================== * fixes related instances bug (:pr:`97`) * Adding non-feature columns to calculated feature matrix (:pr:`78`) * Relax numpy version req (:pr:`82`) * Remove `entity_from_csv`, tests, and lint (:pr:`71`) v0.1.17 Jan 18, 2018 ==================== * LatLong type (:pr:`57`) * Last time index fixes (:pr:`70`) * Make median agg primitives ignore nans by default (:pr:`61`) * Remove Python 3.4 support (:pr:`64`) * Change `normalize_entity` to update `secondary_time_index` (:pr:`59`) * Unpin requirements (:pr:`53`) * associative -> commutative (:pr:`56`) * Add Words and Chars primitives (:pr:`51`) v0.1.16 Dec 19, 2017 ==================== * fix EntitySet.combine_variables and standardize encode_features (:pr:`47`) * Python 3 compatibility (:pr:`16`) v0.1.15 Dec 18, 2017 ==================== * Fix variable type in demo data (:pr:`37`) * Custom primitive kwarg fix (:pr:`38`) * Changed order and text of arguments in make_trans_primitive docstring (:pr:`42`) v0.1.14 Nov 20, 2017 ==================== * Last time index (:pr:`33`) * Update Scipy version to 1.0.0 (:pr:`31`) v0.1.13 Nov 1, 2017 =================== * Add MANIFEST.in (:pr:`26`) v0.1.11 Oct 31, 2017 ==================== * Package linting (:pr:`7`) * Custom primitive creation functions (:pr:`13`) * Split requirements to separate files and pin to latest versions (:pr:`15`) * Select low information features (:pr:`18`) * Fix docs typos (:pr:`19`) * Fixed Diff primitive for rare nan case (:pr:`21`) * added some mising doc strings (:pr:`23`) * Trend fix (:pr:`22`) * Remove as_dir=False option from EntitySet.to_pickle() (:pr:`20`) * Entity Normalization Preserves Types of Copy & Additional Variables (:pr:`25`) v0.1.10 Oct 12, 2017 ==================== * NumTrue primitive added and docstring of other primitives updated (:pr:`11`) * fixed hash issue with same base features (:pr:`8`) * Head fix (:pr:`9`) * Fix training window (:pr:`10`) * Add associative attribute to primitives (:pr:`3`) * Add status badges, fix license in setup.py (:pr:`1`) * fixed head printout and flight demo index (:pr:`2`) v0.1.9 Sep 8, 2017 ================== * Documentation improvements * New ``featuretools.demo.load_mock_customer`` function v0.1.8 Sep 1, 2017 ================== * Bug fixes * Added ``Percentile`` transform primitive v0.1.7 Aug 17, 2017 =================== * Performance improvements for approximate in ``calculate_feature_matrix`` and ``dfs`` * Added ``Week`` transform primitive v0.1.6 Jul 26, 2017 =================== * Added ``load_features`` and ``save_features`` to persist and reload features * Added save_progress argument to ``calculate_feature_matrix`` * Added approximate parameter to ``calculate_feature_matrix`` and ``dfs`` * Added ``load_flight`` to ft.demo v0.1.5 Jul 11, 2017 =================== * Windows support v0.1.3 Jul 10, 2017 =================== * Renamed feature submodule to primitives * Renamed prediction_entity arguments to target_entity * Added training_window parameter to ``calculate_feature_matrix`` v0.1.2 Jul 3rd, 2017 ==================== * Initial release .. command .. git log --pretty=oneline --abbrev-commit ================================================ FILE: docs/source/resources/ecosystem.rst ================================================ :description: A list of libraries, use cases / demos, and tutorials that leverage Featuretools =============================== Featuretools External Ecosystem =============================== New projects are regularly being built on top of Featuretools, highlighting the importance of automated feature engineering. On this page, we have a list of libraries, use cases / demos, and tutorials that leverage Featuretools. If you would like to add a project, please contact us or submit a pull request on `GitHub`_. .. _`GitHub`: https://github.com/alteryx/featuretools .. note:: We are proud and excited to share the work of people using Featuretools, but we cannot endorse or provide support for the tools on this page. --------- Libraries --------- `MLBlocks`_ =========== - MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface. MLBlocks contains a primitive that uses Featuretools. .. _`MLBlocks`: https://github.com/HDI-Project/MLBlocks `Cardea`_ ========= - Cardea is a machine learning library built on top of the FHIR data schema. It uses a number of **automl** tools, including Featuretools. .. _`Cardea`: https://github.com/D3-AI/Cardea ----------------- Demos & Use Cases ----------------- `Predict customer lifetime value`_ ================================== - A common use case for machine learning is to predict customer lifetime value. This article walks through the importance of this prediction problem using Featuretools in the process. .. _`Predict customer lifetime value`: https://towardsdatascience.com/automating-interpretable-feature-engineering-for-predicting-clv-87ece7da9b36 `Predict NHL playoff matches`_ ============================== - Many users of `Kaggle`_ are eager to use Featuretools to improve their model performance. In this blog post, a Kaggle user takes a dataset of plays from National Hockey League games and creates a model to predict if a game is a playoff match. .. _`Predict NHL playoff matches`: https://towardsdatascience.com/automated-feature-engineering-for-predictive-modeling-d8c9fa4e478b .. _`Kaggle`: https://www.kaggle.com/ `Predict poverty of households in Costa Rica`_ ============================================== - Social programs have a difficult time determining the right people to give aid. Using a dataset of Costa Rican household characteristics, this Kaggle kernel predicts the poverty of households. .. _`Predict poverty of households in Costa Rica`: https://www.kaggle.com/willkoehrsen/featuretools-for-good `Predicting Functional Threshold Power (FTP)`_ ============================================== - This notebook and accompanying report evaluates the use of machine learning for predicting a cyclist’s FTP using data collected from previous training sessions. Featuretools is used to generate a set of independent variables that capture changes in performance over time. .. _`Predicting Functional Threshold Power (FTP)`: https://github.com/jrkinley/ftp_proba .. note:: For more demos written by `Feature Labs `_, see `featuretools.com/demos `_ --------- Tutorials --------- `Automated Feature Engineering in Python`_ ========================================== - This article provides a walk-through of how to use a retail dataset with DFS. .. _`Automated Feature Engineering in Python`: https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219 `A Hands-On Guide to Automated Feature Engineering`_ ==================================================== - A **in-depth** tutorial that works through using Featuretools to predict future product sales at "BigMart". .. _`A Hands-On Guide to Automated Feature Engineering`: https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/ `Introduction to Automated Feature Engineering Using DFS`_ ========================================================== - This article demonstrates using Featuretools helps automate the manual process of feature engineering on a dataset of home loans. .. _`Introduction to Automated Feature Engineering Using DFS`: https://heartbeat.fritz.ai/introduction-to-automated-feature-engineering-using-deep-feature-synthesis-dfs-3feb69a7c00b `Automated Feature Engineering Workshop`_ ========================================= - An automated feature engineering workshop using Featuretools hosted at the 2017 Data Summer Conference. .. _`Automated Feature Engineering Workshop`: https://github.com/fred-navruzov/featuretools-workshop `Tutorial in Japanese`_ ======================= - A tutorial of Featuretools that demonstrates integrating with the feature selection library `Boruta`_ and the hyper parameter tuning library `Optuna`_. .. _`Tutorial in Japanese`: https://dev.classmethod.jp/machine-learning/yoshim-featuretools-boruta-optuna/ .. _`Optuna`: https://github.com/pfnet/optuna .. _`Boruta`: https://github.com/scikit-learn-contrib/boruta_py `Building a Churn Prediction Model using Featuretools`_ ======================================================= - A video tutorial that shows how to build a churn prediction model using Featuretools along with `Spark`_, `XGBoost`_, and `Google Cloud Platform`_. .. _`Building a Churn Prediction Model using Featuretools`: https://youtu.be/ZwwneZ6iU3Y .. _`Spark`: https://spark.apache.org/ .. _`XGBoost`: https://github.com/dmlc/xgboost .. _`Google Cloud Platform`: https://cloud.google.com/ `Automated Feature Engineering Workshop in Russian`_ ==================================================== - A video tutorial that shows how to predict if an applicant is capable of repaying a loan using Featuretools. .. _`Automated Feature Engineering Workshop in Russian`: https://youtu.be/R0-mnamKxqY ================================================ FILE: docs/source/resources/frequently_asked_questions.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Frequently Asked Questions\n", "\n", "Here we are attempting to answer some commonly asked questions that appear on Github, and Stack Overflow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import woodwork as ww\n", "\n", "import featuretools as ft" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EntitySet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I get a list of column names and types in an `EntitySet`?\n", "\n", "After you create your `EntitySet`, you may wish to view the column names. An `EntitySet` contains multiple DataFrames, one for each table in the `EntitySet`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to view the underlying Dataframe, you can do the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want view the columns and types for the \"transactions\" DataFrame, you can do the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].ww" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the difference between `copy_columns` and `additional_columns`?\n", "The function `normalize_dataframe` creates a new DataFrame and a relationship from unique values of an existing DataFrame. It takes 2 similar arguments:\n", "\n", "- `additional_columns` removes columns from the base DataFrame and moves them to the new DataFrame. \n", "- `copy_columns` keeps the given columns in the base DataFrame, but also copies them to the new DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ft.demo.load_mock_customer()\n", "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", "products_df = data[\"products\"]\n", "\n", "es = ft.EntitySet(id=\"customer_data\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\",\n", " dataframe=transactions_df,\n", " index=\"transaction_id\",\n", " time_index=\"transaction_time\",\n", ")\n", "\n", "es = es.add_dataframe(\n", " dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n", ")\n", "\n", "es = es.add_relationship(\"products\", \"product_id\", \"transactions\", \"product_id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we normalize to create a new DataFrame, let's look at the base DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the columns `session_id`, `session_start`, `join_date`, `device`, `customer_id`, and `zip_code`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.normalize_dataframe(\n", " base_dataframe_name=\"transactions\",\n", " new_dataframe_name=\"sessions\",\n", " index=\"session_id\",\n", " make_time_index=\"session_start\",\n", " additional_columns=[\"join_date\"],\n", " copy_columns=[\"device\", \"customer_id\", \"zip_code\", \"session_start\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we normalized the columns to create a new DataFrame. \n", "\n", "- For `additional_columns`, the following column `['join_date]` will be removed from the `transactions` DataFrame, and moved to the new `sessions` DataFrame. \n", "\n", "- For `copy_columns`, the following columns `['device', 'customer_id', 'zip_code','session_start']` will be copied from the `transactions` DataFrame to the new `sessions` DataFrame. \n", "\n", "Let's see this in the actual `EntitySet`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"transactions\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice above how `['device', 'customer_id', 'zip_code','session_start']` are still in the `transactions` DataFrame, while `['join_date']` is not. But, they have all been moved to the `sessions` DataFrame, as seen below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es[\"sessions\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why did my columns get new semantic tags?\n", "\n", "During the creation of your `EntitySet`, you might be wondering why the semantic tags in your columns change." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ft.demo.load_mock_customer()\n", "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", "products_df = data[\"products\"]\n", "\n", "es = ft.EntitySet(id=\"customer_data\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\",\n", " dataframe=transactions_df,\n", " index=\"transaction_id\",\n", " time_index=\"transaction_time\",\n", ")\n", "es.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a column contains semantic tags, they will appear on the right side of a semicolon in the plot above. Notice how `session_id` and `session_start` do not have any semantic tags currently associated to them.\n", "\n", "Now, let's normalize the transactions DataFrame to create a new DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.normalize_dataframe(\n", " base_dataframe_name=\"transactions\",\n", " new_dataframe_name=\"sessions\",\n", " index=\"session_id\",\n", " make_time_index=\"session_start\",\n", " additional_columns=[\"session_start\"],\n", ")\n", "es.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `session_id` now has the sematic tag `foreign_key` in the `transactions` DataFrame, and `index` in the new DataFrame, `sessions`. This is the case because when we normalize the DataFrame, we create a new relationship between the `transactions` and `sessions`. There is a one to many relationship between the parent DataFrame, `sessions`, and child DataFrame, `transactions`.\n", "\n", "Therefore, `session_id` has the semantic tag `foreign_key` in `transactions` because it represents an `index` in another DataFrame. There would be a similar effect if we added another DataFrame using `add_dataframe` and `add_relationship`. \n", "\n", "In addition, when we created the new DataFrame, we set `session_start` as the `time_index`. This added the semantic tag `time_index` to the `session_start` column in the new `sessions` DataFrame because it now represents a `time_index`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I update a column's description or metadata?\n", "\n", "You can directly update the description or metadata attributes of the column schema. However, you must specifically use the column schema returned by `DataFrame.ww.columns['col_name']`, **not** `DataFrame.ww['col_name'].ww.schema`. The column schema from `DataFrame.ww.columns['col_name']` is still associated with the EntitySet and propagates any attribute updates, whereas the other does not. As an example, this is how you can update a column's description or metadata:\n", "\n", "```python\n", "column_schema = df.ww.columns['col_name']\n", "column_schema.description = 'my description'\n", "column_schema.metadata.update(key='value')\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I combine two or more interesting values?\n", "\n", "You might want to create features that are conditioned on multiple values before they are calculated. This would require the use of `interesting_values`. However, since we are trying to create the feature with multiple conditions, we will need to modify the Dataframe before we create the `EntitySet`.\n", "\n", "Let's look at how you might accomplish this. \n", "\n", "First, let's create our Dataframes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ft.demo.load_mock_customer()\n", "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", "products_df = data[\"products\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transactions_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "products_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's modify our `transactions` Dataframe to create the additional column that represents multiple conditions for our feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transactions_df[\"product_id_device\"] = (\n", " transactions_df[\"product_id\"].astype(str) + \" and \" + transactions_df[\"device\"]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we created a new column called `product_id_device`, which just combines the `product_id` column, and the `device` column.\n", "\n", "Now let's create our `EntitySet`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.EntitySet(id=\"customer_data\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\",\n", " dataframe=transactions_df,\n", " index=\"transaction_id\",\n", " time_index=\"transaction_time\",\n", " logical_types={\n", " \"product_id\": ww.logical_types.Categorical,\n", " \"product_id_device\": ww.logical_types.Categorical,\n", " \"zip_code\": ww.logical_types.PostalCode,\n", " },\n", ")\n", "\n", "es = es.add_dataframe(\n", " dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n", ")\n", "\n", "es = es.normalize_dataframe(\n", " base_dataframe_name=\"transactions\",\n", " new_dataframe_name=\"sessions\",\n", " index=\"session_id\",\n", " additional_columns=[\"device\", \"product_id_device\", \"customer_id\"],\n", ")\n", "\n", "es = es.normalize_dataframe(\n", " base_dataframe_name=\"sessions\", new_dataframe_name=\"customers\", index=\"customer_id\"\n", ")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we are ready to add our interesting values. \n", "\n", "First, let's view our options for what the interesting values could be." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "interesting_values = transactions_df[\"product_id_device\"].unique().tolist()\n", "interesting_values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wanted to, you could pick a subset of these, and the `where` features created would only use those conditions. In our example, we will use all the possible interesting values.\n", "\n", "Here, we set all of these values as our interesting values for this specific DataFrame and column. If we wanted to, we could make interesting values in the same way for more than one column, but we will just stick with this one for this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "values = {\"product_id_device\": interesting_values}\n", "es.add_interesting_values(dataframe_name=\"sessions\", values=values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run DFS." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"count\"],\n", " where_primitives=[\"count\"],\n", " trans_primitives=[],\n", ")\n", "feature_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To better understand the `where` clause features, let's examine one of those features. \n", "The feature `COUNT(sessions WHERE product_id_device = 5 and tablet)`, tells us how many sessions the customer purchased `product_id` 5 while on a tablet. Notice how the feature depends on multiple conditions **(product_id = 5 & device = tablet)**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix[[\"COUNT(sessions WHERE product_id_device = 5 and tablet)\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DFS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why is DFS not creating aggregation features?\n", "You may have created your `EntitySet`, and then applied DFS to create features. However, you may be puzzled as to why no aggregation features were created. \n", "\n", "- **This is most likely because you have a single DataFrame in your EntitySet, and DFS is not capable of creating aggregation features with fewer than 2 DataFrames. Featuretools looks for a relationship, and aggregates based on that relationship.**\n", "\n", "Let's look at a simple example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ft.demo.load_mock_customer()\n", "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", "\n", "es = ft.EntitySet(id=\"customer_data\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\", dataframe=transactions_df, index=\"transaction_id\"\n", ")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we only have 1 DataFrame in our `EntitySet`. If we try to create aggregation features on this `EntitySet`, it will not be possible because DFS needs 2 DataFrames to generate aggregation features. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es, target_dataframe_name=\"transactions\"\n", ")\n", "feature_defs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "None of the above features are aggregation features. To fix this issue, you can add another DataFrame to your `EntitySet`.\n", "\n", "**Solution #1 - You can add new DataFrame if you have additional data.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "products_df = data[\"products\"]\n", "es = es.add_dataframe(\n", " dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n", ")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we now have an additional DataFrame in our `EntitySet`, called `products`.\n", "\n", "**Solution #2 - You can normalize an existing DataFrame.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.normalize_dataframe(\n", " base_dataframe_name=\"transactions\",\n", " new_dataframe_name=\"sessions\",\n", " index=\"session_id\",\n", " make_time_index=\"session_start\",\n", " additional_columns=[\"device\", \"customer_id\", \"zip_code\", \"join_date\"],\n", " copy_columns=[\"session_start\"],\n", ")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we now have an additional DataFrame in our `EntitySet`, called `sessions`. Here, the normalization created a relationship between `transactions` and `sessions`. However, we could have specified a relationship between `transactions` and `products` if we had only used Solution \\#1.\n", "\n", "Now, we can generate aggregation features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es, target_dataframe_name=\"transactions\"\n", ")\n", "feature_defs[:-10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few of the aggregation features are:\n", "\n", "- ``\n", "- ``\n", "- ``\n", "- ``\n", "- ``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I speed up the runtime of DFS?\n", "\n", "One issue you may encounter while running `ft.dfs` is slow performance. While Featuretools has generally optimal default settings for calculating features, you may want to speed up performance when you are calculating on a large number of features. \n", "\n", "One quick way to speed up performance is by adjusting the `n_jobs` settings of `ft.dfs` or `ft.calculate_feature_matrix`.\n", "\n", "```python\n", "# setting n_jobs to -1 will use all cores\n", "\n", "feature_matrix, feature_defs = ft.dfs(entityset=es,\n", " target_dataframe_name=\"customers\",\n", " n_jobs=-1)\n", "\n", " \n", "feature_matrix, feature_defs = ft.calculate_feature_matrix(entityset=es,\n", " features=feature_defs,\n", " n_jobs=-1)\n", "```\n", "\n", "\n", "**For more ways to speed up performance, please visit:**\n", "\n", "- [Improving Computational Performance](../guides/performance.ipynb#improving-computational-performance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I include only certain features when running DFS?\n", "\n", "When using DFS to generate features, you may wish to include only certain features. There are multiple ways that you do this:\n", "\n", "- Use `ignore_columns` to specify columns in a DataFrame that should not be used to create features. It is a dictionary mapping dataframe names to a list of column names to ignore.\n", "\n", "- Use `drop_contains` to drop features that contain any of the strings listed in this parameter.\n", "\n", "- Use `drop_exact` to drop features that exactly match any of the strings listed in this parameter.\n", "\n", "Here is an example of using all three parameters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " ignore_columns={\n", " \"transactions\": [\"amount\"],\n", " \"customers\": [\"age\", \"gender\", \"birthday\"],\n", " }, # ignore these columns\n", " drop_contains=[\"customers.SUM(\"], # drop features that contain these strings\n", " drop_exact=[\"STD(transactions.quanity)\"],\n", ") # drop features that exactly match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I specify primitives on a per column or per DataFrame basis?\n", "\n", "When using DFS to generate features, you may wish to use only certain features or DataFrames for specific primitives. This can be done through the `primitive_options` parameter. The `primitive_options` parameter is a dictionary that maps a primitive or a tuple of primitives to a dictionary containing options for the primitive(s). A primitive or tuple of primitives can also be mapped to a list of option dictionaries if the primitive(s) \n", "takes multiple inputs. The primitive keys can be the string names of the primitive, the primitive class, or specific instances of the primitive. Each dictionary supplies options for their respective input column. There are multiple ways to control how primitives get applied through these options:\n", "\n", "- Use `ignore_dataframes` to specify DataFrames that should not be used to create features for that primitive. It is a list of DataFrame names to ignore.\n", "\n", "- Use `include_dataframes` to specify the only DataFrames to be included to create features for that primitive. It is a list of DataFrame names to include.\n", "\n", "- Use `ignore_columns` to specify columns in a DataFrame that should not be used to create features for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to ignore.\n", "\n", "- Use `include_columns` to specify the only columns in a DataFrame that should be used to create features for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to include.\n", "\n", "You can also use `primitive_options` to specify which DataFrames or columns you wish to use as groupbys for groupby transformation primitives:\n", "\n", "- Use `ignore_groupby_dataframes` to specify DataFrames that should not be used to get groupbys for that primitive. It is a list of DataFrame names to ignore.\n", "\n", "- Use `include_groupby_dataframes` to specify the only DataFrames that should be used to get groupbys for that primitive. It is a list of DataFrame names to include.\n", "\n", "- Use `ignore_groupby_columns` to specify columns in a DataFrame that should not be used as groupbys for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to ignore.\n", "\n", "- Use `include_groupby_columns` to specify the only columns in a DataFrame that should be used as groupbys for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to include.\n", "\n", "Here is an example of using some of these options:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " primitive_options={\n", " \"mode\": {\n", " \"ignore_dataframes\": [\"sessions\"],\n", " \"ignore_columns\": {\"products\": [\"brand\"], \"transactions\": [\"product_id\"]},\n", " },\n", " # For mode, ignore the \"sessions\" DataFrame and only include \"brands\" in the\n", " # \"products\" dataframe and \"product_id\" in the \"transactions\" DataFrame\n", " (\"count\", \"mean\"): {\"include_dataframes\": [\"sessions\", \"transactions\"]},\n", " # For count and mean, only include the dataframes \"sessions\" and \"transactions\"\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that if options are given for a specific instance of a primitive and for the primitive generally (either by string name or class), the instances with their own options will not use the generic options. For example, in this case:\n", "```\n", "special_mean = Mean()\n", "options = {\n", " special_mean: {'include_dataframes': ['customers']},\n", " 'mean': {'include_dataframes': ['sessions']}\n", "```\n", "the primitive `special_mean` will not use the DataFrame `sessions` because it's options have it only include `customers`. Every other instance of the `Mean` primitive will use the `'mean'` options. \n", "\n", "**For more examples of specifying options for DFS, please visit:**\n", "\n", "- [Specifying Primitive Options](../guides/specifying_primitive_options.rst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If I didn't specify the **cutoff_time**, what date will be used for the feature calculations?\n", "\n", "The cutoff time will be set to the current time using `cutoff_time = datetime.now()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I select a certain amount of past data when calculating features?\n", "\n", "You may encounter a situation when you wish to make prediction using only a certain amount of historical data. You can accomplish this using the `training_window` parameter in `ft.dfs`. When you use the `training_window`, Featuretools will use the historical data between the `cutoff_time` and `cutoff_time - training_window`.\n", "\n", "In order to make the calculation, Featuretools will check the time in the `time_index` column of the `target_dataframe`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es[\"customers\"].ww.time_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our target_dataframe has a `time_index`, which is needed for the `training_window` calculation. Here, we are creating a cutoff time DataFrame so that we can have a unique training window for each customer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cutoff_times = pd.DataFrame()\n", "cutoff_times[\"customer_id\"] = [1, 2, 3, 1]\n", "cutoff_times[\"time\"] = pd.to_datetime(\n", " [\"2014-1-1 04:00\", \"2014-1-1 05:00\", \"2014-1-1 06:00\", \"2014-1-1 08:00\"]\n", ")\n", "cutoff_times[\"label\"] = [True, True, False, True]\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=cutoff_times,\n", " cutoff_time_in_index=True,\n", " training_window=\"1 hour\",\n", ")\n", "feature_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we ran DFS with `training_window` argument of `1 hour` to create features that only used customer data collected in the last hour (from the cutoff time we provided)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Can I run DFS on a single table? \n", "\n", "Although possible, running DFS on a single table doesn't make full use of DFS's capabilities. For one, DFS will not be able to use any aggregation primitives, which require at least two tables. You will only be able to use transform primitives. This limits the complexity of the features that DFS can generate through feature stacking. Additionally, in certain situations, running single table DFS on data with time columns could risk label leakage. With data split in multiple tables, featuretools can filter data based on the cutoff time instead of assuming data was flattened appropriately, but it can not do this with only a single table. \n", "\n", "If you only have a single table of data, DFS can certainly still be of use. There are two main ways to pass in a single table to DFS. \n", "\n", "The first is to simply create an EntitySet with one table. \n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transactions_df = ft.demo.load_mock_customer(return_single_table=True)\n", "\n", "es = ft.EntitySet(id=\"customer_data\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"transactions\",\n", " dataframe=transactions_df,\n", " index=\"transaction_id\",\n", " time_index=\"transaction_time\",\n", ")\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"transactions\",\n", " trans_primitives=[\n", " \"time_since\",\n", " \"day\",\n", " \"is_weekend\",\n", " \"cum_min\",\n", " \"minute\",\n", " \"weekday\",\n", " \"percentile\",\n", " \"year\",\n", " \"week\",\n", " \"cum_mean\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second way is to insert the dataframe into a dictionary mapping its name to a tuple containing specific dataframe information. We then pass in that dictionary to the `dataframes` argument in DFS.\n", "\n", "In this scenario, for the value in our dictionary, we pass in a tuple containing the dataframe, its index column, and its time index. More information about the possible parameters can be found in the [DFS documentation](https://featuretools.alteryx.com/en/stable/generated/featuretools.dfs.html#featuretools.dfs).\n", "\n", "For example: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transactions_df = ft.demo.load_mock_customer(return_single_table=True)\n", "\n", "dataframes = {\"transactions\": (transactions_df, \"transaction_id\", \"transaction_time\")}\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " dataframes=dataframes,\n", " target_dataframe_name=\"transactions\",\n", " trans_primitives=[\n", " \"time_since\",\n", " \"day\",\n", " \"is_weekend\",\n", " \"cum_min\",\n", " \"minute\",\n", " \"weekday\",\n", " \"percentile\",\n", " \"year\",\n", " \"week\",\n", " \"cum_mean\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we examine the output, let's look at our original single table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transactions_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the transformations that Featuretools was able to apply to this single DataFrame to create feature matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I prevent label leakage with DFS?\n", "\n", "One concern you might have with using DFS is about label leakage. You want to make sure that labels in your data aren't used incorrectly to create features and the feature matrix.\n", "\n", "**Featuretools is particularly focused on helping users avoid label leakage.**\n", "\n", "There are two ways to prevent label leakage depending on if your data has timestamps or not.\n", "\n", "#### 1. Data without timestamps\n", "In the case where you do not have timestamps, you can create one `EntitySet` using only the training data and then run `ft.dfs`. This will create a feature matrix using only the training data, but also return a list of feature definitions. Next, you can create an `EntitySet` using the test data and recalculate the same features by calling `ft.calculate_feature_matrix` with the list of feature definitions from before. \n", "\n", "Here is what that flow would look like:\n", "\n", "First, let's create our training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data = pd.DataFrame(\n", " {\n", " \"customer_id\": [1, 2, 3, 4, 5],\n", " \"age\": [40, 50, 10, 20, 30],\n", " \"gender\": [\"m\", \"f\", \"m\", \"f\", \"f\"],\n", " \"signup_date\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n", " \"labels\": [True, False, True, False, True],\n", " }\n", ")\n", "train_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can create an entityset for our training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es_train_data = ft.EntitySet(id=\"customer_train_data\")\n", "es_train_data = es_train_data.add_dataframe(\n", " dataframe_name=\"customers\", dataframe=train_data, index=\"customer_id\"\n", ")\n", "es_train_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we are ready to create our features, and feature matrix for the training data. We don't want Featuretools to use the labels column to build new features, so we will use the ``ignore_columns`` option to exclude it. This would also remove the labels column from the feature matrix, so we will tell DFS to include it as a seed feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels_feature = ft.Feature(es_train_data[\"customers\"].ww[\"labels\"])\n", "feature_matrix_train, feature_defs = ft.dfs(\n", " entityset=es_train_data,\n", " target_dataframe_name=\"customers\",\n", " ignore_columns={\"customers\": [\"labels\"]},\n", " seed_features=[labels_feature],\n", ")\n", "feature_matrix_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also encode our feature matrix to make machine learning compatible features. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix_train_enc, features_enc = ft.encode_features(\n", " feature_matrix_train, feature_defs\n", ")\n", "feature_matrix_train_enc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the whole feature matrix only includes numeric and boolean values now." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the feature definitions to calculate our feature matrix for the test data, and avoid label leakage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_train = pd.DataFrame(\n", " {\n", " \"customer_id\": [6, 7, 8, 9, 10],\n", " \"age\": [20, 25, 55, 22, 35],\n", " \"gender\": [\"f\", \"m\", \"m\", \"m\", \"m\"],\n", " \"signup_date\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n", " \"labels\": [True, False, False, True, True],\n", " }\n", ")\n", "\n", "es_test_data = ft.EntitySet(id=\"customer_test_data\")\n", "es_test_data = es_test_data.add_dataframe(\n", " dataframe_name=\"customers\",\n", " dataframe=test_train,\n", " index=\"customer_id\",\n", " time_index=\"signup_date\",\n", ")\n", "\n", "# Use the feature definitions from earlier\n", "feature_matrix_enc_test = ft.calculate_feature_matrix(\n", " features=features_enc, entityset=es_test_data\n", ")\n", "\n", "feature_matrix_enc_test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out the [Modeling](frequently_asked_questions.ipynb#Modeling) section for an example of using the encoded matrix with sklearn." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Data with timestamps\n", "\n", "If your data has timestamps, the best way to prevent label leakage is to use a list of **cutoff times**, which specify the last point in time data is allowed to be used for each row in the resulting feature matrix. To use **cutoff times**, you need to set a time index for each time sensitive DataFrame in your entity set.\n", "\n", "> **Tip: Even if your data doesn’t have time stamps, you could add a column with dummy timestamps that can be used by Featuretools as time index.**\n", "\n", "When you call `ft.dfs`, you can provide a DataFrame of cutoff times like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cutoff_times = pd.DataFrame(\n", " {\n", " \"customer_id\": [1, 2, 3, 4, 5],\n", " \"time\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n", " }\n", ")\n", "cutoff_times.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_test_data = pd.DataFrame(\n", " {\n", " \"customer_id\": [1, 2, 3, 4, 5],\n", " \"age\": [20, 25, 55, 22, 35],\n", " \"gender\": [\"f\", \"m\", \"m\", \"m\", \"m\"],\n", " \"signup_date\": pd.date_range(\"2010-01-01 01:41:50\", periods=5, freq=\"25min\"),\n", " }\n", ")\n", "\n", "es_train_test_data = ft.EntitySet(id=\"customer_train_test_data\")\n", "es_train_test_data = es_train_test_data.add_dataframe(\n", " dataframe_name=\"customers\",\n", " dataframe=train_test_data,\n", " index=\"customer_id\",\n", " time_index=\"signup_date\",\n", ")\n", "\n", "feature_matrix_train_test, features = ft.dfs(\n", " entityset=es_train_test_data,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=cutoff_times,\n", " cutoff_time_in_index=True,\n", ")\n", "feature_matrix_train_test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we have created a feature matrix that uses cutoff times to avoid label leakage. We could also encode this feature matrix using `ft.encode_features`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the difference between passing a primitive object versus a string to DFS? \n", "\n", "There are 2 ways to pass primitives to DFS: the primitive object, or a string of the primitive name. \n", "\n", "We will use the Transform primitive called `TimeSincePrevious` to illustrate the differences.\n", "\n", "First, let's use the string of primitive name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[],\n", " trans_primitives=[\"time_since_previous\"],\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's use the primitive object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from featuretools.primitives import TimeSincePrevious\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[],\n", " trans_primitives=[TimeSincePrevious],\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see above, the feature matrix is the same.\n", "\n", "However, if we need to modify controllable parameters in the primitive, we should use the primitive object. \n", "For instance, let's make TimeSincePrevious return units of hours (the default is in seconds)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from featuretools.primitives import TimeSincePrevious\n", "\n", "time_since_previous_in_hours = TimeSincePrevious(unit=\"hours\")\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[],\n", " trans_primitives=[time_since_previous_in_hours],\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How can I select features based on some attributes (a specific string, an explicit primitive type, a return type, a given depth)?\n", "\n", "You may wish to select a subset of your features based on some attributes. \n", "\n", "Let's say you wanted to select features that had the string `amount` in its name. You can check for this by using the `get_name` function on the feature definitions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_defs = ft.dfs(\n", " entityset=es, target_dataframe_name=\"customers\", features_only=True\n", ")\n", "\n", "features_with_amount = []\n", "for x in feature_defs:\n", " if \"amount\" in x.get_name():\n", " features_with_amount.append(x)\n", "features_with_amount[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might also want to only select features that are aggregation features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from featuretools import AggregationFeature\n", "\n", "features_only_aggregations = []\n", "for x in feature_defs:\n", " if type(x) == AggregationFeature:\n", " features_only_aggregations.append(x)\n", "features_only_aggregations[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, you might only want to select features that are calculated at a certain depth. You can do this by using the `get_depth` function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_only_depth_2 = []\n", "for x in feature_defs:\n", " if x.get_depth() == 2:\n", " features_only_depth_2.append(x)\n", "features_only_depth_2[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you might only want features that return a certain type. You can do this by using the `column_schema` attribute. For more information on working with column schemas, take a look at [Transitioning from Variables to Woodwork](transition_to_ft_v1.0.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_only_numeric = []\n", "for x in feature_defs:\n", " if \"numeric\" in x.column_schema.semantic_tags:\n", " features_only_numeric.append(x)\n", "features_only_numeric[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have your specific feature list, you can use `ft.calculate_feature_matrix` to generate a feature matrix for only those features.\n", "\n", "For our example, let's use the features with only the string `amount` in its name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix = ft.calculate_feature_matrix(\n", " entityset=es, features=features_with_amount\n", ") # change to your specific feature list\n", "feature_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, notice how all the column names for our feature matrix contain the string `amount`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I create **where** features?\n", "\n", "Sometimes, you might want to create features that are conditioned on a second value before it is calculated. This extra filter is called a “where clause”. You can create these features using the using the `interesting_values` of a column.\n", "\n", "If you have categorical columns in your `EntitySet`, you can use `add_interesting_values`. This function will find interesting values for your categorical columns, which can then be used to generate “where” clauses.\n", "\n", "First, let's create our `EntitySet`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can add the interesting values for the categorical column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es.add_interesting_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run DFS with the `where_primitives` argument to define which primitives to apply with where clauses. In this case, let's use the primitive `count`. For this to work, the primitive `count` must be present in both `agg_primitives` and `where_primitives`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"count\"],\n", " where_primitives=[\"count\"],\n", " trans_primitives=[],\n", ")\n", "feature_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have now created some useful features. One example of a useful feature is the `COUNT(sessions WHERE device = tablet)`. This feature tells us how many sessions a customer completed on a tablet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix[[\"COUNT(sessions WHERE device = tablet)\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Primitives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the difference between the primitive types (Transform, GroupBy Transform, & Aggregation)?\n", "\n", "You might curious to know the difference between the primitive groups.\n", "Let's review the differences between transform, groupby transform, and aggregation primitives.\n", "\n", "First, let's create a simple `EntitySet`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import featuretools as ft\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"id\": [1, 2, 3, 4, 5, 6],\n", " \"time_index\": pd.date_range(\"1/1/2019\", periods=6, freq=\"D\"),\n", " \"group\": [\"a\", \"a\", \"a\", \"a\", \"a\", \"a\"],\n", " \"val\": [5, 1, 10, 20, 6, 23],\n", " }\n", ")\n", "es = ft.EntitySet()\n", "es = es.add_dataframe(\n", " dataframe_name=\"observations\", dataframe=df, index=\"id\", time_index=\"time_index\"\n", ")\n", "\n", "es = es.normalize_dataframe(\n", " base_dataframe_name=\"observations\", new_dataframe_name=\"groups\", index=\"group\"\n", ")\n", "\n", "es.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After calling `normalize_dataframe`, the column \"group\" has the semantic tag \"foreign_key\" because it identifies another DataFrame. Alternatively, it could be set using the `semantic_tags` parameter when we first call `es.add_dataframe()`.\n", "\n", "#### Transform Primitive\n", "\n", "The cum_sum primitive calculates the running sum in list of numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from featuretools.primitives import CumSum\n", "\n", "cum_sum = CumSum()\n", "cum_sum([1, 2, 3, 4, 5]).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we apply it using the `trans_primitives` argument it will calculate it over the entire observations DataFrame like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " target_dataframe_name=\"observations\",\n", " entityset=es,\n", " agg_primitives=[],\n", " trans_primitives=[\"cum_sum\"],\n", " groupby_trans_primitives=[],\n", ")\n", "\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Groupby Transform Primitive\n", "\n", "If we apply it using `groupby_trans_primitives`, then DFS will first group by any foreign key columns before applying the transform primitive. As a result, we get the cumulative sum by group." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " target_dataframe_name=\"observations\",\n", " entityset=es,\n", " agg_primitives=[],\n", " trans_primitives=[],\n", " groupby_trans_primitives=[\"cum_sum\"],\n", ")\n", "\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aggregation Primitive\n", "\n", "Finally, there is also the aggregation primitive \"sum\". If we use sum, it will calculate the sum for the group at the cutoff time for each row. Because we didn't specify a cutoff time it will use all the data for each group for each row." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " target_dataframe_name=\"observations\",\n", " entityset=es,\n", " agg_primitives=[\"sum\"],\n", " trans_primitives=[],\n", " cutoff_time_in_index=True,\n", " groupby_trans_primitives=[],\n", ")\n", "\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we set the cutoff time of each row to be the time index, then use sum as an aggregation primitive, the result is the same as cum_sum. (Though the order is different in the displayed dataframe)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cutoff_time = df[[\"id\", \"time_index\"]]\n", "cutoff_time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " target_dataframe_name=\"observations\",\n", " entityset=es,\n", " agg_primitives=[\"sum\"],\n", " trans_primitives=[],\n", " groupby_trans_primitives=[],\n", " cutoff_time_in_index=True,\n", " cutoff_time=cutoff_time,\n", ")\n", "\n", "feature_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I get a list of all Aggregation and Transform primitives?\n", "\n", "You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a DataFrame with the names, type, and description of the primitives." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_primitives = ft.list_primitives()\n", "df_primitives.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_primitives.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How do I change the units for a TimeSince primitive?\n", "There are a few primitives in Featuretools that make some time-based calculation. These include `TimeSince, TimeSincePrevious, TimeSinceLast, TimeSinceFirst`. \n", "\n", "You can change the units from the default seconds to any valid time unit, by doing the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from featuretools.primitives import (\n", " TimeSince,\n", " TimeSinceFirst,\n", " TimeSinceLast,\n", " TimeSincePrevious,\n", ")\n", "\n", "time_since = TimeSince(unit=\"minutes\")\n", "time_since_previous = TimeSincePrevious(unit=\"hours\")\n", "time_since_last = TimeSinceLast(unit=\"days\")\n", "time_since_first = TimeSinceFirst(unit=\"years\")\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[time_since_last, time_since_first],\n", " trans_primitives=[time_since, time_since_previous],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we changed the units to the following:\n", "- minutes for `TimeSince`\n", "- hours for `TimeSincePrevious`\n", "- days for `TimeSinceLast`\n", "- years for `TimeSinceFirst`.\n", "\n", "\n", "Now we can see that our feature matrix contains multiple features where the units for the TimeSince primitives are changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are now features where time unit is different from the default of seconds, such as `TIME_SINCE_LAST(sessions.session_start, unit=days)`, and `TIME_SINCE_FIRST(sessions.session_start, unit=years)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How does my train & test data work with Featuretools and sklearn's **train_test_split**?\n", "\n", "You might be wondering how to properly use your train & test data with Featuretools, and sklearn's **train_test_split**. There are a few things you must do to ensure accuracy with this workflow.\n", "\n", "Let's imagine we have a Dataframes for our train data, with the labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data = pd.DataFrame(\n", " {\n", " \"customer_id\": [1, 2, 3, 4, 5],\n", " \"age\": [20, 25, 55, 22, 35],\n", " \"gender\": [\"f\", \"m\", \"m\", \"m\", \"m\"],\n", " \"signup_date\": pd.date_range(\"2010-01-01 01:41:50\", periods=5, freq=\"25min\"),\n", " \"labels\": [False, True, True, False, False],\n", " }\n", ")\n", "train_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create our `EntitySet` for the train data, and create our features. To prevent label leakage, we will use cutoff times (see [earlier question](#How-do-I-prevent-label-leakage-with-DFS?))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es_train_data = ft.EntitySet(id=\"customer_data\")\n", "es_train_data = es_train_data.add_dataframe(\n", " dataframe_name=\"customers\", dataframe=train_data, index=\"customer_id\"\n", ")\n", "\n", "cutoff_times = pd.DataFrame(\n", " {\n", " \"customer_id\": [1, 2, 3, 4, 5],\n", " \"time\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n", " }\n", ")\n", "\n", "feature_matrix_train, features = ft.dfs(\n", " entityset=es_train_data,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=cutoff_times,\n", " cutoff_time_in_index=True,\n", ")\n", "feature_matrix_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also encode our feature matrix to compatible for machine learning algorithms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix_train_enc, feature_enc = ft.encode_features(\n", " feature_matrix_train, features\n", ")\n", "feature_matrix_train_enc.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = feature_matrix_train_enc.drop([\"labels\"], axis=1)\n", "y = feature_matrix_train_enc[\"labels\"]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can use the encoded feature matrix with sklearn's **train_test_split**. This will allow you to train your model, and tune your parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How are categorical columns encoded when splitting training and testing data?\n", "\n", "You might be wondering what happens when categorical columns are encoded with your training and testing data. You might be curious to know what happens if the train data has a categorical column that is not present in the testing data. \n", "\n", "Let's explore a simple example to see what happens during the encoding process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data = pd.DataFrame(\n", " {\n", " \"customer_id\": [1, 2, 3, 4, 5],\n", " \"product_purchased\": [\"coke zero\", \"car\", \"toothpaste\", \"coke zero\", \"car\"],\n", " }\n", ")\n", "es_train = ft.EntitySet(id=\"customer_data\")\n", "es_train = es_train.add_dataframe(\n", " dataframe_name=\"customers\",\n", " dataframe=train_data,\n", " index=\"customer_id\",\n", " logical_types={\"product_purchased\": ww.logical_types.Categorical},\n", ")\n", "feature_matrix_train, features = ft.dfs(\n", " entityset=es_train, target_dataframe_name=\"customers\"\n", ")\n", "feature_matrix_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use `ft.encode_features` to properly encode the `product_purchased` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix_train_encoded, features_encoded = ft.encode_features(\n", " feature_matrix_train, features\n", ")\n", "feature_matrix_train_encoded.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets imagine we have some test data that has doesn't have one of the categorical values (**toothpaste**). Also, the test data has a value that wasn't present in the train data (**water**)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data = pd.DataFrame(\n", " {\n", " \"customer_id\": [6, 7, 8, 9, 10],\n", " \"product_purchased\": [\"coke zero\", \"car\", \"coke zero\", \"coke zero\", \"water\"],\n", " }\n", ")\n", "\n", "es_test = ft.EntitySet(id=\"customer_data\")\n", "es_test = es_test.add_dataframe(\n", " dataframe_name=\"customers\", dataframe=test_data, index=\"customer_id\"\n", ")\n", "\n", "feature_matrix_test = ft.calculate_feature_matrix(\n", " entityset=es_test, features=features_encoded\n", ")\n", "feature_matrix_test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen above, we were able to successfully handle the encoding, and deal with the following complications: \n", "- **toothpaste** was present in the training data but not present in the testing data \n", "- **water** was present in the test data but not present in the training data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Errors & Warnings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why am I getting this error 'Index is not unique on dataframe'?\n", "You may be trying to create your `EntitySet`, and run into this error. \n", "```python\n", "IndexError: Index column must be unique\n", "```\n", "**This is because each dataframe in your EntitySet needs a unique index.**\n", "\n", "Let's look at a simple example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "product_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 4], \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0]})\n", "product_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the `id` column has a duplicate index of `4`. If you try to add this dataframe to the EntitySet, you will run into the following error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "es = ft.EntitySet(id=\"product_data\")\n", "es = es.add_dataframe(dataframe_name=\"products\",\n", " dataframe=product_df,\n", " index=\"id\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "---------------------------------------------------------------------------\n", "IndexError Traceback (most recent call last)\n", " in \n", " 1 es = ft.EntitySet(id=\"product_data\")\n", "----> 2 es = es.add_dataframe(dataframe_name=\"products\",\n", " 3 dataframe=product_df,\n", " 4 index=\"id\")\n", "\n", "~/Code/featuretools/featuretools/entityset/entityset.py in add_dataframe(self, dataframe, dataframe_name, index, logical_types, semantic_tags, make_index, time_index, secondary_time_index, already_sorted)\n", " 625 index_was_created, index, dataframe = _get_or_create_index(index, make_index, dataframe)\n", " 626 \n", "--> 627 dataframe.ww.init(name=dataframe_name,\n", " 628 index=index,\n", " 629 time_index=time_index,\n", "\n", "/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in init(self, index, time_index, logical_types, already_sorted, schema, validate, use_standard_tags, **kwargs)\n", " 94 \"\"\"\n", " 95 if validate:\n", "---> 96 _validate_accessor_params(self._dataframe, index, time_index, logical_types, schema, use_standard_tags)\n", " 97 if schema is not None:\n", " 98 self._schema = schema\n", "\n", "/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _validate_accessor_params(dataframe, index, time_index, logical_types, schema, use_standard_tags)\n", " 877 # We ignore these parameters if a schema is passed\n", " 878 if index is not None:\n", "--> 879 _check_index(dataframe, index)\n", " 880 if logical_types:\n", " 881 _check_logical_types(dataframe.columns, logical_types)\n", "\n", "/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _check_index(dataframe, index)\n", " 903 # User specifies an index that is in the dataframe but not unique\n", "--> 904 raise IndexError('Index column must be unique')\n", " 905 \n", " 906 \n", "\n", "IndexError: Index column must be unique\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fix the above error, you can do one of the following solutions:\n", "\n", "**Solution #1 - You can create a unique index on your Dataframe.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "product_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5], \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0]})\n", "product_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we now have a unique index column called `id`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = es.add_dataframe(dataframe_name=\"products\", dataframe=product_df, index=\"id\")\n", "es" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen above, we can now create our DataFrame for our `EntitySet` without an error by creating a unique index in our Dataframe.\n", "\n", "**Solution #2 - Set make_index to True in your call to add_dataframe to create a new index on that data**\n", "- `make_index` creates a unique index for each row by just looking at what number the row is, in relation to all the other rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "product_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 4], \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0]})\n", "\n", "es = ft.EntitySet(id=\"product_data\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"products\", dataframe=product_df, index=\"product_id\", make_index=True\n", ")\n", "\n", "es[\"products\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen above, we created our dataframe for our `EntitySet` without an error using the `make_index` argument." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why am I getting the following warning 'Using training_window but last_time_index is not set'?\n", "\n", "If you are using a training window, and you haven't set a `last_time_index` for your dataframe, you will get this warning.\n", "The training window attribute in Featuretools limits the amount of past data that can be used while calculating a particular feature vector.\n", "\n", "You can add the `last_time_index` to all dataframes automatically by calling `your_entityset.add_last_time_indexes()` after you create your `EntitySet`. This will remove the warning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es.add_last_time_indexes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run DFS without getting the warning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cutoff_times = pd.DataFrame()\n", "cutoff_times[\"customer_id\"] = [1, 2, 3, 1]\n", "cutoff_times[\"time\"] = pd.to_datetime(\n", " [\"2014-1-1 04:00\", \"2014-1-1 05:00\", \"2014-1-1 06:00\", \"2014-1-1 08:00\"]\n", ")\n", "cutoff_times[\"label\"] = [True, True, False, True]\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " cutoff_time=cutoff_times,\n", " cutoff_time_in_index=True,\n", " training_window=\"1 hour\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### last_time_index vs. time_index\n", "\n", "- The `time_index` is when the instance was first known.\n", "- The `last_time_index` is when the instance appears for the last time.\n", "- For example, a customer’s session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user has in a given time period, we often want to count all the sessions that had any transaction during the training window. To accomplish this, we need to not only know when a session starts (**time_index**), but also when it ends (**last_time_index**). The last time that an instance appears in the data is stored as the `last_time_index` of a dataframe. \n", "- Once the last_time_index has been set, Featuretools will check to see if the last_time_index is after the start of the training window. That, combined with the cutoff time, allows DFS to discover which data is relevant for a given training window." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why am I getting errors with Featuretools on [Google Colab](https://colab.research.google.com/)?\n", "\n", "[Google Colab](https://colab.research.google.com/), by default, has Featuretools `0.4.1` installed. You may run into issues following our newest guides, or latest documentation while using an older version of Featuretools. Therefore, we suggest you upgrade to the latest featuretools version by doing the following in your notebook in Google Colab:\n", "```shell\n", "!pip install -U featuretools\n", "```\n", "\n", "You may need to Restart the runtime by doing **Runtime** -> **Restart Runtime**.\n", "You can check latest Featuretools version by doing following:\n", "```python\n", "import featuretools as ft\n", "print(ft.__version__)\n", "```\n", "You should see a version greater than `0.4.1`" ] } ], "metadata": { "file_extension": ".py", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3, "vscode": { "interpreter": { "hash": "3f6b062a214ec48d1657976024d6bc68979519d14a33afb6ad033fc2e4189514" } } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: docs/source/resources/help.rst ================================================ Help ==== Couldn't find what you were looking for? The Featuretools community is happy to provide support to users of Featuretools. Discussion ---------- Conversation happens in the following places: 1. **General usage questions** are directed to `StackOverflow`_ with the #featuretools tag. 2. **Bug reports** are managed on the `GitHub issue tracker`_. 3. **Chat** and collaboration within the community occurs on `Slack`_. For general usage questions, please post on Stack Overflow where answers are more searchable by other users. .. _`StackOverflow`: http://stackoverflow.com/questions/tagged/featuretools .. _`Github issue tracker`: https://github.com/alteryx/featuretools/issues .. _`Slack`: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA Asking for help --------------- All users levels, including beginners, should feel free to ask questions and report bugs when using featuretools. You can get better answers if follow a few simple guidelines: 1. **Use the right resource**: We suggest using Github or StackOverflow. Questions asked at these locations will be more searchable for other users. - Slack should be used for community discussion and collaboration. - For general questions on how something should work or tips, use StackOverflow. - Bugs should be reported on Github. 2. **Ask in one place only**: Please post your question in one place (StackOverflow or Github). 3. **Use examples**: Make `minimal, complete, verifiable examples `_. You will get much better answers if your provide code that people can use to reproduce your problem. ================================================ FILE: docs/source/resources/resources_index.rst ================================================ Resources --------- Frequently asked questions and additional resources .. toctree:: :maxdepth: 1 transition_to_ft_v1.0 frequently_asked_questions help usage_tips/limitations usage_tips/glossary ecosystem ================================================ FILE: docs/source/resources/transition_to_ft_v1.0.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "6004844f", "metadata": {}, "source": [ "# Transitioning to Featuretools Version 1.0\n", "\n", "Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.\n", "\n", "## Background and Introduction\n", "\n", "### Why make these changes?\n", "The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of [Woodwork](https://woodwork.alteryx.com/en/stable/). Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, [EvalML](https://evalml.alteryx.com/en/stable/), which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.\n", "\n", "Other benefits of using Woodwork for managing typing in Featuretools include:\n", "\n", "- Simplified code - custom type management code has been removed\n", "- Seamless integration of new types and improvements to type integration as Woodwork improves\n", "- Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data." ] }, { "cell_type": "markdown", "id": "4a9bfede", "metadata": {}, "source": [ "### What has changed?\n", "- The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types\n", "- Both the `Entity` and `Variable` classes have been removed from Featuretools\n", "- Several key Featuretools methods have been moved or updated\n", "\n", "#### Comparison between legacy typing system and Woodwork typing systems\n", "| Featuretools < 1.0 | Featuretools 1.0 | Description |\n", "| ---- | ---- | ---- |\n", "| Entity | Woodwork DataFrame | stores typing information for all columns |\n", "| Variable | ColumnSchema | stores typing information for a single column |\n", "| Variable subclass | LogicalType and semantic_tags | elements used to define a column type |\n", "\n", "#### Summary of significant method changes\n", "\n", "The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.\n", "\n", "| Older Versions | Featuretools 1.0 |\n", "| ---- | ---- |\n", "| EntitySet.entity_from_dataframe | EntitySet.add_dataframe |\n", "| EntitySet.normalize_entity | EntitySet.normalize_dataframe |\n", "| EntitySet.update_data | EntitySet.replace_dataframe |\n", "| Entity.variable_types | es['dataframe_name'].ww |\n", "| es['entity_id']['variable_name'] | es['dataframe_name'].ww.columns['column_name'] |\n", "| Entity.convert_variable_type | es['dataframe_name'].ww.set_types |\n", "| Entity.add_interesting_values | es.add_interesting_values(dataframe_name='df_name', ...) |\n", "| Entity.set_secondary_time_index | es.set_secondary_time_index(dataframe_name='df_name', ...) |\n", "| Feature(es['entity_id']['variable_name']) | Feature(es['dataframe_name'].ww['column_name']) |\n", "| dfs(target_entity='entity_id', ...) | dfs(target_dataframe_name='dataframe_name', ...) |" ] }, { "cell_type": "markdown", "id": "c3b1e217", "metadata": {}, "source": [ "For more information on how Woodwork manages typing information, refer to the [Woodwork Understanding Types and Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide." ] }, { "cell_type": "markdown", "id": "a8453248", "metadata": {}, "source": [ "### What do these changes mean for users?\n", "Removing these classes required moving several methods from the `Entity` to the `EntitySet` object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.\n", "\n", "All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible." ] }, { "cell_type": "markdown", "id": "de402e3b", "metadata": {}, "source": [ "## Removal of `Entity` Class and Updates to `EntitySet`\n", "\n", "In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.\n", "\n", "### Adding dataframes to an EntitySet\n", "\n", "When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.\n", "\n", "Below are some examples to illustrate this process. First we will create two small dataframes to use for the example." ] }, { "cell_type": "code", "execution_count": null, "id": "5bea1bd4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import featuretools as ft" ] }, { "cell_type": "code", "execution_count": null, "id": "b094ca23", "metadata": {}, "outputs": [], "source": [ "orders_df = pd.DataFrame(\n", " {\"order_id\": [0, 1, 2], \"order_date\": [\"2021-01-02\", \"2021-01-03\", \"2021-01-04\"]}\n", ")\n", "items_df = pd.DataFrame(\n", " {\n", " \"id\": [0, 1, 2, 3, 4],\n", " \"order_id\": [0, 1, 1, 2, 2],\n", " \"item_price\": [29.95, 4.99, 10.25, 20.50, 15.99],\n", " \"on_sale\": [False, True, False, True, False],\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "db705814", "metadata": {}, "source": [ "With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling `entity_from_dataframe` as shown below.\n", "\n", "```python\n", "es = ft.EntitySet('old_es')\n", "\n", "es.entity_from_dataframe(dataframe=orders_df,\n", " entity_id='orders',\n", " index='order_id',\n", " time_index='order_date')\n", "es.entity_from_dataframe(dataframe=items_df,\n", " entity_id='items',\n", " index='id')\n", "```\n", "\n", "```\n", "Entityset: old_es\n", " Entities:\n", " orders [Rows: 3, Columns: 2]\n", " items [Rows: 5, Columns: 3]\n", " Relationships:\n", " No relationships\n", "```" ] }, { "cell_type": "markdown", "id": "f6f95f35", "metadata": {}, "source": [ "With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call `EntitySet.add_dataframe` in place of the previous `EntitySet.entity_from_dataframe` call. Note that the name of the dataframe is specified in the `dataframe_name` argument, which was previously called `entity_id`." ] }, { "cell_type": "code", "execution_count": null, "id": "b1fdffe4", "metadata": {}, "outputs": [], "source": [ "es = ft.EntitySet(\"new_es\")\n", "\n", "es.add_dataframe(\n", " dataframe=orders_df,\n", " dataframe_name=\"orders\",\n", " index=\"order_id\",\n", " time_index=\"order_date\",\n", ")" ] }, { "cell_type": "markdown", "id": "1c983744", "metadata": {}, "source": [ "You can also define the name, index, and time index by first [initializing Woodwork](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.init.html#woodwork.table_accessor.WoodworkTableAccessor.init) on the dataframe and then passing the Woodwork initialized dataframe directly to the `add_dataframe` call. For this example we will initialize Woodwork on `items_df`, setting the dataframe name as `items` and specifying that the index should be the `id` column." ] }, { "cell_type": "code", "execution_count": null, "id": "0d5ad8e5", "metadata": {}, "outputs": [], "source": [ "items_df.ww.init(name=\"items\", index=\"id\")\n", "items_df.ww" ] }, { "cell_type": "markdown", "id": "07f5f27c", "metadata": {}, "source": [ "With Woodwork initialized, we no longer need to specify values for the `dataframe_name` or `index` arguments when calling `add_dataframe` as Featuretools will simply use the values that were already specified when Woodwork was initialized." ] }, { "cell_type": "code", "execution_count": null, "id": "5f4ab39a", "metadata": {}, "outputs": [], "source": [ "es.add_dataframe(dataframe=items_df)" ] }, { "cell_type": "markdown", "id": "93814387", "metadata": {}, "source": [ "### Accessing column typing information\n", "\n", "Previously, column variable type information could be accessed for an entire Entity through `Entity.variable_types` or for an individual column by selecting the individual column first through `es['entity_id']['col_id']`.\n", "\n", "```python\n", "es['items'].variable_types\n", "```\n", "```\n", "{'id': featuretools.variable_types.variable.Index,\n", " 'order_id': featuretools.variable_types.variable.Numeric,\n", " 'item_price': featuretools.variable_types.variable.Numeric}\n", "```\n", "```python\n", "es['items']['item_price']\n", "```\n", "```\n", "\n", "```\n", "\n", "With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the `.ww` namespace on the dataframe. First, select the dataframe from the EntitySet with `es['dataframe_name']` and then access the typing information by chaining a `.ww` call on the end as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "6abb9b10", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww" ] }, { "cell_type": "markdown", "id": "72775903", "metadata": {}, "source": [ "The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a `Woodwork.ColumnSchema` object that stores the typing information:" ] }, { "cell_type": "code", "execution_count": null, "id": "da516642", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww.columns[\"item_price\"]" ] }, { "cell_type": "markdown", "id": "50f9f70a", "metadata": {}, "source": [ "### Type inference and updating column types\n", "\n", "Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the `Entity.variable_types` dictionary.\n", "\n", "Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling `EntitySet.add_dataframe`.\n", "\n", "As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user's full name as the Woodwork `PersonFullName` logical type." ] }, { "cell_type": "code", "execution_count": null, "id": "a34016b5", "metadata": {}, "outputs": [], "source": [ "users_df = pd.DataFrame(\n", " {\"id\": [0, 1, 2], \"name\": [\"John Doe\", \"Rita Book\", \"Teri Dactyl\"]}\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "d999e022", "metadata": {}, "outputs": [], "source": [ "es.add_dataframe(\n", " dataframe=users_df,\n", " dataframe_name=\"users\",\n", " index=\"id\",\n", " logical_types={\"name\": \"PersonFullName\"},\n", ")\n", "\n", "es[\"users\"].ww" ] }, { "cell_type": "markdown", "id": "d2eff5e1", "metadata": {}, "source": [ "Looking at the typing information above, we can see that the logical type for the `name` column was set to `PersonFullName` as we specified.\n", "\n", "Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork `set_types` method. Let's say we want the `order_id` column of the `orders` dataframe to have a `Categorical` logical type instead of the `Integer` type that was inferred. Previously, this would have accomplished through the `Entity.convert_variable_type` method.\n", "\n", "```python\n", "from featuretools.variable_types import Categorical\n", "\n", "es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)\n", "```\n", "\n", "Now, we can perform this same update using Woodwork:" ] }, { "cell_type": "code", "execution_count": null, "id": "a6c095b5", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww.set_types(logical_types={\"order_id\": \"Categorical\"})\n", "es[\"items\"].ww" ] }, { "cell_type": "markdown", "id": "d9d84e08", "metadata": {}, "source": [ "For additional information on Woodwork typing and how it is used in Featuretools, refer to [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb)." ] }, { "cell_type": "markdown", "id": "bf3dfea2", "metadata": {}, "source": [ "### Adding interesting values\n", "\n", "Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.\n", "\n", "To add interesting values for all of the dataframes in an EntitySet, simply call `EntitySet.add_interesting_values`, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.\n", "\n", "Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call `Entity.add_interesting_values()`:\n", "```python\n", "es['items'].add_interesting_values()\n", "```\n", "\n", "Now, in order to specify interesting values for a single dataframe, you call `add_interesting_values` on the EntitySet, and pass the name of the dataframe for which you want interesting values added:" ] }, { "cell_type": "code", "execution_count": null, "id": "c058d2ed", "metadata": {}, "outputs": [], "source": [ "es.add_interesting_values(dataframe_name=\"items\")" ] }, { "cell_type": "markdown", "id": "c3e0a247", "metadata": {}, "source": [ "Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:\n", "\n", "```python\n", "es['items']['order_id'].interesting_values = [1, 2]\n", "```\n", "\n", "Now, this is done through `EntitySet.add_interesting_values`, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of `[1, 2]` to the `order_id` column of the `items` dataframe, use the following approach:" ] }, { "cell_type": "code", "execution_count": null, "id": "8276114b", "metadata": {}, "outputs": [], "source": [ "es.add_interesting_values(dataframe_name=\"items\", values={\"order_id\": [1, 2]})" ] }, { "cell_type": "markdown", "id": "22e70b84", "metadata": {}, "source": [ "Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the `values` parameter.\n", "\n", "Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:\n", "```python\n", "es['items']['order_id'].interesting_values\n", "```\n", "\n", "Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:" ] }, { "cell_type": "code", "execution_count": null, "id": "8461c4f7", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww.columns[\"order_id\"].metadata[\"interesting_values\"]" ] }, { "cell_type": "markdown", "id": "cb23501f", "metadata": {}, "source": [ "### Setting a secondary time index\n", "\n", "In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling `Entity.set_secondary_time_index`. \n", "```python\n", "es_flight = ft.demo.load_flight(nrows=100)\n", "\n", "arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\n", " 'national_airspace_delay', 'security_delay',\n", " 'late_aircraft_delay', 'canceled', 'diverted',\n", " 'taxi_in', 'taxi_out', 'air_time', 'dep_time']\n", "es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})\n", "```\n", "\n", "Since the `Entity` class has been removed in Featuretools 1.0, this now needs to be done through the `EntitySet` instead:" ] }, { "cell_type": "code", "execution_count": null, "id": "b80b1f6a", "metadata": {}, "outputs": [], "source": [ "es_flight = ft.demo.load_flight(nrows=100)\n", "\n", "arr_time_columns = [\n", " \"arr_delay\",\n", " \"dep_delay\",\n", " \"carrier_delay\",\n", " \"weather_delay\",\n", " \"national_airspace_delay\",\n", " \"security_delay\",\n", " \"late_aircraft_delay\",\n", " \"canceled\",\n", " \"diverted\",\n", " \"taxi_in\",\n", " \"taxi_out\",\n", " \"air_time\",\n", " \"dep_time\",\n", "]\n", "es_flight.set_secondary_time_index(\n", " dataframe_name=\"trip_logs\", secondary_time_index={\"arr_time\": arr_time_columns}\n", ")" ] }, { "cell_type": "markdown", "id": "2ebee2e6", "metadata": {}, "source": [ "Previously, the secondary time index could be accessed directly from the Entity with `es_flight['trip_logs'].secondary_time_index`. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "3ea95fdb", "metadata": {}, "outputs": [], "source": [ "es_flight[\"trip_logs\"].ww.metadata[\"secondary_time_index\"]" ] }, { "cell_type": "markdown", "id": "f2f9b64c", "metadata": {}, "source": [ "### Normalizing Entities/DataFrames\n", "\n", "`EntitySet.normalize_entity` has been renamed to `EntitySet.normalize_dataframe` in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.\n", "\n", "| Old Parameter Name | New Parameter Name |\n", "| --- | --- |\n", "| base_entity_id | base_dataframe_name |\n", "| new_entity_id | new_dataframe_name |\n", "| additional_variables | additional_columns |\n", "| copy_variables | copy_columns |\n", "| new_entity_time_index | new_dataframe_time_index |\n", "| new_entity_secondary_time_index | new_dataframe_secondary_time_index |" ] }, { "cell_type": "markdown", "id": "ca81708b", "metadata": {}, "source": [ "### Defining and adding relationships\n", "\n", "In earlier versions of Featuretools, relationships were defined by creating a `Relationship` object, which took two `Variables` as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a `Relationship` and then add it to the EntitySet:\n", "\n", "```python\n", "relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])\n", "es.add_relationship(relationship)\n", "```\n", "\n", "With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to `EntitySet.add_relationship`, and another is to pass a previously created `Relationship` object to the `relationship` keyword argument. Both approaches are demonstrated below." ] }, { "cell_type": "code", "execution_count": null, "id": "7d738807", "metadata": { "nbshpinx": "hidden" }, "outputs": [], "source": [ "# Undo change from above and change child column logical type to match parent and prevent warning\n", "# NOTE: This cell is hidden in the docs build\n", "es[\"items\"].ww.set_types(logical_types={\"order_id\": \"Integer\"})" ] }, { "cell_type": "code", "execution_count": null, "id": "97c04dd4", "metadata": {}, "outputs": [], "source": [ "es.add_relationship(\n", " parent_dataframe_name=\"orders\",\n", " parent_column_name=\"order_id\",\n", " child_dataframe_name=\"items\",\n", " child_column_name=\"order_id\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "26643d04", "metadata": { "nbshpinx": "hidden" }, "outputs": [], "source": [ "# Reset the relationship so we can add it again\n", "# NOTE: This cell is hidden in the docs build\n", "es.relationships = []" ] }, { "cell_type": "markdown", "id": "317e5657", "metadata": {}, "source": [ "Alternatively, we can first create a `Relationship` and pass that to `EntitySet.add_relationship`. When defining a `Relationship` we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column." ] }, { "cell_type": "code", "execution_count": null, "id": "47e54c72", "metadata": {}, "outputs": [], "source": [ "relationship = ft.Relationship(\n", " entityset=es,\n", " parent_dataframe_name=\"orders\",\n", " parent_column_name=\"order_id\",\n", " child_dataframe_name=\"items\",\n", " child_column_name=\"order_id\",\n", ")\n", "es.add_relationship(relationship=relationship)" ] }, { "cell_type": "markdown", "id": "7a49ba91", "metadata": {}, "source": [ "### Updating data for a dataframe in an EntitySet\n", "\n", "Previously to update (replace) the data associated with an Entity, users could call `Entity.update_data` and pass in the new dataframe. As an example, let's update the data in our `users` Entity:\n", "```python\n", "new_users_df = pd.DataFrame({\n", " 'id': [3, 4],\n", " 'name': ['Anne Teak', 'Art Decco']\n", "})\n", "\n", "es['users'].update_data(df=new_users_df)\n", "```\n", "\n", "To accomplish this task with Featuretools 1.0, we will use the `EntitySet.replace_dataframe` method instead:" ] }, { "cell_type": "code", "execution_count": null, "id": "b45a81d5", "metadata": {}, "outputs": [], "source": [ "new_users_df = pd.DataFrame({\"id\": [0, 1], \"name\": [\"Anne Teak\", \"Art Decco\"]})\n", "\n", "es.replace_dataframe(dataframe_name=\"users\", df=new_users_df)\n", "es[\"users\"]" ] }, { "cell_type": "markdown", "id": "679af861", "metadata": {}, "source": [ "## Defining features\n", "\n", "The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.\n", "\n", "```python\n", "feature = ft.Feature(es['items']['item_price'])\n", "```\n", "\n", "Starting with Featuretools 1.0, a similar syntax can be used, but because `es['items']` will now return a Woodwork dataframe instead of an `Entity`, we need to update the syntax slightly to access the Woodwork column. To update, simply add `.ww` between the dataframe name selector and the column selector as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "88902f6b", "metadata": {}, "outputs": [], "source": [ "feature = ft.Feature(es[\"items\"].ww[\"item_price\"])" ] }, { "cell_type": "markdown", "id": "0faf41e4", "metadata": {}, "source": [ "## Defining primitives\n", "\n", "In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate `Variable` class. Starting in version 1.0, the input and return types are defined by Woodwork `ColumnSchema` objects. \n", "\n", "To illustrate this change, let's look closer at the `Age` transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person's age. In previous versions of Featuretools, the input type was defined by specifying the `DateOfBirth` variable type and the return type was specified by the `Numeric` variable type:\n", "\n", "```python\n", "input_types = [DateOfBirth]\n", "return_type = Numeric\n", "```\n", "\n", "Woodwork does not have a specific `DateOfBirth` logical type, but rather identifies a column as a date of birth column by specifying the logical type as `Datetime` with a semantic tag of `date_of_birth`. There is also no `Numeric` logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of `numeric`. Furthermore, we know the `Age` primitive will return a floating point number, which would correspond to a Woodwork logical type of `Double`. With these items in mind, we can redefine the `Age` input types and return types with `ColumnSchema` objects as follows:\n", "\n", "```python\n", "input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]\n", "return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})\n", "```\n", "\n", "Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged." ] }, { "cell_type": "markdown", "id": "ebcd6d9e", "metadata": {}, "source": [ "### Mapping from old Featuretools variable types to Woodwork ColumnSchemas\n", "\n", "Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by `ColumnSchema` objects, the approximate mapping is shown below.\n", "\n", "\n", "| Featuretools Variable | Woodwork Column Schema |\n", "| --- | --- |\n", "| Boolean | ColumnSchema(logical_type=Boolean) or ColumnSchema(logical_type=BooleanNullable) |\n", "| Categorical | ColumnSchema(logical_type=Categorical) |\n", "| CountryCode | ColumnSchema(logical_type=CountryCode) |\n", "| Datetime | ColumnSchema(logical_type=Datetime) |\n", "| DateOfBirth | ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'}) |\n", "| DatetimeTimeIndex | ColumnSchema(logical_type=Datetime, semantic_tags={'time_index'}) |\n", "| Discrete | ColumnSchema(semantic_tags={'category'}) |\n", "| EmailAddress | ColumnSchema(logical_type=EmailAddress) |\n", "| FilePath | ColumnSchema(logical_type=Filepath) |\n", "| FullName | ColumnSchema(logical_type=PersonFullName) |\n", "| Id | ColumnSchema(semantic_tags={'foreign_key'}) |\n", "| Index | ColumnSchema(semantic_tags={'index'}) |\n", "| IPAddress | ColumnSchema(logical_type=IPAddress) |\n", "| LatLong | ColumnSchema(logical_type=LatLong) |\n", "| NaturalLanguage | ColumnSchema(logical_type=NaturalLanguage) |\n", "| Numeric | ColumnSchema(semantic_tags={'numeric'}) |\n", "| NumericTimeIndex | ColumnSchema(semantic_tags={'numeric', 'time_index'}) |\n", "| Ordinal | ColumnSchema(logical_type=Ordinal) |\n", "| PhoneNumber | ColumnSchema(logical_type=PhoneNumber) |\n", "| SubRegionCode | ColumnSchema(logical_type=SubRegionCode) |\n", "| Timedelta | ColumnSchema(logical_type=Timedelta) |\n", "| TimeIndex | ColumnSchema(semantic_tags={'time_index'}) |\n", "| URL | ColumnSchema(logical_type=URL) |\n", "| Unknown | ColumnSchema(logical_type=Unknown) |\n", "| ZIPCode | ColumnSchema(logical_type=PostalCode) |" ] }, { "cell_type": "markdown", "id": "fec87370", "metadata": {}, "source": [ "## Changes to Deep Feature Synthesis and Calculate Feature Matrix\n", "\n", "The argument names for both `featuretools.dfs` and `featuretools.calculate_feature_matrix` have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:\n", "\n", "```python\n", "features = ft.dfs(entityset=es,\n", " target_entity='items',\n", " features_only=True)\n", "```\n", "\n", "In Featuretools 1.0, the `target_entity` argument has been renamed to `target_dataframe_name`, but otherwise this basic call remains the same.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5428949c", "metadata": {}, "outputs": [], "source": [ "features = ft.dfs(entityset=es, target_dataframe_name=\"items\", features_only=True)\n", "features" ] }, { "cell_type": "markdown", "id": "3154734d", "metadata": {}, "source": [ "In addition, the `dfs` argument `ignore_entities` was renamed to `ignore_dataframes` and `ignore_variables` was renamed to `ignore_columns`. Similarly, if specifying primitive options, all references to `entities` should be replaced with `dataframes` and references to `variables` should be replaced with columns. For example, the primitive option of `include_groupby_entities` is now `include_groupby_dataframes` and `include_variables` is now `include_columns`.\n", "\n", "The basic call to `featuretools.calculate_feature_matrix` remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling `calculate_feature_matrix` by passing in a list of `entities` and `relationships` should note that the `entities` argument has been renamed to `dataframes` and the values in the dictionary values should now include Woodwork logical types instead of Featuretools `Variable` classes." ] }, { "cell_type": "code", "execution_count": null, "id": "456da22e", "metadata": {}, "outputs": [], "source": [ "feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "b87489cf", "metadata": {}, "source": [ "In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key (previously `Id`) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as `Integer`, features will not be generated from these columns. Manually converting foreign key columns to `Categorical` as shown above will result in features much closer to those achieved with previous versions.\n", "\n", "Also, because Woodwork's type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.\n", "\n", "Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows." ] }, { "cell_type": "code", "execution_count": null, "id": "cdb45cc9", "metadata": {}, "outputs": [], "source": [ "feature_matrix.ww" ] }, { "cell_type": "markdown", "id": "68910d73", "metadata": {}, "source": [ "Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork `origin` attribute for the column. Columns that were in the original data will be labeled with `base` and features that were created by Featuretools will be labeled with `engineered`.\n", "\n", "As a demonstration of how to access this information, let's compare two features in the feature matrix: `item_price` and `orders.MEAN(items.item_price)`. `item_price` was present in the original data, and `orders.MEAN(items.item_price)` was created by Featuretools." ] }, { "cell_type": "code", "execution_count": null, "id": "f3e143fe", "metadata": {}, "outputs": [], "source": [ "feature_matrix.ww[\"item_price\"].ww.origin" ] }, { "cell_type": "code", "execution_count": null, "id": "12cf8260", "metadata": {}, "outputs": [], "source": [ "feature_matrix.ww[\"orders.MEAN(items.item_price)\"].ww.origin" ] }, { "cell_type": "markdown", "id": "4c429c75", "metadata": {}, "source": [ "## Other changes\n", "\n", "In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.\n", "\n", "- Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.\n", "\n", "- For `LatLong` columns, older versions of Featuretools would replace single `nan` values in the columns with a tuple `(nan, nan)`. This is no longer the case, and single `nan` values will now remain in the `LatLong` column. Based on the behavior in Woodwork, any values of `(nan, nan)` in a `LatLong` column will be replaced with a single `nan` value.\n", "\n", "- Since Featuretools no longer defines `Variable` objects with relationships between them, the `featuretools.variable_types.graph_variable_types` function has been removed.\n", "\n", "- The `featuretools.variable_types.list_variable_types` utility function has been removed and replaced with two corresponding Woodwork functions: `woodwork.list_logical_types` and `woodwork.list_semantic_tags`. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: docs/source/resources/usage_tips/glossary.rst ================================================ .. _glossary: .. currentmodule:: featuretools Glossary ======== .. glossary:: :sorted: feature A transformation of data used for machine learning. Featuretools has a custom language for defining features as described :ref:`here `. All features are represented by subclasses of :class:`FeatureBase`. feature engineering The process of transforming data into representations that are better for machine learning. cutoff time The last point in time data is allowed to be used when calculating a feature EntitySet A collection of dataframes and the relationships between them. Represented by the :class:`.EntitySet` class. instance Equivalent to a row in a relational database. Each dataframe has many instances, and each instance has a value for each column and feature defined on the dataframe. target dataframe The dataframe for which we will be making features parent dataframe A dataframe that is referenced by another dataframe via relationship. The "one" in a one-to-many relationship. child dataframe A dataframe that references another dataframe via relationship. The "many" in a one-to-many relationship. relationship A mapping between a parent dataframe and a child dataframe. The child dataframe must contain a column referencing the index column on the parent dataframe. Represented by the :class:`.Relationship` class. logical type Additional information about how a column should be interpreted or parsed beyond how the data is stored on disk or in memory. Used to determine which primitives can be applied to a column to generate features. semantic tag Optional additional information on the column about the meaning or potential uses of data. Used to determine which primitives can be applied to a column to generate features. ColumnSchema All of a Woodwork column's type information including the logical type and any semantic tags. ================================================ FILE: docs/source/resources/usage_tips/limitations.rst ================================================ Limitations ----------- In-memory ********* Featuretools is intended to be run on datasets that can fit in memory on one machine. For advice on handing large dataset refer to :ref:`Improving Computational Performance `. Bring your own labels ********************* If you are doing supervised machine learning, you must supply your own labels and cutoff times. To structure this process, you can use `Compose `_, which is an open source project for automatically generating labels with cutoff times. ================================================ FILE: docs/source/set-headers.py ================================================ import urllib.request opener = urllib.request.build_opener() opener.addheaders = [("Testing", "True")] urllib.request.install_opener(opener) ================================================ FILE: docs/source/setup.py ================================================ import os import featuretools as ft def load_feature_plots(): es = ft.demo.load_mock_customer(return_entityset=True) path = os.path.join( os.path.dirname(os.path.abspath(__file__)), "getting_started/graphs/", ) agg_feat = ft.AggregationFeature( ft.IdentityFeature(es["sessions"].ww["session_id"]), "customers", ft.primitives.Count, ) trans_feat = ft.TransformFeature( ft.IdentityFeature(es["customers"].ww["join_date"]), ft.primitives.TimeSincePrevious, ) demo_feat = ft.AggregationFeature( ft.TransformFeature( ft.IdentityFeature(es["transactions"].ww["transaction_time"]), ft.primitives.Weekday, ), "sessions", ft.primitives.Mode, ) ft.graph_feature(agg_feat, to_file=os.path.join(path, "agg_feat.dot")) ft.graph_feature(trans_feat, to_file=os.path.join(path, "trans_feat.dot")) ft.graph_feature(demo_feat, to_file=os.path.join(path, "demo_feat.dot")) if __name__ == "__main__": load_feature_plots() ================================================ FILE: docs/source/templates/layout.html ================================================ {% extends "!layout.html" %} {%- block extrahead %} {% set image = 'https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_OpenGraph_1200x630px-featuretools.png' %} {% set description = 'Automated feature engineering in Python' %} {% if meta is defined %} {% if meta.description is defined %} {% set description = meta.description %} {% endif %} {% endif %} {% endblock %} {%- block footer %} {% endblock %} ================================================ FILE: featuretools/__init__.py ================================================ # flake8: noqa from featuretools.version import __version__ from featuretools.config_init import config from featuretools.entityset.api import * from featuretools import primitives from featuretools.synthesis.api import * from featuretools.primitives import list_primitives, summarize_primitives from featuretools.computational_backends.api import * from featuretools import tests from featuretools.utils.recommend_primitives import get_recommended_primitives from featuretools.utils.time_utils import * from featuretools.utils.utils_info import show_info import featuretools.demo from featuretools import feature_base from featuretools import selection from featuretools.feature_base import ( AggregationFeature, DirectFeature, Feature, FeatureBase, GroupByTransformFeature, IdentityFeature, TransformFeature, graph_feature, describe_feature, save_features, load_features, ) import logging import pkg_resources import sys import traceback import warnings from woodwork import list_logical_types, list_semantic_tags logger = logging.getLogger("featuretools") # Call functions registered by other libraries when featuretools is imported for entry_point in pkg_resources.iter_entry_points("featuretools_initialize"): try: method = entry_point.load() if callable(method): method() except Exception: pass for entry_point in pkg_resources.iter_entry_points("alteryx_open_src_initialize"): try: method = entry_point.load() if callable(method): method("featuretools") except Exception: pass # Load in submodules registered by other libraries into Featuretools namespace for entry_point in pkg_resources.iter_entry_points("featuretools_plugin"): try: sys.modules["featuretools." + entry_point.name] = entry_point.load() except Exception: message = "Featuretools failed to load plugin {} from library {}. " message += "For a full stack trace, set logging to debug." logger.warning(message.format(entry_point.name, entry_point.module_name)) logger.debug(traceback.format_exc()) ================================================ FILE: featuretools/__main__.py ================================================ ================================================ FILE: featuretools/computational_backends/__init__.py ================================================ # flake8: noqa from featuretools.computational_backends.api import * ================================================ FILE: featuretools/computational_backends/api.py ================================================ # flake8: noqa from featuretools.computational_backends.calculate_feature_matrix import ( approximate_features, calculate_feature_matrix, ) from featuretools.computational_backends.utils import ( bin_cutoff_times, create_client_and_cluster, replace_inf_values, ) ================================================ FILE: featuretools/computational_backends/calculate_feature_matrix.py ================================================ import logging import math import os import shutil import time import warnings from datetime import datetime import cloudpickle import numpy as np import pandas as pd from woodwork.logical_types import ( Age, AgeNullable, Boolean, BooleanNullable, Integer, IntegerNullable, ) from featuretools.computational_backends.feature_set import FeatureSet from featuretools.computational_backends.feature_set_calculator import ( FeatureSetCalculator, ) from featuretools.computational_backends.utils import ( _check_cutoff_time_type, _validate_cutoff_time, bin_cutoff_times, create_client_and_cluster, gather_approximate_features, gen_empty_approx_features_df, get_ww_types_from_features, save_csv_decorator, ) from featuretools.entityset.relationship import RelationshipPath from featuretools.feature_base import AggregationFeature, FeatureBase from featuretools.utils import Trie from featuretools.utils.gen_utils import ( import_or_raise, make_tqdm_iterator, ) logger = logging.getLogger("featuretools.computational_backend") PBAR_FORMAT = "Elapsed: {elapsed} | Progress: {l_bar}{bar}" FEATURE_CALCULATION_PERCENTAGE = ( 0.95 # make total 5% higher to allot time for wrapping up at end ) def calculate_feature_matrix( features, entityset=None, cutoff_time=None, instance_ids=None, dataframes=None, relationships=None, cutoff_time_in_index=False, training_window=None, approximate=None, save_progress=None, verbose=False, chunk_size=None, n_jobs=1, dask_kwargs=None, progress_callback=None, include_cutoff_time=True, ): """Calculates a matrix for a given set of instance ids and calculation times. Args: features (list[:class:`.FeatureBase`]): Feature definitions to be calculated. entityset (EntitySet): An already initialized entityset. Required if `dataframes` and `relationships` not provided cutoff_time (pd.DataFrame or Datetime): Specifies times at which to calculate the features for each instance. The resulting feature matrix will use data up to and including the cutoff_time. Can either be a DataFrame or a single value. If a DataFrame is passed the instance ids for which to calculate features must be in a column with the same name as the target dataframe index or a column named `instance_id`. The cutoff time values in the DataFrame must be in a column with the same name as the target dataframe time index or a column named `time`. If the DataFrame has more than two columns, any additional columns will be added to the resulting feature matrix. If a single value is passed, this value will be used for all instances. instance_ids (list): List of instances to calculate features on. Only used if cutoff_time is a single datetime. dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]): Dictionary of DataFrames. Entries take the format {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}. Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters will be ignored. relationships (list[(str, str, str, str)]): list of relationships between dataframes. List items are a tuple with the format (parent dataframe name, parent column, child dataframe name, child column). cutoff_time_in_index (bool): If True, return a DataFrame with a MultiIndex where the second index is the cutoff time (first is instance id). DataFrame will be sorted by (time, instance_id). training_window (Timedelta or str, optional): Window defining how much time before the cutoff time data can be used when calculating features. If ``None``, all data before cutoff time is used. Defaults to ``None``. approximate (Timedelta or str): Frequency to group instances with similar cutoff times by for features with costly calculations. For example, if bucket is 24 hours, all instances with cutoff times on the same day will use the same calculation for expensive features. verbose (bool, optional): Print progress info. The time granularity is per chunk. chunk_size (int or float or None): maximum number of rows of output feature matrix to calculate at time. If passed an integer greater than 0, will try to use that many rows per chunk. If passed a float value between 0 and 1 sets the chunk size to that percentage of all rows. if None, and n_jobs > 1 it will be set to 1/n_jobs n_jobs (int, optional): number of parallel processes to use when calculating feature matrix. Requires Dask if not equal to 1. dask_kwargs (dict, optional): Dictionary of keyword arguments to be passed when creating the dask client and scheduler. Even if n_jobs is not set, using `dask_kwargs` will enable multiprocessing. Main parameters: cluster (str or dask.distributed.LocalCluster): cluster or address of cluster to send tasks to. If unspecified, a cluster will be created. diagnostics port (int): port number to use for web dashboard. If left unspecified, web interface will not be enabled. Valid keyword arguments for LocalCluster will also be accepted. save_progress (str, optional): path to save intermediate computational results. progress_callback (callable): function to be called with incremental progress updates. Has the following parameters: update: percentage change (float between 0 and 100) in progress since last call progress_percent: percentage (float between 0 and 100) of total computation completed time_elapsed: total time in seconds that has elapsed since start of call include_cutoff_time (bool): Include data at cutoff times in feature calculations. Defaults to ``True``. Returns: pd.DataFrame: The feature matrix. """ assert ( isinstance(features, list) and features != [] and all([isinstance(feature, FeatureBase) for feature in features]) ), "features must be a non-empty list of features" # handle loading entityset from featuretools.entityset.entityset import EntitySet if not isinstance(entityset, EntitySet): if dataframes is not None: entityset = EntitySet("entityset", dataframes, relationships) else: raise TypeError("No dataframes or valid EntitySet provided") target_dataframe = entityset[features[0].dataframe_name] cutoff_time = _validate_cutoff_time(cutoff_time, target_dataframe) entityset._check_time_indexes() if isinstance(cutoff_time, pd.DataFrame): if instance_ids: msg = "Passing 'instance_ids' is valid only if 'cutoff_time' is a single value or None - ignoring" warnings.warn(msg) pass_columns = [ col for col in cutoff_time.columns if col not in ["instance_id", "time"] ] # make sure dtype of instance_id in cutoff time # is same as column it references target_dataframe = features[0].dataframe ltype = target_dataframe.ww.logical_types[target_dataframe.ww.index] cutoff_time.ww.init(logical_types={"instance_id": ltype}) else: pass_columns = [] if cutoff_time is None: if entityset.time_type == "numeric": cutoff_time = np.inf else: cutoff_time = datetime.now() if instance_ids is None: index_col = target_dataframe.ww.index df = entityset._handle_time( dataframe_name=target_dataframe.ww.name, df=target_dataframe, time_last=cutoff_time, training_window=training_window, include_cutoff_time=include_cutoff_time, ) instance_ids = df[index_col] # convert list or range object into series if not isinstance(instance_ids, pd.Series): instance_ids = pd.Series(instance_ids) cutoff_time = (cutoff_time, instance_ids) _check_cutoff_time_type(cutoff_time, entityset.time_type) # Approximate provides no benefit with a single cutoff time, so ignore it if isinstance(cutoff_time, tuple) and approximate is not None: msg = ( "Using approximate with a single cutoff_time value or no cutoff_time " "provides no computational efficiency benefit" ) warnings.warn(msg) cutoff_time = pd.DataFrame( { "instance_id": cutoff_time[1], "time": [cutoff_time[0]] * len(cutoff_time[1]), }, ) target_dataframe = features[0].dataframe ltype = target_dataframe.ww.logical_types[target_dataframe.ww.index] cutoff_time.ww.init(logical_types={"instance_id": ltype}) feature_set = FeatureSet(features) # Get features to approximate if approximate is not None: approximate_feature_trie = gather_approximate_features(feature_set) # Make a new FeatureSet that ignores approximated features feature_set = FeatureSet( features, approximate_feature_trie=approximate_feature_trie, ) # Check if there are any non-approximated aggregation features no_unapproximated_aggs = True for feature in features: if isinstance(feature, AggregationFeature): # do not need to check if feature is in to_approximate since # only base features of direct features can be in to_approximate no_unapproximated_aggs = False break if approximate is not None: all_approx_features = { f for _, feats in feature_set.approximate_feature_trie for f in feats } else: all_approx_features = set() deps = feature.get_dependencies(deep=True, ignored=all_approx_features) for dependency in deps: if isinstance(dependency, AggregationFeature): no_unapproximated_aggs = False break cutoff_df_time_col = "time" target_time = "_original_time" if approximate is not None: # If there are approximated aggs, bin times binned_cutoff_time = bin_cutoff_times(cutoff_time, approximate) # Think about collisions: what if original time is a feature binned_cutoff_time.ww[target_time] = cutoff_time[cutoff_df_time_col] cutoff_time_to_pass = binned_cutoff_time else: cutoff_time_to_pass = cutoff_time if isinstance(cutoff_time, pd.DataFrame): cutoff_time_len = cutoff_time.shape[0] else: cutoff_time_len = len(cutoff_time[1]) chunk_size = _handle_chunk_size(chunk_size, cutoff_time_len) tqdm_options = { "total": (cutoff_time_len / FEATURE_CALCULATION_PERCENTAGE), "bar_format": PBAR_FORMAT, "disable": True, } if verbose: tqdm_options.update({"disable": False}) elif progress_callback is not None: # allows us to utilize progress_bar updates without printing to anywhere tqdm_options.update({"file": open(os.devnull, "w"), "disable": False}) with make_tqdm_iterator(**tqdm_options) as progress_bar: if n_jobs != 1 or dask_kwargs is not None: feature_matrix = parallel_calculate_chunks( cutoff_time=cutoff_time_to_pass, chunk_size=chunk_size, feature_set=feature_set, approximate=approximate, training_window=training_window, save_progress=save_progress, entityset=entityset, n_jobs=n_jobs, no_unapproximated_aggs=no_unapproximated_aggs, cutoff_df_time_col=cutoff_df_time_col, target_time=target_time, pass_columns=pass_columns, progress_bar=progress_bar, dask_kwargs=dask_kwargs or {}, progress_callback=progress_callback, include_cutoff_time=include_cutoff_time, ) else: feature_matrix = calculate_chunk( cutoff_time=cutoff_time_to_pass, chunk_size=chunk_size, feature_set=feature_set, approximate=approximate, training_window=training_window, save_progress=save_progress, entityset=entityset, no_unapproximated_aggs=no_unapproximated_aggs, cutoff_df_time_col=cutoff_df_time_col, target_time=target_time, pass_columns=pass_columns, progress_bar=progress_bar, progress_callback=progress_callback, include_cutoff_time=include_cutoff_time, ) # ensure rows are sorted by input order if isinstance(cutoff_time, pd.DataFrame): feature_matrix = feature_matrix.ww.reindex( pd.MultiIndex.from_frame( cutoff_time[["instance_id", "time"]], names=feature_matrix.index.names, ), ) else: # Maintain index dtype index_dtype = feature_matrix.index.get_level_values(0).dtype feature_matrix = feature_matrix.ww.reindex( cutoff_time[1].astype(index_dtype), level=0, ) if not cutoff_time_in_index: feature_matrix.ww.reset_index(level="time", drop=True, inplace=True) if save_progress and os.path.exists(os.path.join(save_progress, "temp")): shutil.rmtree(os.path.join(save_progress, "temp")) # force to 100% since we saved last 5 percent previous_progress = progress_bar.n progress_bar.update(progress_bar.total - progress_bar.n) if progress_callback is not None: ( update, progress_percent, time_elapsed, ) = update_progress_callback_parameters(progress_bar, previous_progress) progress_callback(update, progress_percent, time_elapsed) progress_bar.refresh() return feature_matrix def calculate_chunk( cutoff_time, chunk_size, feature_set, entityset, approximate, training_window, save_progress, no_unapproximated_aggs, cutoff_df_time_col, target_time, pass_columns, progress_bar=None, progress_callback=None, include_cutoff_time=True, schema=None, ): if not isinstance(feature_set, FeatureSet): feature_set = cloudpickle.loads(feature_set) # pragma: no cover feature_matrix = [] if no_unapproximated_aggs and approximate is not None: if entityset.time_type == "numeric": group_time = np.inf else: group_time = datetime.now() if isinstance(cutoff_time, tuple): update_progress_callback = None if progress_bar is not None: def update_progress_callback(done): previous_progress = progress_bar.n progress_bar.update(done * len(cutoff_time[1])) if progress_callback is not None: ( update, progress_percent, time_elapsed, ) = update_progress_callback_parameters( progress_bar, previous_progress, ) progress_callback(update, progress_percent, time_elapsed) time_last = cutoff_time[0] ids = cutoff_time[1] calculator = FeatureSetCalculator( entityset, feature_set, time_last, training_window=training_window, ) _feature_matrix = calculator.run( ids, progress_callback=update_progress_callback, include_cutoff_time=include_cutoff_time, ) time_index = pd.Index([time_last] * len(ids), name="time") _feature_matrix = _feature_matrix.set_index(time_index, append=True) feature_matrix.append(_feature_matrix) else: if schema: cutoff_time.ww.init_with_full_schema(schema=schema) # pragma: no cover for _, group in cutoff_time.groupby(cutoff_df_time_col): # if approximating, calculate the approximate features if approximate is not None: group.ww.init(schema=cutoff_time.ww.schema, validate=False) precalculated_features_trie = approximate_features( feature_set, group, window=approximate, entityset=entityset, training_window=training_window, include_cutoff_time=include_cutoff_time, ) else: precalculated_features_trie = None @save_csv_decorator(save_progress) def calc_results( time_last, ids, precalculated_features=None, training_window=None, include_cutoff_time=True, ): update_progress_callback = None if progress_bar is not None: def update_progress_callback(done): previous_progress = progress_bar.n progress_bar.update(done * group.shape[0]) if progress_callback is not None: ( update, progress_percent, time_elapsed, ) = update_progress_callback_parameters( progress_bar, previous_progress, ) progress_callback(update, progress_percent, time_elapsed) calculator = FeatureSetCalculator( entityset, feature_set, time_last, training_window=training_window, precalculated_features=precalculated_features, ) matrix = calculator.run( ids, progress_callback=update_progress_callback, include_cutoff_time=include_cutoff_time, ) return matrix # if all aggregations have been approximated, can calculate all together if no_unapproximated_aggs and approximate is not None: inner_grouped = [[group_time, group]] else: # if approximated features, set cutoff_time to unbinned time if precalculated_features_trie is not None: group[cutoff_df_time_col] = group[target_time] inner_grouped = group.groupby(cutoff_df_time_col, sort=True) if chunk_size is not None: inner_grouped = _chunk_dataframe_groups(inner_grouped, chunk_size) for time_last, group in inner_grouped: # sort group by instance id ids = group["instance_id"].sort_values().values if no_unapproximated_aggs and approximate is not None: window = None else: window = training_window # calculate values for those instances at time time_last _feature_matrix = calc_results( time_last, ids, precalculated_features=precalculated_features_trie, training_window=window, include_cutoff_time=include_cutoff_time, ) id_name = _feature_matrix.index.name # if approximate, merge feature matrix with group frame to get original # cutoff times and passed columns if approximate: cols = [c for c in _feature_matrix.columns if c not in pass_columns] indexer = group[["instance_id", target_time] + pass_columns] _feature_matrix = _feature_matrix[cols].merge( indexer, right_on=["instance_id"], left_index=True, how="right", ) _feature_matrix.set_index( ["instance_id", target_time], inplace=True, ) _feature_matrix.index.set_names([id_name, "time"], inplace=True) _feature_matrix.sort_index(level=1, kind="mergesort", inplace=True) else: # all rows have same cutoff time. set time and add passed columns num_rows = len(ids) if len(pass_columns) > 0: pass_through = group[ ["instance_id", cutoff_df_time_col] + pass_columns ] pass_through.rename( columns={ "instance_id": id_name, cutoff_df_time_col: "time", }, inplace=True, ) time_index = pd.Index([time_last] * num_rows, name="time") _feature_matrix = _feature_matrix.set_index( time_index, append=True, ) if len(pass_columns) > 0: pass_through.set_index([id_name, "time"], inplace=True) for col in pass_columns: _feature_matrix[col] = pass_through[col] feature_matrix.append(_feature_matrix) ww_init_kwargs = get_ww_types_from_features( feature_set.target_features, entityset, pass_columns, cutoff_time, ) feature_matrix = init_ww_and_concat_fm(feature_matrix, ww_init_kwargs) return feature_matrix def approximate_features( feature_set, cutoff_time, window, entityset, training_window=None, include_cutoff_time=True, ): """Given a set of features and cutoff_times to be passed to calculate_feature_matrix, calculates approximate values of some features to speed up calculations. Cutoff times are sorted into window-sized buckets and the approximate feature values are only calculated at one cutoff time for each bucket. ..note:: this only approximates DirectFeatures of AggregationFeatures, on the target dataframe. In future versions, it may also be possible to approximate these features on other top-level dataframes Args: cutoff_time (pd.DataFrame): specifies what time to calculate the features for each instance at. The resulting feature matrix will use data up to and including the cutoff_time. A DataFrame with 'instance_id' and 'time' columns. window (Timedelta or str): frequency to group instances with similar cutoff times by for features with costly calculations. For example, if bucket is 24 hours, all instances with cutoff times on the same day will use the same calculation for expensive features. entityset (:class:`.EntitySet`): An already initialized entityset. feature_set (:class:`.FeatureSet`): The features to be calculated. training_window (`Timedelta`, optional): Window defining how much older than the cutoff time data can be to be included when calculating the feature. If None, all older data is used. include_cutoff_time (bool): If True, data at cutoff times are included in feature calculations. """ approx_fms_trie = Trie(path_constructor=RelationshipPath) target_time_colname = "target_time" cutoff_time.ww[target_time_colname] = cutoff_time["time"] approx_cutoffs = bin_cutoff_times(cutoff_time, window) cutoff_df_time_col = "time" cutoff_df_instance_col = "instance_id" # should this order be by dependencies so that calculate_feature_matrix # doesn't skip approximating something? for relationship_path, approx_feature_names in feature_set.approximate_feature_trie: if not approx_feature_names: continue ( cutoffs_with_approx_e_ids, new_approx_dataframe_index_col, ) = _add_approx_dataframe_index_col( entityset, feature_set.target_df_name, approx_cutoffs.copy(), relationship_path, ) # Select only columns we care about columns_we_want = [ new_approx_dataframe_index_col, cutoff_df_time_col, target_time_colname, ] cutoffs_with_approx_e_ids = cutoffs_with_approx_e_ids[columns_we_want] cutoffs_with_approx_e_ids = cutoffs_with_approx_e_ids.drop_duplicates() cutoffs_with_approx_e_ids.dropna( subset=[new_approx_dataframe_index_col], inplace=True, ) approx_features = [ feature_set.features_by_name[name] for name in approx_feature_names ] if cutoffs_with_approx_e_ids.empty: approx_fm = gen_empty_approx_features_df(approx_features) else: cutoffs_with_approx_e_ids.sort_values( [cutoff_df_time_col, new_approx_dataframe_index_col], inplace=True, ) # CFM assumes specific column names for cutoff_time argument rename = {new_approx_dataframe_index_col: cutoff_df_instance_col} cutoff_time_to_pass = cutoffs_with_approx_e_ids.rename(columns=rename) cutoff_time_to_pass = cutoff_time_to_pass[ [cutoff_df_instance_col, cutoff_df_time_col] ] cutoff_time_to_pass.drop_duplicates(inplace=True) approx_fm = calculate_feature_matrix( approx_features, entityset, cutoff_time=cutoff_time_to_pass, training_window=training_window, approximate=None, cutoff_time_in_index=False, chunk_size=cutoff_time_to_pass.shape[0], include_cutoff_time=include_cutoff_time, ) approx_fms_trie.get_node(relationship_path).value = approx_fm return approx_fms_trie def scatter_warning(num_scattered_workers, num_workers): if num_scattered_workers != num_workers: scatter_warning = "EntitySet was only scattered to {} out of {} workers" logger.warning(scatter_warning.format(num_scattered_workers, num_workers)) def parallel_calculate_chunks( cutoff_time, chunk_size, feature_set, approximate, training_window, save_progress, entityset, n_jobs, no_unapproximated_aggs, cutoff_df_time_col, target_time, pass_columns, progress_bar, dask_kwargs=None, progress_callback=None, include_cutoff_time=True, ): import_or_raise( "distributed", "Dask must be installed to calculate feature matrix with n_jobs set to anything but 1", ) from dask.base import tokenize from distributed import Future, as_completed client = None cluster = None try: client, cluster = create_client_and_cluster( n_jobs=n_jobs, dask_kwargs=dask_kwargs, entityset_size=entityset.__sizeof__(), ) # scatter the entityset # denote future with leading underscore start = time.time() es_token = "EntitySet-{}".format(tokenize(entityset)) if es_token in client.list_datasets(): msg = "Using EntitySet persisted on the cluster as dataset {}" progress_bar.write(msg.format(es_token)) _es = client.get_dataset(es_token) else: _es = client.scatter([entityset])[0] client.publish_dataset(**{_es.key: _es}) # save features to a tempfile and scatter it pickled_feats = cloudpickle.dumps(feature_set) _saved_features = client.scatter(pickled_feats) client.replicate([_es, _saved_features]) num_scattered_workers = len( client.who_has([Future(es_token)]).get(es_token, []), ) num_workers = len(client.scheduler_info()["workers"].values()) schema = None if isinstance(cutoff_time, pd.DataFrame): schema = cutoff_time.ww.schema chunks = cutoff_time.groupby(cutoff_df_time_col) cutoff_time_len = cutoff_time.shape[0] else: chunks = cutoff_time cutoff_time_len = len(cutoff_time[1]) if not chunk_size: chunk_size = _handle_chunk_size(1.0 / num_workers, cutoff_time_len) chunks = _chunk_dataframe_groups(chunks, chunk_size) chunks = [df for _, df in chunks] if len(chunks) < num_workers: # pragma: no cover chunk_warning = ( "Fewer chunks ({}), than workers ({}) consider reducing the chunk size" ) warning_string = chunk_warning.format(len(chunks), num_workers) progress_bar.write(warning_string) scatter_warning(num_scattered_workers, num_workers) end = time.time() scatter_time = round(end - start) # if enabled, reset timer after scatter for better time remaining estimates if not progress_bar.disable: progress_bar.reset() scatter_string = "EntitySet scattered to {} workers in {} seconds" progress_bar.write(scatter_string.format(num_scattered_workers, scatter_time)) # map chunks # TODO: consider handling task submission dask kwargs _chunks = client.map( calculate_chunk, chunks, feature_set=_saved_features, chunk_size=None, entityset=_es, approximate=approximate, training_window=training_window, save_progress=save_progress, no_unapproximated_aggs=no_unapproximated_aggs, cutoff_df_time_col=cutoff_df_time_col, target_time=target_time, pass_columns=pass_columns, progress_bar=None, progress_callback=progress_callback, include_cutoff_time=include_cutoff_time, schema=schema, ) feature_matrix = [] iterator = as_completed(_chunks).batches() for batch in iterator: results = client.gather(batch) for result in results: feature_matrix.append(result) previous_progress = progress_bar.n progress_bar.update(result.shape[0]) if progress_callback is not None: ( update, progress_percent, time_elapsed, ) = update_progress_callback_parameters( progress_bar, previous_progress, ) progress_callback(update, progress_percent, time_elapsed) except Exception: raise finally: if client is not None: client.close() if "cluster" not in dask_kwargs and cluster is not None: cluster.close() # pragma: no cover ww_init_kwargs = get_ww_types_from_features( feature_set.target_features, entityset, pass_columns, cutoff_time, ) feature_matrix = init_ww_and_concat_fm(feature_matrix, ww_init_kwargs) return feature_matrix def _add_approx_dataframe_index_col(es, target_dataframe_name, cutoffs, path): """ Add a column to the cutoff df linking it to the dataframe at the end of the path. Return the updated cutoff df and the name of this column. The name will consist of the columns which were joined through. """ last_child_col = "instance_id" last_parent_col = es[target_dataframe_name].ww.index for _, relationship in path: child_cols = [last_parent_col, relationship._child_column_name] child_df = es[relationship.child_name][child_cols] # Rename relationship.child_column to include the columns we have # joined through. new_col_name = "%s.%s" % (last_child_col, relationship._child_column_name) to_rename = {relationship._child_column_name: new_col_name} child_df = child_df.rename(columns=to_rename) cutoffs = cutoffs.merge( child_df, left_on=last_child_col, right_on=last_parent_col, ) # These will be used in the next iteration. last_child_col = new_col_name last_parent_col = relationship._parent_column_name return cutoffs, new_col_name def _chunk_dataframe_groups(grouped, chunk_size): """chunks a grouped dataframe into groups no larger than chunk_size""" if isinstance(grouped, tuple): for i in range(0, len(grouped[1]), chunk_size): yield None, (grouped[0], grouped[1].iloc[i : i + chunk_size]) else: for group_key, group_df in grouped: for i in range(0, len(group_df), chunk_size): yield group_key, group_df.iloc[i : i + chunk_size] def _handle_chunk_size(chunk_size, total_size): if chunk_size is not None: assert chunk_size > 0, "Chunk size must be greater than 0" if chunk_size < 1: chunk_size = math.ceil(chunk_size * total_size) chunk_size = int(chunk_size) return chunk_size def update_progress_callback_parameters(progress_bar, previous_progress): update = (progress_bar.n - previous_progress) / progress_bar.total * 100 progress_percent = (progress_bar.n / progress_bar.total) * 100 time_elapsed = progress_bar.format_dict["elapsed"] return (update, progress_percent, time_elapsed) def init_ww_and_concat_fm(feature_matrix, ww_init_kwargs): cols_to_check = { col for col, ltype in ww_init_kwargs["logical_types"].items() if isinstance(ltype, (Age, Boolean, Integer)) } replacement_type = { "age": AgeNullable(), "boolean": BooleanNullable(), "integer": IntegerNullable(), } for fm in feature_matrix: updated_cols = set() for col in cols_to_check: # Only convert types if null values are present if fm[col].isnull().any(): current_type = ww_init_kwargs["logical_types"][col].type_string ww_init_kwargs["logical_types"][col] = replacement_type[current_type] updated_cols.add(col) cols_to_check = cols_to_check - updated_cols fm.ww.init(**ww_init_kwargs) feature_matrix = pd.concat(feature_matrix) feature_matrix.ww.init(**ww_init_kwargs) return feature_matrix ================================================ FILE: featuretools/computational_backends/feature_set.py ================================================ import itertools import logging from collections import defaultdict from featuretools.entityset.relationship import RelationshipPath from featuretools.feature_base import ( AggregationFeature, FeatureOutputSlice, GroupByTransformFeature, TransformFeature, ) from featuretools.utils import Trie logger = logging.getLogger("featuretools.computational_backend") class FeatureSet(object): """ Represents an immutable set of features to be calculated for a single dataframe, and their dependencies. """ def __init__(self, features, approximate_feature_trie=None): """ Args: features (list[Feature]): Features of the target dataframe. approximate_feature_trie (Trie[RelationshipPath, set[str]], optional): Dependency features to ignore because they have already been approximated. For example, if one of the target features is a direct feature of a feature A and A is included in approximate_feature_trie then neither A nor its dependencies will appear in FeatureSet.feature_trie. """ self.target_df_name = features[0].dataframe_name self.target_features = features self.target_feature_names = {f.unique_name() for f in features} if not approximate_feature_trie: approximate_feature_trie = Trie( default=list, path_constructor=RelationshipPath, ) self.approximate_feature_trie = approximate_feature_trie # Maps the unique name of each feature to the actual feature. This is necessary # because features do not support equality and so cannot be used as # dictionary keys. The equality operator on features produces a new # feature (which will always be truthy). self.features_by_name = {f.unique_name(): f for f in features} feature_dependents = defaultdict(set) for f in features: deps = f.get_dependencies(deep=True) for dep in deps: feature_dependents[dep.unique_name()].add(f.unique_name()) self.features_by_name[dep.unique_name()] = dep subdeps = dep.get_dependencies(deep=True) for sd in subdeps: feature_dependents[sd.unique_name()].add(dep.unique_name()) # feature names (keys) and the features that rely on them (values). self.feature_dependents = { fname: [self.features_by_name[dname] for dname in feature_dependents[fname]] for fname, f in self.features_by_name.items() } self._feature_trie = None @property def feature_trie(self): """ The target features and their dependencies organized into a trie by relationship path. This is built once when it is first called (to avoid building it if it is not needed) and then used for all subsequent calls. The edges of the trie are RelationshipPaths and the values are tuples of (bool, set[str], set[str]). The bool represents whether the full dataframe is needed at that node, the first set contains the names of features which are needed on the full dataframe, and the second set contains the names of the rest of the features Returns: Trie[RelationshipPath, (bool, set[str], set[str])] """ if not self._feature_trie: self._feature_trie = self._build_feature_trie() return self._feature_trie def _build_feature_trie(self): """ Build the feature trie by adding the target features and their dependencies recursively. """ feature_trie = Trie( default=lambda: (False, set(), set()), path_constructor=RelationshipPath, ) for f in self.target_features: self._add_feature_to_trie(feature_trie, f, self.approximate_feature_trie) return feature_trie def _add_feature_to_trie( self, trie, feature, approximate_feature_trie, ancestor_needs_full_dataframe=False, ): """ Add the given feature to the root of the trie, and recurse on its dependencies. If it is in approximate_feature_trie then it will not be added and we will not recurse on its dependencies. """ node_needs_full_dataframe, full_features, not_full_features = trie.value needs_full_dataframe = ( ancestor_needs_full_dataframe or self.uses_full_dataframe(feature) ) name = feature.unique_name() # If this feature is ignored then don't add it or any of its dependencies. if name in approximate_feature_trie.value: return # Add the feature to one of the sets, depending on whether it needs the full dataframe. if needs_full_dataframe: full_features.add(name) if name in not_full_features: not_full_features.remove(name) # Update needs_full_dataframe for this node. trie.value = (True, full_features, not_full_features) # Set every node in relationship path to needs_full_dataframe. sub_trie = trie for edge in feature.relationship_path: sub_trie = sub_trie.get_node([edge]) (_, f1, f2) = sub_trie.value sub_trie.value = (True, f1, f2) else: if name not in full_features: not_full_features.add(name) sub_trie = trie.get_node(feature.relationship_path) sub_ignored_trie = approximate_feature_trie.get_node(feature.relationship_path) for dep_feat in feature.get_dependencies(): if isinstance(dep_feat, FeatureOutputSlice): dep_feat = dep_feat.base_feature self._add_feature_to_trie( sub_trie, dep_feat, sub_ignored_trie, ancestor_needs_full_dataframe=needs_full_dataframe, ) def group_features(self, feature_names): """ Topologically sort the given features, then group by path, feature type, use_previous, and where. """ features = [self.features_by_name[name] for name in feature_names] depths = self._get_feature_depths(features) def key_func(f): return ( depths[f.unique_name()], f.relationship_path_name(), str(f.__class__), _get_use_previous(f), _get_where(f), self.uses_full_dataframe(f), _get_groupby(f), ) # Sort the list of features by the complex key function above, then # group them by the same key sort_feats = sorted(features, key=key_func) feature_groups = [ list(g) for _, g in itertools.groupby(sort_feats, key=key_func) ] return feature_groups def _get_feature_depths(self, features): """ Generate and return a mapping of {feature name -> depth} in the feature DAG for the given dataframe. """ order = defaultdict(int) depths = {} queue = features[:] while queue: # Get the next feature. f = queue.pop(0) depths[f.unique_name()] = order[f.unique_name()] # Only look at dependencies if they are on the same dataframe. if not f.relationship_path: dependencies = f.get_dependencies() for dep in dependencies: order[dep.unique_name()] = min( order[f.unique_name()] - 1, order[dep.unique_name()], ) queue.append(dep) return depths def uses_full_dataframe(self, feature, check_dependents=False): if ( isinstance(feature, TransformFeature) and feature.primitive.uses_full_dataframe ): return True return check_dependents and self._dependent_uses_full_dataframe(feature) def _dependent_uses_full_dataframe(self, feature): for d in self.feature_dependents[feature.unique_name()]: if isinstance(d, TransformFeature) and d.primitive.uses_full_dataframe: return True return False # These functions are used for sorting and grouping features def _get_use_previous( f, ): # TODO Sort and group features for DateOffset with two different temporal values if isinstance(f, AggregationFeature) and f.use_previous is not None: if len(f.use_previous.times.keys()) > 1: return ("", -1) else: unit = list(f.use_previous.times.keys())[0] value = f.use_previous.times[unit] return (unit, value) else: return ("", -1) def _get_where(f): if isinstance(f, AggregationFeature) and f.where is not None: return f.where.unique_name() else: return "" def _get_groupby(f): if isinstance(f, GroupByTransformFeature): return f.groupby.unique_name() else: return "" ================================================ FILE: featuretools/computational_backends/feature_set_calculator.py ================================================ from datetime import datetime from functools import partial import numpy as np import pandas as pd import pandas.api.types as pdtypes from featuretools.entityset.relationship import RelationshipPath from featuretools.exceptions import UnknownFeature from featuretools.feature_base import ( AggregationFeature, DirectFeature, GroupByTransformFeature, IdentityFeature, TransformFeature, ) from featuretools.utils import Trie from featuretools.utils.gen_utils import get_relationship_column_id class FeatureSetCalculator(object): """ Calculates the values of a set of features for given instance ids. """ def __init__( self, entityset, feature_set, time_last=None, training_window=None, precalculated_features=None, ): """ Args: feature_set (FeatureSet): The features to calculate values for. time_last (pd.Timestamp, optional): Last allowed time. Data from exactly this time not allowed. training_window (Timedelta, optional): Window defining how much time before the cutoff time data can be used when calculating features. If None, all data before cutoff time is used. precalculated_features (Trie[RelationshipPath -> pd.DataFrame]): Maps RelationshipPaths to dataframes of precalculated_features """ self.entityset = entityset self.feature_set = feature_set self.training_window = training_window if time_last is None: time_last = datetime.now() self.time_last = time_last if precalculated_features is None: precalculated_features = Trie(path_constructor=RelationshipPath) self.precalculated_features = precalculated_features # total number of features (including dependencies) to be calculate self.num_features = sum( len(features1) + len(features2) for _, (_, features1, features2) in self.feature_set.feature_trie ) def run(self, instance_ids, progress_callback=None, include_cutoff_time=True): """ Calculate values of features for the given instances of the target dataframe. Summary of algorithm: 1. Construct a trie where the edges are relationships and each node contains a set of features for a single dataframe. See FeatureSet._build_feature_trie. 2. Initialize a trie for storing dataframes. 3. Traverse the trie using depth first search. At each node calculate the features and store the resulting dataframe in the dataframe trie (so that its values can be used by features which depend on these features). See _calculate_features_for_dataframe. 4. Get the dataframe at the root of the trie (for the target dataframe) and return the columns corresponding to the requested features. Args: instance_ids (np.ndarray or pd.Categorical): Instance ids for which to build features. progress_callback (callable): function to be called with incremental progress updates include_cutoff_time (bool): If True, data at cutoff time are included in calculating features. Returns: pd.DataFrame : Pandas DataFrame of calculated feature values. Indexed by instance_ids. Columns in same order as features passed in. """ assert len(instance_ids) > 0, "0 instance ids provided" if progress_callback is None: # do nothing for the progress call back if not provided def progress_callback(*args): pass feature_trie = self.feature_set.feature_trie df_trie = Trie(path_constructor=RelationshipPath) full_dataframe_trie = Trie(path_constructor=RelationshipPath) target_dataframe = self.entityset[self.feature_set.target_df_name] self._calculate_features_for_dataframe( dataframe_name=self.feature_set.target_df_name, feature_trie=feature_trie, df_trie=df_trie, full_dataframe_trie=full_dataframe_trie, precalculated_trie=self.precalculated_features, filter_column=target_dataframe.ww.index, filter_values=instance_ids, progress_callback=progress_callback, include_cutoff_time=include_cutoff_time, ) # The dataframe for the target dataframe should be stored at the root of # df_trie. df = df_trie.value # Fill in empty rows with default values. index_dtype = df.index.dtype.name if df.empty: return self.generate_default_df(instance_ids=instance_ids) missing_ids = [ i for i in instance_ids if i not in df[target_dataframe.ww.index] ] if missing_ids: default_df = self.generate_default_df( instance_ids=missing_ids, extra_columns=df.columns, ) df = pd.concat([df, default_df], sort=True) df.index.name = self.entityset[self.feature_set.target_df_name].ww.index # Order by instance_ids unique_instance_ids = pd.unique(instance_ids) unique_instance_ids = unique_instance_ids.astype(instance_ids.dtype) df = df.reindex(unique_instance_ids) # Keep categorical index if original index was categorical if index_dtype == "category": df.index = df.index.astype("category") column_list = [] for feat in self.feature_set.target_features: column_list.extend(feat.get_feature_names()) return df[column_list] def _calculate_features_for_dataframe( self, dataframe_name, feature_trie, df_trie, full_dataframe_trie, precalculated_trie, filter_column, filter_values, parent_data=None, progress_callback=None, include_cutoff_time=True, ): """ Generate dataframes with features calculated for this node of the trie, and all descendant nodes. The dataframes will be stored in df_trie. Args: dataframe_name (str): The name of the dataframe to calculate features for. feature_trie (Trie): the trie with sets of features to calculate. The root contains features for the given dataframe. df_trie (Trie): a parallel trie for storing dataframes. The dataframe with features calculated will be placed in the root. full_dataframe_trie (Trie): a trie storing dataframes will all dataframe rows, for features that are uses_full_dataframe. precalculated_trie (Trie): a parallel trie containing dataframes with precalculated features. The dataframe specified by dataframe_name will be at the root. filter_column (str): The name of the column to filter this dataframe by. filter_values (pd.Series): The values to filter the filter_column to. parent_data (tuple[Relationship, list[str], pd.DataFrame]): Data related to the parent of this trie. This will only be present if the relationship points from this dataframe to the parent dataframe. A 3 tuple of (parent_relationship, ancestor_relationship_columns, parent_df). ancestor_relationship_columns is the names of columns which link the parent dataframe to its ancestors. include_cutoff_time (bool): If True, data at cutoff time are included in calculating features. """ # Step 1: Get a dataframe for the given dataframe name, filtered by the given # conditions. ( need_full_dataframe, full_dataframe_features, not_full_dataframe_features, ) = feature_trie.value all_features = full_dataframe_features | not_full_dataframe_features columns = self._necessary_columns(dataframe_name, all_features) # If we need the full dataframe then don't filter by filter_values. if need_full_dataframe: query_column = None query_values = None else: query_column = filter_column query_values = filter_values df = self.entityset.query_by_values( dataframe_name=dataframe_name, instance_vals=query_values, column_name=query_column, columns=columns, time_last=self.time_last, training_window=self.training_window, include_cutoff_time=include_cutoff_time, ) # call to update timer progress_callback(0) # Step 2: Add columns to the dataframe linking it to all ancestors. new_ancestor_relationship_columns = [] if parent_data: parent_relationship, ancestor_relationship_columns, parent_df = parent_data if ancestor_relationship_columns: ( df, new_ancestor_relationship_columns, ) = self._add_ancestor_relationship_columns( df, parent_df, ancestor_relationship_columns, parent_relationship, ) # Add the column linking this dataframe to its parent, so that # descendants get linked to the parent. new_ancestor_relationship_columns.append( parent_relationship._child_column_name, ) # call to update timer progress_callback(0) # Step 3: Recurse on children. # Pass filtered values, even if we are using a full df. if need_full_dataframe: filtered_df = df[df[filter_column].isin(filter_values)] else: filtered_df = df for edge, sub_trie in feature_trie.children(): is_forward, relationship = edge if is_forward: sub_dataframe_name = relationship.parent_dataframe.ww.name sub_filter_column = relationship._parent_column_name sub_filter_values = filtered_df[relationship._child_column_name] parent_data = None else: sub_dataframe_name = relationship.child_dataframe.ww.name sub_filter_column = relationship._child_column_name sub_filter_values = filtered_df[relationship._parent_column_name] parent_data = (relationship, new_ancestor_relationship_columns, df) sub_df_trie = df_trie.get_node([edge]) sub_full_dataframe_trie = full_dataframe_trie.get_node([edge]) sub_precalc_trie = precalculated_trie.get_node([edge]) self._calculate_features_for_dataframe( dataframe_name=sub_dataframe_name, feature_trie=sub_trie, df_trie=sub_df_trie, full_dataframe_trie=sub_full_dataframe_trie, precalculated_trie=sub_precalc_trie, filter_column=sub_filter_column, filter_values=sub_filter_values, parent_data=parent_data, progress_callback=progress_callback, include_cutoff_time=include_cutoff_time, ) # Step 4: Calculate the features for this dataframe. # # All dependencies of the features for this dataframe have been calculated # by the above recursive calls, and their results stored in df_trie. # Add any precalculated features. precalculated_features_df = precalculated_trie.value if precalculated_features_df is not None: # Left outer merge to keep all rows of df. df = df.merge( precalculated_features_df, how="left", left_index=True, right_index=True, suffixes=("", "_precalculated"), ) # call to update timer progress_callback(0) # First, calculate any features that require the full dataframe. These can # be calculated first because all of their dependents are included in # full_dataframe_features. if need_full_dataframe: df = self._calculate_features( df, full_dataframe_trie, full_dataframe_features, progress_callback, ) # Store full dataframe full_dataframe_trie.value = df # Filter df so that features that don't require the full dataframe are # only calculated on the necessary instances. df = df[df[filter_column].isin(filter_values)] # Calculate all features that don't require the full dataframe. df = self._calculate_features( df, df_trie, not_full_dataframe_features, progress_callback, ) # Step 5: Store the dataframe for this dataframe at the root of df_trie, so # that it can be accessed by the caller. df_trie.value = df def _calculate_features(self, df, df_trie, features, progress_callback): # Group the features so that each group can be calculated together. # The groups must also be in topological order (if A is a transform of B # then B must be in a group before A). feature_groups = self.feature_set.group_features(features) for group in feature_groups: representative_feature = group[0] handler = self._feature_type_handler(representative_feature) df = handler(group, df, df_trie, progress_callback) return df def _add_ancestor_relationship_columns( self, child_df, parent_df, ancestor_relationship_columns, relationship, ): """ Merge ancestor_relationship_columns from parent_df into child_df, adding a prefix to each column name specifying the relationship. Return the updated df and the new relationship column names. Args: child_df (pd.DataFrame): The dataframe to add relationship columns to. parent_df (pd.DataFrame): The dataframe to copy relationship columns from. ancestor_relationship_columns (list[str]): The names of relationship columns in the parent_df to copy into child_df. relationship (Relationship): the relationship through which the child is connected to the parent. """ relationship_name = relationship.parent_name new_relationship_columns = [ "%s.%s" % (relationship_name, col) for col in ancestor_relationship_columns ] # create an intermediate dataframe which shares a column # with the child dataframe and has a column with the # original parent's id. col_map = {relationship._parent_column_name: relationship._child_column_name} for child_column, parent_column in zip( new_relationship_columns, ancestor_relationship_columns, ): col_map[parent_column] = child_column merge_df = parent_df[list(col_map.keys())].rename(columns=col_map) merge_df.index.name = None # change index name for merge # Merge the dataframe, adding the relationship columns to the child. # Left outer join so that all rows in child are kept (if it contains # all rows of the dataframe then there may not be corresponding rows in the # parent_df). df = child_df.merge( merge_df, how="left", left_on=relationship._child_column_name, right_on=relationship._child_column_name, ) # ensure index is maintained df.set_index( relationship.child_dataframe.ww.index, drop=False, inplace=True, ) return df, new_relationship_columns def generate_default_df(self, instance_ids, extra_columns=None): default_row = [] default_cols = [] for f in self.feature_set.target_features: for name in f.get_feature_names(): default_cols.append(name) default_row.append(f.default_value) default_matrix = [default_row] * len(instance_ids) default_df = pd.DataFrame( default_matrix, columns=default_cols, index=instance_ids, dtype="object", ) index_name = self.entityset[self.feature_set.target_df_name].ww.index default_df.index.name = index_name if extra_columns is not None: for c in extra_columns: if c not in default_df.columns: default_df[c] = [np.nan] * len(instance_ids) return default_df def _feature_type_handler(self, f): if type(f) == TransformFeature: return self._calculate_transform_features elif type(f) == GroupByTransformFeature: return self._calculate_groupby_features elif type(f) == DirectFeature: return self._calculate_direct_features elif type(f) == AggregationFeature: return self._calculate_agg_features elif type(f) == IdentityFeature: return self._calculate_identity_features else: raise UnknownFeature("{} feature unknown".format(f.__class__)) def _calculate_identity_features(self, features, df, _df_trie, progress_callback): for f in features: assert f.get_name() in df.columns, ( 'Column "%s" missing frome dataframe' % f.get_name() ) progress_callback(len(features) / float(self.num_features)) return df def _calculate_transform_features( self, features, frame, _df_trie, progress_callback, ): frame_empty = frame.empty feature_values = [] for f in features: # handle when no data if frame_empty: # Even though we are adding the default values here, when these new # features are added to the dataframe in update_feature_columns, they # are added as empty columns since the dataframe itself is empty. feature_values.append( (f, [f.default_value for _ in range(f.number_output_features)]), ) progress_callback(1 / float(self.num_features)) continue # collect only the columns we need for this transformation column_data = [frame[bf.get_name()] for bf in f.base_features] feature_func = f.get_function() # apply the function to the relevant dataframe slice and add the # feature row to the results dataframe. if f.primitive.uses_calc_time: values = feature_func(*column_data, time=self.time_last) else: values = feature_func(*column_data) # if we don't get just the values, the assignment breaks when indexes don't match if f.number_output_features > 1: values = [strip_values_if_series(value) for value in values] else: values = [strip_values_if_series(values)] feature_values.append((f, values)) progress_callback(1 / float(self.num_features)) frame = update_feature_columns(feature_values, frame) return frame def _calculate_groupby_features(self, features, frame, _df_trie, progress_callback): # set default values to handle the null group default_values = {} for f in features: for name in f.get_feature_names(): default_values[name] = f.default_value frame = pd.concat( [frame, pd.DataFrame(default_values, index=frame.index)], axis=1, ) # handle when no data if frame.shape[0] == 0: progress_callback(len(features) / float(self.num_features)) return frame groupby = features[0].groupby.get_name() grouped = frame.groupby(groupby) groups = frame[ groupby ].unique() # get all the unique group name to iterate over later for f in features: feature_vals = [] for _ in range(f.number_output_features): feature_vals.append([]) for group in groups: # skip null key if it exists if pd.isnull(group): continue column_names = [bf.get_name() for bf in f.base_features] # exclude the groupby column from being passed to the function column_data = [ grouped[name].get_group(group) for name in column_names[:-1] ] feature_func = f.get_function() # apply the function to the relevant dataframe slice and add the # feature row to the results dataframe. if f.primitive.uses_calc_time: values = feature_func(*column_data, time=self.time_last) else: values = feature_func(*column_data) if f.number_output_features == 1: values = [values] # make sure index is aligned for i, value in enumerate(values): if isinstance(value, pd.Series): value.index = column_data[0].index else: value = pd.Series(value, index=column_data[0].index) feature_vals[i].append(value) if any(feature_vals): assert len(feature_vals) == len(f.get_feature_names()) for col_vals, name in zip(feature_vals, f.get_feature_names()): frame[name].update(pd.concat(col_vals)) progress_callback(1 / float(self.num_features)) return frame def _calculate_direct_features( self, features, child_df, df_trie, progress_callback, ): path = features[0].relationship_path assert len(path) == 1, "Error calculating DirectFeatures, len(path) != 1" parent_df = df_trie.get_node([path[0]]).value _is_forward, relationship = path[0] merge_col = relationship._child_column_name # generate a mapping of old column names (in the parent dataframe) to # new column names (in the child dataframe) for the merge col_map = {relationship._parent_column_name: merge_col} index_as_feature = None fillna_dict = {} for f in features: feature_defaults = { name: f.default_value for name in f.get_feature_names() if not pd.isna(f.default_value) } fillna_dict.update(feature_defaults) if f.base_features[0].get_name() == relationship._parent_column_name: index_as_feature = f base_names = f.base_features[0].get_feature_names() for name, base_name in zip(f.get_feature_names(), base_names): if name in child_df.columns: continue col_map[base_name] = name # merge the identity feature from the parent dataframe into the child merge_df = parent_df[list(col_map.keys())].rename(columns=col_map) if index_as_feature is not None: merge_df.set_index( index_as_feature.get_name(), inplace=True, drop=False, ) else: merge_df.set_index(merge_col, inplace=True) new_df = child_df.merge( merge_df, left_on=merge_col, right_index=True, how="left", ) progress_callback(len(features) / float(self.num_features)) return new_df.fillna(fillna_dict) def _calculate_agg_features(self, features, frame, df_trie, progress_callback): test_feature = features[0] child_dataframe = test_feature.base_features[0].dataframe base_frame = df_trie.get_node(test_feature.relationship_path).value # Sometimes approximate features get computed in a previous filter frame # and put in the current one dynamically, # so there may be existing features here fl = [] for f in features: for ind in f.get_feature_names(): if ind not in frame.columns: fl.append(f) break features = fl if not len(features): progress_callback(len(features) / float(self.num_features)) return frame # handle where base_frame_empty = base_frame.empty where = test_feature.where if where is not None and not base_frame_empty: base_frame = base_frame.loc[base_frame[where.get_name()]] # when no child data, just add all the features to frame with nan base_frame_empty = base_frame.empty if base_frame_empty: feature_values = [] for f in features: feature_values.append((f, np.full(f.number_output_features, np.nan))) progress_callback(1 / float(self.num_features)) frame = update_feature_columns(feature_values, frame) else: relationship_path = test_feature.relationship_path groupby_col = get_relationship_column_id(relationship_path) # if the use_previous property exists on this feature, include only the # instances from the child dataframe included in that Timedelta use_previous = test_feature.use_previous if use_previous: # Filter by use_previous values time_last = self.time_last if use_previous.has_no_observations(): time_first = time_last - use_previous ti = child_dataframe.ww.time_index if ti is not None: base_frame = base_frame[base_frame[ti] >= time_first] else: n = use_previous.get_value("o") def last_n(df): return df.iloc[-n:] base_frame = base_frame.groupby( groupby_col, observed=True, sort=False, group_keys=False, ).apply(last_n) to_agg = {} agg_rename = {} to_apply = set() # apply multi-column and time-dependent features as we find them, and # save aggregable features for later for f in features: if _can_agg(f): column_id = f.base_features[0].get_name() if column_id not in to_agg: to_agg[column_id] = [] func = f.get_function() # for some reason, using the string count is significantly # faster than any method a primitive can return # https://stackoverflow.com/questions/55731149/use-a-function-instead-of-string-in-pandas-groupby-agg if func == pd.Series.count: func = "count" funcname = func if callable(func): # if the same function is being applied to the same # column twice, wrap it in a partial to avoid # duplicate functions funcname = str(id(func)) if "{}-{}".format(column_id, funcname) in agg_rename: func = partial(func) funcname = str(id(func)) func.__name__ = funcname to_agg[column_id].append(func) # this is used below to rename columns that pandas names for us agg_rename["{}-{}".format(column_id, funcname)] = f.get_name() continue to_apply.add(f) # Apply the non-aggregable functions generate a new dataframe, and merge # it with the existing one if len(to_apply): wrap = agg_wrapper(to_apply, self.time_last) # groupby_col can be both the name of the index and a column, # to silence pandas warning about ambiguity we explicitly pass # the column (in actuality grouping by both index and group would # work) to_merge = base_frame.groupby( base_frame[groupby_col], observed=True, sort=False, group_keys=False, ).apply(wrap) frame = pd.merge( left=frame, right=to_merge, left_index=True, right_index=True, how="left", ) progress_callback(len(to_apply) / float(self.num_features)) # Apply the aggregate functions to generate a new dataframe, and merge # it with the existing one if len(to_agg): # groupby_col can be both the name of the index and a column, # to silence pandas warning about ambiguity we explicitly pass # the column (in actuality grouping by both index and group would # work) to_merge = base_frame.groupby( base_frame[groupby_col], observed=True, sort=False, ).agg(to_agg) # rename columns to the correct feature names to_merge.columns = [agg_rename["-".join(x)] for x in to_merge.columns] to_merge = to_merge[list(agg_rename.values())] # Workaround for pandas bug where categories are in the wrong order # see: https://github.com/pandas-dev/pandas/issues/22501 # # Pandas claims that bug is fixed but it still shows up in some # cases. More investigation needed. if isinstance(frame.index, pd.CategoricalDtype): categories = pdtypes.CategoricalDtype( categories=frame.index.categories, ) to_merge.index = to_merge.index.astype(object).astype(categories) frame = pd.merge( left=frame, right=to_merge, left_index=True, right_index=True, how="left", ) # determine number of features that were just merged progress_callback(len(to_merge.columns) / float(self.num_features)) # Handle default values fillna_dict = {} for f in features: feature_defaults = {name: f.default_value for name in f.get_feature_names()} fillna_dict.update(feature_defaults) frame = frame.fillna(fillna_dict) return frame def _necessary_columns(self, dataframe_name, feature_names): # We have to keep all index and foreign columns because we don't know what forward # relationships will come from this node. df = self.entityset[dataframe_name] index_columns = { col for col in df.columns if {"index", "foreign_key", "time_index"} & df.ww.semantic_tags[col] } features = (self.feature_set.features_by_name[name] for name in feature_names) feature_columns = { f.column_name for f in features if isinstance(f, IdentityFeature) } return list(index_columns | feature_columns) def _can_agg(feature): assert isinstance(feature, AggregationFeature) base_features = feature.base_features if feature.where is not None: base_features = [ bf.get_name() for bf in base_features if bf.get_name() != feature.where.get_name() ] if feature.primitive.uses_calc_time: return False single_output = feature.primitive.number_output_features == 1 return len(base_features) == 1 and single_output def agg_wrapper(feats, time_last): def wrap(df): d = {} feature_values = [] for f in feats: func = f.get_function() column_ids = [bf.get_name() for bf in f.base_features] args = [df[v] for v in column_ids] if f.primitive.uses_calc_time: values = func(*args, time=time_last) else: values = func(*args) if f.number_output_features == 1: values = [values] feature_values.append((f, values)) d = update_feature_columns(feature_values, d) return pd.Series(d) return wrap def update_feature_columns(feature_data, data): new_cols = {} for item in feature_data: names = item[0].get_feature_names() values = item[1] assert len(names) == len(values) for name, value in zip(names, values): new_cols[name] = value # Handle the case where a dict is being updated if isinstance(data, dict): data.update(new_cols) return data return pd.concat([data, pd.DataFrame(new_cols, index=data.index)], axis=1) def strip_values_if_series(values): if isinstance(values, pd.Series): values = values.values return values ================================================ FILE: featuretools/computational_backends/utils.py ================================================ import logging import os import typing import warnings from datetime import datetime from functools import wraps import numpy as np import pandas as pd import psutil from woodwork.logical_types import Datetime, Double from featuretools.entityset.relationship import RelationshipPath from featuretools.feature_base import AggregationFeature, DirectFeature from featuretools.utils import Trie from featuretools.utils.gen_utils import import_or_none from featuretools.utils.wrangle import _check_time_type, _check_timedelta logger = logging.getLogger("featuretools.computational_backend") def bin_cutoff_times(cutoff_time, bin_size): binned_cutoff_time = cutoff_time.ww.copy() if isinstance(bin_size, int): binned_cutoff_time["time"] = binned_cutoff_time["time"].apply( lambda x: x / bin_size * bin_size, ) else: bin_size = _check_timedelta(bin_size) binned_cutoff_time["time"] = datetime_round( binned_cutoff_time["time"], bin_size, ) return binned_cutoff_time def save_csv_decorator(save_progress=None): def inner_decorator(method): @wraps(method) def wrapped(*args, **kwargs): if save_progress is None: r = method(*args, **kwargs) else: time = args[0].to_pydatetime() file_name = "ft_" + time.strftime("%Y_%m_%d_%I-%M-%S-%f") + ".csv" file_path = os.path.join(save_progress, file_name) temp_dir = os.path.join(save_progress, "temp") if not os.path.exists(temp_dir): os.makedirs(temp_dir) temp_file_path = os.path.join(temp_dir, file_name) r = method(*args, **kwargs) r.to_csv(temp_file_path) os.rename(temp_file_path, file_path) return r return wrapped return inner_decorator def datetime_round(dt, freq): """ round down Timestamp series to a specified freq """ if not freq.is_absolute(): raise ValueError("Unit is relative") # TODO: multitemporal units all_units = list(freq.times.keys()) if len(all_units) == 1: unit = all_units[0] value = freq.times[unit] if unit == "m": unit = "t" # No support for weeks in datetime.datetime if unit == "w": unit = "d" value = value * 7 freq = str(value) + unit return dt.dt.floor(freq) else: assert "Frequency cannot have multiple temporal parameters" def gather_approximate_features(feature_set): """ Find features which can be approximated. Returned as a trie where the values are sets of feature names. Args: feature_set (FeatureSet): Features to search the dependencies of for features to approximate. Returns: Trie[RelationshipPath, set[str]] """ approximate_feature_trie = Trie(default=set, path_constructor=RelationshipPath) for feature in feature_set.target_features: if feature_set.uses_full_dataframe(feature, check_dependents=True): continue if isinstance(feature, DirectFeature): path = feature.relationship_path base_feature = feature.base_features[0] while isinstance(base_feature, DirectFeature): path = path + base_feature.relationship_path base_feature = base_feature.base_features[0] if isinstance(base_feature, AggregationFeature): node_feature_set = approximate_feature_trie.get_node(path).value node_feature_set.add(base_feature.unique_name()) return approximate_feature_trie def gen_empty_approx_features_df(approx_features): df = pd.DataFrame(columns=[f.get_name() for f in approx_features]) df.index.name = approx_features[0].dataframe.ww.index return df def n_jobs_to_workers(n_jobs): try: cpus = len(psutil.Process().cpu_affinity()) except AttributeError: cpus = psutil.cpu_count() # Taken from sklearn parallel_backends code # https://github.com/scikit-learn/scikit-learn/blob/27bbdb570bac062c71b3bb21b0876fd78adc9f7e/sklearn/externals/joblib/_parallel_backends.py#L120 if n_jobs < 0: workers = max(cpus + 1 + n_jobs, 1) else: workers = min(n_jobs, cpus) assert workers > 0, "Need at least one worker" return workers def create_client_and_cluster(n_jobs, dask_kwargs, entityset_size): Client, LocalCluster = get_client_cluster() cluster = None if "cluster" in dask_kwargs: cluster = dask_kwargs["cluster"] else: # diagnostics_port sets the default port to launch bokeh web interface # if it is set to None web interface will not be launched diagnostics_port = None if "diagnostics_port" in dask_kwargs: diagnostics_port = dask_kwargs["diagnostics_port"] del dask_kwargs["diagnostics_port"] workers = n_jobs_to_workers(n_jobs) if n_jobs != -1 and workers < n_jobs: warning_string = "{} workers requested, but only {} workers created." warning_string = warning_string.format(n_jobs, workers) warnings.warn(warning_string) # Distributed default memory_limit for worker is 'auto'. It calculates worker # memory limit as total virtual memory divided by the number # of cores available to the workers (alwasy 1 for featuretools setup). # This means reducing the number of workers does not increase the memory # limit for other workers. Featuretools default is to calculate memory limit # as total virtual memory divided by number of workers. To use distributed # default memory limit, set dask_kwargs['memory_limit']='auto' if "memory_limit" in dask_kwargs: memory_limit = dask_kwargs["memory_limit"] del dask_kwargs["memory_limit"] else: total_memory = psutil.virtual_memory().total memory_limit = int(total_memory / float(workers)) cluster = LocalCluster( n_workers=workers, threads_per_worker=1, diagnostics_port=diagnostics_port, memory_limit=memory_limit, **dask_kwargs, ) # if cluster has bokeh port, notify user if unexpected port number if diagnostics_port is not None: if hasattr(cluster, "scheduler") and cluster.scheduler: info = cluster.scheduler.identity() if "bokeh" in info["services"]: msg = "Dashboard started on port {}" print(msg.format(info["services"]["bokeh"])) client = Client(cluster) warned_of_memory = False for worker in list(client.scheduler_info()["workers"].values()): worker_limit = worker["memory_limit"] if worker_limit < entityset_size: raise ValueError("Insufficient memory to use this many workers") elif worker_limit < 2 * entityset_size and not warned_of_memory: logger.warning( "Worker memory is between 1 to 2 times the memory" " size of the EntitySet. If errors occur that do" " not occur with n_jobs equals 1, this may be the " "cause. See https://featuretools.alteryx.com/en/stable/guides/performance.html#parallel-feature-computation" " for more information.", ) warned_of_memory = True return client, cluster def get_client_cluster(): """ Separated out the imports to make it easier to mock during testing """ distributed = import_or_none("distributed") Client = distributed.Client LocalCluster = distributed.LocalCluster return Client, LocalCluster CutoffTimeType = typing.Union[pd.DataFrame, str, datetime] def _validate_cutoff_time( cutoff_time: CutoffTimeType, target_dataframe, ): """ Verify that the cutoff time is a single value or a pandas dataframe with the proper columns containing no duplicate rows """ if isinstance(cutoff_time, pd.DataFrame): cutoff_time = cutoff_time.reset_index(drop=True) if "instance_id" not in cutoff_time.columns: if target_dataframe.ww.index not in cutoff_time.columns: raise AttributeError( "Cutoff time DataFrame must contain a column with either the same name" ' as the target dataframe index or a column named "instance_id"', ) # rename to instance_id cutoff_time.rename( columns={target_dataframe.ww.index: "instance_id"}, inplace=True, ) if "time" not in cutoff_time.columns: if ( target_dataframe.ww.time_index and target_dataframe.ww.time_index not in cutoff_time.columns ): raise AttributeError( "Cutoff time DataFrame must contain a column with either the same name" ' as the target dataframe time_index or a column named "time"', ) # rename to time cutoff_time.rename( columns={target_dataframe.ww.time_index: "time"}, inplace=True, ) # Make sure user supplies only one valid name for instance id and time columns if ( "instance_id" in cutoff_time.columns and target_dataframe.ww.index in cutoff_time.columns and "instance_id" != target_dataframe.ww.index ): raise AttributeError( 'Cutoff time DataFrame cannot contain both a column named "instance_id" and a column' " with the same name as the target dataframe index", ) if ( "time" in cutoff_time.columns and target_dataframe.ww.time_index in cutoff_time.columns and "time" != target_dataframe.ww.time_index ): raise AttributeError( 'Cutoff time DataFrame cannot contain both a column named "time" and a column' " with the same name as the target dataframe time index", ) assert ( cutoff_time[["instance_id", "time"]].duplicated().sum() == 0 ), "Duplicated rows in cutoff time dataframe." if isinstance(cutoff_time, str): try: cutoff_time = pd.to_datetime(cutoff_time) except ValueError as e: raise ValueError(f"While parsing cutoff_time: {str(e)}") except OverflowError as e: raise OverflowError(f"While parsing cutoff_time: {str(e)}") else: if isinstance(cutoff_time, list): raise TypeError("cutoff_time must be a single value or DataFrame") return cutoff_time def _check_cutoff_time_type(cutoff_time, es_time_type): """ Check that the cutoff time values are of the proper type given the entityset time type """ # Check that cutoff_time time type matches entityset time type if isinstance(cutoff_time, tuple): cutoff_time_value = cutoff_time[0] time_type = _check_time_type(cutoff_time_value) is_numeric = time_type == "numeric" is_datetime = time_type == Datetime else: cutoff_time_col = cutoff_time.ww["time"] is_numeric = cutoff_time_col.ww.schema.is_numeric is_datetime = cutoff_time_col.ww.schema.is_datetime if es_time_type == "numeric" and not is_numeric: raise TypeError( "cutoff_time times must be numeric: try casting " "via pd.to_numeric()", ) if es_time_type == Datetime and not is_datetime: raise TypeError( "cutoff_time times must be datetime type: try casting " "via pd.to_datetime()", ) def replace_inf_values(feature_matrix, replacement_value=np.nan, columns=None): """Replace all ``np.inf`` values in a feature matrix with the specified replacement value. Args: feature_matrix (DataFrame): DataFrame whose columns are feature names and rows are instances replacement_value (int, float, str, optional): Value with which ``np.inf`` values will be replaced columns (list[str], optional): A list specifying which columns should have values replaced. If None, values will be replaced for all columns. Returns: feature_matrix """ if columns is None: feature_matrix = feature_matrix.replace([np.inf, -np.inf], replacement_value) else: feature_matrix[columns] = feature_matrix[columns].replace( [np.inf, -np.inf], replacement_value, ) return feature_matrix def get_ww_types_from_features( features, entityset, pass_columns=None, cutoff_time=None, ): """Given a list of features and entityset (and optionally a list of pass through columns and the cutoff time dataframe), returns the logical types, semantic tags,and origin of each column in the feature matrix. Both pass_columns and cutoff_time will need to be supplied in order to get the type information for the pass through columns """ if pass_columns is None: pass_columns = [] logical_types = {} semantic_tags = {} origins = {} for feature in features: names = feature.get_feature_names() for name in names: logical_types[name] = feature.column_schema.logical_type semantic_tags[name] = feature.column_schema.semantic_tags.copy() semantic_tags[name] -= {"index", "time_index"} if logical_types[name] is None and "numeric" in semantic_tags[name]: logical_types[name] = Double if all([f.primitive is None for f in feature.get_dependencies(deep=True)]): origins[name] = "base" else: origins[name] = "engineered" if pass_columns: cutoff_schema = cutoff_time.ww.schema for column in pass_columns: logical_types[column] = cutoff_schema.logical_types[column] semantic_tags[column] = cutoff_schema.semantic_tags[column] origins[column] = "base" ww_init = { "logical_types": logical_types, "semantic_tags": semantic_tags, "column_origins": origins, } return ww_init ================================================ FILE: featuretools/config_init.py ================================================ import copy import logging import os import sys def initialize_logging(): loggers = {} # Check for environmental variables logger_env_vars = { "FEATURETOOLS_LOG_LEVEL": "featuretools", "FEATURETOOLS_ES_LOG_LEVEL": "featuretools.entityset", "FEATURETOOLS_BACKEND_LOG_LEVEL": "featuretools.computation_backend", } for logger_env, logger in logger_env_vars.items(): log_level = os.environ.get(logger_env, None) if log_level is not None: loggers[logger] = log_level # Set log level to info if not otherwise specified. loggers.setdefault("featuretools", "info") loggers.setdefault("featuretools.computation_backend", "info") loggers.setdefault("featuretools.entityset", "info") fmt = "%(asctime)-15s %(name)s - %(levelname)s %(message)s" out_handler = logging.StreamHandler(sys.stdout) err_handler = logging.StreamHandler(sys.stdout) out_handler.setFormatter(logging.Formatter(fmt)) err_handler.setFormatter(logging.Formatter(fmt)) err_levels = ["WARNING", "ERROR", "CRITICAL"] for name, level in list(loggers.items()): LEVEL = getattr(logging, level.upper()) logger = logging.getLogger(name) logger.setLevel(LEVEL) for _handler in logger.handlers: logger.removeHandler(_handler) if level in err_levels: logger.addHandler(err_handler) else: logger.addHandler(out_handler) logger.propagate = False initialize_logging() class Config: def __init__(self): self._data = {} self.set_to_default() def set_to_default(self): PWD = os.path.dirname(__file__) primitive_data_folder = os.path.join(PWD, "primitives/data") self._data = { "primitive_data_folder": primitive_data_folder, } def get(self, key): return copy.deepcopy(self._data[key]) def get_all(self): return copy.deepcopy(self._data) def set(self, values): self._data.update(values) config = Config() ================================================ FILE: featuretools/demo/__init__.py ================================================ # flake8: noqa from featuretools.demo.api import * ================================================ FILE: featuretools/demo/api.py ================================================ # flake8: noqa from featuretools.demo.flight import load_flight from featuretools.demo.mock_customer import load_mock_customer from featuretools.demo.retail import load_retail from featuretools.demo.weather import load_weather ================================================ FILE: featuretools/demo/flight.py ================================================ import math import re import pandas as pd from tqdm import tqdm from woodwork.logical_types import Boolean, Categorical, Ordinal import featuretools as ft def load_flight( month_filter=None, categorical_filter=None, nrows=None, demo=True, return_single_table=False, verbose=False, ): """ Download, clean, and filter flight data from 2017. The original dataset can be found `here `_. Args: month_filter (list[int]): Only use data from these months (example is ``[1, 2]``). To skip, set to None. categorical_filter (dict[str->str]): Use only specified categorical values. Example is ``{'dest_city': ['Boston, MA'], 'origin_city': ['Boston, MA']}`` which returns all flights in OR out of Boston. To skip, set to None. nrows (int): Passed to nrows in ``pd.read_csv``. Used before filtering. demo (bool): Use only two months of data. If False, use the whole year. return_single_table (bool): Exit the function early and return a dataframe. verbose (bool): Show a progress bar while loading the data. Examples: .. ipython:: :verbatim: In [1]: import featuretools as ft In [2]: es = ft.demo.load_flight(verbose=True, ...: month_filter=[1], ...: categorical_filter={'origin_city':['Boston, MA']}) 100%|xxxxxxxxxxxxxxxxxxxxxxxxx| 100/100 [01:16<00:00, 1.31it/s] In [3]: es Out[3]: Entityset: Flight Data DataFrames: airports [Rows: 55, Columns: 3] flights [Rows: 613, Columns: 9] trip_logs [Rows: 9456, Columns: 22] airlines [Rows: 10, Columns: 1] Relationships: trip_logs.flight_id -> flights.flight_id flights.carrier -> airlines.carrier flights.dest -> airports.dest """ filename, csv_length = get_flight_filename(demo=demo) print("Downloading data ...") url = "https://oss.alteryx.com/datasets/{}?library=featuretools&version={}".format( filename, ft.__version__, ) chunksize = math.ceil(csv_length / 99) pd.options.display.max_columns = 200 iter_csv = pd.read_csv( url, compression="zip", iterator=True, nrows=nrows, chunksize=chunksize, ) if verbose: iter_csv = tqdm(iter_csv, total=100) partial_df_list = [] for chunk in iter_csv: df = filter_data( _clean_data(chunk), month_filter=month_filter, categorical_filter=categorical_filter, ) partial_df_list.append(df) data = pd.concat(partial_df_list) if return_single_table: return data es = make_es(data) return es def make_es(data): es = ft.EntitySet("Flight Data") arr_time_columns = [ "arr_delay", "dep_delay", "carrier_delay", "weather_delay", "national_airspace_delay", "security_delay", "late_aircraft_delay", "canceled", "diverted", "taxi_in", "taxi_out", "air_time", "dep_time", ] logical_types = { "flight_num": Categorical, "distance_group": Ordinal(order=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]), "canceled": Boolean, "diverted": Boolean, } es.add_dataframe( data, dataframe_name="trip_logs", index="trip_log_id", make_index=True, time_index="date_scheduled", secondary_time_index={"arr_time": arr_time_columns}, logical_types=logical_types, ) es.normalize_dataframe( "trip_logs", "flights", "flight_id", additional_columns=[ "origin", "origin_city", "origin_state", "dest", "dest_city", "dest_state", "distance_group", "carrier", "flight_num", ], ) es.normalize_dataframe("flights", "airlines", "carrier", make_time_index=False) es.normalize_dataframe( "flights", "airports", "dest", additional_columns=["dest_city", "dest_state"], make_time_index=False, ) return es def _clean_data(data): # Make column names snake case clean_data = data.rename(columns={col: convert(col) for col in data}) # Chance crs -> "scheduled" and other minor clarifications clean_data = clean_data.rename( columns={ "crs_arr_time": "scheduled_arr_time", "crs_dep_time": "scheduled_dep_time", "crs_elapsed_time": "scheduled_elapsed_time", "nas_delay": "national_airspace_delay", "origin_city_name": "origin_city", "dest_city_name": "dest_city", "cancelled": "canceled", }, ) # Combine strings like 0130 (1:30 AM) with dates (2017-01-01) clean_data["scheduled_dep_time"] = clean_data["scheduled_dep_time"].apply( lambda x: str(x), ) + clean_data["flight_date"].astype("str") # Parse combined string as a date clean_data.loc[:, "scheduled_dep_time"] = pd.to_datetime( clean_data["scheduled_dep_time"], format="%H%M%Y-%m-%d", errors="coerce", ) clean_data["scheduled_elapsed_time"] = pd.to_timedelta( clean_data["scheduled_elapsed_time"], unit="m", ) clean_data = _reconstruct_times(clean_data) # Create a time index 6 months before scheduled_dep clean_data.loc[:, "date_scheduled"] = pd.to_datetime( clean_data["scheduled_dep_time"], ).dt.date - pd.Timedelta("120d") # A null entry for a delay means no delay clean_data = _fill_labels(clean_data) # Nulls for scheduled values are too problematic. Remove them. clean_data = clean_data.dropna( axis="rows", subset=["scheduled_dep_time", "scheduled_arr_time"], ) # Make a flight id. Define a flight as a combination of: # 1. carrier 2. flight number 3. origin airport 4. dest airport clean_data.loc[:, "flight_id"] = ( clean_data["carrier"] + "-" + clean_data["flight_num"].apply(lambda x: str(x)) + ":" + clean_data["origin"] + "->" + clean_data["dest"] ) column_order = [ "flight_id", "flight_num", "date_scheduled", "scheduled_dep_time", "scheduled_arr_time", "carrier", "origin", "origin_city", "origin_state", "dest", "dest_city", "dest_state", "distance_group", "dep_time", "arr_time", "dep_delay", "taxi_out", "taxi_in", "arr_delay", "diverted", "scheduled_elapsed_time", "air_time", "distance", "carrier_delay", "weather_delay", "national_airspace_delay", "security_delay", "late_aircraft_delay", "canceled", ] clean_data = clean_data[column_order] return clean_data def _fill_labels(clean_data): labely_columns = [ "arr_delay", "dep_delay", "carrier_delay", "weather_delay", "national_airspace_delay", "security_delay", "late_aircraft_delay", "canceled", "diverted", "taxi_in", "taxi_out", "air_time", ] for col in labely_columns: clean_data.loc[:, col] = clean_data[col].fillna(0) return clean_data def _reconstruct_times(clean_data): """Reconstruct departure_time, scheduled_dep_time, arrival_time and scheduled_arr_time by adding known delays to known times. We do: - dep_time is scheduled_dep + dep_delay - arr_time is dep_time + taxiing and air_time - scheduled arrival is scheduled_dep + scheduled_elapsed """ clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta( clean_data["dep_delay"], unit="m", ) clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta( clean_data["taxi_out"] + clean_data["air_time"] + clean_data["taxi_in"], unit="m", ) clean_data.loc[:, "scheduled_arr_time"] = ( clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"] ) return clean_data def filter_data(clean_data, month_filter=None, categorical_filter=None): if month_filter is not None: tmp = pd.to_datetime(clean_data["scheduled_dep_time"]).dt.month.isin( month_filter, ) clean_data = clean_data[tmp] if categorical_filter is not None: tmp = False for key, values in categorical_filter.items(): tmp = tmp | clean_data[key].isin(values) clean_data = clean_data[tmp] return clean_data def convert(name): # Rename columns to underscore # Code via SO https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case s1 = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name) return re.sub("([a-z0-9])([A-Z])", r"\1_\2", s1).lower() def get_flight_filename(demo=True): if demo: filename = SMALL_FLIGHT_CSV rows = 860457 else: filename = BIG_FLIGHT_CSV rows = 5162742 return filename, rows SMALL_FLIGHT_CSV = "data_2017_jan_feb.csv.zip" BIG_FLIGHT_CSV = "data_all_2017.csv.zip" ================================================ FILE: featuretools/demo/mock_customer.py ================================================ import pandas as pd from numpy import random from numpy.random import choice from woodwork.logical_types import Categorical, PostalCode import featuretools as ft def load_mock_customer( n_customers=5, n_products=5, n_sessions=35, n_transactions=500, random_seed=0, return_single_table=False, return_entityset=False, ): """Return dataframes of mock customer data""" random.seed(random_seed) last_date = pd.to_datetime("12/31/2013") first_date = pd.to_datetime("1/1/2008") first_bday = pd.to_datetime("1/1/1970") join_dates = [ random.uniform(0, 1) * (last_date - first_date) + first_date for _ in range(n_customers) ] birth_dates = [ random.uniform(0, 1) * (first_date - first_bday) + first_bday for _ in range(n_customers) ] customers_df = pd.DataFrame({"customer_id": range(1, n_customers + 1)}) customers_df["zip_code"] = choice( ["60091", "13244"], n_customers, ) customers_df["join_date"] = pd.Series(join_dates).dt.round("1s") customers_df["birthday"] = pd.Series(birth_dates).dt.round("1d") products_df = pd.DataFrame({"product_id": pd.Categorical(range(1, n_products + 1))}) products_df["brand"] = choice(["A", "B", "C"], n_products) sessions_df = pd.DataFrame({"session_id": range(1, n_sessions + 1)}) sessions_df["customer_id"] = choice(customers_df["customer_id"], n_sessions) sessions_df["device"] = choice(["desktop", "mobile", "tablet"], n_sessions) transactions_df = pd.DataFrame({"transaction_id": range(1, n_transactions + 1)}) transactions_df["session_id"] = choice(sessions_df["session_id"], n_transactions) transactions_df = transactions_df.sort_values("session_id").reset_index(drop=True) transactions_df["transaction_time"] = pd.date_range( "1/1/2014", periods=n_transactions, freq="65s", ) # todo make these less regular transactions_df["product_id"] = pd.Categorical( choice(products_df["product_id"], n_transactions), ) transactions_df["amount"] = random.randint(500, 15000, n_transactions) / 100 # calculate and merge in session start # based on the times we came up with for transactions session_starts = transactions_df.drop_duplicates("session_id")[ ["session_id", "transaction_time"] ].rename(columns={"transaction_time": "session_start"}) sessions_df = sessions_df.merge(session_starts) if return_single_table: return ( transactions_df.merge(sessions_df) .merge(customers_df) .merge(products_df) .reset_index(drop=True) ) elif return_entityset: es = ft.EntitySet(id="transactions") es = es.add_dataframe( dataframe_name="transactions", dataframe=transactions_df, index="transaction_id", time_index="transaction_time", logical_types={"product_id": Categorical}, ) es = es.add_dataframe( dataframe_name="products", dataframe=products_df, index="product_id", ) es = es.add_dataframe( dataframe_name="sessions", dataframe=sessions_df, index="session_id", time_index="session_start", ) es = es.add_dataframe( dataframe_name="customers", dataframe=customers_df, index="customer_id", time_index="join_date", logical_types={"zip_code": PostalCode}, ) rels = [ ("products", "product_id", "transactions", "product_id"), ("sessions", "session_id", "transactions", "session_id"), ("customers", "customer_id", "sessions", "customer_id"), ] es = es.add_relationships(rels) es.add_last_time_indexes() return es return { "customers": customers_df, "sessions": sessions_df, "transactions": transactions_df, "products": products_df, } ================================================ FILE: featuretools/demo/retail.py ================================================ import pandas as pd from woodwork.logical_types import NaturalLanguage import featuretools as ft def load_retail(id="demo_retail_data", nrows=None, return_single_table=False): """Returns the retail entityset example. The original dataset can be found `here `_. We have also made some modifications to the data. We changed the column names, converted the ``customer_id`` to a unique fake ``customer_name``, dropped duplicates, added columns for ``total`` and ``cancelled`` and converted amounts from GBP to USD. You can download the modified CSV in gz `compressed (7 MB) `_ or `uncompressed (43 MB) `_ formats. Args: id (str): Id to assign to EntitySet. nrows (int): Number of rows to load of the underlying CSV. If None, load all. return_single_table (bool): If True, return a CSV rather than an EntitySet. Default is False. Examples: .. ipython:: :verbatim: In [1]: import featuretools as ft In [2]: es = ft.demo.load_retail() In [3]: es Out[3]: Entityset: demo_retail_data DataFrames: orders (shape = [22190, 3]) products (shape = [3684, 3]) customers (shape = [4372, 2]) order_products (shape = [401704, 7]) Load in subset of data .. ipython:: :verbatim: In [4]: es = ft.demo.load_retail(nrows=1000) In [5]: es Out[5]: Entityset: demo_retail_data DataFrames: orders (shape = [67, 5]) products (shape = [606, 3]) customers (shape = [50, 2]) order_products (shape = [1000, 7]) """ es = ft.EntitySet(id) csv_s3_gz = ( "https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv.gz?library=featuretools&version=" + ft.__version__ ) csv_s3 = ( "https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv?library=featuretools&version=" + ft.__version__ ) # Try to read in gz compressed file try: df = pd.read_csv(csv_s3_gz, nrows=nrows, parse_dates=["order_date"]) # Fall back to uncompressed except Exception: df = pd.read_csv(csv_s3, nrows=nrows, parse_dates=["order_date"]) if return_single_table: return df es.add_dataframe( dataframe_name="order_products", dataframe=df, index="order_product_id", make_index=True, time_index="order_date", logical_types={"description": NaturalLanguage}, ) es.normalize_dataframe( new_dataframe_name="products", base_dataframe_name="order_products", index="product_id", additional_columns=["description"], ) es.normalize_dataframe( new_dataframe_name="orders", base_dataframe_name="order_products", index="order_id", additional_columns=["customer_name", "country", "cancelled"], ) es.normalize_dataframe( new_dataframe_name="customers", base_dataframe_name="orders", index="customer_name", ) es.add_last_time_indexes() return es ================================================ FILE: featuretools/demo/weather.py ================================================ import pandas as pd import featuretools as ft def load_weather(nrows=None, return_single_table=False): """ Load the Australian daily-min-temperatures weather dataset. Args: nrows (int): Passed to nrows in ``pd.read_csv``. return_single_table (bool): Exit the function early and return a dataframe. """ filename = "daily-min-temperatures.csv" print("Downloading data ...") url = "https://oss.alteryx.com/datasets/{}?library=featuretools&version={}".format( filename, ft.__version__, ) data = pd.read_csv(url, index_col=None, nrows=nrows) if return_single_table: return data es = make_es(data) return es def make_es(data): es = ft.EntitySet("Weather Data") es.add_dataframe( data, dataframe_name="temperatures", index="id", make_index=True, time_index="Date", ) return es ================================================ FILE: featuretools/entityset/__init__.py ================================================ # flake8: noqa from featuretools.entityset.api import * ================================================ FILE: featuretools/entityset/api.py ================================================ # flake8: noqa from featuretools.entityset.deserialize import read_entityset from featuretools.entityset.entityset import EntitySet from featuretools.entityset.relationship import Relationship from featuretools.entityset.timedelta import Timedelta ================================================ FILE: featuretools/entityset/deserialize.py ================================================ import json import os import tarfile import tempfile from inspect import getfullargspec import pandas as pd import woodwork.type_sys.type_system as ww_type_system from woodwork.deserialize import read_woodwork_table from featuretools.entityset.relationship import Relationship from featuretools.utils.s3_utils import get_transport_params, use_smartopen_es from featuretools.utils.schema_utils import check_schema_version from featuretools.utils.wrangle import _is_local_tar, _is_s3, _is_url def description_to_entityset(description, **kwargs): """Deserialize entityset from data description. Args: description (dict) : Description of an :class:`.EntitySet`. Likely generated using :meth:`.serialize.entityset_to_description` kwargs (keywords): Additional keyword arguments to pass as keywords arguments to the underlying deserialization method. Returns: entityset (EntitySet) : Instance of :class:`.EntitySet`. """ check_schema_version(description, "entityset") from featuretools.entityset import EntitySet # If data description was not read from disk, path is None. path = description.get("path") entityset = EntitySet(description["id"]) for df in description["dataframes"].values(): if path is not None: data_path = os.path.join(path, "data", df["name"]) format = description.get("format") if format is not None: kwargs["format"] = format if format == "parquet" and df["loading_info"]["table_type"] == "pandas": kwargs["filename"] = df["name"] + ".parquet" dataframe = read_woodwork_table(data_path, validate=False, **kwargs) else: dataframe = empty_dataframe(df) entityset.add_dataframe(dataframe) for relationship in description["relationships"]: rel = Relationship.from_dictionary(relationship, entityset) entityset.add_relationship(relationship=rel) return entityset def empty_dataframe(description): """Deserialize empty dataframe from dataframe description. Args: description (dict) : Description of dataframe. Returns: df (DataFrame) : Empty dataframe with Woodwork initialized. """ # TODO: Can we update Woodwork to return an empty initialized dataframe from a description # instead of using this function? Or otherwise eliminate? Issue #1476 logical_types = {} semantic_tags = {} column_descriptions = {} column_metadata = {} use_standard_tags = {} category_dtypes = {} columns = [] for col in description["column_typing_info"]: col_name = col["name"] columns.append(col_name) ltype_metadata = col["logical_type"] ltype = ww_type_system.str_to_logical_type( ltype_metadata["type"], params=ltype_metadata["parameters"], ) tags = col["semantic_tags"] if "index" in tags: tags.remove("index") elif "time_index" in tags: tags.remove("time_index") logical_types[col_name] = ltype semantic_tags[col_name] = tags column_descriptions[col_name] = col["description"] column_metadata[col_name] = col["metadata"] use_standard_tags[col_name] = col["use_standard_tags"] if col["physical_type"]["type"] == "category": # Make sure categories are recreated properly cat_values = col["physical_type"]["cat_values"] cat_dtype = col["physical_type"]["cat_dtype"] cat_object = pd.CategoricalDtype(pd.Index(cat_values, dtype=cat_dtype)) category_dtypes[col_name] = cat_object dataframe = pd.DataFrame(columns=columns).astype(category_dtypes) dataframe.ww.init( name=description.get("name"), index=description.get("index"), time_index=description.get("time_index"), logical_types=logical_types, semantic_tags=semantic_tags, use_standard_tags=use_standard_tags, table_metadata=description.get("table_metadata"), column_metadata=column_metadata, column_descriptions=column_descriptions, validate=False, ) return dataframe def read_data_description(path): """Read data description from disk, S3 path, or URL. Args: path (str): Location on disk, S3 path, or URL to read `data_description.json`. Returns: description (dict) : Description of :class:`.EntitySet`. """ path = os.path.abspath(path) assert os.path.exists(path), '"{}" does not exist'.format(path) filepath = os.path.join(path, "data_description.json") with open(filepath, "r") as file: description = json.load(file) description["path"] = path return description def read_entityset(path, profile_name=None, **kwargs): """Read entityset from disk, S3 path, or URL. NOTE: Never attempt to read an archived EntitySet from an untrusted source. Args: path (str): Directory on disk, S3 path, or URL to read `data_description.json`. profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials. Set to False to use an anonymous profile. kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the underlying deserialization method. """ if _is_url(path) or _is_s3(path) or _is_local_tar(str(path)): with tempfile.TemporaryDirectory() as tmpdir: local_path = path transport_params = None if _is_s3(path): transport_params = get_transport_params(profile_name) if _is_s3(path) or _is_url(path): local_path = os.path.join(tmpdir, "temporary_es") use_smartopen_es(local_path, path, transport_params) with tarfile.open(str(local_path)) as tar: if "filter" in getfullargspec(tar.extractall).kwonlyargs: tar.extractall(path=tmpdir, filter="data") else: raise RuntimeError( "Please upgrade your Python version to the latest patch release to allow for safe extraction of the EntitySet archive.", ) data_description = read_data_description(tmpdir) return description_to_entityset(data_description, **kwargs) else: data_description = read_data_description(path) return description_to_entityset(data_description, **kwargs) ================================================ FILE: featuretools/entityset/entityset.py ================================================ import copy import logging import warnings from collections import defaultdict import numpy as np import pandas as pd from woodwork import init_series from woodwork.logical_types import Datetime, LatLong from featuretools.entityset import deserialize, serialize from featuretools.entityset.relationship import Relationship, RelationshipPath from featuretools.feature_base.feature_base import _ES_REF from featuretools.utils.plot_utils import ( check_graphviz, get_graphviz_format, save_graph, ) from featuretools.utils.wrangle import _check_timedelta pd.options.mode.chained_assignment = None # default='warn' logger = logging.getLogger("featuretools.entityset") LTI_COLUMN_NAME = "_ft_last_time" WW_SCHEMA_KEY = "_ww__getstate__schemas" class EntitySet(object): """ Stores all actual data and typing information for an entityset Attributes: id dataframe_dict relationships time_type Properties: metadata """ def __init__(self, id=None, dataframes=None, relationships=None): """Creates EntitySet Args: id (str) : Unique identifier to associate with this instance dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]): Dictionary of DataFrames. Entries take the format {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}. Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters will be ignored. relationships (list[(str, str, str, str)]): List of relationships between dataframes. List items are a tuple with the format (parent dataframe name, parent column, child dataframe name, child column). Example: .. code-block:: python dataframes = { "cards" : (card_df, "id"), "transactions" : (transactions_df, "id", "transaction_time") } relationships = [("cards", "id", "transactions", "card_id")] ft.EntitySet("my-entity-set", dataframes, relationships) """ self.id = id self.dataframe_dict = {} self.relationships = [] self.time_type = None dataframes = dataframes or {} relationships = relationships or [] for df_name in dataframes: df = dataframes[df_name][0] if df.ww.schema is not None and df.ww.name != df_name: raise ValueError( f"Naming conflict in dataframes dictionary: dictionary key '{df_name}' does not match dataframe name '{df.ww.name}'", ) index_column = None time_index = None make_index = False semantic_tags = None logical_types = None if len(dataframes[df_name]) > 1: index_column = dataframes[df_name][1] if len(dataframes[df_name]) > 2: time_index = dataframes[df_name][2] if len(dataframes[df_name]) > 3: logical_types = dataframes[df_name][3] if len(dataframes[df_name]) > 4: semantic_tags = dataframes[df_name][4] if len(dataframes[df_name]) > 5: make_index = dataframes[df_name][5] self.add_dataframe( dataframe_name=df_name, dataframe=df, index=index_column, time_index=time_index, logical_types=logical_types, semantic_tags=semantic_tags, make_index=make_index, ) for relationship in relationships: parent_df, parent_column, child_df, child_column = relationship self.add_relationship(parent_df, parent_column, child_df, child_column) self.reset_data_description() _ES_REF[self.id] = self def __sizeof__(self): return sum([df.__sizeof__() for df in self.dataframes]) def __dask_tokenize__(self): return (EntitySet, serialize.entityset_to_description(self.metadata)) def __eq__(self, other, deep=False): if self.id != other.id: return False if self.time_type != other.time_type: return False if len(self.dataframe_dict) != len(other.dataframe_dict): return False for df_name, df in self.dataframe_dict.items(): if df_name not in other.dataframe_dict: return False if not df.ww.__eq__(other[df_name].ww, deep=deep): return False if not len(self.relationships) == len(other.relationships): return False for r in self.relationships: if r not in other.relationships: return False return True def __ne__(self, other, deep=False): return not self.__eq__(other, deep=deep) def __getitem__(self, dataframe_name): """Get dataframe instance from entityset Args: dataframe_name (str): Name of dataframe. Returns: :class:`.DataFrame` : Instance of dataframe with Woodwork typing information. None if dataframe doesn't exist on the entityset. """ if dataframe_name in self.dataframe_dict: return self.dataframe_dict[dataframe_name] name = self.id or "entity set" raise KeyError("DataFrame %s does not exist in %s" % (dataframe_name, name)) def __deepcopy__(self, memo): cls = self.__class__ result = cls.__new__(cls) memo[id(self)] = result for k, v in self.__dict__.items(): if k == "dataframe_dict": # Copy the DataFrames, retaining Woodwork typing information copied_attr = copy.copy(v) for df_name, df in copied_attr.items(): copied_attr[df_name] = df.ww.copy() else: copied_attr = copy.deepcopy(v, memo) setattr(result, k, copied_attr) for df in result.dataframe_dict.values(): result._add_references_to_metadata(df) return result @property def dataframes(self): return list(self.dataframe_dict.values()) @property def metadata(self): """Returns the metadata for this EntitySet. The metadata will be recomputed if it does not exist.""" if self._data_description is None: description = serialize.entityset_to_description(self) self._data_description = deserialize.description_to_entityset(description) return self._data_description def reset_data_description(self): self._data_description = None def to_pickle(self, path, compression=None, profile_name=None): """Write entityset in the pickle format, location specified by `path`. Path could be a local path or a S3 path. If writing to S3 a tar archive of files will be written. Args: path (str): location on disk to write to (will be created as a directory) compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}. profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None. """ serialize.write_data_description( self, path, format="pickle", compression=compression, profile_name=profile_name, ) return self def to_parquet(self, path, engine="auto", compression=None, profile_name=None): """Write entityset to disk in the parquet format, location specified by `path`. Path could be a local path or a S3 path. If writing to S3 a tar archive of files will be written. Args: path (str): location on disk to write to (will be created as a directory) engine (str) : Name of the engine to use. Possible values are: {'auto', 'pyarrow'}. compression (str) : Name of the compression to use. Possible values are: {'snappy', 'gzip', 'brotli', None}. profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None. """ serialize.write_data_description( self, path, format="parquet", engine=engine, compression=compression, profile_name=profile_name, ) return self def to_csv( self, path, sep=",", encoding="utf-8", engine="python", compression=None, profile_name=None, ): """Write entityset to disk in the csv format, location specified by `path`. Path could be a local path or a S3 path. If writing to S3 a tar archive of files will be written. Args: path (str) : Location on disk to write to (will be created as a directory) sep (str) : String of length 1. Field delimiter for the output file. encoding (str) : A string representing the encoding to use in the output file, defaults to 'utf-8'. engine (str) : Name of the engine to use. Possible values are: {'c', 'python'}. compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}. profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None. """ serialize.write_data_description( self, path, format="csv", index=False, sep=sep, encoding=encoding, engine=engine, compression=compression, profile_name=profile_name, ) return self def to_dictionary(self): return serialize.entityset_to_description(self) ########################################################################### # Public getter/setter methods ######################################### ########################################################################### def __repr__(self): repr_out = "Entityset: {}\n".format(self.id) repr_out += " DataFrames:" for df in self.dataframes: if df.shape: repr_out += "\n {} [Rows: {}, Columns: {}]".format( df.ww.name, df.shape[0], df.shape[1], ) else: repr_out += "\n {} [Rows: None, Columns: None]".format(df.ww.name) repr_out += "\n Relationships:" if len(self.relationships) == 0: repr_out += "\n No relationships" for r in self.relationships: repr_out += "\n %s.%s -> %s.%s" % ( r._child_dataframe_name, r._child_column_name, r._parent_dataframe_name, r._parent_column_name, ) return repr_out def add_relationships(self, relationships): """Add multiple new relationships to a entityset Args: relationships (list[tuple(str, str, str, str)] or list[Relationship]) : List of new relationships to add. Relationships are specified either as a :class:`.Relationship` object or a four element tuple identifying the parent and child columns: (parent_dataframe_name, parent_column_name, child_dataframe_name, child_column_name) """ for rel in relationships: if isinstance(rel, Relationship): self.add_relationship(relationship=rel) else: self.add_relationship(*rel) return self def add_relationship( self, parent_dataframe_name=None, parent_column_name=None, child_dataframe_name=None, child_column_name=None, relationship=None, ): """Add a new relationship between dataframes in the entityset. Relationships can be specified by passing dataframe and columns names or by passing a :class:`.Relationship` object. Args: parent_dataframe_name (str): Name of the parent dataframe in the EntitySet. Must be specified if relationship is not. parent_column_name (str): Name of the parent column. Must be specified if relationship is not. child_dataframe_name (str): Name of the child dataframe in the EntitySet. Must be specified if relationship is not. child_column_name (str): Name of the child column. Must be specified if relationship is not. relationship (Relationship): Instance of new relationship to be added. Must be specified if dataframe and column names are not supplied. """ if relationship and ( parent_dataframe_name or parent_column_name or child_dataframe_name or child_column_name ): raise ValueError( "Cannot specify dataframe and column name values and also supply a Relationship", ) if not relationship: relationship = Relationship( self, parent_dataframe_name, parent_column_name, child_dataframe_name, child_column_name, ) if relationship in self.relationships: warnings.warn("Not adding duplicate relationship: " + str(relationship)) return self # _operations? # this is a new pair of dataframes child_df = relationship.child_dataframe child_column = relationship._child_column_name if child_df.ww.index == child_column: msg = "Unable to add relationship because child column '{}' in '{}' is also its index" raise ValueError(msg.format(child_column, child_df.ww.name)) parent_df = relationship.parent_dataframe parent_column = relationship._parent_column_name if parent_df.ww.index != parent_column: parent_df.ww.set_index(parent_column) # Empty dataframes (as a result of accessing metadata) # default to object dtypes for categorical columns, but # indexes/foreign keys default to ints. In this case, we convert # the empty column's type to int if ( child_df.empty and child_df[child_column].dtype == object and parent_df.ww.columns[parent_column].is_numeric ): child_df.ww[child_column] = pd.Series(name=child_column, dtype=np.int64) parent_ltype = parent_df.ww.logical_types[parent_column] child_ltype = child_df.ww.logical_types[child_column] if parent_ltype != child_ltype: difference_msg = "" if str(parent_ltype) == str(child_ltype): difference_msg = "There is a conflict between the parameters. " warnings.warn( f"Logical type {child_ltype} for child column {child_column} does not match " f"parent column {parent_column} logical type {parent_ltype}. {difference_msg}" "Changing child logical type to match parent.", ) child_df.ww.set_types(logical_types={child_column: parent_ltype}) if "foreign_key" not in child_df.ww.semantic_tags[child_column]: child_df.ww.add_semantic_tags({child_column: "foreign_key"}) self.relationships.append(relationship) self.reset_data_description() return self def set_secondary_time_index(self, dataframe_name, secondary_time_index): """ Set the secondary time index for a dataframe in the EntitySet using its dataframe name. Args: dataframe_name (str) : name of the dataframe for which to set the secondary time index. secondary_time_index (dict[str-> list[str]]): Name of column containing time data to be used as a secondary time index mapped to a list of the columns in the dataframe associated with that secondary time index. """ dataframe = self[dataframe_name] self._set_secondary_time_index(dataframe, secondary_time_index) def _set_secondary_time_index(self, dataframe, secondary_time_index): """Sets the secondary time index for a Woodwork dataframe passed in""" assert ( dataframe.ww.schema is not None ), "Cannot set secondary time index if Woodwork is not initialized" self._check_secondary_time_index(dataframe, secondary_time_index) if secondary_time_index is not None: dataframe.ww.metadata["secondary_time_index"] = secondary_time_index ########################################################################### # Relationship access/helper methods ################################### ########################################################################### def find_forward_paths(self, start_dataframe_name, goal_dataframe_name): """ Generator which yields all forward paths between a start and goal dataframe. Does not include paths which contain cycles. Args: start_dataframe_name (str) : name of dataframe to start the search from goal_dataframe_name (str) : name of dataframe to find forward path to See Also: :func:`BaseEntitySet.find_backward_paths` """ for sub_dataframe_name, path in self._forward_dataframe_paths( start_dataframe_name, ): if sub_dataframe_name == goal_dataframe_name: yield path def find_backward_paths(self, start_dataframe_name, goal_dataframe_name): """ Generator which yields all backward paths between a start and goal dataframe. Does not include paths which contain cycles. Args: start_dataframe_name (str) : Name of dataframe to start the search from. goal_dataframe_name (str) : Name of dataframe to find backward path to. See Also: :func:`BaseEntitySet.find_forward_paths` """ for path in self.find_forward_paths(goal_dataframe_name, start_dataframe_name): # Reverse path yield path[::-1] def _forward_dataframe_paths(self, start_dataframe_name, seen_dataframes=None): """ Generator which yields the names of all dataframes connected through forward relationships, and the path taken to each. A dataframe will be yielded multiple times if there are multiple paths to it. Implemented using depth first search. """ if seen_dataframes is None: seen_dataframes = set() if start_dataframe_name in seen_dataframes: return seen_dataframes.add(start_dataframe_name) yield start_dataframe_name, [] for relationship in self.get_forward_relationships(start_dataframe_name): next_dataframe = relationship._parent_dataframe_name # Copy seen dataframes for each next node to allow multiple paths (but # not cycles). descendants = self._forward_dataframe_paths( next_dataframe, seen_dataframes.copy(), ) for sub_dataframe_name, sub_path in descendants: yield sub_dataframe_name, [relationship] + sub_path def get_forward_dataframes(self, dataframe_name, deep=False): """ Get dataframes that are in a forward relationship with dataframe Args: dataframe_name (str): Name of dataframe to search from. deep (bool): if True, recursively find forward dataframes. Yields a tuple of (descendent_name, path from dataframe_name to descendant). """ for relationship in self.get_forward_relationships(dataframe_name): parent_dataframe_name = relationship._parent_dataframe_name direct_path = RelationshipPath([(True, relationship)]) yield parent_dataframe_name, direct_path if deep: sub_dataframes = self.get_forward_dataframes( parent_dataframe_name, deep=True, ) for sub_dataframe_name, path in sub_dataframes: yield sub_dataframe_name, direct_path + path def get_backward_dataframes(self, dataframe_name, deep=False): """ Get dataframes that are in a backward relationship with dataframe Args: dataframe_name (str): Name of dataframe to search from. deep (bool): if True, recursively find backward dataframes. Yields a tuple of (descendent_name, path from dataframe_name to descendant). """ for relationship in self.get_backward_relationships(dataframe_name): child_dataframe_name = relationship._child_dataframe_name direct_path = RelationshipPath([(False, relationship)]) yield child_dataframe_name, direct_path if deep: sub_dataframes = self.get_backward_dataframes( child_dataframe_name, deep=True, ) for sub_dataframe_name, path in sub_dataframes: yield sub_dataframe_name, direct_path + path def get_forward_relationships(self, dataframe_name): """Get relationships where dataframe "dataframe_name" is the child Args: dataframe_name (str): Name of dataframe to get relationships for. Returns: list[:class:`.Relationship`]: List of forward relationships. """ return [ r for r in self.relationships if r._child_dataframe_name == dataframe_name ] def get_backward_relationships(self, dataframe_name): """ get relationships where dataframe "dataframe_name" is the parent. Args: dataframe_name (str): Name of dataframe to get relationships for. Returns: list[:class:`.Relationship`]: list of backward relationships """ return [ r for r in self.relationships if r._parent_dataframe_name == dataframe_name ] def has_unique_forward_path(self, start_dataframe_name, end_dataframe_name): """ Is the forward path from start to end unique? This will raise if there is no such path. """ paths = self.find_forward_paths(start_dataframe_name, end_dataframe_name) next(paths) second_path = next(paths, None) return not second_path ########################################################################### # DataFrame creation methods ############################################## ########################################################################### def add_dataframe( self, dataframe, dataframe_name=None, index=None, logical_types=None, semantic_tags=None, make_index=False, time_index=None, secondary_time_index=None, already_sorted=False, ): """ Add a DataFrame to the EntitySet with Woodwork typing information. Args: dataframe (pandas.DataFrame) : Dataframe containing the data. dataframe_name (str, optional) : Unique name to associate with this dataframe. Must be provided if Woodwork is not initialized on the input DataFrame. index (str, optional): Name of the column used to index the dataframe. Must be unique. If None, take the first column. logical_types (dict[str -> Woodwork.LogicalTypes/str, optional]): Keys are column names and values are logical types. Will be inferred if not specified. semantic_tags (dict[str -> str/set], optional): Keys are column names and values are semantic tags. make_index (bool, optional) : If True, assume index does not exist as a column in dataframe, and create a new column of that name using integers. Otherwise, assume index exists. time_index (str, optional): Name of the column containing time data. Type must be numeric or datetime in nature. secondary_time_index (dict[str -> list[str]]): Name of column containing time data to be used as a secondary time index mapped to a list of the columns in the dataframe associated with that secondary time index. already_sorted (bool, optional) : If True, assumes that input dataframe is already sorted by time. Defaults to False. Notes: Will infer logical types from the data. Example: .. ipython:: python import featuretools as ft import pandas as pd transactions_df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "session_id": [1, 2, 1, 3, 4, 5], "amount": [100.40, 20.63, 33.32, 13.12, 67.22, 1.00], "transaction_time": pd.date_range(start="10:00", periods=6, freq="10s"), "fraud": [True, False, True, False, True, True]}) es = ft.EntitySet("example") es.add_dataframe(dataframe_name="transactions", index="id", time_index="transaction_time", dataframe=transactions_df) es["transactions"] """ logical_types = logical_types or {} semantic_tags = semantic_tags or {} if len(self.dataframes) > 0: if not isinstance(dataframe, type(self.dataframes[0])): raise ValueError( "All dataframes must be of the same type. " "Cannot add dataframe of type {} to an entityset with existing dataframes " "of type {}".format(type(dataframe), type(self.dataframes[0])), ) # Only allow string column names non_string_names = [ name for name in dataframe.columns if not isinstance(name, str) ] if non_string_names: raise ValueError( "All column names must be strings (Columns {} " "are not strings)".format(non_string_names), ) if dataframe.ww.schema is None: if dataframe_name is None: raise ValueError( "Cannot add dataframe to EntitySet without a name. " "Please provide a value for the dataframe_name parameter.", ) index_was_created, index, dataframe = _get_or_create_index( index, make_index, dataframe, ) dataframe.ww.init( name=dataframe_name, index=index, time_index=time_index, logical_types=logical_types, semantic_tags=semantic_tags, already_sorted=already_sorted, ) if index_was_created: dataframe.ww.metadata["created_index"] = index else: if dataframe.ww.name is None: raise ValueError( "Cannot add a Woodwork DataFrame to EntitySet without a name", ) if dataframe.ww.index is None: raise ValueError( "Cannot add Woodwork DataFrame to EntitySet without index", ) extra_params = [] if index is not None: extra_params.append("index") if time_index is not None: extra_params.append("time_index") if logical_types: extra_params.append("logical_types") if make_index: extra_params.append("make_index") if semantic_tags: extra_params.append("semantic_tags") if already_sorted: extra_params.append("already_sorted") if dataframe_name is not None and dataframe_name != dataframe.ww.name: extra_params.append("dataframe_name") if extra_params: warnings.warn( "A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: " + ", ".join(extra_params), ) if dataframe.ww.time_index is not None: self._check_uniform_time_index(dataframe) self._check_secondary_time_index(dataframe) if secondary_time_index: self._set_secondary_time_index( dataframe, secondary_time_index=secondary_time_index, ) dataframe = self._normalize_values(dataframe) self.dataframe_dict[dataframe.ww.name] = dataframe self.reset_data_description() self._add_references_to_metadata(dataframe) return self def __setitem__(self, key, value): self.add_dataframe(dataframe=value, dataframe_name=key) def normalize_dataframe( self, base_dataframe_name, new_dataframe_name, index, additional_columns=None, copy_columns=None, make_time_index=None, make_secondary_time_index=None, new_dataframe_time_index=None, new_dataframe_secondary_time_index=None, ): """Create a new dataframe and relationship from unique values of an existing column. Args: base_dataframe_name (str) : Dataframe name from which to split. new_dataframe_name (str): Name of the new dataframe. index (str): Column in old dataframe that will become index of new dataframe. Relationship will be created across this column. additional_columns (list[str]): List of column names to remove from base_dataframe and move to new dataframe. copy_columns (list[str]): List of column names to copy from old dataframe and move to new dataframe. make_time_index (bool or str, optional): Create time index for new dataframe based on time index in base_dataframe, optionally specifying which column in base_dataframe to use for time_index. If specified as True without a specific column name, uses the primary time index. Defaults to True if base dataframe has a time index. make_secondary_time_index (dict[str -> list[str]], optional): Create a secondary time index from key. Values of dictionary are the columns to associate with a secondary time index. Only one secondary time index is allowed. If None, only associate the time index. new_dataframe_time_index (str, optional): Rename new dataframe time index. new_dataframe_secondary_time_index (str, optional): Rename new dataframe secondary time index. """ base_dataframe = self.dataframe_dict[base_dataframe_name] additional_columns = additional_columns or [] copy_columns = copy_columns or [] for list_name, col_list in { "copy_columns": copy_columns, "additional_columns": additional_columns, }.items(): if not isinstance(col_list, list): raise TypeError( "'{}' must be a list, but received type {}".format( list_name, type(col_list), ), ) if len(col_list) != len(set(col_list)): raise ValueError( f"'{list_name}' contains duplicate columns. All columns must be unique.", ) for col_name in col_list: if col_name == index: raise ValueError( "Not adding {} as both index and column in {}".format( col_name, list_name, ), ) for col in additional_columns: if col == base_dataframe.ww.time_index: raise ValueError( "Not moving {} as it is the base time index column. Perhaps, move the column to the copy_columns.".format( col, ), ) if isinstance(make_time_index, str): if make_time_index not in base_dataframe.columns: raise ValueError( "'make_time_index' must be a column in the base dataframe", ) elif make_time_index not in additional_columns + copy_columns: raise ValueError( "'make_time_index' must be specified in 'additional_columns' or 'copy_columns'", ) if index == base_dataframe.ww.index: raise ValueError( "'index' must be different from the index column of the base dataframe", ) transfer_types = {} # Types will be a tuple of (logical_type, semantic_tags, column_metadata, column_description) transfer_types[index] = ( base_dataframe.ww.logical_types[index], base_dataframe.ww.semantic_tags[index], base_dataframe.ww.columns[index].metadata, base_dataframe.ww.columns[index].description, ) for col_name in additional_columns + copy_columns: # Remove any existing time index tags transfer_types[col_name] = ( base_dataframe.ww.logical_types[col_name], (base_dataframe.ww.semantic_tags[col_name] - {"time_index"}), base_dataframe.ww.columns[col_name].metadata, base_dataframe.ww.columns[col_name].description, ) # create and add new dataframe new_dataframe = self[base_dataframe_name].copy() if make_time_index is None and base_dataframe.ww.time_index is not None: make_time_index = True if isinstance(make_time_index, str): # Set the new time index to make_time_index. base_time_index = make_time_index new_dataframe_time_index = make_time_index already_sorted = new_dataframe_time_index == base_dataframe.ww.time_index elif make_time_index: # Create a new time index based on the base dataframe time index. base_time_index = base_dataframe.ww.time_index if new_dataframe_time_index is None: new_dataframe_time_index = "first_%s_time" % (base_dataframe.ww.name) already_sorted = True assert ( base_dataframe.ww.time_index is not None ), "Base dataframe doesn't have time_index defined" if base_time_index not in [col for col in copy_columns]: copy_columns.append(base_time_index) time_index_types = ( base_dataframe.ww.logical_types[base_dataframe.ww.time_index], base_dataframe.ww.semantic_tags[base_dataframe.ww.time_index], base_dataframe.ww.columns[base_dataframe.ww.time_index].metadata, base_dataframe.ww.columns[base_dataframe.ww.time_index].description, ) else: # If base_time_index is in copy_columns then we've already added the transfer types # but since we're changing the name, we have to remove it time_index_types = transfer_types[base_dataframe.ww.time_index] del transfer_types[base_dataframe.ww.time_index] transfer_types[new_dataframe_time_index] = time_index_types else: new_dataframe_time_index = None already_sorted = False if new_dataframe_time_index is not None and new_dataframe_time_index == index: raise ValueError( "time_index and index cannot be the same value, %s" % (new_dataframe_time_index), ) selected_columns = ( [index] + [col for col in additional_columns] + [col for col in copy_columns] ) new_dataframe = new_dataframe.dropna(subset=[index]) new_dataframe2 = new_dataframe.drop_duplicates(index, keep="first")[ selected_columns ] if make_time_index: new_dataframe2 = new_dataframe2.rename( columns={base_time_index: new_dataframe_time_index}, ) if make_secondary_time_index: assert ( len(make_secondary_time_index) == 1 ), "Can only provide 1 secondary time index" secondary_time_index = list(make_secondary_time_index.keys())[0] secondary_columns = [index, secondary_time_index] + list( make_secondary_time_index.values(), )[0] secondary_df = new_dataframe.drop_duplicates(index, keep="last")[ secondary_columns ] if new_dataframe_secondary_time_index: secondary_df = secondary_df.rename( columns={secondary_time_index: new_dataframe_secondary_time_index}, ) secondary_time_index = new_dataframe_secondary_time_index else: new_dataframe_secondary_time_index = secondary_time_index secondary_df = secondary_df.set_index(index) new_dataframe = new_dataframe2.join(secondary_df, on=index) else: new_dataframe = new_dataframe2 base_dataframe_index = index if make_secondary_time_index: old_ti_name = list(make_secondary_time_index.keys())[0] ti_cols = list(make_secondary_time_index.values())[0] ti_cols = [c if c != old_ti_name else secondary_time_index for c in ti_cols] make_secondary_time_index = {secondary_time_index: ti_cols} # will initialize Woodwork on this DataFrame logical_types = {} semantic_tags = {} column_metadata = {} column_descriptions = {} for col_name, (ltype, tags, metadata, description) in transfer_types.items(): logical_types[col_name] = ltype semantic_tags[col_name] = tags - {"time_index"} column_metadata[col_name] = copy.deepcopy(metadata) column_descriptions[col_name] = description new_dataframe.ww.init( name=new_dataframe_name, index=index, already_sorted=already_sorted, time_index=new_dataframe_time_index, logical_types=logical_types, semantic_tags=semantic_tags, column_metadata=column_metadata, column_descriptions=column_descriptions, ) self.add_dataframe( new_dataframe, secondary_time_index=make_secondary_time_index, ) self.dataframe_dict[base_dataframe_name] = self.dataframe_dict[ base_dataframe_name ].ww.drop(additional_columns) self.dataframe_dict[base_dataframe_name].ww.add_semantic_tags( {base_dataframe_index: "foreign_key"}, ) self.add_relationship( new_dataframe_name, index, base_dataframe_name, base_dataframe_index, ) self.reset_data_description() return self # ########################################################################### # # Data wrangling methods ############################################### # ########################################################################### def concat(self, other, inplace=False): """Combine entityset with another to create a new entityset with the combined data of both entitysets. """ if not self.__eq__(other): raise ValueError( "Entitysets must have the same dataframes, relationships" ", and column names", ) if inplace: combined_es = self else: combined_es = copy.deepcopy(self) has_last_time_index = [] for df in self.dataframes: self_df = df other_df = other[df.ww.name] combined_df = pd.concat([self_df, other_df]) # If both DataFrames have made indexes, there will likely # be overlap in the index column, so we use the other values if self_df.ww.metadata.get("created_index") or other_df.ww.metadata.get( "created_index", ): columns = [ col for col in combined_df.columns if col != df.ww.index or col != df.ww.time_index ] else: columns = [df.ww.index] combined_df.drop_duplicates(columns, inplace=True) self_lti_col = df.ww.metadata.get("last_time_index") other_lti_col = other[df.ww.name].ww.metadata.get("last_time_index") if self_lti_col is not None or other_lti_col is not None: has_last_time_index.append(df.ww.name) combined_es.replace_dataframe( dataframe_name=df.ww.name, df=combined_df, recalculate_last_time_indexes=False, already_sorted=False, ) if has_last_time_index: combined_es.add_last_time_indexes(updated_dataframes=has_last_time_index) combined_es.reset_data_description() return combined_es ########################################################################### # Indexing methods ############################################### ########################################################################### def add_last_time_indexes(self, updated_dataframes=None): """ Calculates the last time index values for each dataframe (the last time an instance or children of that instance were observed). Used when calculating features using training windows. Adds the last time index as a series named _ft_last_time on the dataframe. Args: updated_dataframes (list[str]): List of dataframe names to update last_time_index for (will update all parents of those dataframes as well) """ # Generate graph of dataframes to find leaf dataframes children = defaultdict(list) # parent --> child mapping child_cols = defaultdict(dict) for r in self.relationships: children[r._parent_dataframe_name].append(r.child_dataframe) child_cols[r._parent_dataframe_name][r._child_dataframe_name] = ( r.child_column ) updated_dataframes = updated_dataframes or [] if updated_dataframes: # find parents of updated_dataframes parent_queue = updated_dataframes[:] parents = set() while len(parent_queue): df_name = parent_queue.pop(0) if df_name in parents: continue parents.add(df_name) for parent_name, _ in self.get_forward_dataframes(df_name): parent_queue.append(parent_name) queue = [self[p] for p in parents] to_explore = parents else: to_explore = set(self.dataframe_dict.keys()) queue = self.dataframes[:] explored = set() # Store the last time indexes for the entire entityset in a dictionary to update es_lti_dict = {} for df in self.dataframes: lti_col = df.ww.metadata.get("last_time_index") if lti_col is not None: lti_col = df[lti_col] es_lti_dict[df.ww.name] = lti_col for df in queue: es_lti_dict[df.ww.name] = None # We will explore children of dataframes on the queue, # which may not be in the to_explore set. Therefore, # we check whether all elements of to_explore are in # explored, rather than just comparing length while not to_explore.issubset(explored): dataframe = queue.pop(0) if es_lti_dict[dataframe.ww.name] is None: if dataframe.ww.time_index is not None: lti = dataframe[dataframe.ww.time_index].copy() else: lti = dataframe.ww[dataframe.ww.index].copy() # Cannot have a category dtype with nans when calculating last time index lti = lti.astype("object") lti[:] = None es_lti_dict[dataframe.ww.name] = lti if dataframe.ww.name in children: child_dataframes = children[dataframe.ww.name] # if all children not explored, skip for now if not set([df.ww.name for df in child_dataframes]).issubset(explored): # Now there is a possibility that a child dataframe # was not explicitly provided in updated_dataframes, # and never made it onto the queue. If updated_dataframes # is None then we just load all dataframes onto the queue # so we didn't need this logic for df in child_dataframes: if df.ww.name not in explored and df.ww.name not in [ q.ww.name for q in queue ]: # must also reset last time index here es_lti_dict[df.ww.name] = None queue.append(df) queue.append(dataframe) continue # updated last time from all children for child_df in child_dataframes: if es_lti_dict[child_df.ww.name] is None: continue link_col = child_cols[dataframe.ww.name][child_df.ww.name].name lti_df = pd.DataFrame( { "last_time": es_lti_dict[child_df.ww.name], dataframe.ww.index: child_df[link_col], }, ) # sort by time and keep only the most recent lti_df.sort_values( ["last_time", dataframe.ww.index], kind="mergesort", inplace=True, ) lti_df.drop_duplicates( dataframe.ww.index, keep="last", inplace=True, ) lti_df.set_index(dataframe.ww.index, inplace=True) lti_df = lti_df.reindex(es_lti_dict[dataframe.ww.name].index) lti_df["last_time_old"] = es_lti_dict[dataframe.ww.name] if lti_df.empty: # Pandas errors out if it tries to do fillna and then max on an empty dataframe lti_df = pd.Series([], dtype="object") else: lti_df["last_time"] = lti_df["last_time"].astype( "datetime64[ns]", ) lti_df["last_time_old"] = lti_df["last_time_old"].astype( "datetime64[ns]", ) lti_df = lti_df.fillna( pd.to_datetime("1800-01-01 00:00"), ).max(axis=1) lti_df = lti_df.replace( pd.to_datetime("1800-01-01 00:00"), pd.NaT, ) es_lti_dict[dataframe.ww.name] = lti_df es_lti_dict[dataframe.ww.name].name = "last_time" explored.add(dataframe.ww.name) # Store the last time index on the DataFrames dfs_to_update = {} for df in self.dataframes: lti = es_lti_dict[df.ww.name] if lti is not None: if self.time_type == "numeric": if lti.dtype == "datetime64[ns]": # Woodwork cannot convert from datetime to numeric lti = lti.apply(lambda x: x.value) lti = init_series(lti, logical_type="Double") else: lti = init_series(lti, logical_type="Datetime") lti.name = LTI_COLUMN_NAME if LTI_COLUMN_NAME in df.columns: if "last_time_index" in df.ww.semantic_tags[LTI_COLUMN_NAME]: # Remove any previous last time index placed by featuretools df.ww.pop(LTI_COLUMN_NAME) else: raise ValueError( "Cannot add a last time index on DataFrame with an existing " f"'{LTI_COLUMN_NAME}' column. Please rename '{LTI_COLUMN_NAME}'.", ) # Add the new column to the DataFrame df.ww[LTI_COLUMN_NAME] = lti if "last_time_index" not in df.ww.semantic_tags[LTI_COLUMN_NAME]: df.ww.add_semantic_tags({LTI_COLUMN_NAME: "last_time_index"}) df.ww.metadata["last_time_index"] = LTI_COLUMN_NAME for df in dfs_to_update.values(): df.ww.add_semantic_tags({LTI_COLUMN_NAME: "last_time_index"}) df.ww.metadata["last_time_index"] = LTI_COLUMN_NAME self.dataframe_dict[df.ww.name] = df self.reset_data_description() for df in self.dataframes: self._add_references_to_metadata(df) # ########################################################################### # # Pickling ############################################### # ########################################################################### def __getstate__(self): return { **self.__dict__, WW_SCHEMA_KEY: { df_name: df.ww.schema for df_name, df in self.dataframe_dict.items() }, } def __setstate__(self, state): ww_schemas = state.pop(WW_SCHEMA_KEY) for df_name, df in state.get("dataframe_dict", {}).items(): if ww_schemas[df_name] is not None: df.ww.init(schema=ww_schemas[df_name], validate=False) self.__dict__.update(state) # ########################################################################### # # Other ############################################### # ########################################################################### def add_interesting_values( self, max_values=5, verbose=False, dataframe_name=None, values=None, ): """Find or set interesting values for categorical columns, to be used to generate "where" clauses Args: max_values (int) : Maximum number of values per column to add. verbose (bool) : If True, print summary of interesting values found. dataframe_name (str) : The dataframe in the EntitySet for which to add interesting values. If not specified interesting values will be added for all dataframes. values (dict): A dictionary mapping column names to the interesting values to set for the column. If specified, a corresponding dataframe_name must also be provided. If not specified, interesting values will be set for all eligible columns. If values are specified, max_values and verbose parameters will be ignored. Returns: None """ if dataframe_name is None and values is not None: raise ValueError("dataframe_name must be specified if values are provided") if dataframe_name is not None and values is not None: for column, vals in values.items(): self[dataframe_name].ww.columns[column].metadata[ "interesting_values" ] = vals return if dataframe_name: dataframes = [self[dataframe_name]] else: dataframes = self.dataframes def add_value(df, col, val, verbose): if verbose: msg = "Column {}: Marking {} as an interesting value" logger.info(msg.format(col, val)) interesting_vals = df.ww.columns[col].metadata.get("interesting_values", []) interesting_vals.append(val) df.ww.columns[col].metadata["interesting_values"] = interesting_vals for df in dataframes: value_counts = df.ww.value_counts(top_n=max(25, max_values), dropna=True) total_count = len(df) for col, counts in value_counts.items(): if {"index", "foreign_key"}.intersection(df.ww.semantic_tags[col]): continue for i in range(min(max_values, len(counts))): # Categorical columns will include counts of 0 for all values # in categories. Stop when we encounter a 0 count. if counts[i]["count"] == 0: break if len(counts) < 25: value = counts[i]["value"] add_value(df, col, value, verbose) else: fraction = counts[i]["count"] / total_count if fraction > 0.05 and fraction < 0.95: value = counts[i]["value"] add_value(df, col, value, verbose) else: break self.reset_data_description() def plot(self, to_file=None): """ Create a UML diagram-ish graph of the EntitySet. Args: to_file (str, optional) : Path to where the plot should be saved. If set to None (as by default), the plot will not be saved. Returns: graphviz.Digraph : Graph object that can directly be displayed in Jupyter notebooks. Nodes of the graph correspond to the DataFrames in the EntitySet, showing the typing information for each column. Note: The typing information displayed for each column is based off of the Woodwork ColumnSchema for that column and is represented as ``LogicalType; semantic_tags``, but the standard semantic tags have been removed for brevity. """ graphviz = check_graphviz() format_ = get_graphviz_format(graphviz=graphviz, to_file=to_file) # Initialize a new directed graph graph = graphviz.Digraph( self.id, format=format_, graph_attr={"splines": "ortho"}, ) # Draw dataframes for df in self.dataframes: column_typing_info = [] for col_name, col_schema in df.ww.columns.items(): col_string = col_name + " : " + str(col_schema.logical_type) tags = col_schema.semantic_tags - col_schema.logical_type.standard_tags if tags: col_string += "; " col_string += ", ".join(tags) column_typing_info.append(col_string) columns_string = "\l".join(column_typing_info) # noqa: W605 nrows = df.shape[0] label = "{%s (%d row%s)|%s\l}" % ( # noqa: W605 df.ww.name, nrows, "s" * (nrows > 1), columns_string, ) graph.node(df.ww.name, shape="record", label=label) # Draw relationships for rel in self.relationships: # Display the key only once if is the same for both related dataframes if rel._parent_column_name == rel._child_column_name: label = rel._parent_column_name else: label = "%s -> %s" % (rel._parent_column_name, rel._child_column_name) graph.edge( rel._child_dataframe_name, rel._parent_dataframe_name, xlabel=label, ) if to_file: save_graph(graph, to_file, format_) return graph def _handle_time( self, dataframe_name, df, time_last=None, training_window=None, include_cutoff_time=True, ): """ Filter a dataframe for all instances before time_last. If the dataframe does not have a time index, return the original dataframe. """ schema = self[dataframe_name].ww.schema if schema.time_index: df_empty = df.empty if time_last is not None and not df_empty: if include_cutoff_time: df = df[df[schema.time_index] <= time_last] else: df = df[df[schema.time_index] < time_last] if training_window is not None: training_window = _check_timedelta(training_window) if include_cutoff_time: mask = df[schema.time_index] > time_last - training_window else: mask = df[schema.time_index] >= time_last - training_window lti_col = schema.metadata.get("last_time_index") if lti_col is not None: if include_cutoff_time: lti_mask = df[lti_col] > time_last - training_window else: lti_mask = df[lti_col] >= time_last - training_window mask = mask | lti_mask else: warnings.warn( "Using training_window but last_time_index is " "not set for dataframe %s" % (dataframe_name), ) df = df[mask] secondary_time_indexes = schema.metadata.get("secondary_time_index") or {} for secondary_time_index, columns in secondary_time_indexes.items(): # should we use ignore time last here? if time_last is not None and not df.empty: mask = df[secondary_time_index] >= time_last df.loc[mask, columns] = np.nan return df def query_by_values( self, dataframe_name, instance_vals, column_name=None, columns=None, time_last=None, training_window=None, include_cutoff_time=True, ): """Query instances that have column with given value Args: dataframe_name (str): The id of the dataframe to query instance_vals (pd.Dataframe, pd.Series, list[str] or str) : Instance(s) to match. column_name (str) : Column to query on. If None, query on index. columns (list[str]) : Columns to return. Return all columns if None. time_last (pd.TimeStamp) : Query data up to and including this time. Only applies if dataframe has a time index. training_window (Timedelta, optional): Window defining how much time before the cutoff time data can be used when calculating features. If None, all data before cutoff time is used. include_cutoff_time (bool): If True, data at cutoff time are included in calculating features Returns: pd.DataFrame : instances that match constraints with ids in order of underlying dataframe """ dataframe = self[dataframe_name] if not column_name: column_name = dataframe.ww.index instance_vals = _vals_to_series(instance_vals, column_name) training_window = _check_timedelta(training_window) if training_window is not None: assert ( training_window.has_no_observations() ), "Training window cannot be in observations" if instance_vals is None: df = dataframe.copy() elif isinstance(instance_vals, pd.Series) and instance_vals.empty: df = dataframe.head(0) else: df = dataframe[dataframe[column_name].isin(instance_vals)] df = df.set_index(dataframe.ww.index, drop=False) # ensure filtered df has same categories as original # workaround for issue below # github.com/pandas-dev/pandas/issues/22501#issuecomment-415982538 # # Pandas claims that bug is fixed but it still shows up in some # cases. More investigation needed. if dataframe.ww.columns[column_name].is_categorical: categories = pd.api.types.CategoricalDtype( categories=dataframe[column_name].cat.categories, ) df[column_name] = df[column_name].astype(categories) df = self._handle_time( dataframe_name=dataframe_name, df=df, time_last=time_last, training_window=training_window, include_cutoff_time=include_cutoff_time, ) if columns is not None: df = df[columns] return df def replace_dataframe( self, dataframe_name, df, already_sorted=False, recalculate_last_time_indexes=True, ): """Replace the internal dataframe of an EntitySet table, keeping Woodwork typing information the same. Optionally makes sure that data is sorted, that reference indexes to other dataframes are consistent, and that last_time_indexes are updated to reflect the new data. If an index was created for the original dataframe and is not present on the new dataframe, an index column of the same name will be added to the new dataframe. """ if not isinstance(df, type(self[dataframe_name])): raise TypeError("Incorrect DataFrame type used") # If the original DataFrame has a last time index column and the new one doesnt # remove the column and the reference to last time index from that dataframe last_time_index_column = self[dataframe_name].ww.metadata.get("last_time_index") if ( last_time_index_column is not None and last_time_index_column not in df.columns ): self[dataframe_name].ww.pop(last_time_index_column) del self[dataframe_name].ww.metadata["last_time_index"] # If the original DataFrame had an index created via make_index, # we may need to remake the index if it's not in the new DataFrame created_index = self[dataframe_name].ww.metadata.get("created_index") if created_index is not None and created_index not in df.columns: df = _create_index(df, created_index) old_column_names = list(self[dataframe_name].columns) if len(df.columns) != len(old_column_names): raise ValueError( "New dataframe contains {} columns, expecting {}".format( len(df.columns), len(old_column_names), ), ) for col_name in old_column_names: if col_name not in df.columns: raise ValueError( "New dataframe is missing new {} column".format(col_name), ) if df.ww.schema is not None: warnings.warn( "Woodwork typing information on new dataframe will be replaced " f"with existing typing information from {dataframe_name}", ) df.ww.init( schema=self[dataframe_name].ww._schema, already_sorted=already_sorted, ) # Make sure column ordering matches original ordering df = df.ww[old_column_names] df = self._normalize_values(df) self.dataframe_dict[dataframe_name] = df if self[dataframe_name].ww.time_index is not None: self._check_uniform_time_index(self[dataframe_name]) df_metadata = self[dataframe_name].ww.metadata self.set_secondary_time_index( dataframe_name, df_metadata.get("secondary_time_index"), ) if recalculate_last_time_indexes and last_time_index_column is not None: self.add_last_time_indexes(updated_dataframes=[dataframe_name]) self.reset_data_description() self._add_references_to_metadata(df) def _check_time_indexes(self): for dataframe in self.dataframe_dict.values(): self._check_uniform_time_index(dataframe) self._check_secondary_time_index(dataframe) def _check_secondary_time_index(self, dataframe, secondary_time_index=None): secondary_time_index = secondary_time_index or dataframe.ww.metadata.get( "secondary_time_index", {}, ) if secondary_time_index and dataframe.ww.time_index is None: raise ValueError( "Cannot set secondary time index on a DataFrame that has no primary time index.", ) for time_index, columns in secondary_time_index.items(): self._check_uniform_time_index(dataframe, column_name=time_index) if time_index not in columns: columns.append(time_index) def _check_uniform_time_index(self, dataframe, column_name=None): column_name = column_name or dataframe.ww.time_index if column_name is None: return time_type = self._get_time_type(dataframe, column_name) if self.time_type is None: self.time_type = time_type elif self.time_type != time_type: info = "%s time index is %s type which differs from other entityset time indexes" raise TypeError(info % (dataframe.ww.name, time_type)) def _get_time_type(self, dataframe, column_name=None): column_name = column_name or dataframe.ww.time_index column_schema = dataframe.ww.columns[column_name] time_type = None if column_schema.is_numeric: time_type = "numeric" elif column_schema.is_datetime: time_type = Datetime if time_type is None: info = "%s time index not recognized as numeric or datetime" raise TypeError(info % dataframe.ww.name) return time_type def _add_references_to_metadata(self, dataframe): dataframe.ww.metadata.update(entityset_id=self.id) for column in dataframe.columns: metadata = dataframe.ww._schema.columns[column].metadata metadata.update(dataframe_name=dataframe.ww.name) metadata.update(entityset_id=self.id) _ES_REF[self.id] = self def _normalize_values(self, dataframe): def replace(x): if not isinstance(x, (list, tuple, np.ndarray)) and pd.isna(x): return (np.nan, np.nan) else: return x for column, logical_type in dataframe.ww.logical_types.items(): if isinstance(logical_type, LatLong): dataframe[column] = dataframe[column].apply(replace) return dataframe def _vals_to_series(instance_vals, column_id): """ instance_vals may be a pd.Dataframe, a pd.Series, a list, a single value, or None. This function always returns a Series or None. """ if instance_vals is None: return None # If this is a single value, make it a list if not hasattr(instance_vals, "__iter__"): instance_vals = [instance_vals] # convert iterable to pd.Series if isinstance(instance_vals, pd.DataFrame): out_vals = instance_vals[column_id] else: out_vals = pd.Series(instance_vals) # no duplicates or NaN values out_vals = out_vals.drop_duplicates().dropna() # want index to have no name for the merge in query_by_values out_vals.index.name = None return out_vals def _get_or_create_index(index, make_index, df): """Handles index creation logic base on user input""" index_was_created = False if index is None: # Case 1: user wanted to make index but did not specify column name assert not make_index, "Must specify an index name if make_index is True" # Case 2: make_index not specified but no index supplied, use first column warnings.warn( ( "Using first column as index. " "To change this, specify the index parameter" ), ) index = df.columns[0] elif make_index and index in df.columns: # Case 3: user wanted to make index but column already exists raise RuntimeError( f"Cannot make index: column with name {index} already present", ) elif index not in df.columns: if not make_index: # Case 4: user names index, it is not in df. does not specify # make_index. Make new index column and warn warnings.warn( "index {} not found in dataframe, creating new " "integer column".format(index), ) # Case 5: make_index with no errors or warnings # (Case 4 also uses this code path) df = _create_index(df, index) index_was_created = True # Case 6: user specified index, which is already in df. No action needed. return index_was_created, index, df def _create_index(df, index): df.insert(0, index, range(len(df))) return df ================================================ FILE: featuretools/entityset/relationship.py ================================================ class Relationship(object): """Class to represent a relationship between dataframes See Also: :class:`.EntitySet` """ def __init__( self, entityset, parent_dataframe_name, parent_column_name, child_dataframe_name, child_column_name, ): """Create a relationship Args: entityset (:class:`.EntitySet`): EntitySet to which the relationship belongs parent_dataframe_name (str): Name of the parent dataframe in the EntitySet parent_column_name (str): Name of the parent column child_dataframe_name (str): Name of the child dataframe in the EntitySet child_column_name (str): Name of the child column """ self.entityset = entityset self._parent_dataframe_name = parent_dataframe_name self._child_dataframe_name = child_dataframe_name self._parent_column_name = parent_column_name self._child_column_name = child_column_name if ( self.parent_dataframe.ww.index is not None and self._parent_column_name != self.parent_dataframe.ww.index ): raise AttributeError( f"Parent column '{self._parent_column_name}' is not the index of " f"dataframe {self._parent_dataframe_name}", ) @classmethod def from_dictionary(cls, arguments, es): parent_dataframe = arguments["parent_dataframe_name"] child_dataframe = arguments["child_dataframe_name"] parent_column = arguments["parent_column_name"] child_column = arguments["child_column_name"] return cls(es, parent_dataframe, parent_column, child_dataframe, child_column) def __repr__(self): ret = " %s.%s>" % ( self._child_dataframe_name, self._child_column_name, self._parent_dataframe_name, self._parent_column_name, ) return ret def __eq__(self, other): if not isinstance(other, self.__class__): return False return ( self._parent_dataframe_name == other._parent_dataframe_name and self._child_dataframe_name == other._child_dataframe_name and self._parent_column_name == other._parent_column_name and self._child_column_name == other._child_column_name ) def __hash__(self): return hash( ( self._parent_dataframe_name, self._child_dataframe_name, self._parent_column_name, self._child_column_name, ), ) @property def parent_dataframe(self): """Parent dataframe object""" return self.entityset[self._parent_dataframe_name] @property def child_dataframe(self): """Child dataframe object""" return self.entityset[self._child_dataframe_name] @property def parent_column(self): """Column in parent dataframe""" return self.parent_dataframe.ww[self._parent_column_name] @property def child_column(self): """Column in child dataframe""" return self.child_dataframe.ww[self._child_column_name] @property def parent_name(self): """The name of the parent, relative to the child.""" if self._is_unique(): return self._parent_dataframe_name else: return "%s[%s]" % (self._parent_dataframe_name, self._child_column_name) @property def child_name(self): """The name of the child, relative to the parent.""" if self._is_unique(): return self._child_dataframe_name else: return "%s[%s]" % (self._child_dataframe_name, self._child_column_name) def to_dictionary(self): return { "parent_dataframe_name": self._parent_dataframe_name, "child_dataframe_name": self._child_dataframe_name, "parent_column_name": self._parent_column_name, "child_column_name": self._child_column_name, } def _is_unique(self): """Is there any other relationship with same parent and child dataframes?""" es = self.entityset relationships = es.get_forward_relationships(self._child_dataframe_name) n = len( [ r for r in relationships if r._parent_dataframe_name == self._parent_dataframe_name ], ) assert n > 0, "This relationship is missing from the entityset" return n == 1 class RelationshipPath(object): def __init__(self, relationships_with_direction): self._relationships_with_direction = relationships_with_direction @property def name(self): relationship_names = [ _direction_name(is_forward, r) for is_forward, r in self._relationships_with_direction ] return ".".join(relationship_names) def dataframes(self): if self: # Yield first dataframe. is_forward, relationship = self[0] if is_forward: yield relationship._child_dataframe_name else: yield relationship._parent_dataframe_name # Yield the dataframe pointed to by each relationship. for is_forward, relationship in self: if is_forward: yield relationship._parent_dataframe_name else: yield relationship._child_dataframe_name def __add__(self, other): return RelationshipPath( self._relationships_with_direction + other._relationships_with_direction, ) def __getitem__(self, index): return self._relationships_with_direction[index] def __iter__(self): for is_forward, relationship in self._relationships_with_direction: yield is_forward, relationship def __len__(self): return len(self._relationships_with_direction) def __eq__(self, other): return ( isinstance(other, RelationshipPath) and self._relationships_with_direction == other._relationships_with_direction ) def __ne__(self, other): return not self == other def __repr__(self): if self._relationships_with_direction: path = "%s.%s" % (next(self.dataframes()), self.name) else: path = "[]" return "" % path def _direction_name(is_forward, relationship): if is_forward: return relationship.parent_name else: return relationship.child_name ================================================ FILE: featuretools/entityset/serialize.py ================================================ import datetime import json import os import tarfile import tempfile from woodwork.serializers.serializer_base import typing_info_to_dict from featuretools.utils.s3_utils import get_transport_params, use_smartopen_es from featuretools.utils.wrangle import _is_s3, _is_url from featuretools.version import ENTITYSET_SCHEMA_VERSION FORMATS = ["csv", "pickle", "parquet"] def entityset_to_description(entityset, format=None): """Serialize entityset to data description. Args: entityset (EntitySet) : Instance of :class:`.EntitySet`. Returns: description (dict) : Description of :class:`.EntitySet`. """ dataframes = { dataframe.ww.name: typing_info_to_dict(dataframe) for dataframe in entityset.dataframes } relationships = [ relationship.to_dictionary() for relationship in entityset.relationships ] data_description = { "schema_version": ENTITYSET_SCHEMA_VERSION, "id": entityset.id, "dataframes": dataframes, "relationships": relationships, "format": format, } return data_description def write_data_description(entityset, path, profile_name=None, **kwargs): """Serialize entityset to data description and write to disk or S3 path. Args: entityset (EntitySet) : Instance of :class:`.EntitySet`. path (str) : Location on disk or S3 path to write `data_description.json` and dataframe data. profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials. Set to False to use an anonymous profile. kwargs (keywords) : Additional keyword arguments to pass as keywords arguments to the underlying serialization method or to specify AWS profile. """ if _is_s3(path): with tempfile.TemporaryDirectory() as tmpdir: os.makedirs(os.path.join(tmpdir, "data")) dump_data_description(entityset, tmpdir, **kwargs) file_path = create_archive(tmpdir) transport_params = get_transport_params(profile_name) use_smartopen_es( file_path, path, read=False, transport_params=transport_params, ) elif _is_url(path): raise ValueError("Writing to URLs is not supported") else: path = os.path.abspath(path) os.makedirs(os.path.join(path, "data"), exist_ok=True) dump_data_description(entityset, path, **kwargs) def dump_data_description(entityset, path, **kwargs): format = kwargs.get("format") description = entityset_to_description(entityset, format) for df in entityset.dataframes: data_path = os.path.join(path, "data", df.ww.name) os.makedirs(os.path.join(data_path, "data"), exist_ok=True) df.ww.to_disk(data_path, **kwargs) file = os.path.join(path, "data_description.json") with open(file, "w") as file: json.dump(description, file) def create_archive(tmpdir): file_name = "es-{date:%Y-%m-%d_%H%M%S}.tar".format(date=datetime.datetime.now()) file_path = os.path.join(tmpdir, file_name) tar = tarfile.open(str(file_path), "w") tar.add(str(tmpdir) + "/data_description.json", arcname="/data_description.json") tar.add(str(tmpdir) + "/data", arcname="/data") tar.close() return file_path ================================================ FILE: featuretools/entityset/timedelta.py ================================================ import pandas as pd from dateutil.relativedelta import relativedelta class Timedelta(object): """Represents differences in time. Timedeltas can be defined in multiple units. Supported units: - "ms" : milliseconds - "s" : seconds - "h" : hours - "m" : minutes - "d" : days - "o"/"observations" : number of individual events - "mo" : months - "Y" : years Timedeltas can also be defined in terms of observations. In this case, the Timedelta represents the period spanned by `value`. For observation timedeltas: >>> three_observations_log = Timedelta(3, "observations") >>> three_observations_log.get_name() '3 Observations' """ _Observations = "o" # units for absolute times _absolute_units = ["ms", "s", "h", "m", "d", "w"] _relative_units = ["mo", "Y"] _readable_units = { "ms": "Milliseconds", "s": "Seconds", "h": "Hours", "m": "Minutes", "d": "Days", "o": "Observations", "w": "Weeks", "Y": "Years", "mo": "Months", } _readable_to_unit = {v.lower(): k for k, v in _readable_units.items()} def __init__(self, value, unit=None, delta_obj=None): """ Args: value (float, str, dict) : Value of timedelta, string providing both unit and value, or a dictionary of units and times. unit (str) : Unit of time delta. delta_obj (pd.Timedelta or pd.DateOffset) : A time object used internally to do time operations. If None is provided, one will be created using the provided value and unit. """ self.check_value(value, unit) self.times = self.fix_units() if delta_obj is not None: self.delta_obj = delta_obj else: self.delta_obj = self.get_unit_type() @classmethod def from_dictionary(cls, dictionary): dict_units = dictionary["unit"] dict_values = dictionary["value"] if isinstance(dict_units, str) and isinstance(dict_values, (int, float)): return cls({dict_units: dict_values}) else: all_units = dict() for i in range(len(dict_units)): all_units[dict_units[i]] = dict_values[i] return cls(all_units) @classmethod def make_singular(cls, s): if len(s) > 1 and s.endswith("s"): return s[:-1] return s @classmethod def _check_unit_plural(cls, s): if len(s) > 2 and not s.endswith("s"): return (s + "s").lower() elif len(s) > 1: return s.lower() return s def get_value(self, unit=None): if unit is not None: return self.times[unit] elif len(self.times.values()) == 1: return list(self.times.values())[0] else: return self.times def get_units(self): return list(self.times.keys()) def get_unit_type(self): all_units = self.get_units() if self._Observations in all_units: return None elif self.is_absolute() and self.has_multiple_units() is False: return pd.Timedelta(self.times[all_units[0]], all_units[0]) else: readable_times = self.lower_readable_times() return relativedelta(**readable_times) def check_value(self, value, unit): if isinstance(value, str): from featuretools.utils.wrangle import _check_timedelta td = _check_timedelta(value) self.times = td.times elif isinstance(value, dict): self.times = value else: self.times = {unit: value} def fix_units(self): fixed_units = dict() for unit, value in self.times.items(): unit = self._check_unit_plural(unit) if unit in self._readable_to_unit: unit = self._readable_to_unit[unit] fixed_units[unit] = value return fixed_units def lower_readable_times(self): readable_times = dict() for unit, value in self.times.items(): readable_unit = self._readable_units[unit].lower() readable_times[readable_unit] = value return readable_times def get_name(self): all_units = self.get_units() if self.has_multiple_units() is False: return "{} {}".format( self.times[all_units[0]], self._readable_units[all_units[0]], ) final_str = "" for unit, value in self.times.items(): if value == 1: unit = self.make_singular(unit) final_str += "{} {} ".format(value, self._readable_units[unit]) return final_str[:-1] def get_arguments(self): units = list() values = list() for unit, value in self.times.items(): units.append(unit) values.append(value) if len(units) == 1: return {"unit": units[0], "value": values[0]} else: return {"unit": units, "value": values} def is_absolute(self): for unit in self.get_units(): if unit not in self._absolute_units: return False return True def has_no_observations(self): for unit in self.get_units(): if unit in self._Observations: return False return True def has_multiple_units(self): if len(self.get_units()) > 1: return True else: return False def __eq__(self, other): if not isinstance(other, Timedelta): return False return self.times == other.times def __neg__(self): """Negate the timedelta""" new_times = dict() for unit, value in self.times.items(): new_times[unit] = -value if self.delta_obj is not None: return Timedelta(new_times, delta_obj=-self.delta_obj) else: return Timedelta(new_times) def __radd__(self, time): """Add the Timedelta to a timestamp value""" if self._Observations not in self.get_units(): return time + self.delta_obj else: raise Exception("Invalid unit") def __rsub__(self, time): """Subtract the Timedelta from a timestamp value""" if self._Observations not in self.get_units(): return time - self.delta_obj else: raise Exception("Invalid unit") ================================================ FILE: featuretools/exceptions.py ================================================ class UnknownFeature(Exception): def __init__(self, *args, **kwargs): Exception.__init__(self, *args, **kwargs) class UnusedPrimitiveWarning(UserWarning): pass ================================================ FILE: featuretools/feature_base/__init__.py ================================================ # flake8: noqa from featuretools.feature_base.api import * ================================================ FILE: featuretools/feature_base/api.py ================================================ # flake8: noqa from featuretools.feature_base.feature_base import ( AggregationFeature, DirectFeature, Feature, FeatureBase, FeatureOutputSlice, GroupByTransformFeature, IdentityFeature, TransformFeature, ) from featuretools.feature_base.feature_descriptions import describe_feature from featuretools.feature_base.feature_visualizer import graph_feature from featuretools.feature_base.features_deserializer import load_features from featuretools.feature_base.features_serializer import save_features ================================================ FILE: featuretools/feature_base/cache.py ================================================ """ cache.py Custom caching class, currently used for FeatureBase """ # needed for defaultdict annotation if < python 3.9 from __future__ import annotations from collections import defaultdict from dataclasses import dataclass, field from enum import Enum from typing import Any, List, Optional, Union class CacheType(Enum): """Enumerates the supported cache types""" DEPENDENCY = 1 DEPTH = 2 @dataclass() class FeatureCache: """Provides caching for the defined types""" enabled: bool = False cache: defaultdict[dict] = field(default_factory=lambda: defaultdict(dict)) def get( self, cache_type: CacheType, hashkey: int, ) -> Optional[Union[List[Any], Any]]: """Gets the cache entry, if enabled and defined Args: cache_type (CacheType): type of cache hashkey (int): hash key Returns: Optional[Union[List[Any], Any]]: payload assigned to the hashkey """ if not self.enabled or cache_type not in self.cache: return None return self.cache[cache_type].get(hashkey, None) def add(self, cache_type: CacheType, hashkey: int, payload: Any): """Adds an entry to the cache, if enabled Args: cache_type (CacheType): type of cache hashkey (int): hash key payload (Any): payload to assign """ if self.enabled: self.cache[cache_type][hashkey] = payload def clear_all(self): """Clears the cache collections""" self.cache.clear() feature_cache = FeatureCache() ================================================ FILE: featuretools/feature_base/feature_base.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools import primitives from featuretools.entityset.relationship import Relationship, RelationshipPath from featuretools.entityset.timedelta import Timedelta from featuretools.feature_base.utils import is_valid_input from featuretools.primitives.base import ( AggregationPrimitive, PrimitiveBase, TransformPrimitive, ) from featuretools.utils.wrangle import _check_time_against_column, _check_timedelta _ES_REF = {} class FeatureBase(object): def __init__( self, dataframe, base_features, relationship_path, primitive, name=None, names=None, ): """Base class for all features Args: entityset (EntitySet): entityset this feature is being calculated for dataframe (DataFrame): dataframe for calculating this feature base_features (list[FeatureBase]): list of base features for primitive relationship_path (RelationshipPath): path from this dataframe to the dataframe of the base features. primitive (:class:`.PrimitiveBase`): primitive to calculate. if not initialized when passed, gets initialized with no arguments """ assert all( isinstance(f, FeatureBase) for f in base_features ), "All base features must be features" self.dataframe_name = dataframe.ww.name self.entityset = _ES_REF[dataframe.ww.metadata["entityset_id"]] self.base_features = base_features # initialize if not already initialized if not isinstance(primitive, PrimitiveBase): primitive = primitive() self.primitive = primitive self.relationship_path = relationship_path self._name = name self._names = names assert self._check_input_types(), ( "Provided inputs don't match input " "type requirements" ) def __getitem__(self, key): assert ( self.number_output_features > 1 ), "can only access slice of multi-output feature" assert ( self.number_output_features > key ), "index is higher than the number of outputs" return FeatureOutputSlice(self, key) @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): raise NotImplementedError("Must define from_dictionary on FeatureBase subclass") def rename(self, name): """Rename Feature, returns copy. Will reset any custom feature column names to their default value.""" feature_copy = self.copy() feature_copy._name = name feature_copy._names = None return feature_copy def copy(self): raise NotImplementedError("Must define copy on FeatureBase subclass") def get_name(self): if not self._name: self._name = self.generate_name() return self._name def get_feature_names(self): if not self._names: if self.number_output_features == 1: self._names = [self.get_name()] else: self._names = self.generate_names() if self.get_name() != self.generate_name(): self._names = [ self.get_name() + "[{}]".format(i) for i in range(len(self._names)) ] return self._names def set_feature_names(self, names): """Set new values for the feature column names, overriding the default values. Number of names provided must match the number of output columns defined for the feature, and all provided names should be unique. Only works for features that have more than one output column. Use ``Feature.rename`` to change the column name for single output features. Args: names (list[str]): List of names to use for the output feature columns. Provided names must be unique. """ if self.number_output_features == 1: raise ValueError( "The set_feature_names can only be used on features that have more than one output column.", ) num_new_names = len(names) if self.number_output_features != num_new_names: raise ValueError( "Number of names provided must match the number of output features:" f" {num_new_names} name(s) provided, {self.number_output_features} expected.", ) if len(set(names)) != num_new_names: raise ValueError("Provided output feature names must be unique.") self._names = names def get_function(self, **kwargs): return self.primitive.get_function(**kwargs) def get_dependencies(self, deep=False, ignored=None, copy=True): """Returns features that are used to calculate this feature ..note:: If you only want the features that make up the input to the feature function use the base_features attribute instead. """ deps = [] for d in self.base_features[:]: deps += [d] if hasattr(self, "where") and self.where: deps += [self.where] if ignored is None: ignored = set([]) deps = [d for d in deps if d.unique_name() not in ignored] if deep: for dep in deps[:]: # copy so we don't modify list we iterate over deep_deps = dep.get_dependencies(deep, ignored) deps += deep_deps return deps def get_depth(self, stop_at=None): """Returns depth of feature""" max_depth = 0 stop_at_set = set() if stop_at is not None: stop_at_set = set([i.unique_name() for i in stop_at]) if self.unique_name() in stop_at_set: return 0 for dep in self.get_dependencies(deep=True, ignored=stop_at_set): max_depth = max(dep.get_depth(stop_at=stop_at), max_depth) return max_depth + 1 def _check_input_types(self): if len(self.base_features) == 0: return True input_types = self.primitive.input_types if input_types is not None: if not isinstance(input_types[0], list): input_types = [input_types] for t in input_types: zipped = list(zip(t, self.base_features)) if all([is_valid_input(f.column_schema, t) for t, f in zipped]): return True else: return True return False @property def dataframe(self): """Dataframe this feature belongs too""" return self.entityset[self.dataframe_name] @property def number_output_features(self): return self.primitive.number_output_features def __repr__(self): return "" % (self.get_name()) def hash(self): return hash(self.get_name() + self.dataframe_name) def __hash__(self): return self.hash() @property def column_schema(self): feature = self column_schema = self.primitive.return_type while column_schema is None: # get column_schema of first base feature base_feature = feature.base_features[0] column_schema = base_feature.column_schema # only the original time index should exist # so make this feature's return type just a Datetime if "time_index" in column_schema.semantic_tags: column_schema = ColumnSchema( logical_type=column_schema.logical_type, semantic_tags=column_schema.semantic_tags - {"time_index"}, ) elif "index" in column_schema.semantic_tags: column_schema = ColumnSchema( logical_type=column_schema.logical_type, semantic_tags=column_schema.semantic_tags - {"index"}, ) # Need to add back in the numeric standard tag so the schema can get recognized # as a valid return type if column_schema.is_numeric: column_schema.semantic_tags.add("numeric") if column_schema.is_categorical: column_schema.semantic_tags.add("category") # direct features should keep the foreign key tag, but all other features should get converted if ( not isinstance(feature, DirectFeature) and "foreign_key" in column_schema.semantic_tags ): column_schema = ColumnSchema( logical_type=column_schema.logical_type, semantic_tags=column_schema.semantic_tags - {"foreign_key"}, ) feature = base_feature return column_schema @property def default_value(self): return self.primitive.default_value def get_arguments(self): raise NotImplementedError("Must define get_arguments on FeatureBase subclass") def to_dictionary(self): return { "type": type(self).__name__, "dependencies": [dep.unique_name() for dep in self.get_dependencies()], "arguments": self.get_arguments(), } def _handle_binary_comparison(self, other, Primitive, PrimitiveScalar): if isinstance(other, FeatureBase): return Feature([self, other], primitive=Primitive) return Feature([self], primitive=PrimitiveScalar(other)) def __eq__(self, other): """Compares to other by equality""" return self._handle_binary_comparison( other, primitives.Equal, primitives.EqualScalar, ) def __ne__(self, other): """Compares to other by non-equality""" return self._handle_binary_comparison( other, primitives.NotEqual, primitives.NotEqualScalar, ) def __gt__(self, other): """Compares if greater than other""" return self._handle_binary_comparison( other, primitives.GreaterThan, primitives.GreaterThanScalar, ) def __ge__(self, other): """Compares if greater than or equal to other""" return self._handle_binary_comparison( other, primitives.GreaterThanEqualTo, primitives.GreaterThanEqualToScalar, ) def __lt__(self, other): """Compares if less than other""" return self._handle_binary_comparison( other, primitives.LessThan, primitives.LessThanScalar, ) def __le__(self, other): """Compares if less than or equal to other""" return self._handle_binary_comparison( other, primitives.LessThanEqualTo, primitives.LessThanEqualToScalar, ) def __add__(self, other): """Add other""" return self._handle_binary_comparison( other, primitives.AddNumeric, primitives.AddNumericScalar, ) def __radd__(self, other): return self.__add__(other) def __sub__(self, other): """Subtract other""" return self._handle_binary_comparison( other, primitives.SubtractNumeric, primitives.SubtractNumericScalar, ) def __rsub__(self, other): return Feature([self], primitive=primitives.ScalarSubtractNumericFeature(other)) def __div__(self, other): """Divide by other""" return self._handle_binary_comparison( other, primitives.DivideNumeric, primitives.DivideNumericScalar, ) def __truediv__(self, other): return self.__div__(other) def __rtruediv__(self, other): return self.__rdiv__(other) def __rdiv__(self, other): return Feature([self], primitive=primitives.DivideByFeature(other)) def __mul__(self, other): """Multiply by other""" if isinstance(other, FeatureBase): if all( [ isinstance(f.column_schema.logical_type, (Boolean, BooleanNullable)) for f in (self, other) ], ): return Feature([self, other], primitive=primitives.MultiplyBoolean) if ( "numeric" in self.column_schema.semantic_tags and isinstance( other.column_schema.logical_type, (Boolean, BooleanNullable), ) or "numeric" in other.column_schema.semantic_tags and isinstance( self.column_schema.logical_type, (Boolean, BooleanNullable), ) ): return Feature( [self, other], primitive=primitives.MultiplyNumericBoolean, ) return self._handle_binary_comparison( other, primitives.MultiplyNumeric, primitives.MultiplyNumericScalar, ) def __rmul__(self, other): return self.__mul__(other) def __mod__(self, other): """Take modulus of other""" return self._handle_binary_comparison( other, primitives.ModuloNumeric, primitives.ModuloNumericScalar, ) def __rmod__(self, other): return Feature([self], primitive=primitives.ModuloByFeature(other)) def __and__(self, other): return self.AND(other) def __rand__(self, other): return Feature([other, self], primitive=primitives.And) def __or__(self, other): return self.OR(other) def __ror__(self, other): return Feature([other, self], primitive=primitives.Or) def __not__(self, other): return self.NOT(other) def __abs__(self): return Feature([self], primitive=primitives.Absolute) def __neg__(self): return Feature([self], primitive=primitives.Negate) def AND(self, other_feature): """Logical AND with other_feature""" return Feature([self, other_feature], primitive=primitives.And) def OR(self, other_feature): """Logical OR with other_feature""" return Feature([self, other_feature], primitive=primitives.Or) def NOT(self): """Creates inverse of feature""" return Feature([self], primitive=primitives.Not) def isin(self, list_of_output): return Feature( [self], primitive=primitives.IsIn(list_of_outputs=list_of_output), ) def is_null(self): """Compares feature to null by equality""" return Feature([self], primitive=primitives.IsNull) def __invert__(self): return self.NOT() def unique_name(self): return "%s: %s" % (self.dataframe_name, self.get_name()) def relationship_path_name(self): return self.relationship_path.name class IdentityFeature(FeatureBase): """Feature for dataframe that is equivalent to underlying column""" def __init__(self, column, name=None): self.column_name = column.ww.name self.return_type = column.ww.schema metadata = column.ww.schema._metadata es = _ES_REF[metadata["entityset_id"]] super(IdentityFeature, self).__init__( dataframe=es[metadata["dataframe_name"]], base_features=[], relationship_path=RelationshipPath([]), primitive=PrimitiveBase, name=name, ) @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): dataframe_name = arguments["dataframe_name"] column_name = arguments["column_name"] column = entityset[dataframe_name].ww[column_name] return cls(column=column, name=arguments["name"]) def copy(self): """Return copy of feature""" return IdentityFeature(self.entityset[self.dataframe_name].ww[self.column_name]) def generate_name(self): return self.column_name def get_depth(self, stop_at=None): return 0 def get_arguments(self): return { "name": self.get_name(), "column_name": self.column_name, "dataframe_name": self.dataframe_name, } @property def column_schema(self): return self.return_type class DirectFeature(FeatureBase): """Feature for child dataframe that inherits a feature value from a parent dataframe""" input_types = [ColumnSchema()] return_type = None def __init__( self, base_feature, child_dataframe_name, relationship=None, name=None, ): base_feature = _validate_base_features(base_feature)[0] self.parent_dataframe_name = base_feature.dataframe_name relationship = self._handle_relationship( base_feature.entityset, child_dataframe_name, relationship, ) child_dataframe = base_feature.entityset[child_dataframe_name] super(DirectFeature, self).__init__( dataframe=child_dataframe, base_features=[base_feature], relationship_path=RelationshipPath([(True, relationship)]), primitive=PrimitiveBase, name=name, ) def _handle_relationship(self, entityset, child_dataframe_name, relationship): child_dataframe = entityset[child_dataframe_name] if relationship: relationship_child = relationship.child_dataframe assert ( child_dataframe.ww.name == relationship_child.ww.name ), "child_dataframe must be the relationship child dataframe" assert ( self.parent_dataframe_name == relationship.parent_dataframe.ww.name ), "Base feature must be defined on the relationship parent dataframe" else: child_relationships = entityset.get_forward_relationships( child_dataframe.ww.name, ) possible_relationships = ( r for r in child_relationships if r.parent_dataframe.ww.name == self.parent_dataframe_name ) relationship = next(possible_relationships, None) if not relationship: raise RuntimeError( 'No relationship from "%s" to "%s" found.' % (child_dataframe.ww.name, self.parent_dataframe_name), ) # Check for another path. elif next(possible_relationships, None): message = ( "There are multiple relationships to the base dataframe. " "You must specify a relationship." ) raise RuntimeError(message) return relationship @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): base_feature = dependencies[arguments["base_feature"]] relationship = Relationship.from_dictionary( arguments["relationship"], entityset, ) child_dataframe_name = relationship.child_dataframe.ww.name return cls( base_feature=base_feature, child_dataframe_name=child_dataframe_name, relationship=relationship, name=arguments["name"], ) @property def number_output_features(self): return self.base_features[0].number_output_features @property def default_value(self): return self.base_features[0].default_value def copy(self): """Return copy of feature""" _is_forward, relationship = self.relationship_path[0] return DirectFeature( self.base_features[0], self.dataframe_name, relationship=relationship, ) @property def column_schema(self): return self.base_features[0].column_schema def generate_name(self): return self._name_from_base(self.base_features[0].get_name()) def generate_names(self): return [ self._name_from_base(base_name) for base_name in self.base_features[0].get_feature_names() ] def get_arguments(self): _is_forward, relationship = self.relationship_path[0] return { "name": self.get_name(), "base_feature": self.base_features[0].unique_name(), "relationship": relationship.to_dictionary(), } def _name_from_base(self, base_name): return "%s.%s" % (self.relationship_path_name(), base_name) class AggregationFeature(FeatureBase): # Feature to condition this feature by in # computation (e.g. take the Count of products where the product_id is # "basketball".) where = None #: (str or :class:`.Timedelta`): Use only some amount of previous data from # each time point during calculation use_previous = None def __init__( self, base_features, parent_dataframe_name, primitive, relationship_path=None, use_previous=None, where=None, name=None, ): base_features = _validate_base_features(base_features) for bf in base_features: if bf.number_output_features > 1: raise ValueError("Cannot stack on whole multi-output feature.") self.child_dataframe_name = base_features[0].dataframe_name entityset = base_features[0].entityset relationship_path, self._path_is_unique = self._handle_relationship_path( entityset, parent_dataframe_name, relationship_path, ) self.parent_dataframe_name = parent_dataframe_name if where is not None: self.where = _validate_base_features(where)[0] msg = "Where feature must be defined on child dataframe {}".format( self.child_dataframe_name, ) assert self.where.dataframe_name == self.child_dataframe_name, msg if use_previous: assert entityset[self.child_dataframe_name].ww.time_index is not None, ( "Applying function that requires time index to dataframe that " "doesn't have one" ) self.use_previous = _check_timedelta(use_previous) assert len(base_features) > 0 time_index = base_features[0].dataframe.ww.time_index time_col = base_features[0].dataframe.ww[time_index] assert time_index is not None, ( "Use previous can only be defined " "on dataframes with a time index" ) assert _check_time_against_column(self.use_previous, time_col) super(AggregationFeature, self).__init__( dataframe=entityset[parent_dataframe_name], base_features=base_features, relationship_path=relationship_path, primitive=primitive, name=name, ) def _handle_relationship_path( self, entityset, parent_dataframe_name, relationship_path, ): parent_dataframe = entityset[parent_dataframe_name] child_dataframe = entityset[self.child_dataframe_name] if relationship_path: assert all( not is_forward for is_forward, _r in relationship_path ), "All relationships in path must be backward" _is_forward, first_relationship = relationship_path[0] first_parent = first_relationship.parent_dataframe assert ( parent_dataframe.ww.name == first_parent.ww.name ), "parent_dataframe must match first relationship in path." _is_forward, last_relationship = relationship_path[-1] assert ( child_dataframe.ww.name == last_relationship.child_dataframe.ww.name ), "Base feature must be defined on the dataframe at the end of relationship_path" path_is_unique = entityset.has_unique_forward_path( child_dataframe.ww.name, parent_dataframe.ww.name, ) else: paths = entityset.find_backward_paths( parent_dataframe.ww.name, child_dataframe.ww.name, ) first_path = next(paths, None) if not first_path: raise RuntimeError( 'No backward path from "%s" to "%s" found.' % (parent_dataframe.ww.name, child_dataframe.ww.name), ) # Check for another path. elif next(paths, None): message = ( "There are multiple possible paths to the base dataframe. " "You must specify a relationship path." ) raise RuntimeError(message) relationship_path = RelationshipPath([(False, r) for r in first_path]) path_is_unique = True return relationship_path, path_is_unique @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): base_features = [dependencies[name] for name in arguments["base_features"]] relationship_path = [ Relationship.from_dictionary(r, entityset) for r in arguments["relationship_path"] ] parent_dataframe_name = relationship_path[0].parent_dataframe.ww.name relationship_path = RelationshipPath([(False, r) for r in relationship_path]) use_previous_data = arguments["use_previous"] use_previous = use_previous_data and Timedelta.from_dictionary( use_previous_data, ) where_name = arguments["where"] where = where_name and dependencies[where_name] feat = cls( base_features=base_features, parent_dataframe_name=parent_dataframe_name, primitive=primitive, relationship_path=relationship_path, use_previous=use_previous, where=where, name=arguments["name"], ) feat._names = arguments.get("feature_names") return feat def copy(self): return AggregationFeature( self.base_features, parent_dataframe_name=self.parent_dataframe_name, relationship_path=self.relationship_path, primitive=self.primitive, use_previous=self.use_previous, where=self.where, ) def _where_str(self): if self.where is not None: where_str = " WHERE " + self.where.get_name() else: where_str = "" return where_str def _use_prev_str(self): if self.use_previous is not None and hasattr(self.use_previous, "get_name"): use_prev_str = ", Last {}".format(self.use_previous.get_name()) else: use_prev_str = "" return use_prev_str def generate_name(self): return self.primitive.generate_name( base_feature_names=[bf.get_name() for bf in self.base_features], relationship_path_name=self.relationship_path_name(), parent_dataframe_name=self.parent_dataframe_name, where_str=self._where_str(), use_prev_str=self._use_prev_str(), ) def generate_names(self): return self.primitive.generate_names( base_feature_names=[bf.get_name() for bf in self.base_features], relationship_path_name=self.relationship_path_name(), parent_dataframe_name=self.parent_dataframe_name, where_str=self._where_str(), use_prev_str=self._use_prev_str(), ) def get_arguments(self): arg_dict = { "name": self.get_name(), "base_features": [feat.unique_name() for feat in self.base_features], "relationship_path": [r.to_dictionary() for _, r in self.relationship_path], "primitive": self.primitive, "where": self.where and self.where.unique_name(), "use_previous": self.use_previous and self.use_previous.get_arguments(), } if self.number_output_features > 1: arg_dict["feature_names"] = self.get_feature_names() return arg_dict def relationship_path_name(self): if self._path_is_unique: return self.child_dataframe_name else: return self.relationship_path.name class TransformFeature(FeatureBase): def __init__(self, base_features, primitive, name=None): base_features = _validate_base_features(base_features) for bf in base_features: if bf.number_output_features > 1: raise ValueError("Cannot stack on whole multi-output feature.") dataframe = base_features[0].entityset[base_features[0].dataframe_name] super(TransformFeature, self).__init__( dataframe=dataframe, base_features=base_features, relationship_path=RelationshipPath([]), primitive=primitive, name=name, ) @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): base_features = [dependencies[name] for name in arguments["base_features"]] feat = cls( base_features=base_features, primitive=primitive, name=arguments["name"], ) feat._names = arguments.get("feature_names") return feat def copy(self): return TransformFeature(self.base_features, self.primitive) def generate_name(self): return self.primitive.generate_name( base_feature_names=[bf.get_name() for bf in self.base_features], ) def generate_names(self): return self.primitive.generate_names( base_feature_names=[bf.get_name() for bf in self.base_features], ) def get_arguments(self): arg_dict = { "name": self.get_name(), "base_features": [feat.unique_name() for feat in self.base_features], "primitive": self.primitive, } if self.number_output_features > 1: arg_dict["feature_names"] = self.get_feature_names() return arg_dict class GroupByTransformFeature(TransformFeature): def __init__(self, base_features, primitive, groupby, name=None): if not isinstance(groupby, FeatureBase): groupby = IdentityFeature(groupby) assert ( len({"category", "foreign_key"} - groupby.column_schema.semantic_tags) < 2 ) self.groupby = groupby base_features = _validate_base_features(base_features) base_features.append(groupby) super(GroupByTransformFeature, self).__init__( base_features=base_features, primitive=primitive, name=name, ) @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): base_features = [dependencies[name] for name in arguments["base_features"]] groupby = dependencies[arguments["groupby"]] feat = cls( base_features=base_features, primitive=primitive, groupby=groupby, name=arguments["name"], ) feat._names = arguments.get("feature_names") return feat def copy(self): # the groupby feature is appended to base_features in the __init__ # so here we separate them again return GroupByTransformFeature( self.base_features[:-1], self.primitive, self.groupby, ) def generate_name(self): # exclude the groupby feature from base_names since it has a special # place in the feature name base_names = [bf.get_name() for bf in self.base_features[:-1]] _name = self.primitive.generate_name(base_names) return "{} by {}".format(_name, self.groupby.get_name()) def generate_names(self): base_names = [bf.get_name() for bf in self.base_features[:-1]] _names = self.primitive.generate_names(base_names) names = [name + " by {}".format(self.groupby.get_name()) for name in _names] return names def get_arguments(self): # Do not include groupby in base_features. feature_names = [ feat.unique_name() for feat in self.base_features if feat.unique_name() != self.groupby.unique_name() ] arg_dict = { "name": self.get_name(), "base_features": feature_names, "primitive": self.primitive, "groupby": self.groupby.unique_name(), } if self.number_output_features > 1: arg_dict["feature_names"] = self.get_feature_names() return arg_dict class Feature(object): """ Alias to create feature. Infers the feature type based on init parameters. """ def __new__( self, base, dataframe_name=None, groupby=None, parent_dataframe_name=None, primitive=None, use_previous=None, where=None, ): # either direct or identity if primitive is None and dataframe_name is None: return IdentityFeature(base) elif primitive is None and dataframe_name is not None: return DirectFeature(base, dataframe_name) elif primitive is not None and parent_dataframe_name is not None: assert isinstance(primitive, AggregationPrimitive) or issubclass( primitive, AggregationPrimitive, ) return AggregationFeature( base, parent_dataframe_name=parent_dataframe_name, use_previous=use_previous, where=where, primitive=primitive, ) elif primitive is not None: assert isinstance(primitive, TransformPrimitive) or issubclass( primitive, TransformPrimitive, ) if groupby is not None: return GroupByTransformFeature( base, primitive=primitive, groupby=groupby, ) return TransformFeature(base, primitive=primitive) raise Exception("Unrecognized feature initialization") class FeatureOutputSlice(FeatureBase): """ Class to access specific multi output feature column """ def __init__(self, base_feature, n, name=None): base_features = [base_feature] self.num_output_parent = base_feature.number_output_features msg = "cannot access slice from single output feature" assert self.num_output_parent > 1, msg msg = "cannot access column that is not between 0 and " + str( self.num_output_parent - 1, ) assert n < self.num_output_parent, msg self.n = n self._name = name self._names = [name] if name else None self.base_features = base_features self.base_feature = base_features[0] self.dataframe_name = base_feature.dataframe_name self.entityset = base_feature.entityset self.primitive = base_feature.primitive self.relationship_path = base_feature.relationship_path def __getitem__(self, key): raise ValueError("Cannot get item from slice of multi output feature") def generate_name(self): return self.base_feature.get_feature_names()[self.n] @property def number_output_features(self): return 1 def get_arguments(self): return { "name": self.get_name(), "base_feature": self.base_feature.unique_name(), "n": self.n, } @classmethod def from_dictionary(cls, arguments, entityset, dependencies, primitive): base_feature_name = arguments["base_feature"] base_feature = dependencies[base_feature_name] n = arguments["n"] name = arguments["name"] return cls(base_feature=base_feature, n=n, name=name) def copy(self): return FeatureOutputSlice(self.base_feature, self.n) def _validate_base_features(feature): if "Series" == type(feature).__name__: return [IdentityFeature(feature)] elif hasattr(feature, "__iter__"): features = [_validate_base_features(f)[0] for f in feature] msg = "all base features must share the same dataframe" assert len(set([bf.dataframe_name for bf in features])) == 1, msg return features elif isinstance(feature, FeatureBase): return [feature] else: raise Exception("Not a feature") ================================================ FILE: featuretools/feature_base/feature_descriptions.py ================================================ import json import featuretools as ft def describe_feature( feature, feature_descriptions=None, primitive_templates=None, metadata_file=None, ): """Generates an English language description of a feature. Args: feature (FeatureBase) : Feature to describe feature_descriptions (dict, optional) : dictionary mapping features or unique feature names to custom descriptions primitive_templates (dict, optional) : dictionary mapping primitives or primitive names to description templates metadata_file (str, optional) : path to json metadata file Returns: str : English description of the feature """ feature_descriptions = feature_descriptions or {} primitive_templates = primitive_templates or {} if metadata_file: file_feature_descriptions, file_primitive_templates = parse_json_metadata( metadata_file, ) feature_descriptions = {**file_feature_descriptions, **feature_descriptions} primitive_templates = {**file_primitive_templates, **primitive_templates} description = generate_description( feature, feature_descriptions, primitive_templates, ) return description[:1].upper() + description[1:] + "." def generate_description(feature, feature_descriptions, primitive_templates): # Check if feature has custom description if feature in feature_descriptions or feature.unique_name() in feature_descriptions: description = feature_descriptions.get(feature) or feature_descriptions.get( feature.unique_name(), ) return description # Check if identity feature: if isinstance(feature, ft.IdentityFeature): description = feature.column_schema.description if description is None: description = 'the "{}"'.format(feature.column_name) return description # Handle direct features if isinstance(feature, ft.DirectFeature): base_feature, direct_description = get_direct_description(feature) direct_base = generate_description( base_feature, feature_descriptions, primitive_templates, ) return direct_base + direct_description # Get input descriptions input_descriptions = [] input_columns = feature.base_features if isinstance(feature, ft.feature_base.FeatureOutputSlice): input_columns = feature.base_feature.base_features for input_col in input_columns: col_description = generate_description( input_col, feature_descriptions, primitive_templates, ) input_descriptions.append(col_description) # Remove groupby description from input columns groupby_description = None if isinstance(feature, ft.GroupByTransformFeature): groupby_description = input_descriptions.pop() # Generate primitive description template_override = None if ( feature.primitive in primitive_templates or feature.primitive.name in primitive_templates ): template_override = primitive_templates.get( feature.primitive, ) or primitive_templates.get(feature.primitive.name) slice_num = feature.n if hasattr(feature, "n") else None primitive_description = feature.primitive.get_description( input_descriptions, slice_num=slice_num, template_override=template_override, ) if isinstance(feature, ft.feature_base.FeatureOutputSlice): feature = feature.base_feature # Generate groupby phrase if applicable groupby = "" if isinstance(feature, ft.AggregationFeature): groupby_description = get_aggregation_groupby(feature, feature_descriptions) if groupby_description is not None: if groupby_description.startswith("the "): groupby_description = groupby_description[4:] groupby = "for each {}".format(groupby_description) # Generate aggregation dataframe phrase with use_previous dataframe_description = "" if isinstance(feature, ft.AggregationFeature): if feature.use_previous: dataframe_description = "of the previous {} of ".format( feature.use_previous.get_name().lower(), ) else: dataframe_description = "of all instances of " dataframe_description += '"{}"'.format( feature.relationship_path[-1][1].child_dataframe.ww.name, ) # Generate where phrase where = "" if hasattr(feature, "where") and feature.where: where_col = generate_description( feature.where.base_features[0], feature_descriptions, primitive_templates, ) where = "where {} is {}".format(where_col, feature.where.primitive.value) # Join all parts of template description_template = [ primitive_description, dataframe_description, where, groupby, ] description = " ".join([phrase for phrase in description_template if phrase != ""]) return description def get_direct_description(feature): direct_description = ( ' the instance of "{}" associated with this ' 'instance of "{}"'.format( feature.relationship_path[-1][1].parent_dataframe.ww.name, feature.dataframe_name, ) ) base_features = feature.base_features # shortens stacked direct features to make it easier to understand while isinstance(base_features[0], ft.DirectFeature): base_feat = base_features[0] base_feat_description = ' the instance of "{}" associated ' "with".format( base_feat.relationship_path[-1][1].parent_dataframe.ww.name, ) direct_description = base_feat_description + direct_description base_features = base_feat.base_features direct_description = " for" + direct_description return base_features[0], direct_description def get_aggregation_groupby(feature, feature_descriptions=None): if feature_descriptions is None: feature_descriptions = {} groupby_name = feature.dataframe.ww.index groupby = ft.IdentityFeature( feature.entityset[feature.dataframe_name].ww[groupby_name], ) if groupby in feature_descriptions or groupby.unique_name() in feature_descriptions: return feature_descriptions.get(groupby) or feature_descriptions.get( groupby.unique_name(), ) else: return '"{}" in "{}"'.format(groupby_name, feature.dataframe_name) def parse_json_metadata(file): with open(file) as f: json_metadata = json.load(f) return ( json_metadata.get("feature_descriptions", {}), json_metadata.get("primitive_templates", {}), ) ================================================ FILE: featuretools/feature_base/feature_visualizer.py ================================================ import html from featuretools.feature_base.feature_base import ( AggregationFeature, DirectFeature, FeatureOutputSlice, IdentityFeature, TransformFeature, ) from featuretools.feature_base.feature_descriptions import describe_feature from featuretools.utils.plot_utils import ( check_graphviz, get_graphviz_format, save_graph, ) TARGET_COLOR = "#D9EAD3" TABLE_TEMPLATE = """< {table_cols}
{dataframe_name}
>""" COL_TEMPLATE = """{}""" TARGET_TEMPLATE = """ {} """.format( "{}", "{}", target_color=TARGET_COLOR, ) def graph_feature(feature, to_file=None, description=False, **kwargs): """Generates a feature lineage graph for the given feature Args: feature (FeatureBase) : Feature to generate lineage graph for to_file (str, optional) : Path to where the plot should be saved. If set to None (as by default), the plot will not be saved. description (bool or str, optional): The feature description to use as a caption for the graph. If False, no description is added. Set to True to use an auto-generated description. Defaults to False. kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the ft.describe_feature function. Returns: graphviz.Digraph : Graph object that can directly be displayed in Jupyter notebooks. """ graphviz = check_graphviz() format_ = get_graphviz_format(graphviz=graphviz, to_file=to_file) # Initialize a new directed graph graph = graphviz.Digraph( feature.get_name(), format=format_, graph_attr={"rankdir": "LR"}, ) dataframes = {} edges = ([], []) primitives = [] groupbys = [] _, max_depth = get_feature_data( feature, dataframes, groupbys, edges, primitives, layer=0, ) dataframes[feature.dataframe_name]["targets"].add(feature.get_name()) for df_name in dataframes: dataframe_name = ( "\u2605 {} (target)".format(df_name) if df_name == feature.dataframe_name else df_name ) dataframe_table = get_dataframe_table(dataframe_name, dataframes[df_name]) graph.attr("node", shape="plaintext") graph.node(df_name, dataframe_table) graph.attr("node", shape="diamond") num_primitives = len(primitives) for prim_name, prim_label, layer, prim_type in primitives: step_num = max_depth - layer if num_primitives == 1: type_str = ( '{}

'.format(prim_type) if prim_type else "" ) prim_label = "<{}{}>".format(type_str, prim_label) else: step = "Step {}".format(step_num) type_str = " " + prim_type if prim_type else "" prim_label = ( '<{}:{}

{}>'.format( step, type_str, prim_label, ) ) # sink first layer transform primitive if multiple primitives if step_num == 1 and prim_type == "Transform" and num_primitives > 1: with graph.subgraph() as init_transform: init_transform.attr(rank="min") init_transform.node(name=prim_name, label=prim_label) else: graph.node(name=prim_name, label=prim_label) graph.attr("node", shape="box") for groupby_name, groupby_label in groupbys: graph.node(name=groupby_name, label=groupby_label) graph.attr("edge", style="solid", dir="forward") for edge in edges[1]: graph.edge(*edge) graph.attr("edge", style="dotted", arrowhead="none", dir="forward") for edge in edges[0]: graph.edge(*edge) if description is True: graph.attr(label=describe_feature(feature, **kwargs)) elif description is not False: graph.attr(label=description) if to_file: save_graph(graph, to_file, format_) return graph def get_feature_data(feat, dataframes, groupbys, edges, primitives, layer=0): # 1) add feature to dataframes tables: feat_name = feat.get_name() if feat.dataframe_name not in dataframes: add_dataframe(feat.dataframe, dataframes) dataframe_dict = dataframes[feat.dataframe_name] # if we've already explored this feat, continue feat_node = "{}:{}".format(feat.dataframe_name, feat_name) if feat_name in dataframe_dict["columns"] or feat_name in dataframe_dict["feats"]: return feat_node, layer if isinstance(feat, IdentityFeature): dataframe_dict["columns"].add(feat_name) else: dataframe_dict["feats"].add(feat_name) base_node = feat_node # 2) if multi-output, convert feature to generic base if isinstance(feat, FeatureOutputSlice): feat = feat.base_feature feat_name = feat.get_name() # 3) add primitive node if feat.primitive.name or isinstance(feat, DirectFeature): prim_name = feat.primitive.name if feat.primitive.name else "join" prim_type = "" if isinstance(feat, AggregationFeature): prim_type = "Aggregation" elif isinstance(feat, TransformFeature): prim_type = "Transform" primitive_node = "{}_{}_{}".format(layer, feat_name, prim_name) primitives.append((primitive_node, prim_name.upper(), layer, prim_type)) edges[1].append([primitive_node, base_node]) base_node = primitive_node # 4) add groupby/join edges and nodes dependencies = [(dep.hash(), dep) for dep in feat.get_dependencies()] for is_forward, r in feat.relationship_path: if is_forward: if r.child_dataframe.ww.name not in dataframes: add_dataframe(r.child_dataframe, dataframes) dataframes[r.child_dataframe.ww.name]["columns"].add(r._child_column_name) child_node = "{}:{}".format(r.child_dataframe.ww.name, r._child_column_name) edges[0].append([base_node, child_node]) else: if r.child_dataframe.ww.name not in dataframes: add_dataframe(r.child_dataframe, dataframes) dataframes[r.child_dataframe.ww.name]["columns"].add(r._child_column_name) child_node = "{}:{}".format(r.child_dataframe.ww.name, r._child_column_name) child_name = child_node.replace(":", "--") groupby_node = "{}_groupby_{}".format(feat_name, child_name) groupby_name = "group by\n{}".format(r._child_column_name) groupbys.append((groupby_node, groupby_name)) edges[0].append([child_node, groupby_node]) edges[1].append([groupby_node, base_node]) base_node = groupby_node if hasattr(feat, "groupby"): groupby = feat.groupby _ = get_feature_data( groupby, dataframes, groupbys, edges, primitives, layer + 1, ) dependencies.remove((groupby.hash(), groupby)) groupby_name = groupby.get_name() if isinstance(groupby, IdentityFeature): dataframes[groupby.dataframe_name]["columns"].add(groupby_name) else: dataframes[groupby.dataframe_name]["feats"].add(groupby_name) child_node = "{}:{}".format(groupby.dataframe_name, groupby_name) child_name = child_node.replace(":", "--") groupby_node = "{}_groupby_{}".format(feat_name, child_name) groupby_name = "group by\n{}".format(groupby_name) groupbys.append((groupby_node, groupby_name)) edges[0].append([child_node, groupby_node]) edges[1].append([groupby_node, base_node]) base_node = groupby_node # 5) recurse over dependents max_depth = layer for _, f in dependencies: dependent_node, depth = get_feature_data( f, dataframes, groupbys, edges, primitives, layer + 1, ) edges[1].append([dependent_node, base_node]) max_depth = max(depth, max_depth) return feat_node, max_depth def add_dataframe(dataframe, dataframe_dict): dataframe_dict[dataframe.ww.name] = { "index": dataframe.ww.index, "targets": set(), "columns": set(), "feats": set(), } def get_dataframe_table(dataframe_name, dataframe_dict): """ given a dict of columns and feats, construct the html table for it """ index = dataframe_dict["index"] targets = dataframe_dict["targets"] columns = dataframe_dict["columns"].difference(targets) feats = dataframe_dict["feats"].difference(targets) # If the index is used, make sure it's the first element in the table clean_index = html.escape(index) if index in columns: rows = [COL_TEMPLATE.format(clean_index, clean_index + " (index)")] columns.discard(index) elif index in targets: rows = [TARGET_TEMPLATE.format(clean_index, clean_index + " (index)")] targets.discard(index) else: rows = [] for col in list(columns) + list(feats) + list(targets): template = COL_TEMPLATE if col in targets: template = TARGET_TEMPLATE col = html.escape(col) rows.append(template.format(col, col)) table = TABLE_TEMPLATE.format( dataframe_name=dataframe_name, table_cols="\n".join(rows), ) return table ================================================ FILE: featuretools/feature_base/features_deserializer.py ================================================ import json from featuretools.entityset.deserialize import ( description_to_entityset as deserialize_es, ) from featuretools.feature_base.feature_base import ( AggregationFeature, DirectFeature, Feature, FeatureBase, FeatureOutputSlice, GroupByTransformFeature, IdentityFeature, TransformFeature, ) from featuretools.primitives.utils import PrimitivesDeserializer from featuretools.utils.s3_utils import get_transport_params, use_smartopen_features from featuretools.utils.schema_utils import check_schema_version from featuretools.utils.wrangle import _is_s3, _is_url def load_features(features, profile_name=None): """Loads the features from a filepath, S3 path, URL, an open file, or a JSON formatted string. Args: features (str or :class:`.FileObject`): The file location of saved features. This must either be the name of the file, a JSON formatted string, or a readable file handle. profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials. Set to False to use an anonymous profile. Returns: features (list[:class:`.FeatureBase`]): Feature definitions list. Note: Features saved in one version of Featuretools or Python are not guaranteed to work in another. After upgrading Featuretools or Python, features may need to be generated again. Example: .. ipython:: python :suppress: import featuretools as ft import os .. code-block:: python # Option 1 filepath = os.path.join('/Home/features/', 'list.json') features = ft.load_features(filepath) # Option 2 filepath = os.path.join('/Home/features/', 'list.json') with open(filepath, 'r') as f: features = ft.load_features(f) # Option 3 filepath = os.path.join('/Home/features/', 'list.json') with open(filepath, 'r') as : feature_str = f.read() features = ft.load_features(feature_str) .. seealso:: :func:`.save_features` """ return FeaturesDeserializer.load(features, profile_name).to_list() class FeaturesDeserializer(object): FEATURE_CLASSES = { "AggregationFeature": AggregationFeature, "DirectFeature": DirectFeature, "Feature": Feature, "FeatureBase": FeatureBase, "GroupByTransformFeature": GroupByTransformFeature, "IdentityFeature": IdentityFeature, "TransformFeature": TransformFeature, "FeatureOutputSlice": FeatureOutputSlice, } def __init__(self, features_dict): self.features_dict = features_dict self._check_schema_version() self.entityset = deserialize_es(features_dict["entityset"]) self._deserialized_features = {} # name -> feature primitive_deserializer = PrimitivesDeserializer() primitive_definitions = features_dict["primitive_definitions"] self._deserialized_primitives = { k: primitive_deserializer.deserialize_primitive(v) for k, v in primitive_definitions.items() } @classmethod def load(cls, features, profile_name): if isinstance(features, str): try: features_dict = json.loads(features) except ValueError: if _is_url(features) or _is_s3(features): transport_params = None if _is_s3(features): transport_params = get_transport_params(profile_name) features_dict = use_smartopen_features( features, transport_params=transport_params, ) else: with open(features, "r") as f: features_dict = json.load(f) return cls(features_dict) return cls(json.load(features)) def to_list(self): feature_names = self.features_dict["feature_list"] return [self._deserialize_feature(name) for name in feature_names] def _deserialize_feature(self, feature_name): if feature_name in self._deserialized_features: return self._deserialized_features[feature_name] feature_dict = self.features_dict["feature_definitions"][feature_name] dependencies_list = feature_dict["dependencies"] primitive = None primitive_id = feature_dict["arguments"].get("primitive") if primitive_id is not None: primitive = self._deserialized_primitives[primitive_id] # Collect dependencies into a dictionary of name -> feature. dependencies = { dependency: self._deserialize_feature(dependency) for dependency in dependencies_list } type = feature_dict["type"] cls = self.FEATURE_CLASSES.get(type) if not cls: raise RuntimeError('Unrecognized feature type "%s"' % type) args = feature_dict["arguments"] feature = cls.from_dictionary(args, self.entityset, dependencies, primitive) self._deserialized_features[feature_name] = feature return feature def _check_schema_version(self): check_schema_version(self, "features") ================================================ FILE: featuretools/feature_base/features_serializer.py ================================================ import json from featuretools.primitives.utils import serialize_primitive from featuretools.utils.s3_utils import get_transport_params, use_smartopen_features from featuretools.utils.wrangle import _is_s3, _is_url from featuretools.version import FEATURES_SCHEMA_VERSION from featuretools.version import __version__ as ft_version def save_features(features, location=None, profile_name=None): """Saves the features list as JSON to a specified filepath/S3 path, writes to an open file, or returns the serialized features as a JSON string. If no file provided, returns a string. Args: features (list[:class:`.FeatureBase`]): List of Feature definitions. location (str or :class:`.FileObject`, optional): The location of where to save the features list which must include the name of the file, or a writeable file handle to write to. If location is None, will return a JSON string of the serialized features. Default: None profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials. Set to False to use an anonymous profile. Note: Features saved in one version of Featuretools are not guaranteed to work in another. After upgrading Featuretools, features may need to be generated again. Example: .. ipython:: python :suppress: from featuretools.tests.testing_utils import ( make_ecommerce_entityset) import featuretools as ft es = make_ecommerce_entityset() import os .. code-block:: python f1 = ft.Feature(es["log"].ww["product_id"]) f2 = ft.Feature(es["log"].ww["purchased"]) f3 = ft.Feature(es["log"].ww["value"]) features = [f1, f2, f3] # Option 1 filepath = os.path.join('/Home/features/', 'list.json') ft.save_features(features, filepath) # Option 2 filepath = os.path.join('/Home/features/', 'list.json') with open(filepath, 'w') as f: ft.save_features(features, f) # Option 3 features_string = ft.save_features(features) .. seealso:: :func:`.load_features` """ return FeaturesSerializer(features).save(location, profile_name=profile_name) class FeaturesSerializer(object): def __init__(self, feature_list): self.feature_list = feature_list self._features_dict = None def to_dict(self): names_list = [feat.unique_name() for feat in self.feature_list] es = self.feature_list[0].entityset feature_defs, primitive_defs = self._feature_definitions() return { "schema_version": FEATURES_SCHEMA_VERSION, "ft_version": ft_version, "entityset": es.to_dictionary(), "feature_list": names_list, "feature_definitions": feature_defs, "primitive_definitions": primitive_defs, } def save(self, location, profile_name): features_dict = self.to_dict() if location is None: return json.dumps(features_dict) if isinstance(location, str): if _is_url(location): raise ValueError("Writing to URLs is not supported") if _is_s3(location): transport_params = get_transport_params(profile_name) use_smartopen_features( location, features_dict, transport_params, read=False, ) else: with open(location, "w") as f: json.dump(features_dict, f) else: json.dump(features_dict, location) def _feature_definitions(self): if not self._features_dict: self._features_dict = {} self._primitives_dict = {} for feature in self.feature_list: self._serialize_feature(feature) primitive_number = 0 primitive_id_to_key = {} for name, feature in self._features_dict.items(): primitive = feature["arguments"].get("primitive") if primitive: primitive_id = id(primitive) if primitive_id not in primitive_id_to_key.keys(): # Primitive we haven't seen before, add to dict and increment primitive_id counter # Always use string for keys because json conversion results in integer dict keys # being converted to strings, but integer dict values are not. primitives_dict_key = str(primitive_number) primitive_id_to_key[primitive_id] = primitives_dict_key self._primitives_dict[primitives_dict_key] = ( serialize_primitive(primitive) ) self._features_dict[name]["arguments"]["primitive"] = ( primitives_dict_key ) primitive_number += 1 else: # Primitive we have seen already - use existing primitive_id key key = primitive_id_to_key[primitive_id] self._features_dict[name]["arguments"]["primitive"] = key return self._features_dict, self._primitives_dict def _serialize_feature(self, feature): name = feature.unique_name() if name not in self._features_dict: self._features_dict[feature.unique_name()] = feature.to_dictionary() for dependency in feature.get_dependencies(deep=True): name = dependency.unique_name() if name not in self._features_dict: self._features_dict[name] = dependency.to_dictionary() ================================================ FILE: featuretools/feature_base/utils.py ================================================ def is_valid_input(candidate, template): """Checks if a candidate schema should be considered a match for a template schema""" if template.logical_type is not None and not isinstance( candidate.logical_type, type(template.logical_type), ): return False if len(template.semantic_tags - candidate.semantic_tags): return False return True ================================================ FILE: featuretools/feature_discovery/FeatureCollection.py ================================================ from __future__ import annotations import hashlib from itertools import combinations from typing import Any, Dict, List, Optional, Set, Type, Union, cast from woodwork.logical_types import LogicalType from featuretools.feature_discovery.LiteFeature import LiteFeature from featuretools.feature_discovery.type_defs import ANY from featuretools.feature_discovery.utils import hash_primitive, logical_types_map from featuretools.primitives.base.primitive_base import PrimitiveBase from featuretools.primitives.utils import ( PrimitivesDeserializer, ) class FeatureCollection: def __init__(self, features: List[LiteFeature]): self._all_features: List[LiteFeature] = features self.indexed = False self.sorted = False self._hash_key: Optional[str] = None def sort_features(self): if not self.sorted: self._all_features = sorted(self._all_features) self.sorted = True def __repr__(self): return f"" @property def all_features(self): return self._all_features.copy() @property def hash_key(self) -> str: if self._hash_key is None: if not self.sorted: self.sort_features() self._set_hash() assert self._hash_key is not None return self._hash_key def _set_hash(self): hash_msg = hashlib.sha256() for feature in self._all_features: hash_msg.update(feature.id.encode("utf-8")) self._hash_key = hash_msg.hexdigest() return self def __hash__(self): return hash(self.hash_key) def __eq__(self, other: FeatureCollection) -> bool: return self.hash_key == other.hash_key def reindex(self) -> FeatureCollection: self.by_logical_type: Dict[ Union[Type[LogicalType], None], Set[LiteFeature], ] = {} self.by_tag: Dict[str, Set[LiteFeature]] = {} self.by_origin_feature: Dict[LiteFeature, Set[LiteFeature]] = {} self.by_depth: Dict[int, Set[LiteFeature]] = {} self.by_name: Dict[str, LiteFeature] = {} self.by_key: Dict[str, List[LiteFeature]] = {} for feature in self._all_features: for key in self.feature_to_keys(feature): self.by_key.setdefault(key, []).append(feature) logical_type = feature.logical_type self.by_logical_type.setdefault(logical_type, set()).add(feature) tags = feature.tags for tag in tags: self.by_tag.setdefault(tag, set()).add(feature) origin_features = feature.get_origin_features() for origin_feature in origin_features: self.by_origin_feature.setdefault(origin_feature, set()).add(feature) if feature.depth == 0: self.by_origin_feature.setdefault(feature, set()).add(feature) feature_name = feature.name assert feature_name is not None assert feature_name not in self.by_name self.by_name[feature_name] = feature self.indexed = True return self def get_by_logical_type(self, logical_type: Type[LogicalType]) -> Set[LiteFeature]: return self.by_logical_type.get(logical_type, set()) def get_by_tag(self, tag: str) -> Set[LiteFeature]: return self.by_tag.get(tag, set()) def get_by_origin_feature(self, origin_feature: LiteFeature) -> Set[LiteFeature]: return self.by_origin_feature.get(origin_feature, set()) def get_by_origin_feature_name(self, name: str) -> Union[LiteFeature, None]: feature = self.by_name.get(name) return feature def get_dependencies_by_origin_name(self, name) -> Set[LiteFeature]: origin_feature = self.by_name.get(name) if origin_feature: return self.by_origin_feature[origin_feature] return set() def get_by_key(self, key: str) -> List[LiteFeature]: return self.by_key.get(key, []) def flatten_features(self) -> Dict[str, LiteFeature]: all_features_dict: Dict[str, LiteFeature] = {} def rfunc(feature_list: List[LiteFeature]): for feature in feature_list: all_features_dict.setdefault(feature.id, feature) rfunc(feature.base_features) rfunc(self._all_features) return all_features_dict def flatten_primitives(self) -> Dict[str, Dict[str, Any]]: all_primitives_dict: Dict[str, Dict[str, Any]] = {} def rfunc(feature_list: List[LiteFeature]): for feature in feature_list: if feature.primitive: key, prim_dict = hash_primitive(feature.primitive) all_primitives_dict.setdefault(key, prim_dict) rfunc(feature.base_features) rfunc(self._all_features) return all_primitives_dict def to_dict(self): all_primitives_dict = self.flatten_primitives() all_features_dict = self.flatten_features() return { "primitives": all_primitives_dict, "feature_ids": [f.id for f in self._all_features], "all_features": {k: f.to_dict() for k, f in all_features_dict.items()}, } @staticmethod def feature_to_keys(feature: LiteFeature) -> List[str]: """ Generate hashing keys from LiteFeature. For example: - LiteFeature("f1", Double, {"numeric"}) -> ['Double', 'numeric', 'Double,numeric', 'ANY'] - LiteFeature("f1", Datetime, {"time_index"}) -> ['Datetime', 'time_index', 'Datetime,time_index', 'ANY'] - LiteFeature("f1", Double, {"index", "other"}) -> ['Double', 'index', 'other', 'Double,index', 'Double,other', 'ANY'] Args: feature (LiteFeature): Returns: List[str] List of hashing keys """ keys: List[str] = [] logical_type = feature.logical_type logical_type_name = None if logical_type is not None: logical_type_name = logical_type.__name__ keys.append(logical_type_name) all_tags = sorted(feature.tags) tag_combinations = [] # generate combinations of all lengths from 1 to the length of the input list for i in range(1, len(all_tags) + 1): # generate combinations of length i and append to the combinations_list for comb in combinations(all_tags, i): tag_combinations.append(list(comb)) for tag_combination in tag_combinations: tags_key = ",".join(tag_combination) keys.append(tags_key) if logical_type_name: keys.append(f"{logical_type_name},{tags_key}") keys.append(ANY) return keys @staticmethod def from_dict(input_dict): primitive_deserializer = PrimitivesDeserializer() primitives = {} for prim_key, prim_dict in input_dict["primitives"].items(): primitive = primitive_deserializer.deserialize_primitive( prim_dict, ) assert isinstance(primitive, PrimitiveBase) primitives[prim_key] = primitive hydrated_features: Dict[str, LiteFeature] = {} feature_ids: List[str] = cast(List[str], input_dict["feature_ids"]) all_features: Dict[str, Any] = cast(Dict[str, Any], input_dict["all_features"]) def hydrate_feature(feature_id: str) -> LiteFeature: if feature_id in hydrated_features: return hydrated_features[feature_id] feature_dict = all_features[feature_id] base_features = [hydrate_feature(x) for x in feature_dict["base_features"]] logical_type = ( logical_types_map[feature_dict["logical_type"]] if feature_dict["logical_type"] else None ) hydrated_feature = LiteFeature( name=feature_dict["name"], logical_type=logical_type, tags=set(feature_dict["tags"]), primitive=primitives[feature_dict["primitive"]] if feature_dict["primitive"] else None, base_features=base_features, df_id=feature_dict["df_id"], related_features=set(), idx=feature_dict["idx"], ) assert hydrated_feature.id == feature_dict["id"] == feature_id hydrated_features[feature_id] = hydrated_feature # need to link after features are stored on cache related_features = [ hydrate_feature(x) for x in feature_dict["related_features"] ] hydrated_feature.related_features = set(related_features) return hydrated_feature return FeatureCollection([hydrate_feature(x) for x in feature_ids]) ================================================ FILE: featuretools/feature_discovery/LiteFeature.py ================================================ from __future__ import annotations import hashlib from dataclasses import field from functools import total_ordering from typing import Any, Dict, List, Optional, Set, Type, Union from woodwork.column_schema import ColumnSchema from woodwork.logical_types import LogicalType from featuretools.feature_discovery.utils import ( get_primitive_return_type, hash_primitive, ) from featuretools.primitives.base.primitive_base import PrimitiveBase @total_ordering class LiteFeature: _name: Optional[str] = None _alias: Optional[str] = None _logical_type: Optional[Type[LogicalType]] = None _tags: Set[str] = field(default_factory=set) _primitive: Optional[PrimitiveBase] = None _base_features: List[LiteFeature] = field(default_factory=list) _df_id: Optional[str] = None _id: str _n_output_features: int = 1 _depth = 0 _related_features: Set[LiteFeature] _idx: int = 0 def __init__( self, name: Optional[str] = None, logical_type: Optional[Type[LogicalType]] = None, tags: Optional[Set[str]] = None, primitive: Optional[PrimitiveBase] = None, base_features: Optional[List[LiteFeature]] = None, df_id: Optional[str] = None, related_features: Optional[Set[LiteFeature]] = None, idx: Optional[int] = None, ): self._logical_type = logical_type self._tags = tags if tags else set() self._primitive = primitive self._base_features = base_features if base_features else [] self._df_id = df_id self._idx = idx if idx is not None else 0 self._related_features = related_features if related_features else set() if self._primitive: if not isinstance(self._primitive, PrimitiveBase): raise ValueError("primitive input must be of type PrimitiveBase") if len(self.base_features) == 0: raise ValueError("there must be base features if given a primitive") if self._primitive.commutative: self._base_features = sorted(self._base_features) self._n_output_features = self._primitive.number_output_features self._depth = max([x.depth for x in self.base_features]) + 1 if name: self._alias = name self._name = self._primitive.generate_name( [x.name for x in self.base_features], ) return_column_schema = get_primitive_return_type(self._primitive) self._logical_type = ( type(return_column_schema.logical_type) if return_column_schema.logical_type else None ) self._tags = return_column_schema.semantic_tags else: if name is None: raise TypeError("Name must be given if origin feature") if self._logical_type is None: raise TypeError("Logical Type must be given if origin feature") self._name = name if self._logical_type is not None and "index" not in self._tags: self._tags = self._tags | self._logical_type.standard_tags self._id = self._generate_hash() @property def name(self): if self._alias: return self._alias elif self.is_multioutput(): return f"{self._name}[{self.idx}]" return self._name @name.setter def name(self, _): raise AttributeError("name is immutable") def set_alias(self, value: Union[str, None]): self._alias = value @property def non_indexed_name(self): if not self.is_multioutput(): raise ValueError("only used on multioutput features") return self._name @property def logical_type(self): return self._logical_type @logical_type.setter def logical_type(self, _): raise AttributeError("logical_type is immutable") @property def tags(self): return self._tags.copy() @tags.setter def tags(self, _): raise AttributeError("tags is immutable") @property def primitive(self): return self._primitive @primitive.setter def primitive(self, _): raise AttributeError("primitive is immutable") @property def base_features(self): return self._base_features @base_features.setter def base_features(self, _): raise AttributeError("base_features are immutable") @property def df_id(self): return self._df_id @df_id.setter def df_id(self, _): raise AttributeError("df_id is immutable") @property def id(self): return self._id @id.setter def id(self, _): raise AttributeError("id is immutable") @property def n_output_features(self): return self._n_output_features @n_output_features.setter def n_output_features(self, _): raise AttributeError("n_output_features is immutable") @property def depth(self): return self._depth @depth.setter def depth(self, _): raise AttributeError("depth is immutable") @property def related_features(self): return self._related_features.copy() @related_features.setter def related_features(self, value: Set[LiteFeature]): self._related_features = value @property def idx(self): return self._idx @idx.setter def idx(self, _): raise AttributeError("idx is immutable") @staticmethod def hash( name: Optional[str], primitive: Optional[PrimitiveBase] = None, base_features: List[LiteFeature] = [], df_id: Optional[str] = None, idx: int = 0, ): hash_msg = hashlib.sha256() if primitive: # TODO: hashing should be on primitive hash_msg.update(hash_primitive(primitive)[0].encode("utf-8")) commutative = primitive.commutative assert ( len(base_features) > 0 ), "there must be base features if give a primitive" base_columns = base_features if commutative: base_features.sort() for c in base_columns: hash_msg.update(c.id.encode("utf-8")) else: assert name hash_msg.update(name.encode("utf-8")) if df_id: hash_msg.update(df_id.encode("utf-8")) hash_msg.update(str(idx).encode("utf-8")) return hash_msg.hexdigest() def __eq__(self, other: LiteFeature): return self._id == other._id def __lt__(self, other: LiteFeature): return self._id < other._id def __ne__(self, other): return self._id != other._id def __hash__(self): return hash(self._id) def _generate_hash(self) -> str: return self.hash( name=self._name, primitive=self._primitive, base_features=self._base_features, df_id=self._df_id, idx=self._idx, ) def get_primitive_name(self) -> Union[str, None]: return self._primitive.name if self._primitive else None def get_dependencies(self, deep=False) -> List[LiteFeature]: flattened_dependencies = [] for f in self._base_features: flattened_dependencies.append(f) if deep: dependencies = f.get_dependencies() if isinstance(dependencies, list): flattened_dependencies.extend(dependencies) else: flattened_dependencies.append(dependencies) return flattened_dependencies def get_origin_features(self) -> List[LiteFeature]: all_dependencies = self.get_dependencies(deep=True) return [f for f in all_dependencies if f._depth == 0] @property def column_schema(self) -> ColumnSchema: return ColumnSchema(logical_type=self.logical_type, semantic_tags=self.tags) def dependent_primitives(self) -> Set[Type[PrimitiveBase]]: dependent_features = self.get_dependencies(deep=True) dependent_primitives = { type(f._primitive) for f in dependent_features if f._primitive } if self._primitive: dependent_primitives.add(type(self._primitive)) return dependent_primitives def to_dict(self) -> Dict[str, Any]: return { "name": self.name, "logical_type": self.logical_type.__name__ if self.logical_type else None, "tags": list(self.tags), "primitive": hash_primitive(self.primitive)[0] if self.primitive else None, "base_features": [x.id for x in self.base_features], "df_id": self.df_id, "id": self.id, "related_features": [x.id for x in self.related_features], "idx": self.idx, } def is_multioutput(self) -> bool: return len(self._related_features) > 0 def copy(self) -> LiteFeature: copied_feature = LiteFeature( name=self._name, logical_type=self._logical_type, tags=self._tags.copy(), primitive=self._primitive, base_features=[f.copy() for f in self._base_features], df_id=self._df_id, idx=self._idx, related_features=self._related_features.copy(), ) copied_feature.set_alias(self._alias) return copied_feature def __repr__(self) -> str: name = f"name='{self.name}'" logical_type = f"logical_type={self.logical_type}" tags = f"tags={self.tags}" primitive = f"primitive={self.get_primitive_name()}" return f"LiteFeature({name}, {logical_type}, {tags}, {primitive})" ================================================ FILE: featuretools/feature_discovery/__init__.py ================================================ ================================================ FILE: featuretools/feature_discovery/convertors.py ================================================ from __future__ import annotations from typing import Dict, List import pandas as pd from woodwork.logical_types import LogicalType from featuretools.feature_base.feature_base import ( FeatureBase, IdentityFeature, TransformFeature, ) from featuretools.feature_discovery.LiteFeature import LiteFeature from featuretools.primitives import TransformPrimitive from featuretools.primitives.base.primitive_base import PrimitiveBase FeatureCache = Dict[str, FeatureBase] def convert_featurebase_list_to_feature_list( featurebase_list: List[FeatureBase], ) -> List[LiteFeature]: """ Convert a List of FeatureBase objects to a list LiteFeature objects Args: featurebase_list (List[FeatureBase]): Returns: LiteFeatures (List[LiteFeature]) - converted LiteFeature objects """ def rfunc(fb: FeatureBase) -> List[LiteFeature]: base_features = [ feature for feature_list in [rfunc(x) for x in fb.base_features] for feature in feature_list ] col_schema = fb.column_schema logical_type = col_schema.logical_type if logical_type is not None: assert issubclass(type(logical_type), LogicalType) logical_type = type(logical_type) tags = col_schema.semantic_tags if isinstance(fb, IdentityFeature): primitive = None else: primitive = fb.primitive assert isinstance(primitive, PrimitiveBase) if fb.number_output_features > 1: features: List[LiteFeature] = [] for idx, name in enumerate(fb.get_feature_names()): f = LiteFeature( name=name, logical_type=logical_type, tags=tags, primitive=primitive, base_features=base_features, # TODO: use when working with multi-table df_id=None, idx=idx, ) features.append(f) for feature in features: related_features = [f for f in features if f.id != feature.id] feature.related_features = set(related_features) return features return [ LiteFeature( name=fb.get_name(), logical_type=logical_type, tags=tags, primitive=primitive, base_features=base_features, # TODO: use when working with multi-table df_id=None, ), ] return [ feature for feature_list in [rfunc(fb) for fb in featurebase_list] for feature in feature_list ] def _feature_to_transform_feature( feature: LiteFeature, base_features: List[FeatureBase], ) -> FeatureBase: """ Transform LiteFeature into FeatureBase object. Handles the Multi-output feature in correct way. Args: feature (LiteFeature) base_features (List[FeatureBase]) Returns: FeatureBase """ assert feature.primitive assert isinstance( feature.primitive, TransformPrimitive, ), "Only Transform Primitives" fb = TransformFeature(base_features, feature.primitive) if feature.is_multioutput(): sorted_features = sorted( [f for f in feature.related_features] + [feature], key=lambda x: x.idx, ) names = [x.name for x in sorted_features] fb = fb.rename(feature.non_indexed_name) fb.set_feature_names(names) else: fb = fb.rename(feature.name) return fb def _convert_feature_to_featurebase( feature: LiteFeature, dataframe: pd.DataFrame, cache: FeatureCache, ) -> FeatureBase: """ Recursively transforms a LiteFeature object into a Featurebase object Args: feature (LiteFeature) base_features (List[FeatureBase]) cache (FeatureCache) already converted features Returns: FeatureBase """ def get_base_features( feature: LiteFeature, ) -> List[FeatureBase]: new_base_features: List[FeatureBase] = [] for bf in feature.base_features: fb = rfunc(bf) if bf.is_multioutput(): idx = bf.idx # if its multioutput, you can index on the FeatureBase new_base_features.append(fb[idx]) else: new_base_features.append(fb) return new_base_features def rfunc(feature: LiteFeature) -> FeatureBase: # if feature has already been converted, return from cache if feature.id in cache: return cache[feature.id] # if depth is 0, we are at an origin feature if feature.depth == 0: fb = IdentityFeature(dataframe.ww[feature.name]) cache[feature.id] = fb return fb base_features = get_base_features(feature) fb = _feature_to_transform_feature(feature, base_features) cache[feature.id] = fb return fb return rfunc(feature) def convert_feature_list_to_featurebase_list( feature_list: List[LiteFeature], dataframe: pd.DataFrame, ) -> List[FeatureBase]: """ Convert a list of LiteFeature objects into a list of FeatureBase objects Args: feature_list (List[LiteFeature]) dataframe (pd.DataFrame) Returns: List[FeatureBase] """ feature_cache: FeatureCache = {} converted_features: List[FeatureBase] = [] for feature in feature_list: if feature.is_multioutput(): related_feature_ids = [f.id for f in feature.related_features] if any((x in feature_cache for x in related_feature_ids)): # feature base already created for related ids continue fb = _convert_feature_to_featurebase( feature=feature, dataframe=dataframe, cache=feature_cache, ) converted_features.append(fb) return converted_features ================================================ FILE: featuretools/feature_discovery/feature_discovery.py ================================================ import inspect from collections import defaultdict from itertools import combinations, permutations, product from typing import Iterable, List, Set, Tuple, Type, Union, cast from woodwork.column_schema import ColumnSchema from woodwork.logical_types import LogicalType from woodwork.table_schema import TableSchema from featuretools.feature_discovery.FeatureCollection import FeatureCollection from featuretools.feature_discovery.LiteFeature import LiteFeature from featuretools.feature_discovery.utils import column_schema_to_keys, flatten_list from featuretools.primitives.base.primitive_base import PrimitiveBase def _index_column_set(column_set: List[ColumnSchema]) -> List[Tuple[str, int]]: """ Indexes input set to find types of columns and the quantity of each Args: column_set (List(ColumnSchema)): List of Column types needed by associated primitive. Returns: List[Tuple[str, int]] A list of key, count tuples Examples: .. code-block:: python from featuretools.feature_discovery.feature_discovery import _index_column_set from woodwork.column_schema import ColumnSchema column_set = [ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"})] indexed_column_set = _index_column_set(column_set) [("numeric": 2)] """ out = defaultdict(int) for column_schema in column_set: key = column_schema_to_keys(column_schema) out[key] += 1 return list(out.items()) def _get_features( feature_collection: FeatureCollection, column_keys: Tuple[Tuple[str, int]], commutative: bool, ) -> List[List[LiteFeature]]: """ Calculates all LiteFeature combinations using the given hashmap of existing features, and the input set of required columns. Args: feature_collection (FeatureCollection): An indexed feature collection object for efficient querying of features column_keys (List[Tuple[str, int]]): List of Column types needed by associated primitive. commutative (bool): whether or not we need to use product or combinations to create feature sets. Returns: List[List[LiteFeature]] A list of LiteFeature sets. Examples: .. code-block:: python from featuretools.feature_discovery.feature_discovery import _get_features from woodwork.column_schema import ColumnSchema feature_groups = { "ANY": ["f1", "f2", "f3"], "Double": ["f1", "f2", "f3"], "numeric": ["f1", "f2", "f3"], "Double,numeric": ["f1", "f2", "f3"], } column_set = [ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"})] features = _get_features(col_groups, column_set, commutative=False) """ prod_iter = [] for key, count in column_keys: relevant_features = list(feature_collection.get_by_key(key)) if commutative: prod_iter.append(combinations(relevant_features, count)) else: prod_iter.append(permutations(relevant_features, count)) feature_combinations = product(*prod_iter) return [flatten_list(x) for x in feature_combinations] def _primitive_to_columnsets(primitive: PrimitiveBase) -> List[List[ColumnSchema]]: column_sets = primitive.input_types assert column_sets is not None if not isinstance(column_sets[0], list): column_sets = [primitive.input_types] column_sets = cast(List[List[ColumnSchema]], column_sets) # Some primitives are commutative, yet have explicit versions of commutative pairs (eg. MultiplyNumericBoolean), # which would create multiple versions, so this resolved that. if primitive.commutative: existing = set() uniq_column_sets = [] for column_set in column_sets: key = "_".join(sorted([x.__repr__() for x in column_set])) if key not in existing: uniq_column_sets.append(column_set) existing.add(key) column_sets = uniq_column_sets return column_sets def _get_matching_features( feature_collection: FeatureCollection, primitive: PrimitiveBase, ) -> List[List[LiteFeature]]: """ For a given primitive, find all feature sets that can be used to create new feature Args: feature_collection (FeatureCollection): An indexed feature collection object for efficient querying of features primitive (PrimitiveBase) Returns: List[List[LiteFeature]] List of feature sets Examples: .. code-block:: python from featuretools.feature_discovery.feature_discovery import get_matching_columns from woodwork.column_schema import ColumnSchema feature_groups = { "ANY": ["f1", "f2", "f3"], "Double": ["f1", "f2", "f3"], "numeric": ["f1", "f2", "f3"], "Double,numeric": ["f1", "f2", "f3"], } feature_sets = _get_matching_features(col_groups, AddNumeric) [ ["f1", "f2"], ["f1", "f3"], ["f2", "f3"] ] """ column_sets = _primitive_to_columnsets(primitive=primitive) column_keys_set = [_index_column_set(c) for c in column_sets] commutative = primitive.commutative feature_sets = [] for column_keys in column_keys_set: assert column_keys is not None feature_sets_ = _get_features( feature_collection=feature_collection, column_keys=tuple(column_keys), commutative=commutative, ) feature_sets.extend(feature_sets_) return feature_sets def _features_from_primitive( primitive: PrimitiveBase, feature_collection: FeatureCollection, ) -> List[LiteFeature]: """ For a given primitive, creates all engineered features Args: primitive (Type[PrimitiveBase]) feature_collection (FeatureCollection): An indexed feature collection object for efficient querying of features Returns: List[List[LiteFeature]] List of feature sets Examples: .. code-block:: python from featuretools.feature_discovery.feature_discovery import get_matching_columns from woodwork.column_schema import ColumnSchema feature_groups = { "ANY": ["f1", "f2", "f3"], "Double": ["f1", "f2", "f3"], "numeric": ["f1", "f2", "f3"], "Double,numeric": ["f1", "f2", "f3"], } feature_sets = _features_from_primitive(AddNumeric, feature_groups) [ ["f1", "f2"], ["f1", "f3"], ["f2", "f3"] ] """ assert isinstance(primitive, PrimitiveBase) features: List[LiteFeature] = [] feature_sets = _get_matching_features( feature_collection=feature_collection, primitive=primitive, ) for feature_set in feature_sets: if primitive.number_output_features > 1: related_features: Set[LiteFeature] = set() for n in range(primitive.number_output_features): feature = LiteFeature( primitive=primitive, base_features=feature_set, idx=n, ) related_features.add(feature) for f in related_features: f.related_features = related_features - {f} features.append(f) else: features.append( LiteFeature( primitive=primitive, base_features=feature_set, ), ) return features def schema_to_features(schema: TableSchema) -> List[LiteFeature]: """ ** EXPERIMENTAL ** Convert a Woodwork Schema object to a list of LiteFeatures. Args: schema (TableSchema): Woodwork TableSchema object Returns: List[LiteFeature] Examples: .. code-block:: python from featuretools.feature_discovery.feature_discovery import schema_to_features from featuretools.primitives import Absolute, IsNull import pandas as pd import woodwork as ww df = pd.DataFrame({ "idx": [0,1,2,3], "f1": ["A", "B", "C", "D"], "f2": [1.2, 2.3, 3.4, 4.5] }) df.ww.init() features = schema_to_features(df.ww.schema) """ features = [] for col_name, column_schema in schema.columns.items(): assert isinstance(column_schema, ColumnSchema) logical_type = column_schema.logical_type assert logical_type assert issubclass(type(logical_type), LogicalType) tags = column_schema.semantic_tags assert isinstance(tags, set) features.append( LiteFeature( name=col_name, logical_type=type(logical_type), tags=tags, ), ) return features def _check_inputs( input_features: Iterable[LiteFeature], primitives: Union[List[Type[PrimitiveBase]], List[PrimitiveBase]], ) -> Tuple[Iterable[LiteFeature], List[PrimitiveBase]]: if not isinstance(input_features, Iterable): raise ValueError("input_features must be an iterable of LiteFeature objects") for feature in input_features: if not isinstance(feature, LiteFeature): raise ValueError( "input_features must be an iterable of LiteFeature objects", ) if not isinstance(primitives, List): raise ValueError( "primitives must be a list of Primitive classes or Primitive instances", ) primitive_instances: List[PrimitiveBase] = [] for primitive in primitives: if inspect.isclass(primitive) and issubclass(primitive, PrimitiveBase): primitive_instances.append(primitive()) elif isinstance(primitive, PrimitiveBase): primitive_instances.append(primitive) else: raise ValueError( "primitives must be a list of Primitive classes or Primitive instances", ) return (input_features, primitive_instances) def generate_features_from_primitives( input_features: Iterable[LiteFeature], primitives: Union[List[Type[PrimitiveBase]], List[PrimitiveBase]], ) -> List[LiteFeature]: """ ** EXPERIMENTAL ** Calculates all Features for a given input of features and a list of primitives. Args: origin_features (List[LiteFeature]): List of origin features primitives (List[Type[PrimitiveBase]]) List of primitive classes Returns: List[LiteFeature] Examples: .. code-block:: python from featuretools.feature_discovery.feature_discovery import lite_dfs from featuretools.primitives import Absolute, IsNull import pandas as pd import woodwork as ww df = pd.DataFrame({ "idx": [0,1,2,3], "f1": ["A", "B", "C", "D"], "f2": [1.2, 2.3, 3.4, 4.5] }) df.ww.init() origin_features = schema_to_features(df.ww.schema) features = lite_dfs(origin_features, [Absolute, IsNull]) """ (input_features, primitives) = _check_inputs(input_features, primitives) features = [x.copy() for x in input_features] feature_collection = FeatureCollection(features=features) feature_collection.reindex() for primitive in primitives: features_ = _features_from_primitive( primitive=primitive, feature_collection=feature_collection, ) features.extend(features_) return features ================================================ FILE: featuretools/feature_discovery/type_defs.py ================================================ ANY = "ANY" ================================================ FILE: featuretools/feature_discovery/utils.py ================================================ import hashlib import json from functools import lru_cache from typing import Any, Dict, Tuple from woodwork.column_schema import ColumnSchema from featuretools.feature_discovery.type_defs import ANY from featuretools.primitives.base.primitive_base import PrimitiveBase from featuretools.primitives.utils import ( get_all_logical_type_names, get_all_primitives, serialize_primitive, ) primitives_map = get_all_primitives() logical_types_map = get_all_logical_type_names() def column_schema_to_keys(column_schema: ColumnSchema) -> str: """ Generate a hashing key from a Columns Schema. For example: - ColumnSchema(logical_type=Double) -> "Double" - ColumnSchema(semantic_tags={"index"}) -> "index" - ColumnSchema(logical_type=Double, semantic_tags={"index", "other"}) -> "Double,index,other" Args: column_schema (ColumnSchema): Returns: str: hashing key """ logical_type = column_schema.logical_type tags = column_schema.semantic_tags lt_key = None if logical_type: lt_key = type(logical_type).__name__ tags = sorted(tags) if len(tags) > 0: tag_key = ",".join(tags) return f"{lt_key},{tag_key}" if lt_key is not None else tag_key elif lt_key is not None: return lt_key else: return ANY @lru_cache(maxsize=None) def hash_primitive(primitive: PrimitiveBase) -> Tuple[str, Dict[str, Any]]: hash_msg = hashlib.sha256() primitive_name = primitive.name assert isinstance(primitive_name, str) primitive_dict = serialize_primitive(primitive) primitive_json = json.dumps(primitive_dict).encode("utf-8") hash_msg.update(primitive_json) key = hash_msg.hexdigest() return (key, primitive_dict) def get_primitive_return_type(primitive: PrimitiveBase) -> ColumnSchema: """ Get Return type from a primitive Args: primitive (PrimitiveBase) Returns: ColumnSchema """ if primitive.return_type: return primitive.return_type return_type = primitive.input_types[0] if isinstance(return_type, list): return_type = return_type[0] return return_type def flatten_list(nested_list): return [item for sublist in nested_list for item in sublist] ================================================ FILE: featuretools/primitives/__init__.py ================================================ # flake8: noqa import inspect import logging import traceback import pkg_resources from featuretools.primitives.standard import * from featuretools.primitives.utils import ( get_aggregation_primitives, get_default_aggregation_primitives, get_default_transform_primitives, get_transform_primitives, list_primitives, summarize_primitives, ) def _load_primitives(): """Load in a list of primitives registered by other libraries into Featuretools. Example entry_points definition for a library using this entry point either in: - setup.py: setup( entry_points={ 'featuretools_primitives': [ 'other_library = other_library', ], }, ) - setup.cfg: [options.entry_points] featuretools_primitives = other_library = other_library - pyproject.toml: [project.entry-points."featuretools_primitives"] other_library = "other_library" where `other_library` is a top-level module containing all the primitives. """ logger = logging.getLogger("featuretools") base_primitives = AggregationPrimitive, TransformPrimitive # noqa: F405 for entry_point in pkg_resources.iter_entry_points("featuretools_primitives"): try: loaded = entry_point.load() except Exception: message = f'Featuretools failed to load "{entry_point.name}" primitives from "{entry_point.module_name}". ' message += "For a full stack trace, set logging to debug." logger.warning(message) logger.debug(traceback.format_exc()) continue for key in dir(loaded): primitive = getattr(loaded, key, None) if ( inspect.isclass(primitive) and issubclass(primitive, base_primitives) and primitive not in base_primitives ): name = primitive.__name__ scope = globals() if name in scope: this_module, that_module = ( primitive.__module__, scope[name].__module__, ) message = f'While loading primitives via "{entry_point.name}" entry point, ' message += ( f'ignored primitive "{name}" from "{this_module}" because ' ) message += ( f'a primitive with that name already exists in "{that_module}"' ) logger.warning(message) else: scope[name] = primitive _load_primitives() ================================================ FILE: featuretools/primitives/base/__init__.py ================================================ from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.primitives.base.primitive_base import PrimitiveBase from featuretools.primitives.base.transform_primitive_base import TransformPrimitive ================================================ FILE: featuretools/primitives/base/aggregation_primitive_base.py ================================================ from featuretools.primitives.base.primitive_base import PrimitiveBase class AggregationPrimitive(PrimitiveBase): def generate_name( self, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ): base_features_str = ", ".join(base_feature_names) return "%s(%s.%s%s%s%s)" % ( self.name.upper(), relationship_path_name, base_features_str, where_str, use_prev_str, self.get_args_string(), ) def generate_names( self, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ): n = self.number_output_features base_name = self.generate_name( base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ) return [base_name + "[%s]" % i for i in range(n)] ================================================ FILE: featuretools/primitives/base/primitive_base.py ================================================ import os from inspect import signature import numpy as np import pandas as pd from featuretools import config from featuretools.utils.description_utils import convert_to_nth class PrimitiveBase(object): """Base class for all primitives.""" #: (str): Name of the primitive name = None #: (list): woodwork.ColumnSchema types of inputs input_types = None #: (woodwork.ColumnSchema): ColumnSchema type of return return_type = None #: Default value this feature returns if no data found. Defaults to np.nan default_value = np.nan #: (bool): True if feature needs to know what the current calculation time # is (provided to computational backend as "time_last") uses_calc_time = False #: (int): Maximum number of features in the largest chain proceeding # downward from this feature's base features. max_stack_depth = None #: (int): Number of columns in feature matrix associated with this feature number_output_features = 1 # whitelist of primitives can have this primitive in input_types base_of = None # blacklist of primitives can have this primitive in input_types base_of_exclude = None # whitelist of primitives that can be in input_types stack_on = None # blacklist of primitives that can be in signature stack_on_exclude = None # determines if primitive can be in input_types for self stack_on_self = True # (bool) If True will only make one feature per unique set of base features commutative = False #: (str, list[str]): description template of the primitive. Input column # descriptions are passed as positional arguments to the template. Slice # number (if present) in "nth" form is passed to the template via the # `nth_slice` keyword argument. Multi-output primitives can use a list to # differentiate between the base description and a slice description. description_template = None def __init__(self): pass def __call__(self, *args, **kwargs): series_args = [pd.Series(arg) for arg in args] try: return self._method(*series_args, **kwargs) except AttributeError: self._method = self.get_function() return self._method(*series_args, **kwargs) def __lt__(self, other): return (self.name + self.get_args_string()) < ( other.name + other.get_args_string() ) def generate_name(self): raise NotImplementedError("Subclass must implement") def generate_names(self): raise NotImplementedError("Subclass must implement") def get_function(self): raise NotImplementedError("Subclass must implement") def get_filepath(self, filename): return os.path.join(config.get("primitive_data_folder"), filename) def get_args_string(self): strings = [] for name, value in self.get_arguments(): # format arg to string string = "{}={}".format(name, str(value)) strings.append(string) if len(strings) == 0: return "" string = ", ".join(strings) string = ", " + string return string def get_arguments(self): values = [] args = signature(self.__class__).parameters.items() for name, arg in args: # assert that arg is attribute of primitive error = '"{}" must be attribute of {}' assert hasattr(self, name), error.format(name, self.__class__.__name__) value = getattr(self, name) # check if args are the same type if isinstance(value, type(arg.default)): # skip if default value if arg.default == value: continue values.append((name, value)) return values def get_description( self, input_column_descriptions, slice_num=None, template_override=None, ): template = template_override or self.description_template if template: if isinstance(template, list): if slice_num is not None: slice_index = slice_num + 1 if slice_index < len(template): return template[slice_index].format( *input_column_descriptions, nth_slice=convert_to_nth(slice_index), ) else: if len(template) > 2: raise IndexError("Slice out of range of template") return template[1].format( *input_column_descriptions, nth_slice=convert_to_nth(slice_index), ) else: template = template[0] return template.format(*input_column_descriptions) # generic case: name = self.name.upper() if self.name is not None else type(self).__name__ if slice_num is not None: nth_slice = convert_to_nth(slice_num + 1) description = "the {} output from applying {} to {}".format( nth_slice, name, ", ".join(input_column_descriptions), ) else: description = "the result of applying {} to {}".format( name, ", ".join(input_column_descriptions), ) return description @staticmethod def flatten_nested_input_types(input_types): """Flattens nested column schema inputs into a single list.""" if isinstance(input_types[0], list): input_types = [ sub_input for input_obj in input_types for sub_input in input_obj ] return input_types ================================================ FILE: featuretools/primitives/base/transform_primitive_base.py ================================================ from featuretools.primitives.base.primitive_base import PrimitiveBase class TransformPrimitive(PrimitiveBase): """Feature for dataframe that is a based off one or more other features in that dataframe.""" # (bool) If True, feature function depends on all values of dataframe # (and will receive these values as input, regardless of specified instance ids) uses_full_dataframe = False def generate_name(self, base_feature_names): return "%s(%s%s)" % ( self.name.upper(), ", ".join(base_feature_names), self.get_args_string(), ) def generate_names(self, base_feature_names): n = self.number_output_features base_name = self.generate_name(base_feature_names) return [base_name + "[%s]" % i for i in range(n)] ================================================ FILE: featuretools/primitives/options_utils.py ================================================ import logging import warnings from itertools import permutations from featuretools import primitives from featuretools.feature_base import IdentityFeature logger = logging.getLogger("featuretools") def _get_primitive_options(): # all possible option keys: function that verifies value type return { "ignore_dataframes": list_dataframe_check, "include_dataframes": list_dataframe_check, "ignore_columns": dict_to_list_column_check, "include_columns": dict_to_list_column_check, "ignore_groupby_dataframes": list_dataframe_check, "include_groupby_dataframes": list_dataframe_check, "ignore_groupby_columns": dict_to_list_column_check, "include_groupby_columns": dict_to_list_column_check, } def dict_to_list_column_check(option, es): if not ( isinstance(option, dict) and all([isinstance(option_val, list) for option_val in option.values()]) ): return False else: for dataframe, columns in option.items(): if dataframe not in es: warnings.warn("Dataframe '%s' not in entityset" % (dataframe)) else: for invalid_col in [ column for column in columns if column not in es[dataframe] ]: warnings.warn( "Column '%s' not in dataframe '%s'" % (invalid_col, dataframe), ) return True def list_dataframe_check(option, es): if not isinstance(option, list): return False else: for invalid_dataframe in [ dataframe for dataframe in option if dataframe not in es ]: warnings.warn("Dataframe '%s' not in entityset" % (invalid_dataframe)) return True def generate_all_primitive_options( all_primitives, primitive_options, ignore_dataframes, ignore_columns, es, ): dataframe_dict = { dataframe.ww.name: [col for col in dataframe.columns] for dataframe in es.dataframes } primitive_options = _init_primitive_options(primitive_options, dataframe_dict) global_ignore_dataframes = ignore_dataframes global_ignore_columns = ignore_columns.copy() # for now, only use primitive names as option keys for primitive in all_primitives: if primitive in primitive_options and primitive.name in primitive_options: msg = ( "Options present for primitive instance and generic " "primitive class (%s), primitive instance will not use generic " "options" % (primitive.name) ) warnings.warn(msg) if primitive in primitive_options or primitive.name in primitive_options: options = primitive_options.get( primitive, primitive_options.get(primitive.name), ) # Reconcile global options with individually-specified options included_dataframes = set().union( *[ option.get("include_dataframes", set()).union( option.get("include_columns", {}).keys(), ) for option in options ] ) global_ignore_dataframes = global_ignore_dataframes.difference( included_dataframes, ) for option in options: # don't globally ignore a column if it's included for a primitive if "include_columns" in option: for dataframe, include_cols in option["include_columns"].items(): global_ignore_columns[dataframe] = global_ignore_columns[ dataframe ].difference(include_cols) option["ignore_dataframes"] = option["ignore_dataframes"].union( ignore_dataframes.difference(included_dataframes), ) for dataframe, ignore_cols in ignore_columns.items(): # if already ignoring columns for this dataframe, add globals for option in options: if dataframe in option["ignore_columns"]: option["ignore_columns"][dataframe] = option["ignore_columns"][ dataframe ].union(ignore_cols) # if no ignore_columns and dataframe is explicitly included, don't ignore the column elif dataframe in included_dataframes: continue # Otherwise, keep the global option else: option["ignore_columns"][dataframe] = ignore_cols else: # no user specified options, just use global defaults primitive_options[primitive] = [ { "ignore_dataframes": ignore_dataframes, "ignore_columns": ignore_columns, }, ] return primitive_options, global_ignore_dataframes, global_ignore_columns def _init_primitive_options(primitive_options, es): # Flatten all tuple keys, convert value lists into sets, check for # conflicting keys flattened_options = {} for primitive_keys, options in primitive_options.items(): if not isinstance(primitive_keys, tuple): primitive_keys = (primitive_keys,) if isinstance(options, list): for primitive_key in primitive_keys: if isinstance(primitive_key, str): primitive = primitives.get_aggregation_primitives().get( primitive_key, ) or primitives.get_transform_primitives().get(primitive_key) if not primitive: msg = "Unknown primitive with name '{}'".format(primitive_key) raise ValueError(msg) else: primitive = primitive_key assert ( len(primitive.input_types[0]) == len(options) if isinstance(primitive.input_types[0], list) else len(primitive.input_types) == len(options) ), ( "Number of options does not match number of inputs for primitive %s" % (primitive_key) ) options = [ _init_option_dict(primitive_keys, option, es) for option in options ] else: options = [_init_option_dict(primitive_keys, options, es)] for primitive in primitive_keys: if isinstance(primitive, type): primitive = primitive.name # if primitive is specified more than once, raise error if primitive in flattened_options: raise KeyError("Multiple options found for primitive %s" % (primitive)) flattened_options[primitive] = options return flattened_options def _init_option_dict(key, option_dict, es): initialized_option_dict = {} primitive_options = _get_primitive_options() # verify all keys are valid and match expected type, convert lists to sets for option_key, option in option_dict.items(): if option_key not in primitive_options: raise KeyError( "Unrecognized primitive option '%s' for %s" % (option_key, ",".join(key)), ) if not primitive_options[option_key](option, es): raise TypeError( "Incorrect type formatting for '%s' for %s" % (option_key, ",".join(key)), ) if isinstance(option, list): initialized_option_dict[option_key] = set(option) elif isinstance(option, dict): initialized_option_dict[option_key] = { key: set(option[key]) for key in option } # initialize ignore_dataframes and ignore_columns to empty sets if not present if "ignore_columns" not in initialized_option_dict: initialized_option_dict["ignore_columns"] = dict() if "ignore_dataframes" not in initialized_option_dict: initialized_option_dict["ignore_dataframes"] = set() return initialized_option_dict def column_filter(f, options, groupby=False): if groupby and not f.column_schema.semantic_tags.intersection( {"category", "foreign_key"}, ): return False include_cols = "include_groupby_columns" if groupby else "include_columns" ignore_cols = "ignore_groupby_columns" if groupby else "ignore_columns" include_dataframes = ( "include_groupby_dataframes" if groupby else "include_dataframes" ) ignore_dataframes = "ignore_groupby_dataframes" if groupby else "ignore_dataframes" dependencies = f.get_dependencies(deep=True) + [f] for base_f in dependencies: if isinstance(base_f, IdentityFeature): if ( include_cols in options and base_f.dataframe_name in options[include_cols] ): if base_f.get_name() in options[include_cols][base_f.dataframe_name]: continue # this is a valid feature, go to next else: return False # this is not an included feature if ignore_cols in options and base_f.dataframe_name in options[ignore_cols]: if base_f.get_name() in options[ignore_cols][base_f.dataframe_name]: return False # ignore this feature if include_dataframes in options: return base_f.dataframe_name in options[include_dataframes] elif ( ignore_dataframes in options and base_f.dataframe_name in options[ignore_dataframes] ): return False # ignore the dataframe return True def ignore_dataframe_for_primitive(options, dataframe, groupby=False): # This logic handles whether given options ignore an dataframe or not def should_ignore_dataframe(option): if groupby: if ( "include_groupby_columns" not in option or dataframe.ww.name not in option["include_groupby_columns"] ): if ( "include_groupby_dataframes" in option and dataframe.ww.name not in option["include_groupby_dataframes"] ): return True elif ( "ignore_groupby_dataframes" in option and dataframe.ww.name in option["ignore_groupby_dataframes"] ): return True if ( "include_columns" in option and dataframe.ww.name in option["include_columns"] ): return False elif "include_dataframes" in option: return dataframe.ww.name not in option["include_dataframes"] elif dataframe.ww.name in option["ignore_dataframes"]: return True else: return False return any([should_ignore_dataframe(option) for option in options]) def filter_groupby_matches_by_options(groupby_matches, options): return filter_matches_by_options( [(groupby_match,) for groupby_match in groupby_matches], options, groupby=True, ) def filter_matches_by_options(matches, options, groupby=False, commutative=False): # If more than one option, than need to handle each for each input if len(options) > 1: def is_valid_match(match): if all( [ column_filter(m, option, groupby) for m, option in zip(match, options) ], ): return True else: return False else: def is_valid_match(match): if all([column_filter(f, options[0], groupby) for f in match]): return True else: return False valid_matches = set() for match in matches: if is_valid_match(match): valid_matches.add(match) elif commutative: for order in permutations(match): if is_valid_match(order): valid_matches.add(order) break return sorted( valid_matches, key=lambda features: ([feature.unique_name() for feature in features]), ) ================================================ FILE: featuretools/primitives/standard/__init__.py ================================================ # flake8: noqa from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.aggregation import * from featuretools.primitives.standard.transform import * ================================================ FILE: featuretools/primitives/standard/aggregation/__init__.py ================================================ from featuretools.primitives.standard.aggregation.all_primitive import All from featuretools.primitives.standard.aggregation.any_primitive import Any from featuretools.primitives.standard.aggregation.avg_time_between import AvgTimeBetween from featuretools.primitives.standard.aggregation.average_count_per_unique import ( AverageCountPerUnique, ) from featuretools.primitives.standard.aggregation.count import Count from featuretools.primitives.standard.aggregation.count_above_mean import CountAboveMean from featuretools.primitives.standard.aggregation.count_below_mean import CountBelowMean from featuretools.primitives.standard.aggregation.count_greater_than import ( CountGreaterThan, ) from featuretools.primitives.standard.aggregation.count_inside_nth_std import ( CountInsideNthSTD, ) from featuretools.primitives.standard.aggregation.count_inside_range import ( CountInsideRange, ) from featuretools.primitives.standard.aggregation.count_less_than import CountLessThan from featuretools.primitives.standard.aggregation.count_outside_nth_std import ( CountOutsideNthSTD, ) from featuretools.primitives.standard.aggregation.count_outside_range import ( CountOutsideRange, ) from featuretools.primitives.standard.aggregation.date_first_event import DateFirstEvent from featuretools.primitives.standard.aggregation.entropy import Entropy from featuretools.primitives.standard.aggregation.first import First from featuretools.primitives.standard.aggregation.first_last_time_delta import ( FirstLastTimeDelta, ) from featuretools.primitives.standard.aggregation.kurtosis import Kurtosis from featuretools.primitives.standard.aggregation.is_unique import IsUnique from featuretools.primitives.standard.aggregation.last import Last from featuretools.primitives.standard.aggregation.max_primitive import Max from featuretools.primitives.standard.aggregation.max_consecutive_false import ( MaxConsecutiveFalse, ) from featuretools.primitives.standard.aggregation.max_consecutive_negatives import ( MaxConsecutiveNegatives, ) from featuretools.primitives.standard.aggregation.max_consecutive_positives import ( MaxConsecutivePositives, ) from featuretools.primitives.standard.aggregation.max_consecutive_true import ( MaxConsecutiveTrue, ) from featuretools.primitives.standard.aggregation.max_consecutive_zeros import ( MaxConsecutiveZeros, ) from featuretools.primitives.standard.aggregation.mean import Mean from featuretools.primitives.standard.aggregation.median import Median from featuretools.primitives.standard.aggregation.max_count import MaxCount from featuretools.primitives.standard.aggregation.median_count import MedianCount from featuretools.primitives.standard.aggregation.max_min_delta import MaxMinDelta from featuretools.primitives.standard.aggregation.min_count import MinCount from featuretools.primitives.standard.aggregation.min_primitive import Min from featuretools.primitives.standard.aggregation.mode import Mode from featuretools.primitives.standard.aggregation.n_unique_days import NUniqueDays from featuretools.primitives.standard.aggregation.n_unique_days_of_calendar_year import ( NUniqueDaysOfCalendarYear, ) from featuretools.primitives.standard.aggregation.n_unique_days_of_month import ( NUniqueDaysOfMonth, ) from featuretools.primitives.standard.aggregation.has_no_duplicates import ( HasNoDuplicates, ) from featuretools.primitives.standard.aggregation.is_monotonically_decreasing import ( IsMonotonicallyDecreasing, ) from featuretools.primitives.standard.aggregation.is_monotonically_increasing import ( IsMonotonicallyIncreasing, ) from featuretools.primitives.standard.aggregation.n_unique_months import NUniqueMonths from featuretools.primitives.standard.aggregation.n_unique_weeks import NUniqueWeeks from featuretools.primitives.standard.aggregation.n_most_common import NMostCommon from featuretools.primitives.standard.aggregation.n_most_common_frequency import ( NMostCommonFrequency, ) from featuretools.primitives.standard.aggregation.num_true import NumTrue from featuretools.primitives.standard.aggregation.num_peaks import NumPeaks from featuretools.primitives.standard.aggregation.num_zero_crossings import ( NumZeroCrossings, ) from featuretools.primitives.standard.aggregation.num_true_since_last_false import ( NumTrueSinceLastFalse, ) from featuretools.primitives.standard.aggregation.num_false_since_last_true import ( NumFalseSinceLastTrue, ) from featuretools.primitives.standard.aggregation.num_consecutive_greater_mean import ( NumConsecutiveGreaterMean, ) from featuretools.primitives.standard.aggregation.num_consecutive_less_mean import ( NumConsecutiveLessMean, ) from featuretools.primitives.standard.aggregation.num_unique import NumUnique from featuretools.primitives.standard.aggregation.percent_unique import PercentUnique from featuretools.primitives.standard.aggregation.percent_true import PercentTrue from featuretools.primitives.standard.aggregation.skew import Skew from featuretools.primitives.standard.aggregation.std import Std from featuretools.primitives.standard.aggregation.sum_primitive import Sum from featuretools.primitives.standard.aggregation.time_since_first import TimeSinceFirst from featuretools.primitives.standard.aggregation.time_since_last import TimeSinceLast from featuretools.primitives.standard.aggregation.time_since_last_true import ( TimeSinceLastTrue, ) from featuretools.primitives.standard.aggregation.time_since_last_min import ( TimeSinceLastMin, ) from featuretools.primitives.standard.aggregation.time_since_last_max import ( TimeSinceLastMax, ) from featuretools.primitives.standard.aggregation.time_since_last_false import ( TimeSinceLastFalse, ) from featuretools.primitives.standard.aggregation.trend import Trend from featuretools.primitives.standard.aggregation.variance import Variance ================================================ FILE: featuretools/primitives/standard/aggregation/all_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class All(AggregationPrimitive): """Calculates if all values are 'True' in a list. Description: Given a list of booleans, return `True` if all of the values are `True`. Examples: >>> all = All() >>> all([False, False, False, True]) False """ name = "all" input_types = [ [ColumnSchema(logical_type=Boolean)], [ColumnSchema(logical_type=BooleanNullable)], ] return_type = ColumnSchema(logical_type=Boolean) stack_on_self = False description_template = "whether all of {} are true" def get_function(self): return np.all ================================================ FILE: featuretools/primitives/standard/aggregation/any_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Any(AggregationPrimitive): """Determines if any value is 'True' in a list. Description: Given a list of booleans, return `True` if one or more of the values are `True`. Examples: >>> any = Any() >>> any([False, False, False, True]) True """ name = "any" input_types = [ [ColumnSchema(logical_type=Boolean)], [ColumnSchema(logical_type=BooleanNullable)], ] return_type = ColumnSchema(logical_type=Boolean) stack_on_self = False description_template = "whether any of {} are true" def get_function(self): return np.any ================================================ FILE: featuretools/primitives/standard/aggregation/average_count_per_unique.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import AggregationPrimitive class AverageCountPerUnique(AggregationPrimitive): """Determines the average count across all unique value. Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: Determine the average count values for all unique items in the input >>> input = [1, 1, 2, 2, 3, 4, 5, 6, 7, 8] >>> avg_count_per_unique = AverageCountPerUnique() >>> avg_count_per_unique(input) 1.25 Determine the average count values for all unique items in the input with nan values ignored >>> input = [1, 1, 2, 2, 3, 4, 5, None, 6, 7, 8] >>> avg_count_per_unique = AverageCountPerUnique() >>> avg_count_per_unique(input) 1.25 Determine the average count values for all unique items in the input with nan values included >>> input = [1, 2, 2, 3, 4, 5, None, 6, 7, 8, 9] >>> avg_count_per_unique_skipna_false = AverageCountPerUnique(skipna=False) >>> avg_count_per_unique_skipna_false(input) 1.1 """ name = "average_count_per_unique" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def average_count_per_unique(x): return x.value_counts( dropna=self.skipna, ).mean(skipna=self.skipna) return average_count_per_unique ================================================ FILE: featuretools/primitives/standard/aggregation/avg_time_between.py ================================================ from datetime import datetime import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.utils import convert_time_units class AvgTimeBetween(AggregationPrimitive): """Computes the average number of seconds between consecutive events. Description: Given a list of datetimes, return the average time (default in seconds) elapsed between consecutive events. If there are fewer than 2 non-null values, return `NaN`. Args: unit (str): Defines the unit of time. Defaults to seconds. Acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds Examples: >>> from datetime import datetime >>> avg_time_between = AvgTimeBetween() >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> avg_time_between(times) 375.0 >>> avg_time_between = AvgTimeBetween(unit="minutes") >>> avg_time_between(times) 6.25 """ name = "avg_time_between" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) description_template = "the average time between each of {}" def __init__(self, unit="seconds"): self.unit = unit.lower() def get_function(self): def pd_avg_time_between(x): """Assumes time scales are closer to order of seconds than to nanoseconds if times are much closer to nanoseconds we could get some floating point errors this can be fixed with another function that calculates the mean before converting to seconds """ x = x.dropna() if x.shape[0] < 2: return np.nan if isinstance(x.iloc[0], (pd.Timestamp, datetime)): x = x.view("int64") # use len(x)-1 because we care about difference # between values, len(x)-1 = len(diff(x)) avg = (x.max() - x.min()) / (len(x) - 1) avg = avg * 1e-9 # long form: # diff_in_ns = x.diff().iloc[1:].astype('int64') # diff_in_seconds = diff_in_ns * 1e-9 # avg = diff_in_seconds.mean() return convert_time_units(avg, self.unit) return pd_avg_time_between ================================================ FILE: featuretools/primitives/standard/aggregation/count.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Count(AggregationPrimitive): """Determines the total number of values, excluding `NaN`. Examples: >>> count = Count() >>> count([1, 2, 3, 4, 5, None]) 5 """ name = "count" input_types = [ColumnSchema(semantic_tags={"index"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 description_template = "the number" def get_function(self): return pd.Series.count def generate_name( self, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ): return "COUNT(%s%s%s)" % (relationship_path_name, where_str, use_prev_str) ================================================ FILE: featuretools/primitives/standard/aggregation/count_above_mean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountAboveMean(AggregationPrimitive): """Calculates the number of values that are above the mean. Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: >>> count_above_mean = CountAboveMean() >>> count_above_mean([1, 2, 3, 4, 5]) 2 The way NaNs are treated can be controlled. >>> count_above_mean_skipna = CountAboveMean(skipna=False) >>> count_above_mean_skipna([1, 2, 3, 4, 5, None]) nan """ name = "count_above_mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def count_above_mean(x): mean = x.mean(skipna=self.skipna) if np.isnan(mean): return np.nan return len(x[x > mean]) return count_above_mean ================================================ FILE: featuretools/primitives/standard/aggregation/count_below_mean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountBelowMean(AggregationPrimitive): """Determines the number of values that are below the mean. Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: >>> count_below_mean = CountBelowMean() >>> count_below_mean([1, 2, 3, 4, 10]) 3 The way NaNs are treated can be controlled. >>> count_below_mean_skipna = CountBelowMean(skipna=False) >>> count_below_mean_skipna([1, 2, 3, 4, 5, None]) nan """ name = "count_below_mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def count_below_mean(x): mean = x.mean(skipna=self.skipna) if np.isnan(mean): return np.nan return len(x[x < mean]) return count_below_mean ================================================ FILE: featuretools/primitives/standard/aggregation/count_greater_than.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Integer from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountGreaterThan(AggregationPrimitive): """Determines the number of values greater than a controllable threshold. Args: threshold (float): The threshold to use when counting the number of values greater than. Defaults to 10. Examples: >>> count_greater_than = CountGreaterThan(threshold=3) >>> count_greater_than([1, 2, 3, 4, 5]) 2 """ name = "count_greater_than" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, threshold=10): self.threshold = threshold def get_function(self): def count_greater_than(x): return x[x > self.threshold].count() return count_greater_than ================================================ FILE: featuretools/primitives/standard/aggregation/count_inside_nth_std.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Integer from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountInsideNthSTD(AggregationPrimitive): """Determines the count of observations that lie inside the first N standard deviations (inclusive). Args: n (float): Number of standard deviations. Default is 1 Examples: >>> count_inside_nth_std = CountInsideNthSTD(n=1.5) >>> count_inside_nth_std([1, 10, 15, 20, 100]) 4 """ name = "count_inside_nth_std" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, n=1): if n < 0: raise ValueError("n must be a positive number") self.n = n def get_function(self): def count_inside_nth_std(x): cond = np.abs(x - np.mean(x)) <= np.std(x) * self.n return cond.sum() return count_inside_nth_std ================================================ FILE: featuretools/primitives/standard/aggregation/count_inside_range.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountInsideRange(AggregationPrimitive): """Determines the number of values that fall within a certain range. Args: lower (float): Lower boundary of range (inclusive). Default is 0. upper (float): Upper boundary of range (inclusive). Default is 1. skipna (bool): If this is False any value in x is NaN then the result will be NaN. If True, `nan` values are skipped. Default is True. Examples: >>> count_inside_range = CountInsideRange(lower=1.5, upper=3.6) >>> count_inside_range([1, 2, 3, 4, 5]) 2 The way NaNs are treated can be controlled. >>> count_inside_range_skipna = CountInsideRange(skipna=False) >>> count_inside_range_skipna([1, 2, 3, 4, 5, None]) nan """ name = "count_inside_range" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, lower=0, upper=1, skipna=True): self.lower = lower self.upper = upper self.skipna = skipna def get_function(self): def count_inside_range(x): if not self.skipna and x.isnull().values.any(): return np.nan cond = (self.lower <= x) & (x <= self.upper) return cond.sum() return count_inside_range ================================================ FILE: featuretools/primitives/standard/aggregation/count_less_than.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Integer from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountLessThan(AggregationPrimitive): """Determines the number of values less than a controllable threshold. Args: threshold (float): The threshold to use when counting the number of values less than. Defaults to 10. Examples: >>> count_less_than = CountLessThan(threshold=3.5) >>> count_less_than([1, 2, 3, 4, 5]) 3 """ name = "count_less_than" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, threshold=10): self.threshold = threshold def get_function(self): def count_less_than(x): return x[x < self.threshold].count() return count_less_than ================================================ FILE: featuretools/primitives/standard/aggregation/count_outside_nth_std.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Integer from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountOutsideNthSTD(AggregationPrimitive): """Determines the number of observations that lie outside the first N standard deviations. Args: n (float): Number of standard deviations. Default is 1 Examples: >>> count_outside_nth_std = CountOutsideNthSTD(n=1.5) >>> count_outside_nth_std([1, 10, 15, 20, 100]) 1 """ name = "count_outside_nth_std" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, n=1): if n < 0: raise ValueError("n must be a positive number") self.n = n def get_function(self): def count_outside_nth_std(x): cond = np.abs(x - np.mean(x)) > np.std(x) * self.n return cond.sum() return count_outside_nth_std ================================================ FILE: featuretools/primitives/standard/aggregation/count_outside_range.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class CountOutsideRange(AggregationPrimitive): """Determines the number of values that fall outside a certain range. Args: lower (float): Lower boundary of range (exclusive). Default is 0. upper (float): Upper boundary of range (exclusive). Default is 1. skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: >>> count_outside_range = CountOutsideRange(lower=1.5, upper=3.6) >>> count_outside_range([1, 2, 3, 4, 5]) 3 The way NaNs are treated can be controlled. >>> count_outside_range_skipna = CountOutsideRange(skipna=False) >>> count_outside_range_skipna([1, 2, 3, 4, 5, None]) nan """ name = "count_outside_range" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, lower=0, upper=1, skipna=True): self.lower = lower self.upper = upper self.skipna = skipna def get_function(self): def count_outside_range(x): if not self.skipna and x.isnull().values.any(): return np.nan cond = (x < self.lower) | (x > self.upper) return cond.sum() return count_outside_range ================================================ FILE: featuretools/primitives/standard/aggregation/date_first_event.py ================================================ from pandas import NaT from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base import AggregationPrimitive class DateFirstEvent(AggregationPrimitive): """Determines the first datetime from a list of datetimes. Examples: >>> from datetime import datetime >>> date_first_event = DateFirstEvent() >>> date_first_event([ ... datetime(2011, 4, 9, 10, 30, 10), ... datetime(2011, 4, 9, 10, 30, 20), ... datetime(2011, 4, 9, 10, 30, 30)]) Timestamp('2011-04-09 10:30:10') """ name = "date_first_event" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=Datetime) stack_on_self = False default_value = 0 def get_function(self): def date_first_event(x): x = x.dropna() if x.empty: return NaT return x.iat[0] return date_first_event ================================================ FILE: featuretools/primitives/standard/aggregation/entropy.py ================================================ from scipy import stats from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Entropy(AggregationPrimitive): """Calculates the entropy for a categorical column Description: Given a list of observations from a categorical column return the entropy of the distribution. NaN values can be treated as a category or dropped. Args: dropna (bool): Whether to consider NaN values as a separate category Defaults to False. base (float): The logarithmic base to use Defaults to e (natural logarithm) Examples: >>> pd_entropy = Entropy() >>> pd_entropy([1, 2, 3, 4]) 1.3862943611198906 """ name = "entropy" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False description_template = "the entropy of {}" def __init__(self, dropna=False, base=None): self.dropna = dropna self.base = base def get_function(self): def pd_entropy(s): distribution = s.value_counts(normalize=True, dropna=self.dropna) if distribution.dtype == "Float64": distribution = distribution.astype("float64") return stats.entropy(distribution.to_numpy(), base=self.base) return pd_entropy ================================================ FILE: featuretools/primitives/standard/aggregation/first.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class First(AggregationPrimitive): """Determines the first value in a list. Examples: >>> first = First() >>> first([1, 2, 3, 4, 5, None]) 1.0 """ name = "first" input_types = [ColumnSchema()] return_type = None stack_on_self = False description_template = "the first instance of {}" def get_function(self): def pd_first(x): return x.iloc[0] return pd_first ================================================ FILE: featuretools/primitives/standard/aggregation/first_last_time_delta.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base import AggregationPrimitive class FirstLastTimeDelta(AggregationPrimitive): """Determines the time between the first and last time value in seconds. Examples: >>> from datetime import datetime >>> first_last_time_delta = FirstLastTimeDelta() >>> first_last_time_delta([ ... datetime(2011, 4, 9, 10, 30, 0), ... datetime(2011, 4, 9, 10, 30, 15), ... datetime(2011, 4, 9, 10, 30, 35)]) 35.0 """ name = "first_last_time_delta" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = False stack_on_self = False default_value = 0 def get_function(self): def first_last_time_delta(datetime_col): datetime_col = datetime_col.dropna() if datetime_col.empty: return np.nan delta = datetime_col.iloc[-1] - datetime_col.iloc[0] return delta.total_seconds() return first_last_time_delta ================================================ FILE: featuretools/primitives/standard/aggregation/has_no_duplicates.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base import AggregationPrimitive class HasNoDuplicates(AggregationPrimitive): """Determines if there are duplicates in the input. Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: >>> has_no_duplicates = HasNoDuplicates() >>> has_no_duplicates([1, 1, 2]) False >>> has_no_duplicates([1, 2, 3]) True NaNs are skipped by default. >>> has_no_duplicates([1, 2, 3, None, None]) True However, the way NaNs are treated can be controlled. >>> has_no_duplicates_skipna = HasNoDuplicates(skipna=False) >>> has_no_duplicates_skipna([1, 2, 3, None, None]) False >>> has_no_duplicates_skipna([1, 2, 3, None]) True """ name = "has_no_duplicates" input_types = [ [ColumnSchema(semantic_tags={"category"})], [ColumnSchema(semantic_tags={"numeric"})], ] return_type = ColumnSchema(logical_type=BooleanNullable) stack_on_self = False default_value = True def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def has_no_duplicates(data): if self.skipna: data = data.dropna() return not data.duplicated().any() return has_no_duplicates ================================================ FILE: featuretools/primitives/standard/aggregation/is_monotonically_decreasing.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base import AggregationPrimitive class IsMonotonicallyDecreasing(AggregationPrimitive): """Determines if a series is monotonically decreasing. Description: Given a list of numeric values, return True if the values are strictly decreasing. If the series contains `NaN` values, they will be skipped. Examples: >>> is_monotonically_decreasing = IsMonotonicallyDecreasing() >>> is_monotonically_decreasing([9, 5, 3, 1]) True """ name = "is_monotonically_decreasing" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=BooleanNullable) stack_on_self = False default_value = False def get_function(self): def is_monotonically_decreasing(x): return x.dropna().is_monotonic_decreasing return is_monotonically_decreasing ================================================ FILE: featuretools/primitives/standard/aggregation/is_monotonically_increasing.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base import AggregationPrimitive class IsMonotonicallyIncreasing(AggregationPrimitive): """Determines if a series is monotonically increasing. Description: Given a list of numeric values, return True if the values are strictly increasing. If the series contains `NaN` values, they will be skipped. Examples: >>> is_monotonically_increasing = IsMonotonicallyIncreasing() >>> is_monotonically_increasing([1, 3, 5, 9]) True """ name = "is_monotonically_increasing" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=BooleanNullable) stack_on_self = False default_value = False def get_function(self): def is_monotonically_increasing(x): return x.dropna().is_monotonic_increasing return is_monotonically_increasing ================================================ FILE: featuretools/primitives/standard/aggregation/is_unique.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base import AggregationPrimitive class IsUnique(AggregationPrimitive): """Determines whether or not a series of discrete is all unique. Description: Given a series of discrete values, return True if each value in the series is unique. If any value is repeated, return False. Examples: >>> is_unique = IsUnique() >>> is_unique(['red', 'blue', 'green', 'yellow']) True If the series is not unique, return False >>> is_unique = IsUnique() >>> is_unique(['red', 'blue', 'green', 'blue']) False """ name = "is_unique" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(logical_type=BooleanNullable) stack_on_self = False default_value = False def get_function(self): def is_unique(x): return x.is_unique return is_unique ================================================ FILE: featuretools/primitives/standard/aggregation/kurtosis.py ================================================ from scipy.stats import kurtosis from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, Integer from featuretools.primitives.base import AggregationPrimitive class Kurtosis(AggregationPrimitive): """Calculates the kurtosis for a list of numbers Args: fisher (bool): Optional. If True, Fisher's definition is used (normal ==> 0.0). If False, Pearson's definition is used (normal ==> 3.0). Default is True. bias (bool): Optional. If False, then the calculations are corrected for statistical bias. Default is True. nan_policy (str): Optional. Defines how to handle when input contains Nan. Possible values include `['propagate', 'raise', 'omit']`. 'propagate' returns Nan, 'raise' throws an error, 'omit' performs the calculations ignoring Nan values. Default is 'propagate'. Examples: >>> kurtosis = Kurtosis() >>> kurtosis([1, 2, 3, 4, 5]) -1.3 You can use Pearson's definition by setting the 'fisher' argument to False >>> kurtosis_fisher = Kurtosis(fisher=False) >>> kurtosis_fisher([1, 2, 3, 4, 5]) 1.7 You can correct for statistical bias by setting the 'bias' argument to False >>> kurtosis_bias = Kurtosis(bias=False) >>> kurtosis_bias([1, 2, 3, 4, 5]) -1.2000000000000004 You can specifiy how to handle NaN values in the input with the 'nan_policy' argument >>> kurtosis_nan_policy = Kurtosis(nan_policy='omit') >>> kurtosis_nan_policy([1, 2, None, 3, 4, 5]) -1.3 """ name = "kurtosis" input_types = [ [ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})], [ColumnSchema(logical_type=Double, semantic_tags={"numeric"})], ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, fisher=True, bias=True, nan_policy="propagate"): if nan_policy not in ["propagate", "raise", "omit"]: raise ValueError("Invalid nan_policy") self.fisher = fisher self.bias = bias self.nan_policy = nan_policy def get_function(self): def kurtosis_func(x): return kurtosis( x, axis=0, fisher=self.fisher, bias=self.bias, nan_policy=self.nan_policy, ) return kurtosis_func ================================================ FILE: featuretools/primitives/standard/aggregation/last.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Last(AggregationPrimitive): """Determines the last value in a list. Examples: >>> last = Last() >>> last([1, 2, 3, 4, 5, None]) nan """ name = "last" input_types = [ColumnSchema()] return_type = None stack_on_self = False description_template = "the last instance of {}" def get_function(self): def pd_last(x): return x.iloc[-1] return pd_last ================================================ FILE: featuretools/primitives/standard/aggregation/max_consecutive_false.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, Integer from featuretools.primitives.base import AggregationPrimitive class MaxConsecutiveFalse(AggregationPrimitive): """Determines the maximum number of consecutive False values in the input Examples: >>> max_consecutive_false = MaxConsecutiveFalse() >>> max_consecutive_false([True, False, False, True, True, False]) 2 """ name = "max_consecutive_false" input_types = [ColumnSchema(logical_type=Boolean)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def max_consecutive_false(x): # invert the input array to work properly with the computation x[x.notnull()] = ~(x[x.notnull()].astype(bool)) # find the locations where the value changes from the previous value not_equal = x != x.shift() # Use cumulative sum to determine where consecutive values occur. When the # sum changes, consecutive False values are present, when the cumulative # sum remains unchnaged, consecutive True values are present. not_equal_sum = not_equal.cumsum() # group the input by the cumulative sum values and use cumulative count # to count the number of consecutive values. Add 1 to account for the cumulative # sum starting at zero where the first True occurs consecutive = x.groupby(not_equal_sum).cumcount() + 1 # multiply by the inverted input to keep only the counts that correspond to # false values consecutive_false = consecutive * x # return the max of all the consecutive false values return consecutive_false.max() return max_consecutive_false ================================================ FILE: featuretools/primitives/standard/aggregation/max_consecutive_negatives.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, Integer from featuretools.primitives.base import AggregationPrimitive class MaxConsecutiveNegatives(AggregationPrimitive): """Determines the maximum number of consecutive negative values in the input Args: skipna (bool): Ignore any `NaN` values in the input. Default is True. Examples: >>> max_consecutive_negatives = MaxConsecutiveNegatives() >>> max_consecutive_negatives([1.0, -1.4, -2.4, -5.4, 2.9, -4.3]) 3 `NaN` values can be ignored with the `skipna` parameter >>> max_consecutive_negatives_skipna = MaxConsecutiveNegatives(skipna=False) >>> max_consecutive_negatives_skipna([1.0, 1.4, -2.4, None, -2.9, -4.3]) 2 """ name = "max_consecutive_negatives" input_types = [ [ColumnSchema(logical_type=Integer)], [ColumnSchema(logical_type=Double)], ] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def max_consecutive_negatives(x): if self.skipna: x = x.dropna() # convert the numeric values to booleans for processing x[x.notnull()] = x[x.notnull()].lt(0) # find the locations where the value changes from the previous value not_equal = x != x.shift() # Use cumulative sum to determine where consecutive values occur. When the # sum changes, consecutive non-negative values are present, when the cumulative # sum remains unchnaged, consecutive negative values are present. not_equal_sum = not_equal.cumsum() # group the input by the cumulative sum values and use cumulative count # to count the number of consecutive values. Add 1 to account for the cumulative # sum starting at zero where the first negative occurs consecutive = x.groupby(not_equal_sum).cumcount() + 1 # multiply by the inverted input to keep only the counts that correspond to # negative values consecutive_neg = consecutive * x # return the max of all the consecutive negative values return consecutive_neg.max() return max_consecutive_negatives ================================================ FILE: featuretools/primitives/standard/aggregation/max_consecutive_positives.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, Integer from featuretools.primitives.base import AggregationPrimitive class MaxConsecutivePositives(AggregationPrimitive): """Determines the maximum number of consecutive positive values in the input Args: skipna (bool): Ignore any `NaN` values in the input. Default is True. Examples: >>> max_consecutive_positives = MaxConsecutivePositives() >>> max_consecutive_positives([1.0, -1.4, 2.4, 5.4, 2.9, -4.3]) 3 `NaN` values can be ignored with the `skipna` parameter >>> max_consecutive_positives_skipna = MaxConsecutivePositives(skipna=False) >>> max_consecutive_positives_skipna([1.0, -1.4, 2.4, None, 2.9, 4.3]) 2 """ name = "max_consecutive_positives" input_types = [ [ColumnSchema(logical_type=Integer)], [ColumnSchema(logical_type=Double)], ] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def max_consecutive_positives(x): if self.skipna: x = x.dropna() # convert the numeric values to booleans for processing x[x.notnull()] = x[x.notnull()].gt(0) # find the locations where the value changes from the previous value not_equal = x != x.shift() # Use cumulative sum to determine where consecutive values occur. When the # sum changes, consecutive non-positive values are present, when the cumulative # sum remains unchnaged, consecutive positive values are present. not_equal_sum = not_equal.cumsum() # group the input by the cumulative sum values and use cumulative count # to count the number of consecutive values. Add 1 to account for the cumulative # sum starting at zero where the first positive occurs consecutive = x.groupby(not_equal_sum).cumcount() + 1 # multiply by the inverted input to keep only the counts that correspond to # positive values consecutive_pos = consecutive * x # return the max of all the consecutive positive values return consecutive_pos.max() return max_consecutive_positives ================================================ FILE: featuretools/primitives/standard/aggregation/max_consecutive_true.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, Integer from featuretools.primitives.base import AggregationPrimitive class MaxConsecutiveTrue(AggregationPrimitive): """Determines the maximum number of consecutive True values in the input Examples: >>> max_consecutive_true = MaxConsecutiveTrue() >>> max_consecutive_true([True, False, True, True, True, False]) 3 """ name = "max_consecutive_true" input_types = [ColumnSchema(logical_type=Boolean)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def max_consecutive_true(x): # find the locations where the value changes from the previous value not_equal = x != x.shift() # use cumulative sum to determine where consecutive values occur. When the # sum changes, consecutive False values are present, when the cumulative # sum remains unchnaged, consecutive True values are present. not_equal_sum = not_equal.cumsum() # group the input by the cumulative sum values and use cumulative count # to count the number of consecutive values. Add 1 to account for the cumulative # sum starting at zero where the first True occurs consecutive = x.groupby(not_equal_sum).cumcount() + 1 # multiply by the original input to keep only the counts that correspond to # true values consecutive_true = consecutive * x # return the max of all the consecutive true values return consecutive_true.max() return max_consecutive_true ================================================ FILE: featuretools/primitives/standard/aggregation/max_consecutive_zeros.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, Integer from featuretools.primitives.base import AggregationPrimitive class MaxConsecutiveZeros(AggregationPrimitive): """Determines the maximum number of consecutive zero values in the input Args: skipna (bool): Ignore any `NaN` values in the input. Default is True. Examples: >>> max_consecutive_zeros = MaxConsecutiveZeros() >>> max_consecutive_zeros([1.0, -1.4, 0, 0.0, 0, -4.3]) 3 `NaN` values can be ignored with the `skipna` parameter >>> max_consecutive_zeros_skipna = MaxConsecutiveZeros(skipna=False) >>> max_consecutive_zeros_skipna([1.0, -1.4, 0, None, 0.0, -4.3]) 1 """ name = "max_consecutive_zeros" input_types = [ [ColumnSchema(logical_type=Integer)], [ColumnSchema(logical_type=Double)], ] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def max_consecutive_zeros(x): if self.skipna: x = x.dropna() # convert the numeric values to booleans for processing x[x.notnull()] = x[x.notnull()].eq(0) # find the locations where the value changes from the previous value not_equal = x != x.shift() # Use cumulative sum to determine where consecutive values occur. When the # sum changes, consecutive non-zero values are present, when the cumulative # sum remains unchnaged, consecutive zero values are present. not_equal_sum = not_equal.cumsum() # group the input by the cumulative sum values and use cumulative count # to count the number of consecutive values. Add 1 to account for the cumulative # sum starting at zero where the first zero occurs consecutive = x.groupby(not_equal_sum).cumcount() + 1 # multiply by the boolean input to keep only the counts that correspond to # zero values consecutive_zero = consecutive * x # return the max of all the consecutive zero values return consecutive_zero.max() return max_consecutive_zeros ================================================ FILE: featuretools/primitives/standard/aggregation/max_count.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import AggregationPrimitive class MaxCount(AggregationPrimitive): """Calculates the number of occurrences of the max value in a list Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. If skipna is False, and there are NaN values in the array, the max will be NaN regardless of the other values, and NaN will be returned. Examples: >>> max_count = MaxCount() >>> max_count([1, 2, 5, 1, 5, 3, 5]) 3 You can optionally specify how to handle NaN values >>> max_count_skipna = MaxCount(skipna=False) >>> max_count_skipna([1, 2, 5, 1, 5, 3, None]) nan """ name = "max_count" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def max_count(x): xmax = x.max(skipna=self.skipna) if np.isnan(xmax): return np.nan return x.eq(xmax).sum() return max_count ================================================ FILE: featuretools/primitives/standard/aggregation/max_min_delta.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import AggregationPrimitive class MaxMinDelta(AggregationPrimitive): """Determines the difference between the max and min value. Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: >>> max_min_delta = MaxMinDelta() >>> max_min_delta([7, 2, 5, 3, 10]) 8 You can optionally specify how to handle NaN values >>> max_min_delta_skipna = MaxMinDelta(skipna=False) >>> max_min_delta_skipna([7, 2, None, 3, 10]) nan """ name = "max_min_delta" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def max_min_delta(x): max_val = x.max(skipna=self.skipna) min_val = x.min(skipna=self.skipna) return max_val - min_val return max_min_delta ================================================ FILE: featuretools/primitives/standard/aggregation/max_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Max(AggregationPrimitive): """Calculates the highest value, ignoring `NaN` values. Examples: >>> max = Max() >>> max([1, 2, 3, 4, 5, None]) 5.0 """ name = "max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False description_template = "the maximum of {}" def get_function(self): return np.max ================================================ FILE: featuretools/primitives/standard/aggregation/mean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Mean(AggregationPrimitive): """Computes the average for a list of values. Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Examples: >>> mean = Mean() >>> mean([1, 2, 3, 4, 5, None]) 3.0 We can also control the way `NaN` values are handled. >>> mean = Mean(skipna=False) >>> mean([1, 2, 3, 4, 5, None]) nan """ name = "mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the average of {}" def __init__(self, skipna=True): self.skipna = skipna def get_function(self): if self.skipna: # np.mean of series is functionally nanmean return np.mean def mean(series): return np.mean(series.values) return mean ================================================ FILE: featuretools/primitives/standard/aggregation/median.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Median(AggregationPrimitive): """Determines the middlemost number in a list of values. Examples: >>> median = Median() >>> median([5, 3, 2, 1, 4]) 3.0 `NaN` values are ignored. >>> median([5, 3, 2, 1, 4, None]) 3.0 """ name = "median" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the median of {}" def get_function(self): return pd.Series.median ================================================ FILE: featuretools/primitives/standard/aggregation/median_count.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base import AggregationPrimitive class MedianCount(AggregationPrimitive): """Calculates the number of occurrences of the median value in a list Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. If skipna is False, and there are NaN values in the array, the median will be NaN, regardless of the other values. Examples: >>> median_count = MedianCount() >>> median_count([1, 2, 3, 1, 5, 3, 5]) 2 You can optionally specify how to handle NaN values >>> median_count_skipna = MedianCount(skipna=False) >>> median_count_skipna([1, 2, 3, 1, 5, 3, None]) nan """ name = "median_count" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def median_count(x): median = x.median(skipna=self.skipna) if np.isnan(median): return np.nan return x.eq(median).sum() return median_count ================================================ FILE: featuretools/primitives/standard/aggregation/min_count.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base import AggregationPrimitive class MinCount(AggregationPrimitive): """Calculates the number of occurrences of the min value in a list Args: skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. If skipna is False, and there are NaN values in the array, the min will be NaN regardless of the other values, and NaN will be returned. Examples: >>> min_count = MinCount() >>> min_count([1, 2, 5, 1, 5, 3, 5]) 2 You can optionally specify how to handle NaN values >>> min_count_skipna = MinCount(skipna=False) >>> min_count_skipna([1, 2, 5, 1, 5, 3, None]) nan """ name = "min_count" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def min_count(x): xmin = x.min(skipna=self.skipna) if np.isnan(xmin): return np.nan return x.eq(xmin).sum() return min_count ================================================ FILE: featuretools/primitives/standard/aggregation/min_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Min(AggregationPrimitive): """Calculates the smallest value, ignoring `NaN` values. Examples: >>> min = Min() >>> min([1, 2, 3, 4, 5, None]) 1.0 """ name = "min" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False description_template = "the minimum of {}" def get_function(self): return np.min ================================================ FILE: featuretools/primitives/standard/aggregation/mode.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Mode(AggregationPrimitive): """Determines the most commonly repeated value. Description: Given a list of values, return the value with the highest number of occurences. If list is empty, return `NaN`. Examples: >>> mode = Mode() >>> mode(['red', 'blue', 'green', 'blue']) 'blue' """ name = "mode" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = None description_template = "the most frequently occurring value of {}" def get_function(self): def pd_mode(s): return s.mode().get(0, np.nan) return pd_mode ================================================ FILE: featuretools/primitives/standard/aggregation/n_most_common.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class NMostCommon(AggregationPrimitive): """Determines the `n` most common elements. Description: Given a list of values, return the `n` values which appear the most frequently. If there are fewer than `n` unique values, the output will be filled with `NaN`. Args: n (int): defines "n" in "n most common." Defaults to 3. Examples: >>> n_most_common = NMostCommon(n=2) >>> x = ['orange', 'apple', 'orange', 'apple', 'orange', 'grapefruit'] >>> n_most_common(x).tolist() ['orange', 'apple'] """ name = "n_most_common" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = None def __init__(self, n=3): self.n = n self.number_output_features = n self.description_template = [ "the {} most common values of {{}}".format(n), "the most common value of {}", *["the {nth_slice} most common value of {}"] * (n - 1), ] def get_function(self): def n_most_common(x): # Counts of 0 remain in value_counts output if dtype is category # so we need to remove them counts = x.value_counts() counts = counts[counts > 0] array = np.array(counts.index[: self.n]) if len(array) < self.n: filler = np.full(self.n - len(array), np.nan) array = np.append(array, filler) return array return n_most_common ================================================ FILE: featuretools/primitives/standard/aggregation/n_most_common_frequency.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical from featuretools.primitives.base import AggregationPrimitive class NMostCommonFrequency(AggregationPrimitive): """Determines the frequency of the n most common items. Args: n (int): defines "n" in "n most common". Defaults to 3. skipna (bool): Determines if to use NA/null values. Defaults to True to skip NA/null. Description: Given a list, find the n most common items, and return a series showing the frequency of each item. If the list has less than n unique values, the resulting series will be padded with nan. Examples: >>> n_most_common_frequency = NMostCommonFrequency() >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4]).to_list() [3, 2, 2] We can increase n to include more items. >>> n_most_common_frequency = NMostCommonFrequency(4) >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4]).to_list() [3, 2, 2, 1] NaNs are skipped by default. >>> n_most_common_frequency = NMostCommonFrequency(3) >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4, None, None, None]).to_list() [3, 2, 2] However, the way NaNs are treated can be controlled. >>> n_most_common_frequency = NMostCommonFrequency(3, skipna=False) >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4, None, None, None]).to_list() [3, 3, 2] """ name = "n_most_common_frequency" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def __init__(self, n=3, skipna=True): self.n = n self.number_output_features = n self.skipna = skipna def get_function(self): def n_most_common_frequency(data, n=self.n): frequencies = data.value_counts(dropna=self.skipna) n_most_common = frequencies.iloc[0:n] nan_add = n - frequencies.shape[0] if nan_add > 0: n_most_common = pd.concat( [n_most_common, pd.Series([np.nan] * nan_add)], ) return n_most_common return n_most_common_frequency ================================================ FILE: featuretools/primitives/standard/aggregation/n_unique_days.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Integer from featuretools.primitives.base import AggregationPrimitive class NUniqueDays(AggregationPrimitive): """Determines the number of unique days. Description: Given a list of datetimes, return the number of unique days. The same day in two different years is treated as different. So Feb 21, 2017 is different than Feb 21, 2019, even though they are both the 21st of February. Examples: >>> from datetime import datetime >>> n_unique_days = NUniqueDays() >>> times = [datetime(2019, 2, 1), ... datetime(2019, 2, 1), ... datetime(2018, 2, 1), ... datetime(2019, 1, 1)] >>> n_unique_days(times) 3 """ name = "n_unique_days" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def n_unique_days(x): return x.dt.floor("D").nunique() return n_unique_days ================================================ FILE: featuretools/primitives/standard/aggregation/n_unique_days_of_calendar_year.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Integer from featuretools.primitives.base import AggregationPrimitive class NUniqueDaysOfCalendarYear(AggregationPrimitive): """Determines the number of unique calendar days. Description: Given a list of datetimes, return the number of unique calendar days. The same date in two different years is counted as one. So Feb 21, 2017 is not unique from Feb 21, 2019. Examples: >>> from datetime import datetime >>> n_unique_days_of_calendar_year = NUniqueDaysOfCalendarYear() >>> times = [datetime(2019, 2, 1), ... datetime(2019, 2, 1), ... datetime(2018, 2, 1), ... datetime(2019, 1, 1)] >>> n_unique_days_of_calendar_year(times) 2 """ name = "n_unique_days_of_calendar_year" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def n_unique_days_of_calendar_year(x): return x.dropna().dt.strftime("%m-%d").nunique() return n_unique_days_of_calendar_year ================================================ FILE: featuretools/primitives/standard/aggregation/n_unique_days_of_month.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Integer from featuretools.primitives.base import AggregationPrimitive class NUniqueDaysOfMonth(AggregationPrimitive): """Determines the number of unique days of month. Description: Given a list of datetimes, return the number of unique days of month. The maximum value is 31. 2018-01-01 and 2018-02-01 will be counted as 1 unique day. 2019-01-01 and 2018-01-01 will also be counted as 1. Examples: >>> from datetime import datetime >>> n_unique_days_of_month = NUniqueDaysOfMonth() >>> times = [datetime(2019, 1, 1), ... datetime(2019, 2, 1), ... datetime(2018, 2, 1), ... datetime(2019, 1, 2), ... datetime(2019, 1, 3)] >>> n_unique_days_of_month(times) 3 """ name = "n_unique_days_of_month" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def n_unique_days_of_month(x): return x.dropna().dt.day.nunique() return n_unique_days_of_month ================================================ FILE: featuretools/primitives/standard/aggregation/n_unique_months.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Integer from featuretools.primitives.base import AggregationPrimitive class NUniqueMonths(AggregationPrimitive): """Determines the number of unique months. Description: Given a list of datetimes, return the number of unique months. NUniqueMonths counts absolute month, not month of year, so the same month in two different years is treated as different. (i.e. Feb 2017 is different than Feb 2019.) Examples: >>> from datetime import datetime >>> n_unique_months = NUniqueMonths() >>> times = [datetime(2019, 1, 1), ... datetime(2019, 1, 2), ... datetime(2019, 1, 3), ... datetime(2019, 2, 1), ... datetime(2018, 2, 1)] >>> n_unique_months(times) 3 """ name = "n_unique_months" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def n_unique_months(x): return x.dt.to_period("M").nunique() return n_unique_months ================================================ FILE: featuretools/primitives/standard/aggregation/n_unique_weeks.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Integer from featuretools.primitives.base import AggregationPrimitive class NUniqueWeeks(AggregationPrimitive): """Determines the number of unique weeks. Description: Given a list of datetimes, return the number of unique weeks (Monday-Sunday). NUniqueWeeks counts by absolute week, not week of year, so the first week of 2018 and the first week of 2019 count as two unique values. Examples: >>> from datetime import datetime >>> n_unique_weeks = NUniqueWeeks() >>> times = [datetime(2018, 2, 2), ... datetime(2019, 1, 1), ... datetime(2019, 2, 1), ... datetime(2019, 2, 1), ... datetime(2019, 2, 3), ... datetime(2019, 2, 21)] >>> n_unique_weeks(times) 4 """ name = "n_unique_weeks" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def n_unique_weeks(x): return x.dt.to_period("W").nunique() return n_unique_weeks ================================================ FILE: featuretools/primitives/standard/aggregation/num_consecutive_greater_mean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base import AggregationPrimitive class NumConsecutiveGreaterMean(AggregationPrimitive): """Determines the length of the longest subsequence above the mean. Description: Given a list of numbers, find the longest subsequence of numbers larger than the mean of the entire sequence. Return the length of the longest subsequence. Args: skipna (bool): If this is False and any value in x is `NaN`, then the result will be `NaN`. If True, `NaN` values are skipped. Default is True. Examples: >>> num_consecutive_greater_mean = NumConsecutiveGreaterMean() >>> num_consecutive_greater_mean([1, 2, 3, 4, 5, 6]) 3.0 We can also control the way `NaN` values are handled. >>> num_consecutive_greater_mean = NumConsecutiveGreaterMean(skipna=False) >>> num_consecutive_greater_mean([1, 2, 3, 4, 5, 6, None]) nan """ name = "num_consecutive_greater_mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def num_consecutive_greater_mean(x): # check for NaN cases if x.isnull().all(): return np.nan if not self.skipna and x.isnull().values.any(): return np.nan x_mean = x.mean() # In some cases, the mean of x may be NaN # (such as when x has both inf and -inf values) if np.isnan(x.mean()): return np.nan # Find indices of points at or below mean x = x.dropna().reset_index(drop=True) below_mean_indices = x[x <= x_mean].index.to_series() # If none of x is below the mean, return the length of x if below_mean_indices.empty: return len(x) # Pad index with start/end values, in case the longest # sequence occurs at the beginning or end of x below_mean_indices[-1] = -1 below_mean_indices[len(x)] = len(x) below_mean_indices = below_mean_indices.sort_index() # Calculate gaps between points below mean below_mean_indices_shifted = below_mean_indices.shift(1) diffs = below_mean_indices - below_mean_indices_shifted # Take biggest gap, and subtract 1 to get result max_gap = (diffs).max() - 1 return max_gap return num_consecutive_greater_mean ================================================ FILE: featuretools/primitives/standard/aggregation/num_consecutive_less_mean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base import AggregationPrimitive class NumConsecutiveLessMean(AggregationPrimitive): """Determines the length of the longest subsequence below the mean. Description: Given a list of numbers, find the longest subsequence of numbers smaller than the mean of the entire sequence. Return the length of the longest subsequence. Args: skipna (bool): If this is False and any value in x is `NaN`, then the result will be `NaN`. If True, `NaN` values are skipped. Default is True. Examples: >>> num_consecutive_less_mean = NumConsecutiveLessMean() >>> num_consecutive_less_mean([1, 2, 3, 4, 5, 6]) 3.0 We can also control the way `NaN` values are handled. >>> num_consecutive_less_mean = NumConsecutiveLessMean(skipna=False) >>> num_consecutive_less_mean([1, 2, 3, 4, 5, 6, None]) nan """ name = "num_consecutive_less_mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def num_consecutive_less_mean(x): # check for NaN cases if x.isnull().all(): return np.nan if not self.skipna and x.isnull().values.any(): return np.nan x_mean = x.mean() # In some cases, the mean of x may be NaN # (such as when x has both inf and -inf values) if np.isnan(x.mean()): return np.nan # Find indices of points at or above mean x = x.dropna().reset_index(drop=True) above_mean_indices = x[x >= x_mean].index.to_series() # If none of x is above the mean, return the length of x if above_mean_indices.empty: return len(x) # Pad index with start/end values, in case the longest # sequence occurs at the beginning or end of x above_mean_indices[-1] = -1 above_mean_indices[len(x)] = len(x) above_mean_indices = above_mean_indices.sort_index() # Calculate gaps between points above mean above_mean_indices_shifted = above_mean_indices.shift(1) diffs = above_mean_indices - above_mean_indices_shifted # Take biggest gap, and subtract 1 to get result max_gap = (diffs).max() - 1 return max_gap return num_consecutive_less_mean ================================================ FILE: featuretools/primitives/standard/aggregation/num_false_since_last_true.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, IntegerNullable from featuretools.primitives.base import AggregationPrimitive class NumFalseSinceLastTrue(AggregationPrimitive): """Calculates the number of `False` values since the last `True` value. Description: From a series of Booleans, find the last record with a `True` value. Return the count of `False` values between that record and the end of the series. Return nan if no values are `True`. Any nan values in the input are ignored. A `True` value in the last row will result in a count of 0. Inputs are converted too booleans before calculating the result. Examples: >>> num_false_since_last_true = NumFalseSinceLastTrue() >>> num_false_since_last_true([True, False, True, False, False]) 2 """ name = "num_false_since_last_true" input_types = [ColumnSchema(logical_type=Boolean)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def num_false_since_last_true(x): if x.empty: return np.nan x = x.dropna().astype(bool) true_indices = x[x] if true_indices.empty: return np.nan last_true_index = true_indices.index[-1] x_slice = x.loc[last_true_index:] return np.invert(x_slice).sum() return num_false_since_last_true ================================================ FILE: featuretools/primitives/standard/aggregation/num_peaks.py ================================================ import pandas as pd from scipy.signal import find_peaks from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Integer from featuretools.primitives.base import AggregationPrimitive class NumPeaks(AggregationPrimitive): """Determines the number of peaks in a list of numbers. Description: Given a list of numbers, count the number of local maxima. Uses the find_peaks function from scipy.signal. Examples: >>> num_peaks = NumPeaks() >>> num_peaks([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0]) 4 """ name = "num_peaks" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def num_peaks(x): if x.dtype == "Int64": x = x.astype("float64") peaks = find_peaks(x)[0] return len(peaks[~pd.isna(peaks)]) return num_peaks ================================================ FILE: featuretools/primitives/standard/aggregation/num_true.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable, IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class NumTrue(AggregationPrimitive): """Counts the number of `True` values. Description: Given a list of booleans, return the number of `True` values. Ignores 'NaN'. Examples: >>> num_true = NumTrue() >>> num_true([True, False, True, True, None]) 3 """ name = "num_true" input_types = [ [ColumnSchema(logical_type=Boolean)], [ColumnSchema(logical_type=BooleanNullable)], ] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 stack_on = [] stack_on_exclude = [] description_template = "the number of times {} is true" def get_function(self): return np.sum ================================================ FILE: featuretools/primitives/standard/aggregation/num_true_since_last_false.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, IntegerNullable from featuretools.primitives.base import AggregationPrimitive class NumTrueSinceLastFalse(AggregationPrimitive): """Calculates the number of `True` values since the last `False` value. Description: From a series of Booleans, find the last record with a `False` value. Return the count of `True` values between that record and the end of the series. Return nan if no values are `False`. Any nan values in the input are ignored. A `False` value in the last row will result in a count of 0. Examples: >>> num_true_since_last_false = NumTrueSinceLastFalse() >>> num_true_since_last_false([False, True, False, True, True]) 2 """ name = "num_true_since_last_false" input_types = [ColumnSchema(logical_type=Boolean)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False default_value = 0 def get_function(self): def num_true_since_last_false(x): x = x.dropna().astype(bool) false_indices = x[~x] if false_indices.empty: return np.nan last_false_index = false_indices.index[-1] x_slice = x.loc[last_false_index:] return x_slice.sum() return num_true_since_last_false ================================================ FILE: featuretools/primitives/standard/aggregation/num_unique.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class NumUnique(AggregationPrimitive): """Determines the number of distinct values, ignoring `NaN` values. Args: use_string_for_pd_calc (bool): Determines if the string 'nunique' or the function pd.Series.nunique is used for making the primitive calculation. Put in place to account for the bug https://github.com/pandas-dev/pandas/issues/57317. Defaults to using the string. Examples: >>> num_unique = NumUnique(use_string_for_pd_calc=False) >>> num_unique(['red', 'blue', 'green', 'yellow']) 4 `NaN` values will be ignored. >>> num_unique(['red', 'blue', 'green', 'yellow', None]) 4 """ name = "num_unique" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) stack_on_self = False description_template = "the number of unique elements in {}" def __init__(self, use_string_for_pd_calc=True): self.use_string_for_pd_calc = use_string_for_pd_calc def get_function(self): if self.use_string_for_pd_calc: return "nunique" return pd.Series.nunique ================================================ FILE: featuretools/primitives/standard/aggregation/num_zero_crossings.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Integer from featuretools.primitives.base import AggregationPrimitive class NumZeroCrossings(AggregationPrimitive): """Determines the number of times a list crosses 0. Description: Given a list of numbers, return the number of times the value crosses 0. It is the number of times the value goes from a positive number to a negative number, or a negative number to a positive number. NaN values are ignored. Examples: >>> num_zero_crossings = NumZeroCrossings() >>> num_zero_crossings([1, -1, 2, -2, 3, -3]) 5 """ name = "num_zero_crossings" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}) def get_function(self): def num_zero_crossings(x): cleaned = x[(x != 0) & (x == x)] signs = np.sign(cleaned) difference = np.diff(signs) crossings = np.where(difference)[0] return len(crossings) return num_zero_crossings ================================================ FILE: featuretools/primitives/standard/aggregation/percent_true.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable, Double from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class PercentTrue(AggregationPrimitive): """Determines the percent of `True` values. Description: Given a list of booleans, return the percent of values which are `True` as a decimal. `NaN` values are treated as `False`, adding to the denominator. Examples: >>> percent_true = PercentTrue() >>> percent_true([True, False, True, True, None]) 0.6 """ name = "percent_true" input_types = [ [ColumnSchema(logical_type=BooleanNullable)], [ColumnSchema(logical_type=Boolean)], ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) stack_on = [] stack_on_exclude = [] default_value = pd.NA description_template = "the percentage of true values in {}" def get_function(self): def percent_true(s): return s.fillna(False).mean() return percent_true ================================================ FILE: featuretools/primitives/standard/aggregation/percent_unique.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import AggregationPrimitive class PercentUnique(AggregationPrimitive): """Determines the percent of unique values. Description: Given a list of values, determine what percent of the list is made up of unique values. Multiple `NaN` values are treated as one unique value. Args: skipna (bool): Determines whether to ignore `NaN` values. Defaults to True. Examples: >>> percent_unique = PercentUnique() >>> percent_unique([1, 1, 2, 2, 3, 4, 5, 6, 7, 8]) 0.8 We can control whether or not `NaN` values are ignored. >>> percent_unique = PercentUnique() >>> percent_unique([1, 1, 2, None]) 0.5 >>> percent_unique_skipna = PercentUnique(skipna=False) >>> percent_unique_skipna([1, 1, 2, None]) 0.75 """ name = "percent_unique" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) default_value = 0 def __init__(self, skipna=True): self.skipna = skipna def get_function(self): def percent_unique(x): return x.nunique(dropna=self.skipna) / (x.shape[0] * 1.0) return percent_unique ================================================ FILE: featuretools/primitives/standard/aggregation/skew.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Skew(AggregationPrimitive): """Computes the extent to which a distribution differs from a normal distribution. Description: For normally distributed data, the skewness should be about 0. A skewness value > 0 means that there is more weight in the left tail of the distribution. Examples: >>> skew = Skew() >>> skew([1, 10, 30, None]) 1.0437603722639681 """ name = "skew" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on = [] stack_on_self = False description_template = "the skewness of {}" def get_function(self): return pd.Series.skew ================================================ FILE: featuretools/primitives/standard/aggregation/std.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive class Std(AggregationPrimitive): """Computes the dispersion relative to the mean value, ignoring `NaN`. Examples: >>> std = Std() >>> round(std([1, 2, 3, 4, 5, None]), 3) 1.414 """ name = "std" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False description_template = "the standard deviation of {}" def get_function(self): return np.std ================================================ FILE: featuretools/primitives/standard/aggregation/sum_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.primitives.standard.aggregation.count import Count class Sum(AggregationPrimitive): """Calculates the total addition, ignoring `NaN`. Examples: >>> sum = Sum() >>> sum([1, 2, 3, 4, 5, None]) 15.0 """ name = "sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False stack_on_exclude = [Count] default_value = 0 description_template = "the sum of {}" def get_function(self): return np.sum ================================================ FILE: featuretools/primitives/standard/aggregation/time_since_first.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.utils import convert_time_units class TimeSinceFirst(AggregationPrimitive): """Calculates the time elapsed since the first datetime (in seconds). Description: Given a list of datetimes, calculate the time elapsed since the first datetime (in seconds). Uses the instance's cutoff time. Args: unit (str): Defines the unit of time to count from. Defaults to seconds. Acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds Examples: >>> from datetime import datetime >>> time_since_first = TimeSinceFirst() >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> time_since_first(times, time=cutoff_time) 900.0 >>> from datetime import datetime >>> time_since_first = TimeSinceFirst(unit = "minutes") >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> time_since_first(times, time=cutoff_time) 15.0 """ name = "time_since_first" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = True description_template = "the time since the first {}" def __init__(self, unit="seconds"): self.unit = unit.lower() def get_function(self): def time_since_first(values, time=None): time_since = time - values.iloc[0] return convert_time_units(time_since.total_seconds(), self.unit) return time_since_first ================================================ FILE: featuretools/primitives/standard/aggregation/time_since_last.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.utils import convert_time_units class TimeSinceLast(AggregationPrimitive): """Calculates the time elapsed since the last datetime (default in seconds). Description: Given a list of datetimes, calculate the time elapsed since the last datetime (default in seconds). Uses the instance's cutoff time. Args: unit (str): Defines the unit of time to count from. Defaults to seconds. Acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds Examples: >>> from datetime import datetime >>> time_since_last = TimeSinceLast() >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> time_since_last(times, time=cutoff_time) 150.0 >>> from datetime import datetime >>> time_since_last = TimeSinceLast(unit = "minutes") >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> time_since_last(times, time=cutoff_time) 2.5 """ name = "time_since_last" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = True description_template = "the time since the last {}" def __init__(self, unit="seconds"): self.unit = unit.lower() def get_function(self): def time_since_last(values, time=None): time_since = time - values.iloc[-1] return convert_time_units(time_since.total_seconds(), self.unit) return time_since_last ================================================ FILE: featuretools/primitives/standard/aggregation/time_since_last_false.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable, Datetime, Double from featuretools.primitives.base import AggregationPrimitive class TimeSinceLastFalse(AggregationPrimitive): """Calculates the time since the last `False` value. Description: Using a series of Datetimes and a series of Booleans, find the last record with a `False` value. Return the seconds elapsed between that record and the instance's cutoff time. Return nan if no values are `False`. Examples: >>> from datetime import datetime >>> time_since_last_false = TimeSinceLastFalse() >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> booleans = [True, False, True] >>> time_since_last_false(times, booleans, time=cutoff_time) 285.0 """ name = "time_since_last_false" input_types = [ [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(logical_type=Boolean), ], [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(logical_type=BooleanNullable), ], ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = True stack_on_self = False default_value = 0 def get_function(self): def time_since_last_false(datetime_col, bool_col, time=None): df = pd.DataFrame( { "datetime": datetime_col, "bool": bool_col, }, ).dropna() if df.empty: return np.nan false_indices = df[~df["bool"]] if false_indices.empty: return np.nan last_false_index = false_indices.index[-1] time_since = time - datetime_col.loc[last_false_index] return time_since.total_seconds() return time_since_last_false ================================================ FILE: featuretools/primitives/standard/aggregation/time_since_last_max.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base import AggregationPrimitive class TimeSinceLastMax(AggregationPrimitive): """Calculates the time since the maximum value occurred. Description: Given a list of numbers, and a corresponding index of datetimes, find the time of the maximum value, and return the time elapsed since it occured. This calculation is done using an instance id's cutoff time. If multiple values equal the maximum, use the first occuring maximum. Examples: >>> from datetime import datetime >>> time_since_last_max = TimeSinceLastMax() >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> time_since_last_max(times, [1, 3, 2], time=cutoff_time) 285.0 """ name = "time_since_last_max" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = True stack_on_self = False default_value = 0 def get_function(self): def time_since_last_max(datetime_col, numeric_col, time=None): df = pd.DataFrame( { "datetime": datetime_col, "numeric": numeric_col, }, ).dropna() if df.empty: return np.nan max_row = df.loc[df["numeric"].idxmax()] max_time = max_row["datetime"] time_since = time - max_time return time_since.total_seconds() return time_since_last_max ================================================ FILE: featuretools/primitives/standard/aggregation/time_since_last_min.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base import AggregationPrimitive class TimeSinceLastMin(AggregationPrimitive): """Calculates the time since the minimum value occurred. Description: Given a list of numbers, and a corresponding index of datetimes, find the time of the minimum value, and return the time elapsed since it occured. This calculation is done using an instance id's cutoff time. If multiple values equal the minimum, use the first occuring minimum. Examples: >>> from datetime import datetime >>> time_since_last_min = TimeSinceLastMin() >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> time_since_last_min(times, [1, 3, 2], time=cutoff_time) 900.0 """ name = "time_since_last_min" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = True stack_on_self = False default_value = 0 def get_function(self): def time_since_last_min(datetime_col, numeric_col, time=None): df = pd.DataFrame( { "datetime": datetime_col, "numeric": numeric_col, }, ).dropna() if df.empty: return np.nan min_row = df.loc[df["numeric"].idxmin()] min_time = min_row["datetime"] time_since = time - min_time return time_since.total_seconds() return time_since_last_min ================================================ FILE: featuretools/primitives/standard/aggregation/time_since_last_true.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable, Datetime, Double from featuretools.primitives.base import AggregationPrimitive class TimeSinceLastTrue(AggregationPrimitive): """Calculates the time since the last `True` value. Description: Using a series of Datetimes and a series of Booleans, find the last record with a `True` value. Return the seconds elapsed between that record and the instance's cutoff time. Return nan if no values are `True`. Examples: >>> from datetime import datetime >>> time_since_last_true = TimeSinceLastTrue() >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0) >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30)] >>> booleans = [True, True, False] >>> time_since_last_true(times, booleans, time=cutoff_time) 285.0 """ name = "time_since_last_true" input_types = [ [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(logical_type=Boolean), ], [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(logical_type=BooleanNullable), ], ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_calc_time = True stack_on_self = False default_value = 0 def get_function(self): def time_since_last_true(datetime_col, bool_col, time=None): df = pd.DataFrame( { "datetime": datetime_col, "bool": bool_col, }, ).dropna() if df.empty: return np.nan true_indices = df[df["bool"]] if true_indices.empty: return np.nan last_true_index = true_indices.index[-1] time_since = time - datetime_col.loc[last_true_index] return time_since.total_seconds() return time_since_last_true ================================================ FILE: featuretools/primitives/standard/aggregation/trend.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive from featuretools.utils import calculate_trend class Trend(AggregationPrimitive): """Calculates the trend of a column over time. Description: Given a list of values and a corresponding list of datetimes, calculate the slope of the linear trend of values. Examples: >>> from datetime import datetime >>> trend = Trend() >>> times = [datetime(2010, 1, 1, 11, 45, 0), ... datetime(2010, 1, 1, 11, 55, 15), ... datetime(2010, 1, 1, 11, 57, 30), ... datetime(2010, 1, 1, 11, 12), ... datetime(2010, 1, 1, 11, 12, 15)] >>> round(trend([1, 2, 3, 4, 5], times), 3) -0.053 """ name = "trend" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the linear trend of {} over time" def get_function(self): def pd_trend(y, x): return calculate_trend(pd.Series(data=y.values, index=x.values)) return pd_trend ================================================ FILE: featuretools/primitives/standard/aggregation/variance.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import AggregationPrimitive class Variance(AggregationPrimitive): """Calculates the variance of a list of numbers. Description: Given a list of numbers, return the variance, using numpy's built-in variance function. Nan values in a series will be ignored. Return nan when the series is empty or entirely null. Examples: >>> variance = Variance() >>> variance([0, 3, 4, 3]) 2.25 Null values in a series will be ignored. >>> variance = Variance() >>> variance([0, 3, 4, 3, None]) 2.25 """ name = "variance" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) stack_on_self = False default_value = np.nan def get_function(self): return np.var ================================================ FILE: featuretools/primitives/standard/transform/__init__.py ================================================ # flake8: noqa from featuretools.primitives.standard.transform.absolute_diff import AbsoluteDiff from featuretools.primitives.standard.transform.binary import * from featuretools.primitives.standard.transform.cumulative import * from featuretools.primitives.standard.transform.datetime import * from featuretools.primitives.standard.transform.email import * from featuretools.primitives.standard.transform.exponential import * from featuretools.primitives.standard.transform.file_extension import FileExtension from featuretools.primitives.standard.transform.full_name_to_first_name import ( FullNameToFirstName, ) from featuretools.primitives.standard.transform.full_name_to_last_name import ( FullNameToLastName, ) from featuretools.primitives.standard.transform.full_name_to_title import ( FullNameToTitle, ) from featuretools.primitives.standard.transform.nth_week_of_month import NthWeekOfMonth from featuretools.primitives.standard.transform.is_in import IsIn from featuretools.primitives.standard.transform.is_null import IsNull from featuretools.primitives.standard.transform.latlong import * from featuretools.primitives.standard.transform.natural_language import * from featuretools.primitives.standard.transform.not_primitive import Not from featuretools.primitives.standard.transform.numeric import * from featuretools.primitives.standard.transform.percent_change import PercentChange from featuretools.primitives.standard.transform.postal import * from featuretools.primitives.standard.transform.savgol_filter import SavgolFilter from featuretools.primitives.standard.transform.time_series import * from featuretools.primitives.standard.transform.url import * ================================================ FILE: featuretools/primitives/standard/transform/absolute_diff.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class AbsoluteDiff(TransformPrimitive): """Calculates the absolute difference from the previous element in a list of numbers. Description: The absolute difference from the previous element is computed for all elements in the input. The first item in the output will always be nan, since there is no previous element for the first element. Elements in the input containing nan will be filled using either a forward-fill or backward-fill method, specified by the method argument. Args: method (str): Method to use for filling nan values in reindexed Series. Possible values are ['pad', 'ffill', 'backfill', 'bfill']. Default is 'ffill'. `pad / ffill`: propagate last valid observation forward to fill gap `backfill / bfill`: propagate next valid observation backward to fill gap limit (int): The max number of consecutive NaN values in a gap that can be filled. Default is None. Examples: >>> absolute_diff = AbsoluteDiff() >>> absolute_diff([2, 5, 15, 3]).tolist() [nan, 3.0, 10.0, 12.0] Forward filling of input elements using the 'ffill' argument >>> absolute_diff_ffill = AbsoluteDiff(method="ffill") >>> absolute_diff_ffill([None, 5, 10, 20, None, 10, None]).tolist() [nan, nan, 5.0, 10.0, 0.0, 10.0, 0.0] Backward filling of input element using the 'bfill' argument >>> absolute_diff_bfill = AbsoluteDiff(method="bfill") >>> absolute_diff_bfill([None, 5, 10, 20, None, 10, None]).tolist() [nan, 0.0, 5.0, 10.0, 10.0, 0.0, nan] The number of nan values that are filled can be limited >>> absolute_diff_limitfill = AbsoluteDiff(limit=2) >>> absolute_diff_limitfill([2, None, None, None, 3, 1]).tolist() [nan, 0.0, 0.0, nan, nan, 2.0] """ name = "absolute_diff" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, method="ffill", limit=None): if method not in ["backfill", "bfill", "pad", "ffill"]: raise ValueError("Invalid method") self.method = method self.limit = limit def get_function(self): def absolute_diff(data): return data.fillna(method=self.method, limit=self.limit).diff().abs() return absolute_diff ================================================ FILE: featuretools/primitives/standard/transform/binary/__init__.py ================================================ from featuretools.primitives.standard.transform.binary.add_numeric import AddNumeric from featuretools.primitives.standard.transform.binary.add_numeric_scalar import ( AddNumericScalar, ) from featuretools.primitives.standard.transform.binary.and_primitive import And from featuretools.primitives.standard.transform.binary.divide_by_feature import ( DivideByFeature, ) from featuretools.primitives.standard.transform.binary.divide_numeric import ( DivideNumeric, ) from featuretools.primitives.standard.transform.binary.divide_numeric_scalar import ( DivideNumericScalar, ) from featuretools.primitives.standard.transform.binary.equal import Equal from featuretools.primitives.standard.transform.binary.equal_scalar import EqualScalar from featuretools.primitives.standard.transform.binary.greater_than import GreaterThan from featuretools.primitives.standard.transform.binary.greater_than_equal_to import ( GreaterThanEqualTo, ) from featuretools.primitives.standard.transform.binary.greater_than_equal_to_scalar import ( GreaterThanEqualToScalar, ) from featuretools.primitives.standard.transform.binary.greater_than_scalar import ( GreaterThanScalar, ) from featuretools.primitives.standard.transform.binary.less_than import LessThan from featuretools.primitives.standard.transform.binary.less_than_equal_to import ( LessThanEqualTo, ) from featuretools.primitives.standard.transform.binary.less_than_equal_to_scalar import ( LessThanEqualToScalar, ) from featuretools.primitives.standard.transform.binary.less_than_scalar import ( LessThanScalar, ) from featuretools.primitives.standard.transform.binary.modulo_by_feature import ( ModuloByFeature, ) from featuretools.primitives.standard.transform.binary.modulo_numeric import ( ModuloNumeric, ) from featuretools.primitives.standard.transform.binary.modulo_numeric_scalar import ( ModuloNumericScalar, ) from featuretools.primitives.standard.transform.binary.multiply_boolean import ( MultiplyBoolean, ) from featuretools.primitives.standard.transform.binary.multiply_numeric import ( MultiplyNumeric, ) from featuretools.primitives.standard.transform.binary.multiply_numeric_boolean import ( MultiplyNumericBoolean, ) from featuretools.primitives.standard.transform.binary.multiply_numeric_scalar import ( MultiplyNumericScalar, ) from featuretools.primitives.standard.transform.binary.not_equal import NotEqual from featuretools.primitives.standard.transform.binary.not_equal_scalar import ( NotEqualScalar, ) from featuretools.primitives.standard.transform.binary.or_primitive import Or from featuretools.primitives.standard.transform.binary.scalar_subtract_numeric_feature import ( ScalarSubtractNumericFeature, ) from featuretools.primitives.standard.transform.binary.subtract_numeric import ( SubtractNumeric, ) from featuretools.primitives.standard.transform.binary.subtract_numeric_scalar import ( SubtractNumericScalar, ) ================================================ FILE: featuretools/primitives/standard/transform/binary/add_numeric.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class AddNumeric(TransformPrimitive): """Performs element-wise addition of two lists. Description: Given a list of values X and a list of values Y, determine the sum of each value in X with its corresponding value in Y. Examples: >>> add_numeric = AddNumeric() >>> add_numeric([2, 1, 2], [1, 2, 2]).tolist() [3, 3, 4] """ name = "add_numeric" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) commutative = True description_template = "the sum of {} and {}" def get_function(self): return np.add def generate_name(self, base_feature_names): return "%s + %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/add_numeric_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class AddNumericScalar(TransformPrimitive): """Adds a scalar to each value in the list. Description: Given a list of numeric values and a scalar, add the given scalar to each value in the list. Examples: >>> add_numeric_scalar = AddNumericScalar(value=2) >>> add_numeric_scalar([3, 1, 2]).tolist() [5, 3, 4] """ name = "add_numeric_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=0): self.value = value self.description_template = "the sum of {{}} and {}".format(self.value) def get_function(self): def add_scalar(vals): return vals + self.value return add_scalar def generate_name(self, base_feature_names): return "%s + %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/and_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class And(TransformPrimitive): """Performs element-wise logical AND of two lists. Description: Given a list of booleans X and a list of booleans Y, determine whether each value in X is `True`, and whether its corresponding value in Y is also `True`. Examples: >>> _and = And() >>> _and([False, True, False], [True, True, False]).tolist() [False, True, False] """ name = "and" input_types = [ [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=BooleanNullable), ], [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)], [ ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=BooleanNullable), ], [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=Boolean), ], ] return_type = ColumnSchema(logical_type=BooleanNullable) commutative = True description_template = "whether {} and {} are true" def get_function(self): return np.logical_and def generate_name(self, base_feature_names): return "AND(%s, %s)" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/divide_by_feature.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class DivideByFeature(TransformPrimitive): """Divides a scalar by each value in the list. Description: Given a list of numeric values and a scalar, divide the scalar by each value and return the list of quotients. Examples: >>> divide_by_feature = DivideByFeature(value=2) >>> divide_by_feature([4, 1, 2]).tolist() [0.5, 2.0, 1.0] """ name = "divide_by_feature" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=1): self.value = value self.description_template = "the result of {} divided by {{}}".format( self.value, ) def get_function(self): def divide_by_feature(vals): return self.value / vals return divide_by_feature def generate_name(self, base_feature_names): return "%s / %s" % (str(self.value), base_feature_names[0]) ================================================ FILE: featuretools/primitives/standard/transform/binary/divide_numeric.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class DivideNumeric(TransformPrimitive): """Performs element-wise division of two lists. Description: Given a list of values X and a list of values Y, determine the quotient of each value in X divided by its corresponding value in Y. Args: commutative (bool): determines if Deep Feature Synthesis should generate both x / y and y / x, or just one. If True, there is no guarantee which of the two will be generated. Defaults to False. Examples: >>> divide_numeric = DivideNumeric() >>> divide_numeric([2.0, 1.0, 2.0], [1.0, 2.0, 2.0]).tolist() [2.0, 0.5, 1.0] """ name = "divide_numeric" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the result of {} divided by {}" def __init__(self, commutative=False): self.commutative = commutative def get_function(self): def divide_numeric(val1, val2): return val1 / val2 return divide_numeric def generate_name(self, base_feature_names): return "%s / %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/divide_numeric_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class DivideNumericScalar(TransformPrimitive): """Divides each element in the list by a scalar. Description: Given a list of numeric values and a scalar, divide each value in the list by the scalar. Examples: >>> divide_numeric_scalar = DivideNumericScalar(value=2) >>> divide_numeric_scalar([3, 1, 2]).tolist() [1.5, 0.5, 1.0] """ name = "divide_numeric_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=1): self.value = value self.description_template = "the result of {{}} divided by {}".format( self.value, ) def get_function(self): def divide_scalar(vals): return vals / self.value return divide_scalar def generate_name(self, base_feature_names): return "%s / %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/equal.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class Equal(TransformPrimitive): """Determines if values in one list are equal to another list. Description: Given a list of values X and a list of values Y, determine whether each value in X is equal to each corresponding value in Y. Examples: >>> equal = Equal() >>> equal([2, 1, 2], [1, 2, 2]).tolist() [False, False, True] """ name = "equal" input_types = [ColumnSchema(), ColumnSchema()] return_type = ColumnSchema(logical_type=BooleanNullable) commutative = True description_template = "whether {} equals {}" def get_function(self): def equal(x_vals, y_vals): if isinstance(x_vals.dtype, pd.CategoricalDtype) and isinstance( y_vals.dtype, pd.CategoricalDtype, ): categories = set(x_vals.cat.categories).union( set(y_vals.cat.categories), ) x_vals = x_vals.cat.add_categories( categories.difference(set(x_vals.cat.categories)), ) y_vals = y_vals.cat.add_categories( categories.difference(set(y_vals.cat.categories)), ) return x_vals.eq(y_vals) return equal def generate_name(self, base_feature_names): return "%s = %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/equal_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class EqualScalar(TransformPrimitive): """Determines if values in a list are equal to a given scalar. Description: Given a list of values and a constant scalar, determine whether each of the values is equal to the scalar. Examples: >>> equal_scalar = EqualScalar(value=2) >>> equal_scalar([3, 1, 2]).tolist() [False, False, True] """ name = "equal_scalar" input_types = [ColumnSchema()] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, value=None): self.value = value self.description_template = "whether {{}} equals {}".format(self.value) def get_function(self): def equal_scalar(vals): return vals == self.value return equal_scalar def generate_name(self, base_feature_names): return "%s = %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/greater_than.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime, Ordinal from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class GreaterThan(TransformPrimitive): """Determines if values in one list are greater than another list. Description: Given a list of values X and a list of values Y, determine whether each value in X is greater than each corresponding value in Y. Equal pairs will return `False`. Examples: >>> greater_than = GreaterThan() >>> greater_than([2, 1, 2], [1, 2, 2]).tolist() [True, False, False] """ name = "greater_than" input_types = [ [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ], [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)], [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)], ] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is greater than {}" def get_function(self): def greater_than(val1, val2): val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype) val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype) if val1_is_categorical and val2_is_categorical: if not all(val1.cat.categories == val2.cat.categories): return val1.where(pd.isnull, np.nan) elif val1_is_categorical or val2_is_categorical: # This can happen because CFM does not set proper dtypes for intermediate # features, so some agg features that should be Ordinal don't yet have correct type. return val1.where(pd.isnull, np.nan) return val1 > val2 return greater_than def generate_name(self, base_feature_names): return "%s > %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/greater_than_equal_to.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime, Ordinal from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class GreaterThanEqualTo(TransformPrimitive): """Determines if values in one list are greater than or equal to another list. Description: Given a list of values X and a list of values Y, determine whether each value in X is greater than or equal to each corresponding value in Y. Equal pairs will return `True`. Examples: >>> greater_than_equal_to = GreaterThanEqualTo() >>> greater_than_equal_to([2, 1, 2], [1, 2, 2]).tolist() [True, False, True] """ name = "greater_than_equal_to" input_types = [ [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ], [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)], [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)], ] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is greater than or equal to {}" def get_function(self): def greater_than_equal(val1, val2): val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype) val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype) if val1_is_categorical and val2_is_categorical: if not all(val1.cat.categories == val2.cat.categories): return val1.where(pd.isnull, np.nan) elif val1_is_categorical or val2_is_categorical: # This can happen because CFM does not set proper dtypes for intermediate # features, so some agg features that should be Ordinal don't yet have correct type. return val1.where(pd.isnull, np.nan) return val1 >= val2 return greater_than_equal def generate_name(self, base_feature_names): return "%s >= %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/greater_than_equal_to_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class GreaterThanEqualToScalar(TransformPrimitive): """Determines if values are greater than or equal to a given scalar. Description: Given a list of values and a constant scalar, determine whether each of the values is greater than or equal to the scalar. If a value is equal to the scalar, return `True`. Examples: >>> greater_than_equal_to_scalar = GreaterThanEqualToScalar(value=2) >>> greater_than_equal_to_scalar([3, 1, 2]).tolist() [True, False, True] """ name = "greater_than_equal_to_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, value=0): self.value = value self.description_template = ( "whether {{}} is greater than or equal to {}".format(self.value) ) def get_function(self): def greater_than_equal_to_scalar(vals): return vals >= self.value return greater_than_equal_to_scalar def generate_name(self, base_feature_names): return "%s >= %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/greater_than_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class GreaterThanScalar(TransformPrimitive): """Determines if values are greater than a given scalar. Description: Given a list of values and a constant scalar, determine whether each of the values is greater than the scalar. If a value is equal to the scalar, return `False`. Examples: >>> greater_than_scalar = GreaterThanScalar(value=2) >>> greater_than_scalar([3, 1, 2]).tolist() [True, False, False] """ name = "greater_than_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, value=0): self.value = value self.description_template = "whether {{}} is greater than {}".format(self.value) def get_function(self): def greater_than_scalar(vals): return vals > self.value return greater_than_scalar def generate_name(self, base_feature_names): return "%s > %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/less_than.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime, Ordinal from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class LessThan(TransformPrimitive): """Determines if values in one list are less than another list. Description: Given a list of values X and a list of values Y, determine whether each value in X is less than each corresponding value in Y. Equal pairs will return `False`. Examples: >>> less_than = LessThan() >>> less_than([2, 1, 2], [1, 2, 2]).tolist() [False, True, False] """ name = "less_than" input_types = [ [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ], [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)], [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)], ] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is less than {}" def get_function(self): def less_than(val1, val2): val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype) val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype) if val1_is_categorical and val2_is_categorical: if not all(val1.cat.categories == val2.cat.categories): return val1.where(pd.isnull, np.nan) elif val1_is_categorical or val2_is_categorical: # This can happen because CFM does not set proper dtypes for intermediate # features, so some agg features that should be Ordinal don't yet have correct type. return val1.where(pd.isnull, np.nan) return val1 < val2 return less_than def generate_name(self, base_feature_names): return "%s < %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/less_than_equal_to.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime, Ordinal from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class LessThanEqualTo(TransformPrimitive): """Determines if values in one list are less than or equal to another list. Description: Given a list of values X and a list of values Y, determine whether each value in X is less than or equal to each corresponding value in Y. Equal pairs will return `True`. Examples: >>> less_than_equal_to = LessThanEqualTo() >>> less_than_equal_to([2, 1, 2], [1, 2, 2]).tolist() [False, True, True] """ name = "less_than_equal_to" input_types = [ [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ], [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)], [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)], ] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is less than or equal to {}" def get_function(self): def less_than_equal(val1, val2): val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype) val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype) if val1_is_categorical and val2_is_categorical: if not all(val1.cat.categories == val2.cat.categories): return val1.where(pd.isnull, np.nan) elif val1_is_categorical or val2_is_categorical: # This can happen because CFM does not set proper dtypes for intermediate # features, so some agg features that should be Ordinal don't yet have correct type. return val1.where(pd.isnull, np.nan) return val1 <= val2 return less_than_equal def generate_name(self, base_feature_names): return "%s <= %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/less_than_equal_to_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class LessThanEqualToScalar(TransformPrimitive): """Determines if values are less than or equal to a given scalar. Description: Given a list of values and a constant scalar, determine whether each of the values is less than or equal to the scalar. If a value is equal to the scalar, return `True`. Examples: >>> less_than_equal_to_scalar = LessThanEqualToScalar(value=2) >>> less_than_equal_to_scalar([3, 1, 2]).tolist() [False, True, True] """ name = "less_than_equal_to_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, value=0): self.value = value self.description_template = "whether {{}} is less than or equal to {}".format( self.value, ) def get_function(self): def less_than_equal_to_scalar(vals): return vals <= self.value return less_than_equal_to_scalar def generate_name(self, base_feature_names): return "%s <= %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/less_than_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class LessThanScalar(TransformPrimitive): """Determines if values are less than a given scalar. Description: Given a list of values and a constant scalar, determine whether each of the values is less than the scalar. If a value is equal to the scalar, return `False`. Examples: >>> less_than_scalar = LessThanScalar(value=2) >>> less_than_scalar([3, 1, 2]).tolist() [False, True, False] """ name = "less_than_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, value=0): self.value = value self.description_template = "whether {{}} is less than {}".format(self.value) def get_function(self): def less_than_scalar(vals): return vals < self.value return less_than_scalar def generate_name(self, base_feature_names): return "%s < %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/modulo_by_feature.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class ModuloByFeature(TransformPrimitive): """Computes the modulo of a scalar by each element in a list. Description: Given a list of numeric values and a scalar, return the modulo, or remainder of the scalar after being divided by each value. Examples: >>> modulo_by_feature = ModuloByFeature(value=2) >>> modulo_by_feature([4, 1, 2]).tolist() [2, 0, 0] """ name = "modulo_by_feature" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=1): self.value = value self.description_template = "the remainder after dividing {} by {{}}".format( self.value, ) def get_function(self): def modulo_by_feature(vals): return self.value % vals return modulo_by_feature def generate_name(self, base_feature_names): return "%s %% %s" % (str(self.value), base_feature_names[0]) ================================================ FILE: featuretools/primitives/standard/transform/binary/modulo_numeric.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class ModuloNumeric(TransformPrimitive): """Performs element-wise modulo of two lists. Description: Given a list of values X and a list of values Y, determine the modulo, or remainder of each value in X after it's divided by its corresponding value in Y. Examples: >>> modulo_numeric = ModuloNumeric() >>> modulo_numeric([2, 1, 5], [1, 2, 2]).tolist() [0, 1, 1] """ name = "modulo_numeric" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the remainder after dividing {} by {}" def get_function(self): return np.mod def generate_name(self, base_feature_names): return "%s %% %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/modulo_numeric_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class ModuloNumericScalar(TransformPrimitive): """Computes the modulo of each element in the list by a given scalar. Description: Given a list of numeric values and a scalar, return the modulo, or remainder of each value after being divided by the scalar. Examples: >>> modulo_numeric_scalar = ModuloNumericScalar(value=2) >>> modulo_numeric_scalar([3, 1, 2]).tolist() [1, 1, 0] """ name = "modulo_numeric_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=1): self.value = value self.description_template = "the remainder after dividing {{}} by {}".format( self.value, ) def get_function(self): def modulo_scalar(vals): return vals % self.value return modulo_scalar def generate_name(self, base_feature_names): return "%s %% %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/multiply_boolean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class MultiplyBoolean(TransformPrimitive): """Performs element-wise multiplication of two lists of boolean values. Description: Given a list of boolean values X and a list of boolean values Y, determine the product of each value in X with its corresponding value in Y. Examples: >>> multiply_boolean = MultiplyBoolean() >>> multiply_boolean([True, True, False], [True, False, True]).tolist() [True, False, False] """ name = "multiply_boolean" input_types = [ [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=BooleanNullable), ], [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)], [ ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=BooleanNullable), ], [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=Boolean), ], ] return_type = ColumnSchema(logical_type=BooleanNullable) commutative = True description_template = "the product of {} and {}" def get_function(self): return np.bitwise_and def generate_name(self, base_feature_names): return "%s * %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/multiply_numeric.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class MultiplyNumeric(TransformPrimitive): """Performs element-wise multiplication of two lists. Description: Given a list of values X and a list of values Y, determine the product of each value in X with its corresponding value in Y. Examples: >>> multiply_numeric = MultiplyNumeric() >>> multiply_numeric([2, 1, 2], [1, 2, 2]).tolist() [2, 2, 4] """ name = "multiply_numeric" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) commutative = True description_template = "the product of {} and {}" def get_function(self): return np.multiply def generate_name(self, base_feature_names): return "%s * %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/multiply_numeric_boolean.py ================================================ import pandas.api.types as pdtypes from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class MultiplyNumericBoolean(TransformPrimitive): """Performs element-wise multiplication of a numeric list with a boolean list. Description: Given a list of numeric values X and a list of boolean values Y, return the values in X where the corresponding value in Y is True. Examples: >>> import pandas as pd >>> multiply_numeric_boolean = MultiplyNumericBoolean() >>> multiply_numeric_boolean([2, 1, 2], [True, True, False]).tolist() [2, 1, 0] >>> multiply_numeric_boolean([2, None, None], [True, True, False]).astype("float64").tolist() [2.0, nan, nan] >>> multiply_numeric_boolean([2, 1, 2], pd.Series([True, True, pd.NA], dtype="boolean")).tolist() [2, 1, ] """ name = "multiply_numeric_boolean" input_types = [ [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(logical_type=Boolean), ], [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(logical_type=BooleanNullable), ], [ ColumnSchema(logical_type=Boolean), ColumnSchema(semantic_tags={"numeric"}), ], [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(semantic_tags={"numeric"}), ], ] return_type = ColumnSchema(semantic_tags={"numeric"}) commutative = True description_template = "the product of {} and {}" def get_function(self): def multiply_numeric_boolean(ser1, ser2): if pdtypes.is_bool_dtype(ser1): bools = ser1 vals = ser2 else: bools = ser2 vals = ser1 result = vals * bools.astype("Int64") return result return multiply_numeric_boolean def generate_name(self, base_feature_names): return "%s * %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/multiply_numeric_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class MultiplyNumericScalar(TransformPrimitive): """Multiplies each element in the list by a scalar. Description: Given a list of numeric values and a scalar, multiply each value in the list by the scalar. Examples: >>> multiply_numeric_scalar = MultiplyNumericScalar(value=2) >>> multiply_numeric_scalar([3, 1, 2]).tolist() [6, 2, 4] """ name = "multiply_numeric_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=1): self.value = value self.description_template = "the product of {{}} and {}".format(self.value) def get_function(self): def multiply_scalar(vals): return vals * self.value return multiply_scalar def generate_name(self, base_feature_names): return "%s * %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/not_equal.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class NotEqual(TransformPrimitive): """Determines if values in one list are not equal to another list. Description: Given a list of values X and a list of values Y, determine whether each value in X is not equal to each corresponding value in Y. Examples: >>> not_equal = NotEqual() >>> not_equal([2, 1, 2], [1, 2, 2]).tolist() [True, True, False] """ name = "not_equal" input_types = [ColumnSchema(), ColumnSchema()] return_type = ColumnSchema(logical_type=BooleanNullable) commutative = True description_template = "whether {} does not equal {}" def get_function(self): def not_equal(x_vals, y_vals): if isinstance(x_vals.dtype, pd.CategoricalDtype) and isinstance( y_vals.dtype, pd.CategoricalDtype, ): categories = set(x_vals.cat.categories).union( set(y_vals.cat.categories), ) x_vals = x_vals.cat.add_categories( categories.difference(set(x_vals.cat.categories)), ) y_vals = y_vals.cat.add_categories( categories.difference(set(y_vals.cat.categories)), ) return x_vals.ne(y_vals) return not_equal def generate_name(self, base_feature_names): return "%s != %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/not_equal_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class NotEqualScalar(TransformPrimitive): """Determines if values in a list are not equal to a given scalar. Description: Given a list of values and a constant scalar, determine whether each of the values is not equal to the scalar. Examples: >>> not_equal_scalar = NotEqualScalar(value=2) >>> not_equal_scalar([3, 1, 2]).tolist() [True, True, False] """ name = "not_equal_scalar" input_types = [ColumnSchema()] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, value=None): self.value = value self.description_template = "whether {{}} does not equal {}".format(self.value) def get_function(self): def not_equal_scalar(vals): return vals != self.value return not_equal_scalar def generate_name(self, base_feature_names): return "%s != %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/binary/or_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class Or(TransformPrimitive): """Performs element-wise logical OR of two lists. Description: Given a list of booleans X and a list of booleans Y, determine whether each value in X is `True`, or whether its corresponding value in Y is `True`. Examples: >>> _or = Or() >>> _or([False, True, False], [True, True, False]).tolist() [True, True, False] """ name = "or" input_types = [ [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=BooleanNullable), ], [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)], [ ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=BooleanNullable), ], [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=Boolean), ], ] return_type = ColumnSchema(logical_type=BooleanNullable) commutative = True description_template = "whether {} is true or {} is true" def get_function(self): return np.logical_or def generate_name(self, base_feature_names): return "OR(%s, %s)" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/scalar_subtract_numeric_feature.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class ScalarSubtractNumericFeature(TransformPrimitive): """Subtracts each value in the list from a given scalar. Description: Given a list of numeric values and a scalar, subtract the each value from the scalar and return the list of differences. Examples: >>> scalar_subtract_numeric_feature = ScalarSubtractNumericFeature(value=2) >>> scalar_subtract_numeric_feature([3, 1, 2]).tolist() [-1, 1, 0] """ name = "scalar_subtract_numeric_feature" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=0): self.value = value self.description_template = "the result {} minus {{}}".format(self.value) def get_function(self): def scalar_subtract_numeric_feature(vals): return self.value - vals return scalar_subtract_numeric_feature def generate_name(self, base_feature_names): return "%s - %s" % (str(self.value), base_feature_names[0]) ================================================ FILE: featuretools/primitives/standard/transform/binary/subtract_numeric.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class SubtractNumeric(TransformPrimitive): """Performs element-wise subtraction of two lists. Description: Given a list of values X and a list of values Y, determine the difference of each value in X from its corresponding value in Y. Args: commutative (bool): determines if Deep Feature Synthesis should generate both x - y and y - x, or just one. If True, there is no guarantee which of the two will be generated. Defaults to True. Notes: commutative is True by default since False would result in 2 perfectly correlated series. Examples: >>> subtract_numeric = SubtractNumeric() >>> subtract_numeric([2, 1, 2], [1, 2, 2]).tolist() [1, -1, 0] """ name = "subtract_numeric" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the result of {} minus {}" commutative = True def __init__(self, commutative=True): self.commutative = commutative def get_function(self): return np.subtract def generate_name(self, base_feature_names): return "%s - %s" % (base_feature_names[0], base_feature_names[1]) ================================================ FILE: featuretools/primitives/standard/transform/binary/subtract_numeric_scalar.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base.transform_primitive_base import TransformPrimitive class SubtractNumericScalar(TransformPrimitive): """Subtracts a scalar from each element in the list. Description: Given a list of numeric values and a scalar, subtract the given scalar from each value in the list. Examples: >>> subtract_numeric_scalar = SubtractNumericScalar(value=2) >>> subtract_numeric_scalar([3, 1, 2]).tolist() [1, -1, 0] """ name = "subtract_numeric_scalar" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, value=0): self.value = value self.description_template = "the result of {{}} minus {}".format(self.value) def get_function(self): def subtract_scalar(vals): return vals - self.value return subtract_scalar def generate_name(self, base_feature_names): return "%s - %s" % (base_feature_names[0], str(self.value)) ================================================ FILE: featuretools/primitives/standard/transform/cumulative/__init__.py ================================================ from featuretools.primitives.standard.transform.cumulative.cum_count import CumCount from featuretools.primitives.standard.transform.cumulative.cum_max import CumMax from featuretools.primitives.standard.transform.cumulative.cum_mean import CumMean from featuretools.primitives.standard.transform.cumulative.cum_min import CumMin from featuretools.primitives.standard.transform.cumulative.cum_sum import CumSum from featuretools.primitives.standard.transform.cumulative.cumulative_time_since_last_false import ( CumulativeTimeSinceLastFalse, ) from featuretools.primitives.standard.transform.cumulative.cumulative_time_since_last_true import ( CumulativeTimeSinceLastTrue, ) ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cum_count.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable from featuretools.primitives.base import TransformPrimitive class CumCount(TransformPrimitive): """Calculates the cumulative count. Description: Given a list of values, return the cumulative count (or running count). There is no set window, so the count at each point is calculated over all prior values. `NaN` values are counted. Examples: >>> cum_count = CumCount() >>> cum_count([1, 2, 3, 4, None, 5]).tolist() [1, 2, 3, 4, 5, 6] """ name = "cum_count" input_types = [ [ColumnSchema(semantic_tags={"foreign_key"})], [ColumnSchema(semantic_tags={"category"})], ] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the cumulative count of {}" def get_function(self): def cum_count(values): return np.arange(1, len(values) + 1) return cum_count ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cum_max.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class CumMax(TransformPrimitive): """Calculates the cumulative maximum. Description: Given a list of values, return the cumulative max (or running max). There is no set window, so the max at each point is calculated over all prior values. `NaN` values will return `NaN`, but in the window of a cumulative caluclation, they're ignored. Examples: >>> cum_max = CumMax() >>> cum_max([1, 2, 3, 4, None, 5]).tolist() [1.0, 2.0, 3.0, 4.0, nan, 5.0] """ name = "cum_max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the cumulative maximum of {}" def get_function(self): def cum_max(values): return values.cummax() return cum_max ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cum_mean.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class CumMean(TransformPrimitive): """Calculates the cumulative mean. Description: Given a list of values, return the cumulative mean (or running mean). There is no set window, so the mean at each point is calculated over all prior values. `NaN` values will return `NaN`, but in the window of a cumulative caluclation, they're treated as 0. Examples: >>> cum_mean = CumMean() >>> cum_mean([1, 2, 3, 4, None, 5]).tolist() [1.0, 1.5, 2.0, 2.5, nan, 2.5] """ name = "cum_mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the cumulative mean of {}" def get_function(self): def cum_mean(values): return values.cumsum() / np.arange(1, len(values) + 1) return cum_mean ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cum_min.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class CumMin(TransformPrimitive): """Calculates the cumulative minimum. Description: Given a list of values, return the cumulative min (or running min). There is no set window, so the min at each point is calculated over all prior values. `NaN` values will return `NaN`, but in the window of a cumulative caluclation, they're ignored. Examples: >>> cum_min = CumMin() >>> cum_min([1, 2, -3, 4, None, 5]).tolist() [1.0, 1.0, -3.0, -3.0, nan, -3.0] """ name = "cum_min" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the cumulative minimum of {}" def get_function(self): def cum_min(values): return values.cummin() return cum_min ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cum_sum.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class CumSum(TransformPrimitive): """Calculates the cumulative sum. Description: Given a list of values, return the cumulative sum (or running total). There is no set window, so the sum at each point is calculated over all prior values. `NaN` values will return `NaN`, but in the window of a cumulative caluclation, they're ignored. Examples: >>> cum_sum = CumSum() >>> cum_sum([1, 2, 3, 4, None, 5]).tolist() [1.0, 3.0, 6.0, 10.0, nan, 15.0] """ name = "cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the cumulative sum of {}" def get_function(self): def cum_sum(values): return values.cumsum() return cum_sum ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cumulative_time_since_last_false.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, Datetime, Double from featuretools.primitives.base import TransformPrimitive class CumulativeTimeSinceLastFalse(TransformPrimitive): """Determines the time since last `False` value. Description: Given a list of booleans and a list of corresponding datetimes, determine the time at each point since the last `False` value. Returns time difference in seconds. `NaN` values are ignored. Examples: >>> from datetime import datetime >>> cumulative_time_since_last_false = CumulativeTimeSinceLastFalse() >>> booleans = [False, True, False, True] >>> datetimes = [ ... datetime(2011, 4, 9, 10, 30, 0), ... datetime(2011, 4, 9, 10, 30, 10), ... datetime(2011, 4, 9, 10, 30, 15), ... datetime(2011, 4, 9, 10, 30, 29) ... ] >>> cumulative_time_since_last_false(datetimes, booleans).tolist() [0.0, 10.0, 0.0, 14.0] """ name = "cumulative_time_since_last_false" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(logical_type=Boolean), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) def get_function(self): def time_since_previous_false(datetime_col, bool_col): if bool_col.dropna().empty: return pd.Series([np.nan] * len(bool_col)) df = pd.DataFrame( { "datetime": datetime_col, "last_false_datetime": datetime_col, "bool": bool_col, }, ) not_false_indices = df["bool"] df.loc[not_false_indices, "last_false_datetime"] = np.nan df["last_false_datetime"] = df["last_false_datetime"].fillna(method="ffill") total_seconds = ( pd.to_datetime(df["datetime"]).subtract(df["last_false_datetime"]) ).dt.total_seconds() return pd.Series(total_seconds) return time_since_previous_false ================================================ FILE: featuretools/primitives/standard/transform/cumulative/cumulative_time_since_last_true.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, Datetime, Double from featuretools.primitives.base import TransformPrimitive class CumulativeTimeSinceLastTrue(TransformPrimitive): """Determines the time (in seconds) since the last boolean was `True` given a datetime index column and boolean column Examples: >>> from datetime import datetime >>> cumulative_time_since_last_true = CumulativeTimeSinceLastTrue() >>> booleans = [False, True, False, True] >>> datetimes = [ ... datetime(2011, 4, 9, 10, 30, 0), ... datetime(2011, 4, 9, 10, 30, 10), ... datetime(2011, 4, 9, 10, 30, 15), ... datetime(2011, 4, 9, 10, 30, 30) ... ] >>> cumulative_time_since_last_true(datetimes, booleans).tolist() [nan, 0.0, 5.0, 0.0] """ name = "cumulative_time_since_last_true" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(logical_type=Boolean), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) def get_function(self): def time_since_previous_true(datetime_col, bool_col): if bool_col.dropna().empty: return pd.Series([np.nan] * len(bool_col)) df = pd.DataFrame( { "datetime": datetime_col, "last_true_datetime": datetime_col, "bool": bool_col, }, ) not_false_indices = df["bool"] df.loc[~not_false_indices, "last_true_datetime"] = np.nan df["last_true_datetime"] = df["last_true_datetime"].fillna(method="ffill") total_seconds = ( pd.to_datetime(df["datetime"]).subtract(df["last_true_datetime"]) ).dt.total_seconds() return pd.Series(total_seconds) return time_since_previous_true ================================================ FILE: featuretools/primitives/standard/transform/datetime/__init__.py ================================================ from featuretools.primitives.standard.transform.datetime.age import Age from featuretools.primitives.standard.transform.datetime.date_to_holiday import ( DateToHoliday, ) from featuretools.primitives.standard.transform.datetime.date_to_timezone import ( DateToTimeZone, ) from featuretools.primitives.standard.transform.datetime.day import Day from featuretools.primitives.standard.transform.datetime.day_of_year import DayOfYear from featuretools.primitives.standard.transform.datetime.days_in_month import ( DaysInMonth, ) from featuretools.primitives.standard.transform.datetime.diff_datetime import ( DiffDatetime, ) from featuretools.primitives.standard.transform.datetime.distance_to_holiday import ( DistanceToHoliday, ) from featuretools.primitives.standard.transform.datetime.hour import Hour from featuretools.primitives.standard.transform.datetime.is_first_week_of_month import ( IsFirstWeekOfMonth, ) from featuretools.primitives.standard.transform.datetime.is_federal_holiday import ( IsFederalHoliday, ) from featuretools.primitives.standard.transform.datetime.is_leap_year import IsLeapYear from featuretools.primitives.standard.transform.datetime.is_lunch_time import ( IsLunchTime, ) from featuretools.primitives.standard.transform.datetime.is_month_end import IsMonthEnd from featuretools.primitives.standard.transform.datetime.is_month_start import ( IsMonthStart, ) from featuretools.primitives.standard.transform.datetime.is_quarter_end import ( IsQuarterEnd, ) from featuretools.primitives.standard.transform.datetime.is_quarter_start import ( IsQuarterStart, ) from featuretools.primitives.standard.transform.datetime.is_weekend import IsWeekend from featuretools.primitives.standard.transform.datetime.is_working_hours import ( IsWorkingHours, ) from featuretools.primitives.standard.transform.datetime.is_year_end import IsYearEnd from featuretools.primitives.standard.transform.datetime.is_year_start import ( IsYearStart, ) from featuretools.primitives.standard.transform.datetime.minute import Minute from featuretools.primitives.standard.transform.datetime.month import Month from featuretools.primitives.standard.transform.datetime.part_of_day import PartOfDay from featuretools.primitives.standard.transform.datetime.quarter import Quarter from featuretools.primitives.standard.transform.datetime.season import Season from featuretools.primitives.standard.transform.datetime.second import Second from featuretools.primitives.standard.transform.datetime.time_since import TimeSince from featuretools.primitives.standard.transform.datetime.time_since_previous import ( TimeSincePrevious, ) from featuretools.primitives.standard.transform.datetime.week import Week from featuretools.primitives.standard.transform.datetime.weekday import Weekday from featuretools.primitives.standard.transform.datetime.year import Year ================================================ FILE: featuretools/primitives/standard/transform/datetime/age.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import AgeFractional, Datetime from featuretools.primitives.base import TransformPrimitive class Age(TransformPrimitive): """Calculates the age in years as a floating point number given a date of birth. Description: Age in years is computed by calculating the number of days between the date of birth and the reference time and dividing the result by 365. Examples: Determine the age of three people as of Jan 1, 2019 >>> import pandas as pd >>> reference_date = pd.to_datetime("01-01-2019") >>> age = Age() >>> input_ages = [pd.to_datetime("01-01-2000"), ... pd.to_datetime("05-30-1983"), ... pd.to_datetime("10-17-1997")] >>> age(input_ages, time=reference_date).tolist() [19.013698630136986, 35.61643835616438, 21.221917808219178] """ name = "age" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"date_of_birth"})] return_type = ColumnSchema(logical_type=AgeFractional, semantic_tags={"numeric"}) uses_calc_time = True description_template = "the age from {}" def get_function(self): def age(x, time=None): return (time - x).dt.days / 365 return age ================================================ FILE: featuretools/primitives/standard/transform/datetime/date_to_holiday.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, Datetime from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.datetime.utils import HolidayUtil class DateToHoliday(TransformPrimitive): """Transforms time of an instance into the holiday name, if there is one. Description: If there is no holiday, it returns `NaN`. Currently only works for the United States and Canada with dates between 1950 and 2100. Args: country (str): Country to use for determining Holidays. Default is 'US'. Should be one of the available countries here: https://github.com/dr-prodigy/python-holidays#available-countries Examples: >>> from datetime import datetime >>> date_to_holiday = DateToHoliday() >>> dates = pd.Series([datetime(2016, 1, 1), ... datetime(2016, 2, 27), ... datetime(2017, 5, 29, 10, 30, 5), ... datetime(2018, 7, 4)]) >>> date_to_holiday(dates).tolist() ["New Year's Day", nan, 'Memorial Day', 'Independence Day'] We can also change the country. >>> date_to_holiday_canada = DateToHoliday(country='Canada') >>> dates = pd.Series([datetime(2016, 7, 1), ... datetime(2016, 11, 15), ... datetime(2018, 12, 25)]) >>> date_to_holiday_canada(dates).tolist() ['Canada Day', nan, 'Christmas Day'] """ name = "date_to_holiday" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def __init__(self, country="US"): self.country = country self.holidayUtil = HolidayUtil(country) def get_function(self): def date_to_holiday(x): holiday_df = self.holidayUtil.to_df() df = pd.DataFrame({"date": x}) df["date"] = df["date"].dt.date.astype("datetime64[ns]") df = df.merge( holiday_df, how="left", left_on="date", right_on="holiday_date", ) return df.names.values return date_to_holiday ================================================ FILE: featuretools/primitives/standard/transform/datetime/date_to_timezone.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, Datetime from featuretools.primitives.base import TransformPrimitive class DateToTimeZone(TransformPrimitive): """Determines the timezone of a datetime. Description: Given a list of datetimes, extract the timezone from each one. Looks for the `tzinfo` attribute on `datetime.datetime` objects. If the datetime has no timezone or the date is missing, return `NaN`. Examples: >>> from datetime import datetime >>> from pytz import timezone >>> date_to_time_zone = DateToTimeZone() >>> dates = [datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")), ... datetime(2010, 1, 1, tzinfo=timezone("America/New_York")), ... datetime(2010, 1, 1, tzinfo=timezone("America/Chicago")), ... datetime(2010, 1, 1)] >>> date_to_time_zone(dates).tolist() ['America/Los_Angeles', 'America/New_York', 'America/Chicago', nan] """ name = "date_to_time_zone" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def date_to_time_zone(x): return x.apply(lambda x: x.tzinfo.zone if x.tzinfo else np.nan) return date_to_time_zone ================================================ FILE: featuretools/primitives/standard/transform/datetime/day.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Day(TransformPrimitive): """Determines the day of the month from a datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 3, 3), ... datetime(2019, 3, 31)] >>> day = Day() >>> day(dates).tolist() [1, 3, 31] """ name = "day" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 32))), semantic_tags={"category"}, ) description_template = "the day of the month of {}" def get_function(self): def day(vals): return vals.dt.day return day ================================================ FILE: featuretools/primitives/standard/transform/datetime/day_of_year.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class DayOfYear(TransformPrimitive): """Determines the ordinal day of the year from the given datetime Description: For a list of dates, return the ordinal day of the year from the given datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 1, 1), ... datetime(2020, 12, 31), ... datetime(2020, 2, 28)] >>> dayOfYear = DayOfYear() >>> dayOfYear(dates).tolist() [1, 366, 59] """ name = "day_of_year" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 367))), semantic_tags={"category"}, ) description_template = "the day of year from {}" def get_function(self): def dayOfYear(vals): return vals.dt.dayofyear return dayOfYear ================================================ FILE: featuretools/primitives/standard/transform/datetime/days_in_month.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class DaysInMonth(TransformPrimitive): """Determines the number of days in the month of given datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 12, 1), ... datetime(2019, 1, 3), ... datetime(2020, 2, 1)] >>> days_in_month = DaysInMonth() >>> days_in_month(dates).tolist() [31, 31, 29] """ name = "days_in_month" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 32))), semantic_tags={"category"}, ) description_template = "the days in the month of {}" def get_function(self): def days_in_month(vals): return vals.dt.daysinmonth return days_in_month ================================================ FILE: featuretools/primitives/standard/transform/datetime/diff_datetime.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Timedelta from featuretools.primitives.standard.transform.numeric.diff import Diff class DiffDatetime(Diff): """Computes the timedelta between a datetime in a list and the previous datetime in that list. Args: periods (int): The number of periods by which to shift the index row. Default is 0. Periods correspond to rows. Description: Given a list of datetimes, compute the difference from the previous item in the list. The result for the first element of the list will always be `NaT`. Examples: >>> from datetime import datetime >>> dt_values = [datetime(2019, 3, 1), datetime(2019, 6, 30), datetime(2019, 11, 17), datetime(2020, 1, 30), datetime(2020, 3, 11)] >>> diff_dt = DiffDatetime() >>> diff_dt(dt_values).tolist() [NaT, Timedelta('121 days 00:00:00'), Timedelta('140 days 00:00:00'), Timedelta('74 days 00:00:00'), Timedelta('41 days 00:00:00')] You can specify the number of periods to shift the values >>> diff_dt_periods = DiffDatetime(periods = 1) >>> diff_dt_periods(dt_values).tolist() [NaT, NaT, Timedelta('121 days 00:00:00'), Timedelta('140 days 00:00:00'), Timedelta('74 days 00:00:00')] """ name = "diff_datetime" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Timedelta) uses_full_dataframe = True description_template = "the difference from the previous value of {}" def __init__(self, periods=0): super().__init__(periods) ================================================ FILE: featuretools/primitives/standard/transform/datetime/distance_to_holiday.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.datetime.utils import HolidayUtil class DistanceToHoliday(TransformPrimitive): """Computes the number of days before or after a given holiday. Description: For a list of dates, return the distance from the nearest occurrence of a chosen holiday. The distance is returned in days. If the closest occurrence is prior to the date given, return a negative number. If a date is missing, return `NaN`. Currently only works with dates between 1950 and 2100. Args: holiday (str): Name of the holiday. Defaults to New Year's Day. country (str): Specifies which country's calendar to use for the given holiday. Default is `US`. Examples: >>> from datetime import datetime >>> distance_to_holiday = DistanceToHoliday("New Year's Day") >>> dates = [datetime(2010, 1, 1), ... datetime(2012, 5, 31), ... datetime(2017, 7, 31), ... datetime(2020, 12, 31)] >>> distance_to_holiday(dates).tolist() [0, -151, 154, 1] We can also control the country in which we're searching for a holiday. >>> distance_to_holiday = DistanceToHoliday("Canada Day", country='Canada') >>> dates = [datetime(2010, 1, 1), ... datetime(2012, 5, 31), ... datetime(2017, 7, 31), ... datetime(2020, 12, 31)] >>> distance_to_holiday(dates).tolist() [181, 31, -30, 182] """ name = "distance_to_holiday" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(semantic_tags={"numeric"}) default_value = 0 def __init__(self, holiday="New Year's Day", country="US"): self.country = country self.holiday = holiday self.holidayUtil = HolidayUtil(country) available_holidays = list(set(self.holidayUtil.federal_holidays.values())) if self.holiday not in available_holidays: error = "must be one of the available holidays:\n%s" % available_holidays raise ValueError(error) def get_function(self): def distance_to_holiday(x): holiday_df = self.holidayUtil.to_df() holiday_df = holiday_df[holiday_df.names == self.holiday] df = pd.DataFrame({"date": x}) df["x_index"] = df.index # store original index as a column df = df.dropna() df = df.sort_values("date") df["date"] = df["date"].dt.date.astype("datetime64[ns]") matches = pd.merge_asof( df, holiday_df, left_on="date", right_on="holiday_date", direction="nearest", tolerance=pd.Timedelta("365d"), ) matches = matches.set_index("x_index") matches["days_diff"] = (matches.holiday_date - matches.date).dt.days return matches.days_diff.reindex_like(x) return distance_to_holiday ================================================ FILE: featuretools/primitives/standard/transform/datetime/hour.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Hour(TransformPrimitive): """Determines the hour value of a datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 3, 3, 11, 10, 50), ... datetime(2019, 3, 31, 19, 45, 15)] >>> hour = Hour() >>> hour(dates).tolist() [0, 11, 19] """ name = "hour" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(24))), semantic_tags={"category"}, ) description_template = "the hour value of {}" def get_function(self): def hour(vals): return vals.dt.hour return hour ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_federal_holiday.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.datetime.utils import HolidayUtil class IsFederalHoliday(TransformPrimitive): """Determines if a given datetime is a federal holiday. Description: This primtive currently only works for the United States and Canada with dates between 1950 and 2100. Args: country (str): Country to use for determining Holidays. Default is 'US'. Should be one of the available countries here: https://github.com/dr-prodigy/python-holidays#available-countries Examples: >>> from datetime import datetime >>> is_federal_holiday = IsFederalHoliday(country="US") >>> is_federal_holiday([ ... datetime(2019, 7, 4, 10, 0, 30), ... datetime(2019, 2, 26)]).tolist() [True, False] """ name = "is_federal_holiday" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, country="US"): self.country = country self.holidayUtil = HolidayUtil(country) def get_function(self): def is_federal_holiday(x): holidays_df = self.holidayUtil.to_df() is_holiday = x.dt.normalize().isin(holidays_df.holiday_date) if x.isnull().values.any(): is_holiday = is_holiday.astype("object") is_holiday[x.isnull()] = np.nan return is_holiday.values return is_federal_holiday ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_first_week_of_month.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsFirstWeekOfMonth(TransformPrimitive): """Determines if a date falls in the first week of the month. Description: Converts a datetime to a boolean indicating if the date falls in the first week of the month. The first week of the month starts on day 1, and the week number is incremented each Sunday. Examples: >>> from datetime import datetime >>> is_first_week_of_month = IsFirstWeekOfMonth() >>> times = [datetime(2019, 3, 1), ... datetime(2019, 3, 3), ... datetime(2019, 3, 31), ... datetime(2019, 3, 30)] >>> is_first_week_of_month(times).tolist() [True, False, False, False] """ name = "is_first_week_of_month" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) def get_function(self): def is_first_week_of_month(x): df = pd.DataFrame({"date": x}) df["first_day"] = df.date - pd.to_timedelta(df["date"].dt.day - 1, unit="d") df["dom"] = df.date.dt.day df["first_day_weekday"] = df.first_day.dt.weekday df["adjusted_dom"] = df.dom + df.first_day_weekday + 1 df.loc[df["first_day_weekday"].astype(float) == 6.0, "adjusted_dom"] = df[ "dom" ] df["is_first_week"] = np.ceil(df.adjusted_dom / 7.0) == 1.0 if df["date"].isnull().values.any(): df["is_first_week"] = df["is_first_week"].astype("object") df.loc[df["date"].isnull(), "is_first_week"] = np.nan return df.is_first_week.values return is_first_week_of_month ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_leap_year.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsLeapYear(TransformPrimitive): """Determines the is_leap_year attribute of a datetime column. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2020, 3, 3, 11, 10, 50), ... datetime(2021, 3, 31, 19, 45, 15)] >>> ily = IsLeapYear() >>> ily(dates).tolist() [False, True, False] """ name = "is_leap_year" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether the year of {} is a leap year" def get_function(self): def is_leap_year(vals): return vals.dt.is_leap_year return is_leap_year ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_lunch_time.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsLunchTime(TransformPrimitive): """Determines if a datetime falls during configurable lunch hour, on a 24-hour clock. Args: lunch_hour (int): Hour when lunch is taken. Must adhere to 24-hour clock. Defaults to 12. Examples: >>> import numpy as np >>> from datetime import datetime >>> dates = [datetime(2022, 6, 21, 12, 3, 3), ... datetime(2019, 1, 3, 4, 4, 4), ... datetime(2022, 1, 1, 11, 1, 2), ... np.nan] >>> is_lunch_time = IsLunchTime() >>> is_lunch_time(dates).tolist() [True, False, False, False] >>> is_lunch_time = IsLunchTime(11) >>> is_lunch_time(dates).tolist() [False, False, True, False] """ name = "is_lunch_time" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} falls during lunch time" def __init__(self, lunch_hour=12): self.lunch_hour = lunch_hour def get_function(self): def is_lunch_time(vals): return vals.dt.hour == self.lunch_hour return is_lunch_time ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_month_end.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsMonthEnd(TransformPrimitive): """Determines the is_month_end attribute of a datetime column. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2021, 2, 28), ... datetime(2020, 2, 29)] >>> ime = IsMonthEnd() >>> ime(dates).tolist() [False, True, True] """ name = "is_month_end" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is at the end of a month" def get_function(self): def is_month_end(vals): return vals.dt.is_month_end return is_month_end ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_month_start.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsMonthStart(TransformPrimitive): """Determines the is_month_start attribute of a datetime column. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2020, 2, 13), ... datetime(2020, 2, 29)] >>> ims = IsMonthStart() >>> ims(dates).tolist() [True, False, False] """ name = "is_month_start" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is at the start of a month" def get_function(self): def is_month_start(vals): return vals.dt.is_month_start return is_month_start ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_quarter_end.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsQuarterEnd(TransformPrimitive): """Determines the is_quarter_end attribute of a datetime column. Examples: >>> from datetime import datetime >>> iqe = IsQuarterEnd() >>> dates = [datetime(2020, 3, 31), ... datetime(2020, 1, 1)] >>> iqe(dates).tolist() [True, False] """ name = "is_quarter_end" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is a quarter end" def get_function(self): def is_quarter_end(vals): return vals.dt.is_quarter_end return is_quarter_end ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_quarter_start.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsQuarterStart(TransformPrimitive): """Determines the is_quarter_start attribute of a datetime column. Examples: >>> from datetime import datetime >>> iqs = IsQuarterStart() >>> dates = [datetime(2020, 3, 31), ... datetime(2020, 1, 1)] >>> iqs(dates).tolist() [False, True] """ name = "is_quarter_start" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} is a quarter start" def get_function(self): def is_quarter_start(vals): return vals.dt.is_quarter_start return is_quarter_start ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_weekend.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsWeekend(TransformPrimitive): """Determines if a date falls on a weekend. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 6, 17, 11, 10, 50), ... datetime(2019, 11, 30, 19, 45, 15)] >>> is_weekend = IsWeekend() >>> is_weekend(dates).tolist() [False, False, True] """ name = "is_weekend" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} occurred on a weekend" def get_function(self): def is_weekend(vals): return vals.dt.weekday > 4 return is_weekend ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_working_hours.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsWorkingHours(TransformPrimitive): """Determines if a datetime falls during working hours on a 24-hour clock. Can configure start_hour and end_hour. Args: start_hour (int): Start hour of workday. Must adhere to 24-hour clock. Default is 8 (8am). end_hour (int): End hour of workday. Must adhere to 24-hour clock. Default is 18 (6pm). Examples: >>> import numpy as np >>> from datetime import datetime >>> dates = [datetime(2022, 6, 21, 16, 3, 3), ... datetime(2019, 1, 3, 4, 4, 4), ... datetime(2022, 1, 1, 12, 1, 2), ... np.nan] >>> is_working_hour = IsWorkingHours() >>> is_working_hour(dates).tolist() [True, False, True, False] >>> is_working_hour = IsWorkingHours(15, 17) >>> is_working_hour(dates).tolist() [True, False, False, False] """ name = "is_working_hours" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} falls during working hours" def __init__(self, start_hour=8, end_hour=18): self.start_hour = start_hour self.end_hour = end_hour def get_function(self): def is_working_hours(vals): return (vals.dt.hour >= self.start_hour) & (vals.dt.hour <= self.end_hour) return is_working_hours ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_year_end.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsYearEnd(TransformPrimitive): """Determines if a date falls on the end of a year. Examples: >>> import numpy as np >>> from datetime import datetime >>> dates = [datetime(2019, 12, 31), ... datetime(2019, 1, 1), ... datetime(2019, 11, 30), ... np.nan] >>> is_year_end = IsYearEnd() >>> is_year_end(dates).tolist() [True, False, False, False] """ name = "is_year_end" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} occurred on the end of a year" def get_function(self): def is_year_end(vals): return vals.dt.is_year_end return is_year_end ================================================ FILE: featuretools/primitives/standard/transform/datetime/is_year_start.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, Datetime from featuretools.primitives.base import TransformPrimitive class IsYearStart(TransformPrimitive): """Determines if a date falls on the start of a year. Examples: >>> import numpy as np >>> from datetime import datetime >>> dates = [datetime(2019, 12, 31), ... datetime(2019, 1, 1), ... datetime(2019, 11, 30), ... np.nan] >>> is_year_start = IsYearStart() >>> is_year_start(dates).tolist() [False, True, False, False] """ name = "is_year_start" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "whether {} occurred on the start of a year" def get_function(self): def is_year_start(vals): return vals.dt.is_year_start return is_year_start ================================================ FILE: featuretools/primitives/standard/transform/datetime/minute.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Minute(TransformPrimitive): """Determines the minutes value of a datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 3, 3, 11, 10, 50), ... datetime(2019, 3, 31, 19, 45, 15)] >>> minute = Minute() >>> minute(dates).tolist() [0, 10, 45] """ name = "minute" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(60))), semantic_tags={"category"}, ) description_template = "the minutes value of {}" def get_function(self): def minute(vals): return vals.dt.minute return minute ================================================ FILE: featuretools/primitives/standard/transform/datetime/month.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Month(TransformPrimitive): """Determines the month value of a datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 6, 17, 11, 10, 50), ... datetime(2019, 11, 30, 19, 45, 15)] >>> month = Month() >>> month(dates).tolist() [3, 6, 11] """ name = "month" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 13))), semantic_tags={"category"}, ) description_template = "the month of {}" def get_function(self): def month(vals): return vals.dt.month return month ================================================ FILE: featuretools/primitives/standard/transform/datetime/part_of_day.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, Datetime from featuretools.primitives.base import TransformPrimitive class PartOfDay(TransformPrimitive): """Determines the part of day of a datetime. Description: For a list of datetimes, determines the part of day the datetime falls into, based on the hour. If the hour falls from 4 to 5, the part of day is 'dawn'. If the hour falls from 6 to 7, the part of day is 'early morning'. If the hour falls from 8 to 10, the part of day is 'late morning'. If the hour falls from 11 to 13, the part of day is 'noon'. If the hour falls from 14 to 16, the part of day is 'afternoon'. If the hour falls from 17 to 19, the part of day is 'evening'. If the hour falls from 20 to 22, the part of day is 'night'. If the hour falls into 23, 24, or 1 to 3, the part of day is 'midnight'. Examples: >>> from datetime import datetime >>> dates = [datetime(2020, 1, 11, 6, 2, 1), ... datetime(2021, 3, 31, 4, 2, 1), ... datetime(2020, 3, 4, 9, 2, 1)] >>> part_of_day = PartOfDay() >>> part_of_day(dates).tolist() ['early morning', 'dawn', 'late morning'] """ name = "part_of_day" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) description_template = "the part of day {} falls in" @staticmethod def construct_replacement_dict(): tdict = dict() tdict[pd.NaT] = np.nan for hour in [4, 5]: tdict[hour] = "dawn" for hour in [6, 7]: tdict[hour] = "early morning" for hour in [8, 9, 10]: tdict[hour] = "late morning" for hour in [11, 12, 13]: tdict[hour] = "noon" for hour in [14, 15, 16]: tdict[hour] = "afternoon" for hour in [17, 18, 19]: tdict[hour] = "evening" for hour in [20, 21, 22]: tdict[hour] = "night" for hour in [23, 0, 1, 2, 3]: tdict[hour] = "midnight" return tdict def get_function(self): replacement_dict = self.construct_replacement_dict() def part_of_day(vals): ans = vals.dt.hour.replace(replacement_dict) return ans return part_of_day ================================================ FILE: featuretools/primitives/standard/transform/datetime/quarter.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Quarter(TransformPrimitive): """Determines the quarter a datetime column falls into (1, 2, 3, 4) Examples: >>> from datetime import datetime >>> dates = [datetime(2019,12,1), ... datetime(2019,1,3), ... datetime(2020,2,1)] >>> q = Quarter() >>> q(dates).tolist() [4, 1, 1] """ name = "quarter" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 5))), semantic_tags={"category"}, ) description_template = "the quarter that describes {}" def get_function(self): def quarter(vals): return vals.dt.quarter return quarter ================================================ FILE: featuretools/primitives/standard/transform/datetime/season.py ================================================ from datetime import date import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, Datetime from featuretools.primitives.base import TransformPrimitive class Season(TransformPrimitive): """Determines the season of a given datetime. Returns winter, spring, summer, or fall. This only works for northern hemisphere. Description: Given a list of datetimes, return the season of each one (`winter`, `spring`, `summer`, or `fall`). Examples: >>> from datetime import datetime >>> times = [datetime(2019, 1, 1), ... datetime(2019, 4, 15), ... datetime(2019, 7, 20), ... datetime(2019, 12, 30)] >>> season = Season() >>> season(times).tolist() ['winter', 'spring', 'summer', 'winter'] """ name = "season" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def season(x): # https://stackoverflow.com/a/28688724/2512385 Y = 2000 # dummy leap year to allow input X-02-29 (leap day) seasons = [ ("winter", (date(Y, 1, 1), date(Y, 3, 20))), ("spring", (date(Y, 3, 21), date(Y, 6, 20))), ("summer", (date(Y, 6, 21), date(Y, 9, 22))), ("fall", (date(Y, 9, 23), date(Y, 12, 20))), ("winter", (date(Y, 12, 21), date(Y, 12, 31))), ] x = x.apply(lambda x: x.replace(year=2000)) def get_season(dt): for season, (start, end) in seasons: if not pd.isna(dt) and start <= dt.date() <= end: return season return pd.NA new = x.apply(get_season).astype(dtype="string") return new return season ================================================ FILE: featuretools/primitives/standard/transform/datetime/second.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Second(TransformPrimitive): """Determines the seconds value of a datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 3, 3, 11, 10, 50), ... datetime(2019, 3, 31, 19, 45, 15)] >>> second = Second() >>> second(dates).tolist() [0, 50, 15] """ name = "second" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(60))), semantic_tags={"category"}, ) description_template = "the seconds value of {}" def get_function(self): def second(vals): return vals.dt.second return second ================================================ FILE: featuretools/primitives/standard/transform/datetime/time_since.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base import TransformPrimitive from featuretools.utils import convert_time_units class TimeSince(TransformPrimitive): """Calculates time from a value to a specified cutoff datetime. Args: unit (str): Defines the unit of time to count from. Defaults to Seconds. Acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds Examples: >>> from datetime import datetime >>> time_since = TimeSince() >>> times = [datetime(2019, 3, 1, 0, 0, 0, 1), ... datetime(2019, 3, 1, 0, 0, 1, 0), ... datetime(2019, 3, 1, 0, 2, 0, 0)] >>> cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0) >>> values = time_since(times, time=cutoff_time) >>> list(map(int, values)) [0, -1, -120] Change output to nanoseconds >>> from datetime import datetime >>> time_since_nano = TimeSince(unit='nanoseconds') >>> times = [datetime(2019, 3, 1, 0, 0, 0, 1), ... datetime(2019, 3, 1, 0, 0, 1, 0), ... datetime(2019, 3, 1, 0, 2, 0, 0)] >>> cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0) >>> values = time_since_nano(times, time=cutoff_time) >>> list(map(lambda x: int(round(x)), values)) [-1000, -1000000000, -120000000000] """ name = "time_since" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_calc_time = True description_template = "the time from {} to the cutoff time" def __init__(self, unit="seconds"): self.unit = unit.lower() def get_function(self): def pd_time_since(array, time): return convert_time_units((time - array).dt.total_seconds(), self.unit) return pd_time_since ================================================ FILE: featuretools/primitives/standard/transform/datetime/time_since_previous.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base import TransformPrimitive from featuretools.utils import convert_time_units class TimeSincePrevious(TransformPrimitive): """Computes the time since the previous entry in a list. Args: unit (str): Defines the unit of time to count from. Defaults to Seconds. Acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds Description: Given a list of datetimes, compute the time in seconds elapsed since the previous item in the list. The result for the first item in the list will always be `NaN`. Examples: >>> from datetime import datetime >>> time_since_previous = TimeSincePrevious() >>> dates = [datetime(2019, 3, 1, 0, 0, 0), ... datetime(2019, 3, 1, 0, 2, 0), ... datetime(2019, 3, 1, 0, 3, 0), ... datetime(2019, 3, 1, 0, 2, 30), ... datetime(2019, 3, 1, 0, 10, 0)] >>> time_since_previous(dates).tolist() [nan, 120.0, 60.0, -30.0, 450.0] """ name = "time_since_previous" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the time since the previous instance of {}" def __init__(self, unit="seconds"): self.unit = unit.lower() def get_function(self): def pd_diff(values): return convert_time_units( values.diff().apply(lambda x: x.total_seconds()), self.unit, ) return pd_diff ================================================ FILE: featuretools/primitives/standard/transform/datetime/utils.py ================================================ from typing import Optional, Tuple import holidays import pandas as pd class HolidayUtil: def __init__(self, country="US"): try: country, subdivision = self.convert_to_subdivision(country) self.holidays = holidays.country_holidays( country=country, subdiv=subdivision, ) except NotImplementedError: available_countries = ( "https://github.com/dr-prodigy/python-holidays#available-countries" ) error = "must be one of the available countries:\n%s" % available_countries raise ValueError(error) self.federal_holidays = getattr(holidays, country)(years=range(1950, 2075)) def to_df(self): holidays_df = pd.DataFrame( sorted(self.federal_holidays.items()), columns=["holiday_date", "names"], ) holidays_df.holiday_date = holidays_df.holiday_date.astype("datetime64[ns]") return holidays_df def convert_to_subdivision(self, country: str) -> Tuple[str, Optional[str]]: """Convert country to country + subdivision Created in response to library changes that changed countries to subdivisions Args: country (str): Original country name Returns: Tuple[str,Optional[str]]: country, subdivsion """ return { "ENGLAND": ("GB", country), "NORTHERNIRELAND": ("GB", country), "PORTUGALEXT": ("PT", "Ext"), "PTE": ("PT", "Ext"), "SCOTLAND": ("GB", country), "UK": ("GB", country), "WALES": ("GB", country), }.get(country.upper(), (country, None)) ================================================ FILE: featuretools/primitives/standard/transform/datetime/week.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Week(TransformPrimitive): """Determines the week of the year from a datetime. Description: Returns the week of the year from a datetime value. The first week of the year starts on January 1, and week numbers increment each Monday. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 1, 3), ... datetime(2019, 6, 17, 11, 10, 50), ... datetime(2019, 11, 30, 19, 45, 15)] >>> week = Week() >>> week(dates).tolist() [1, 25, 48] """ name = "week" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 54))), semantic_tags={"category"}, ) description_template = "the week of the year of {}" def get_function(self): def week(vals): if hasattr(vals.dt, "isocalendar"): return vals.dt.isocalendar().week else: return vals.dt.week return week ================================================ FILE: featuretools/primitives/standard/transform/datetime/weekday.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Weekday(TransformPrimitive): """Determines the day of the week from a datetime. Description: Returns the day of the week from a datetime value. Weeks start on Monday (day 0) and run through Sunday (day 6). Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2019, 6, 17, 11, 10, 50), ... datetime(2019, 11, 30, 19, 45, 15)] >>> weekday = Weekday() >>> weekday(dates).tolist() [4, 0, 5] """ name = "weekday" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(7))), semantic_tags={"category"}, ) description_template = "the day of the week of {}" def get_function(self): def weekday(vals): return vals.dt.weekday return weekday ================================================ FILE: featuretools/primitives/standard/transform/datetime/year.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Ordinal from featuretools.primitives.base import TransformPrimitive class Year(TransformPrimitive): """Determines the year value of a datetime. Examples: >>> from datetime import datetime >>> dates = [datetime(2019, 3, 1), ... datetime(2048, 6, 17, 11, 10, 50), ... datetime(1950, 11, 30, 19, 45, 15)] >>> year = Year() >>> year(dates).tolist() [2019, 2048, 1950] """ name = "year" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema( logical_type=Ordinal(order=list(range(1, 3000))), semantic_tags={"category"}, ) description_template = "the year of {}" def get_function(self): def year(vals): return vals.dt.year return year ================================================ FILE: featuretools/primitives/standard/transform/email/__init__.py ================================================ from featuretools.primitives.standard.transform.email.email_address_to_domain import ( EmailAddressToDomain, ) from featuretools.primitives.standard.transform.email.is_free_email_domain import ( IsFreeEmailDomain, ) ================================================ FILE: featuretools/primitives/standard/transform/email/email_address_to_domain.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, EmailAddress from featuretools.primitives.base import TransformPrimitive class EmailAddressToDomain(TransformPrimitive): """Determines the domain of an email Description: EmailAddress input should be a string. Will return Nan if an invalid email address is provided, or if the input is not a string. Examples: >>> email_address_to_domain = EmailAddressToDomain() >>> email_address_to_domain(['name@gmail.com', 'name@featuretools.com']).tolist() ['gmail.com', 'featuretools.com'] """ name = "email_address_to_domain" input_types = [ColumnSchema(logical_type=EmailAddress)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def email_address_to_domain(emails): # if the input is empty return an empty Series if len(emails) == 0: return pd.Series([], dtype="category") emails_df = pd.DataFrame({"email": emails}) # if all emails are NaN expand won't propogate NaNs and will fail on indexing if emails_df["email"].isnull().all(): emails_df["domain"] = np.nan emails_df["domain"] = emails_df["domain"].astype(object) else: # .str.strip() and .str.split() return NaN for NaN values and propogate NaNs into new columns emails_df["domain"] = ( emails_df["email"].str.strip().str.split("@", expand=True)[1] ) return emails_df.domain.values return email_address_to_domain ================================================ FILE: featuretools/primitives/standard/transform/email/is_free_email_domain.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, EmailAddress from featuretools.primitives.base import TransformPrimitive class IsFreeEmailDomain(TransformPrimitive): """Determines if an email address is from a free email domain. Description: EmailAddress input should be a string. Will return Nan if an invalid email address is provided, or if the input is not a string. The list of free email domains used in this primitive was obtained from https://github.com/willwhite/freemail/blob/master/data/free.txt. Examples: >>> is_free_email_domain = IsFreeEmailDomain() >>> is_free_email_domain(['name@gmail.com', 'name@featuretools.com']).tolist() [True, False] """ name = "is_free_email_domain" input_types = [ColumnSchema(logical_type=EmailAddress)] return_type = ColumnSchema(logical_type=BooleanNullable) filename = "free_email_provider_domains.txt" def get_function(self): file_path = self.get_filepath(self.filename) free_domains = pd.read_csv(file_path, header=None, names=["domain"]) free_domains["domain"] = free_domains.domain.str.strip() def is_free_email_domain(emails): # if the input is empty return an empty Series if len(emails) == 0: return pd.Series([], dtype="category") emails_df = pd.DataFrame({"email": emails}) # if all emails are NaN expand won't propogate NaNs and will fail on indexing if emails_df["email"].isnull().all(): emails_df["domain"] = np.nan else: # .str.strip() and .str.split() return NaN for NaN values and propogate NaNs into new columns emails_df["domain"] = ( emails_df["email"].str.strip().str.split("@", expand=True)[1] ) emails_df["is_free"] = emails_df["domain"].isin(free_domains["domain"]) # if there are any NaN domain values, change the series type to allow for # both bools and NaN values and set is_free to NaN for the NaN domains if emails_df["domain"].isnull().values.any(): emails_df["is_free"] = emails_df["is_free"].astype("object") emails_df.loc[emails_df["domain"].isnull(), "is_free"] = np.nan return emails_df.is_free.values return is_free_email_domain ================================================ FILE: featuretools/primitives/standard/transform/exponential/__init__.py ================================================ from featuretools.primitives.standard.transform.exponential.exponential_weighted_average import ( ExponentialWeightedAverage, ) from featuretools.primitives.standard.transform.exponential.exponential_weighted_std import ( ExponentialWeightedSTD, ) from featuretools.primitives.standard.transform.exponential.exponential_weighted_variance import ( ExponentialWeightedVariance, ) ================================================ FILE: featuretools/primitives/standard/transform/exponential/exponential_weighted_average.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class ExponentialWeightedAverage(TransformPrimitive): """Computes the exponentially weighted moving average for a series of numbers Description: Returns the exponentially weighted moving average for a series of numbers. Exactly one of center of mass (com), span, half-life, and alpha must be provided. Missing values can be ignored when calculating weights by setting 'ignore_na' to True. Args: com (float): Specify decay in terms of center of mass for com >= 0. Default is None. span (float): Specify decay in terms of span for span >= 1. Default is None. halflife (float): Specify decay in terms of half-life for halflife > 0. Default is None. alpha (float): Specify smoothing factor alpha directly. Alpha should be greater than 0 and less than or equal to 1. Default is None. ignore_na (bool): Ignore missing values when calculating weights. Default is False. Examples: >>> exponential_weighted_average = ExponentialWeightedAverage(com=0.5) >>> exponential_weighted_average([1, 2, 3, 4]).tolist() [1.0, 1.75, 2.615384615384615, 3.55] Missing values can be ignored >>> ewma_ignorena = ExponentialWeightedAverage(com=0.5, ignore_na=True) >>> ewma_ignorena([1, 2, 3, None, 4]).tolist() [1.0, 1.75, 2.615384615384615, 2.615384615384615, 3.55] """ name = "exponential_weighted_average" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False): if all(x is None for x in [com, span, halflife, alpha]): com = 0.5 self.com = com self.span = span self.halflife = halflife self.alpha = alpha self.ignore_na = ignore_na def get_function(self): def exponential_weighted_average(x): return x.ewm( com=self.com, span=self.span, halflife=self.halflife, alpha=self.alpha, ignore_na=self.ignore_na, ).mean() return exponential_weighted_average ================================================ FILE: featuretools/primitives/standard/transform/exponential/exponential_weighted_std.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class ExponentialWeightedSTD(TransformPrimitive): """Computes the exponentially weighted moving standard deviation for a series of numbers Description: Returns the exponentially weighted moving standard deviation for a series of numbers. Exactly one of center of mass (com), span, half-life, and alpha must be provided. Missing values can be ignored when calculating weights by setting 'ignore_na' to True. Args: com (float): Specify decay in terms of center of mass for com >= 0. Default is None. span (float): Specify decay in terms of span for span >= 1. Default is None. halflife (float): Specify decay in terms of half-life for halflife > 0. Default is None. alpha (float): Specify smoothing factor alpha directly. Alpha should be greater than 0 and less than or equal to 1. Default is None. ignore_na (bool): Ignore missing values when calculating weights. Default is False. Examples: >>> exponential_weighted_std = ExponentialWeightedSTD(com=0.5) >>> exponential_weighted_std([1, 2, 3, 7]).tolist() [nan, 0.7071067811865475, 0.9198662110077998, 2.9852200022005855] Missing values can be ignored >>> ewmstd_ignorena = ExponentialWeightedSTD(com=0.5, ignore_na=True) >>> ewmstd_ignorena([1, 2, 3, None, 7]).tolist() [nan, 0.7071067811865475, 0.9198662110077998, 0.9198662110077998, 2.9852200022005855] """ name = "exponential_weighted_std" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False): if all(x is None for x in [com, span, halflife, alpha]): com = 0.5 self.com = com self.span = span self.halflife = halflife self.alpha = alpha self.ignore_na = ignore_na def get_function(self): def exponential_weighted_std(x): return x.ewm( com=self.com, span=self.span, halflife=self.halflife, alpha=self.alpha, ignore_na=self.ignore_na, ).std() return exponential_weighted_std ================================================ FILE: featuretools/primitives/standard/transform/exponential/exponential_weighted_variance.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class ExponentialWeightedVariance(TransformPrimitive): """Computes the exponentially weighted moving variance for a series of numbers Description: Returns the exponentially weighted moving variance for a series of numbers. Exactly one of center of mass (com), span, half-life, and alpha must be provided. Missing values can be ignored when calculating weights by setting 'ignore_na' to True. Args: com (float): Specify decay in terms of center of mass for com >= 0. Default is None. span (float): Specify decay in terms of span for span >= 1. Default is None. halflife (float): Specify decay in terms of half-life for halflife > 0. Default is None. alpha (float): Specify smoothing factor alpha directly. Alpha should be greater than 0 and less than or equal to 1. Default is None. ignore_na (bool): Ignore missing values when calculating weights. Default is False. Examples: >>> exponential_weighted_variance = ExponentialWeightedVariance(com=0.5) >>> exponential_weighted_variance([1, 2, 3, 4]).tolist() [nan, 0.49999999999999983, 0.8461538461538459, 1.1230769230769233] Missing values can be ignored >>> ewmv_ignorena = ExponentialWeightedVariance(com=0.5, ignore_na=True) >>> ewmv_ignorena([1, 2, 3, None, 4]).tolist() [nan, 0.49999999999999983, 0.8461538461538459, 0.8461538461538459, 1.1230769230769233] """ name = "exponential_weighted_variance" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False): if all(x is None for x in [com, span, halflife, alpha]): com = 0.5 self.com = com self.span = span self.halflife = halflife self.alpha = alpha self.ignore_na = ignore_na def get_function(self): def exponential_weighted_average(x): return x.ewm( com=self.com, span=self.span, halflife=self.halflife, alpha=self.alpha, ignore_na=self.ignore_na, ).var() return exponential_weighted_average ================================================ FILE: featuretools/primitives/standard/transform/file_extension.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Filepath from featuretools.primitives.base import TransformPrimitive class FileExtension(TransformPrimitive): """Determines the extension of a filepath. Description: Given a list of filepaths, return the extension suffix of each one. If the filepath is missing or invalid, return `NaN`. Examples: >>> file_extension = FileExtension() >>> file_extension(['doc.txt', '~/documents/data.json', 'file']).tolist() ['.txt', '.json', nan] """ name = "file_extension" input_types = [ColumnSchema(logical_type=Filepath)] return_type = ColumnSchema(semantic_tags={"category"}) def get_function(self): def file_extension(x): p = r"(\.[a-z|A-Z]+$)" return x.str.extract(p, expand=False).str.lower() return file_extension ================================================ FILE: featuretools/primitives/standard/transform/full_name_to_first_name.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, PersonFullName from featuretools.primitives.base import TransformPrimitive class FullNameToFirstName(TransformPrimitive): """Determines the first name from a person's name. Description: Given a list of names, determines the first name. If only a single name is provided, assume this is a first name. If only a title and a single name is provided return `nan`. This assumes all titles will be followed by a period. Please note, in the current implementation, last names containing spaces may result in improper first name matches. Examples: >>> full_name_to_first_name = FullNameToFirstName() >>> names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina', ... 'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown'] >>> full_name_to_first_name(names).to_list() ['Woolf', 'Oliva', 'Frederick', 'Michael', nan] """ name = "full_name_to_first_name" input_types = [ColumnSchema(logical_type=PersonFullName)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def full_name_to_first_name(x): title_with_last_pattern = r"(^[A-Z][a-z]+\. [A-Z][a-z]+$)" titles_pattern = r"([A-Z][a-z]+)\. " df = pd.DataFrame({"names": x}) # remove any entries with just a title and a name df["names"] = df["names"].str.replace( title_with_last_pattern, "", regex=True, ) # remove any known titles df["names"] = df["names"].str.replace(titles_pattern, "", regex=True) # extract first names pattern = r"([A-Z][a-z]+ |, [A-Z][a-z]+$|^[A-Z][a-z]+$)" df["first_name"] = df["names"].str.extract(pattern) # clean up white space and leftover commas df["first_name"] = df["first_name"].str.replace(",", "").str.strip() return df["first_name"] return full_name_to_first_name ================================================ FILE: featuretools/primitives/standard/transform/full_name_to_last_name.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, PersonFullName from featuretools.primitives.base import TransformPrimitive class FullNameToLastName(TransformPrimitive): """Determines the first name from a person's name. Description: Given a list of names, determines the last name. If only a single name is provided, assume this is a first name, and return `nan`. This assumes all titles will be followed by a period. Examples: >>> full_name_to_last_name = FullNameToLastName() >>> names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina', ... 'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown'] >>> full_name_to_last_name(names).to_list() ['Spector', 'Oliva y Ocana', 'Ware', 'Peter', 'Brown'] """ name = "full_name_to_last_name" input_types = [ColumnSchema(logical_type=PersonFullName)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def full_name_to_last_name(x): titles_pattern = r"([A-Z][a-z]+)\. " df = pd.DataFrame({"names": x}) # extract initial names pattern = r"(^.+?,|^[A-Z][a-z]+\. [A-Z][a-z]+$| [A-Z][a-z]+$| [A-Z][a-z]+[/-][A-Z][a-z]+$)" df["last_name"] = df["names"].str.extract(pattern) # remove titles df["last_name"] = df["last_name"].str.replace( titles_pattern, "", regex=True, ) # clean up white space and leftover commas df["last_name"] = df["last_name"].str.replace(",", "").str.strip() return df["last_name"] return full_name_to_last_name ================================================ FILE: featuretools/primitives/standard/transform/full_name_to_title.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, PersonFullName from featuretools.primitives.base import TransformPrimitive class FullNameToTitle(TransformPrimitive): """Determines the title from a person's name. Description: Given a list of names, determines the title, or prefix of each name (e.g. "Mr", "Mrs", etc). If no title is found, returns `NaN`. Examples: >>> full_name_to_title = FullNameToTitle() >>> names = ['Spector, Mr. Woolf', 'Oliva y Ocana, Dona. Fermina', ... 'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown'] >>> full_name_to_title(names).to_list() ['Mr', 'Dona', 'Mr', nan, 'Mr'] """ name = "full_name_to_title" input_types = [ColumnSchema(logical_type=PersonFullName)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def full_name_to_title(x): pattern = r"([A-Z][a-z]+)\. " return x.str.extract(pattern, expand=True)[0] return full_name_to_title ================================================ FILE: featuretools/primitives/standard/transform/is_in.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean from featuretools.primitives.base import TransformPrimitive class IsIn(TransformPrimitive): """Determines whether a value is present in a provided list. Examples: >>> items = ['string', 10.3, False] >>> is_in = IsIn(list_of_outputs=items) >>> is_in(['string', 10.5, False]).tolist() [True, False, True] """ name = "isin" input_types = [ColumnSchema()] return_type = ColumnSchema(logical_type=Boolean) def __init__(self, list_of_outputs=None): self.list_of_outputs = list_of_outputs if not list_of_outputs: stringified_output_list = "[]" else: stringified_output_list = ", ".join([str(x) for x in list_of_outputs]) self.description_template = "whether {{}} is in {}".format( stringified_output_list, ) def get_function(self): def pd_is_in(array): return array.isin(self.list_of_outputs or []) return pd_is_in def generate_name(self, base_feature_names): return "%s.isin(%s)" % (base_feature_names[0], str(self.list_of_outputs)) ================================================ FILE: featuretools/primitives/standard/transform/is_null.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean from featuretools.primitives.base import TransformPrimitive class IsNull(TransformPrimitive): """Determines if a value is null. Examples: >>> is_null = IsNull() >>> is_null([1, None, 3]).tolist() [False, True, False] """ name = "is_null" input_types = [ColumnSchema()] return_type = ColumnSchema(logical_type=Boolean) description_template = "whether {} is null" def get_function(self): def isnull(array): return array.isnull() return isnull ================================================ FILE: featuretools/primitives/standard/transform/latlong/__init__.py ================================================ from featuretools.primitives.standard.transform.latlong.cityblock_distance import ( CityblockDistance, ) from featuretools.primitives.standard.transform.latlong.geomidpoint import GeoMidpoint from featuretools.primitives.standard.transform.latlong.haversine import Haversine from featuretools.primitives.standard.transform.latlong.is_in_geobox import IsInGeoBox from featuretools.primitives.standard.transform.latlong.latitude import Latitude from featuretools.primitives.standard.transform.latlong.longitude import Longitude ================================================ FILE: featuretools/primitives/standard/transform/latlong/cityblock_distance.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, LatLong from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.latlong.utils import ( _haversine_calculate, ) class CityblockDistance(TransformPrimitive): """Calculates the distance between points in a city road grid. Description: This distance is calculated using the haversine formula, which takes into account the curvature of the Earth. If either input data contains `NaN`s, the calculated distance with be `NaN`. This calculation is also known as the Mahnattan distance. Args: unit (str): Determines the unit value to output. Could be miles or kilometers. Default is miles. Examples: >>> cityblock_distance = CityblockDistance() >>> DC = (38, -77) >>> Boston = (43, -71) >>> NYC = (40, -74) >>> distances_mi = cityblock_distance([DC, DC], [NYC, Boston]) >>> np.round(distances_mi, 3).tolist() [301.519, 672.089] We can also change the units in which the distance is calculated. >>> cityblock_distance_kilometers = CityblockDistance(unit='kilometers') >>> distances_km = cityblock_distance_kilometers([DC, DC], [NYC, Boston]) >>> np.round(distances_km, 3).tolist() [485.248, 1081.622] """ name = "cityblock_distance" input_types = [ ColumnSchema(logical_type=LatLong), ColumnSchema(logical_type=LatLong), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) commutative = True def __init__(self, unit="miles"): if unit not in ["miles", "kilometers"]: raise ValueError("Invalid unit given") self.unit = unit def get_function(self): def cityblock(latlong_1, latlong_2): latlong_1 = np.array(latlong_1.tolist()) latlong_2 = np.array(latlong_2.tolist()) lat_1s = latlong_1[:, 0] lat_2s = latlong_2[:, 0] lon_1s = latlong_1[:, 1] lon_2s = latlong_2[:, 1] lon_dis = _haversine_calculate(lat_1s, lon_1s, lat_1s, lon_2s, self.unit) lat_dist = _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_1s, self.unit) return pd.Series(lon_dis + lat_dist) return cityblock ================================================ FILE: featuretools/primitives/standard/transform/latlong/geomidpoint.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import LatLong from featuretools.primitives.base import TransformPrimitive class GeoMidpoint(TransformPrimitive): """Determines the geographic center of two coordinates. Examples: >>> geomidpoint = GeoMidpoint() >>> geomidpoint([(42.4, -71.1)], [(40.0, -122.4)]) [(41.2, -96.75)] """ name = "geomidpoint" input_types = [ ColumnSchema(logical_type=LatLong), ColumnSchema(logical_type=LatLong), ] return_type = ColumnSchema(logical_type=LatLong) commutative = True def get_function(self): def geomidpoint_func(latlong_1, latlong_2): latlong_1 = np.array(latlong_1.tolist()) latlong_2 = np.array(latlong_2.tolist()) lat_1s = latlong_1[:, 0] lat_2s = latlong_2[:, 0] lon_1s = latlong_1[:, 1] lon_2s = latlong_2[:, 1] lat_middle = np.array([lat_1s, lat_2s]).transpose().mean(axis=1) lon_middle = np.array([lon_1s, lon_2s]).transpose().mean(axis=1) return list(zip(lat_middle, lon_middle)) return geomidpoint_func ================================================ FILE: featuretools/primitives/standard/transform/latlong/haversine.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import LatLong from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.latlong.utils import ( _haversine_calculate, ) class Haversine(TransformPrimitive): """Calculates the approximate haversine distance between two LatLong columns. Args: unit (str): Determines the unit value to output. Could be `miles` or `kilometers`. Default is `miles`. Examples: >>> haversine = Haversine() >>> distances = haversine([(42.4, -71.1), (40.0, -122.4)], ... [(40.0, -122.4), (41.2, -96.75)]) >>> np.round(distances, 3).tolist() [2631.231, 1343.289] Output units can be specified >>> haversine_km = Haversine(unit='kilometers') >>> distances_km = haversine_km([(42.4, -71.1), (40.0, -122.4)], ... [(40.0, -122.4), (41.2, -96.75)]) >>> np.round(distances_km, 3).tolist() [4234.555, 2161.814] """ name = "haversine" input_types = [ ColumnSchema(logical_type=LatLong), ColumnSchema(logical_type=LatLong), ] return_type = ColumnSchema(semantic_tags={"numeric"}) commutative = True def __init__(self, unit="miles"): valid_units = ["miles", "kilometers"] if unit not in valid_units: error_message = "Invalid unit %s provided. Must be one of %s" % ( unit, valid_units, ) raise ValueError(error_message) self.unit = unit self.description_template = ( "the haversine distance in {} between {{}} and {{}}".format(self.unit) ) def get_function(self): def haversine(latlong_1, latlong_2): latlong_1 = np.array(latlong_1.tolist()) latlong_2 = np.array(latlong_2.tolist()) lat_1s = latlong_1[:, 0] lat_2s = latlong_2[:, 0] lon_1s = latlong_1[:, 1] lon_2s = latlong_2[:, 1] distance = _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_2s, self.unit) return distance return haversine def generate_name(self, base_feature_names): name = "{}(".format(self.name.upper()) name += ", ".join(base_feature_names) if self.unit != "miles": name += ", unit={}".format(self.unit) name += ")" return name ================================================ FILE: featuretools/primitives/standard/transform/latlong/is_in_geobox.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable, LatLong from featuretools.primitives.base import TransformPrimitive class IsInGeoBox(TransformPrimitive): """Determines if coordinates are inside a box defined by two corner coordinate points. Description: Coordinate values should be specified as (latitude, longitude) tuples. This primitive is unable to handle coordinates and boxes at the poles, and near +/- 180 degrees latitude. Args: point1 (tuple(float, float)): The coordinates of the first corner of the box. Defaults to (0, 0). point2 (tuple(float, float)): The coordinates of the diagonal corner of the box. Defaults to (0, 0). Example: >>> is_in_geobox = IsInGeoBox((40.7128, -74.0060), (42.2436, -71.1677)) >>> is_in_geobox([(41.034, -72.254), (39.125, -87.345)]).tolist() [True, False] """ name = "is_in_geobox" input_types = [ColumnSchema(logical_type=LatLong)] return_type = ColumnSchema(logical_type=BooleanNullable) def __init__(self, point1=(0, 0), point2=(0, 0)): self.point1 = point1 self.point2 = point2 self.lats = np.sort(np.array([point1[0], point2[0]])) self.lons = np.sort(np.array([point1[1], point2[1]])) def get_function(self): def geobox(latlongs): transposed = np.transpose(np.array(latlongs.tolist())) lats = (self.lats[0] <= transposed[0]) & (self.lats[1] >= transposed[0]) longs = (self.lons[0] <= transposed[1]) & (self.lons[1] >= transposed[1]) return lats & longs return geobox ================================================ FILE: featuretools/primitives/standard/transform/latlong/latitude.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import LatLong from featuretools.primitives.base import TransformPrimitive class Latitude(TransformPrimitive): """Returns the first tuple value in a list of LatLong tuples. For use with the LatLong logical type. Examples: >>> latitude = Latitude() >>> latitude([(42.4, -71.1), ... (40.0, -122.4), ... (41.2, -96.75)]).tolist() [42.4, 40.0, 41.2] """ name = "latitude" input_types = [ColumnSchema(logical_type=LatLong)] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the latitude of {}" def get_function(self): def latitude(latlong): latlong = np.array(latlong.tolist()) return latlong[:, 0] return latitude ================================================ FILE: featuretools/primitives/standard/transform/latlong/longitude.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import LatLong from featuretools.primitives.base import TransformPrimitive class Longitude(TransformPrimitive): """Returns the second tuple value in a list of LatLong tuples. For use with the LatLong logical type. Examples: >>> longitude = Longitude() >>> longitude([(42.4, -71.1), ... (40.0, -122.4), ... (41.2, -96.75)]).tolist() [-71.1, -122.4, -96.75] """ name = "longitude" input_types = [ColumnSchema(logical_type=LatLong)] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the longitude of {}" def get_function(self): def longitude(latlong): latlong = np.array(latlong.tolist()) return latlong[:, 1] return longitude ================================================ FILE: featuretools/primitives/standard/transform/latlong/utils.py ================================================ import numpy as np def _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_2s, unit): # https://stackoverflow.com/a/29546836/2512385 lon1, lat1, lon2, lat2 = map(np.radians, [lon_1s, lat_1s, lon_2s, lat_2s]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2 radius_earth = 3958.7613 if unit == "kilometers": radius_earth = 6371.0088 distances = radius_earth * 2 * np.arcsin(np.sqrt(a)) return distances ================================================ FILE: featuretools/primitives/standard/transform/natural_language/__init__.py ================================================ from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) from featuretools.primitives.standard.transform.natural_language.mean_characters_per_word import ( MeanCharactersPerWord, ) from featuretools.primitives.standard.transform.natural_language.median_word_length import ( MedianWordLength, ) from featuretools.primitives.standard.transform.natural_language.num_characters import ( NumCharacters, ) from featuretools.primitives.standard.transform.natural_language.num_unique_separators import ( NumUniqueSeparators, ) from featuretools.primitives.standard.transform.natural_language.num_words import ( NumWords, ) from featuretools.primitives.standard.transform.natural_language.number_of_common_words import ( NumberOfCommonWords, ) from featuretools.primitives.standard.transform.natural_language.number_of_hashtags import ( NumberOfHashtags, ) from featuretools.primitives.standard.transform.natural_language.number_of_mentions import ( NumberOfMentions, ) from featuretools.primitives.standard.transform.natural_language.number_of_unique_words import ( NumberOfUniqueWords, ) from featuretools.primitives.standard.transform.natural_language.number_of_words_in_quotes import ( NumberOfWordsInQuotes, ) from featuretools.primitives.standard.transform.natural_language.punctuation_count import ( PunctuationCount, ) from featuretools.primitives.standard.transform.natural_language.title_word_count import ( TitleWordCount, ) from featuretools.primitives.standard.transform.natural_language.total_word_length import ( TotalWordLength, ) from featuretools.primitives.standard.transform.natural_language.upper_case_count import ( UpperCaseCount, ) from featuretools.primitives.standard.transform.natural_language.upper_case_word_count import ( UpperCaseWordCount, ) from featuretools.primitives.standard.transform.natural_language.whitespace_count import ( WhitespaceCount, ) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/constants.py ================================================ from string import punctuation DELIMITERS = "[ \n\t]" PUNCTUATION_AND_WHITESPACE = f"[{punctuation}\n\t ]" common_words_1000 = frozenset( [ "the", "of", "to", "and", "a", "in", "is", "it", "you", "that", "he", "was", "for", "on", "are", "with", "as", "i", "his", "they", "be", "at", "one", "have", "this", "from", "or", "had", "by", "not", "word", "but", "what", "some", "we", "can", "out", "other", "were", "all", "there", "when", "up", "use", "your", "how", "said", "an", "each", "she", "which", "do", "their", "time", "if", "will", "way", "about", "many", "then", "them", "write", "would", "like", "so", "these", "her", "long", "make", "thing", "see", "him", "two", "has", "look", "more", "day", "could", "go", "come", "did", "number", "sound", "no", "most", "people", "my", "over", "know", "water", "than", "call", "first", "who", "may", "down", "side", "been", "now", "find", "any", "new", "work", "part", "take", "get", "place", "made", "live", "where", "after", "back", "little", "only", "round", "man", "year", "came", "show", "every", "good", "me", "give", "our", "under", "name", "very", "through", "just", "form", "sentence", "great", "think", "say", "help", "low", "line", "differ", "turn", "cause", "much", "mean", "before", "move", "right", "boy", "old", "too", "same", "tell", "does", "set", "three", "want", "air", "well", "also", "play", "small", "end", "put", "home", "read", "hand", "port", "large", "spell", "add", "even", "land", "here", "must", "big", "high", "such", "follow", "act", "why", "ask", "men", "change", "went", "light", "kind", "off", "need", "house", "picture", "try", "us", "again", "animal", "point", "mother", "world", "near", "build", "self", "earth", "father", "head", "stand", "own", "page", "should", "country", "found", "answer", "school", "grow", "study", "still", "learn", "plant", "cover", "food", "sun", "four", "between", "state", "keep", "eye", "never", "last", "let", "thought", "city", "tree", "cross", "farm", "hard", "start", "might", "story", "saw", "far", "sea", "draw", "left", "late", "run", "don't", "while", "press", "close", "night", "real", "life", "few", "north", "open", "seem", "together", "next", "white", "children", "begin", "got", "walk", "example", "ease", "paper", "group", "always", "music", "those", "both", "mark", "often", "letter", "until", "mile", "river", "car", "feet", "care", "second", "book", "carry", "took", "science", "eat", "room", "friend", "began", "idea", "fish", "mountain", "stop", "once", "base", "hear", "horse", "cut", "sure", "watch", "color", "face", "wood", "main", "enough", "plain", "girl", "usual", "young", "ready", "above", "ever", "red", "list", "though", "feel", "talk", "bird", "soon", "body", "dog", "family", "direct", "pose", "leave", "song", "measure", "door", "product", "black", "short", "numeral", "class", "wind", "question", "happen", "complete", "ship", "area", "half", "rock", "order", "fire", "south", "problem", "piece", "told", "knew", "pass", "since", "top", "whole", "king", "space", "heard", "best", "hour", "better", "true", "during", "hundred", "five", "remember", "step", "early", "hold", "west", "ground", "interest", "reach", "fast", "verb", "sing", "listen", "six", "table", "travel", "less", "morning", "ten", "simple", "several", "vowel", "toward", "war", "lay", "against", "pattern", "slow", "center", "love", "person", "money", "serve", "appear", "road", "map", "rain", "rule", "govern", "pull", "cold", "notice", "voice", "unit", "power", "town", "fine", "certain", "fly", "fall", "lead", "cry", "dark", "machine", "note", "wait", "plan", "figure", "star", "box", "noun", "field", "rest", "correct", "able", "pound", "done", "beauty", "drive", "stood", "contain", "front", "teach", "week", "final", "gave", "green", "oh", "quick", "develop", "ocean", "warm", "free", "minute", "strong", "special", "mind", "behind", "clear", "tail", "produce", "fact", "street", "inch", "multiply", "nothing", "course", "stay", "wheel", "full", "force", "blue", "object", "decide", "surface", "deep", "moon", "island", "foot", "system", "busy", "test", "record", "boat", "common", "gold", "possible", "plane", "stead", "dry", "wonder", "laugh", "thousand", "ago", "ran", "check", "game", "shape", "equate", "hot", "miss", "brought", "heat", "snow", "tire", "bring", "yes", "distant", "fill", "east", "paint", "language", "among", "grand", "ball", "yet", "wave", "drop", "heart", "am", "present", "heavy", "dance", "engine", "position", "arm", "wide", "sail", "material", "size", "vary", "settle", "speak", "weight", "general", "ice", "matter", "circle", "pair", "include", "divide", "syllable", "felt", "perhaps", "pick", "sudden", "count", "square", "reason", "length", "represent", "art", "subject", "region", "energy", "hunt", "probable", "bed", "brother", "egg", "ride", "cell", "believe", "fraction", "forest", "sit", "race", "window", "store", "summer", "train", "sleep", "prove", "lone", "leg", "exercise", "wall", "catch", "mount", "wish", "sky", "board", "joy", "winter", "sat", "written", "wild", "instrument", "kept", "glass", "grass", "cow", "job", "edge", "sign", "visit", "past", "soft", "fun", "bright", "gas", "weather", "month", "million", "bear", "finish", "happy", "hope", "flower", "clothe", "strange", "gone", "jump", "baby", "eight", "village", "meet", "root", "buy", "raise", "solve", "metal", "whether", "push", "seven", "paragraph", "third", "shall", "held", "hair", "describe", "cook", "floor", "either", "result", "burn", "hill", "safe", "cat", "century", "consider", "type", "law", "bit", "coast", "copy", "phrase", "silent", "tall", "sand", "soil", "roll", "temperature", "finger", "industry", "value", "fight", "lie", "beat", "excite", "natural", "view", "sense", "ear", "else", "quite", "broke", "case", "middle", "kill", "son", "lake", "moment", "scale", "loud", "spring", "observe", "child", "straight", "consonant", "nation", "dictionary", "milk", "speed", "method", "organ", "pay", "age", "section", "dress", "cloud", "surprise", "quiet", "stone", "tiny", "climb", "cool", "design", "poor", "lot", "experiment", "bottom", "key", "iron", "single", "stick", "flat", "twenty", "skin", "smile", "crease", "hole", "trade", "melody", "trip", "office", "receive", "row", "mouth", "exact", "symbol", "die", "least", "trouble", "shout", "except", "wrote", "seed", "tone", "join", "suggest", "clean", "break", "lady", "yard", "rise", "bad", "blow", "oil", "blood", "touch", "grew", "cent", "mix", "team", "wire", "cost", "lost", "brown", "wear", "garden", "equal", "sent", "choose", "fell", "fit", "flow", "fair", "bank", "collect", "save", "control", "decimal", "gentle", "woman", "captain", "practice", "separate", "difficult", "doctor", "please", "protect", "noon", "whose", "locate", "ring", "character", "insect", "caught", "period", "indicate", "radio", "spoke", "atom", "human", "history", "effect", "electric", "expect", "crop", "modern", "element", "hit", "student", "corner", "party", "supply", "bone", "rail", "imagine", "provide", "agree", "thus", "capital", "won't", "chair", "danger", "fruit", "rich", "thick", "soldier", "process", "operate", "guess", "necessary", "sharp", "wing", "create", "neighbor", "wash", "bat", "rather", "crowd", "corn", "compare", "poem", "string", "bell", "depend", "meat", "rub", "tube", "famous", "dollar", "stream", "fear", "sight", "thin", "triangle", "planet", "hurry", "chief", "colony", "clock", "mine", "tie", "enter", "major", "fresh", "search", "send", "yellow", "gun", "allow", "print", "dead", "spot", "desert", "suit", "current", "lift", "rose", "continue", "block", "chart", "hat", "sell", "success", "company", "subtract", "event", "particular", "deal", "swim", "term", "opposite", "wife", "shoe", "shoulder", "spread", "arrange", "camp", "invent", "cotton", "born", "determine", "quart", "nine", "truck", "noise", "level", "chance", "gather", "shop", "stretch", "throw", "shine", "property", "column", "molecule", "select", "wrong", "gray", "repeat", "require", "broad", "prepare", "salt", "nose", "plural", "anger", "claim", "continent", "oxygen", "sugar", "death", "pretty", "skill", "women", "season", "solution", "magnet", "silver", "thank", "branch", "match", "suffix", "especially", "fig", "afraid", "huge", "sister", "steel", "discuss", "forward", "similar", "guide", "experience", "score", "apple", "bought", "led", "pitch", "coat", "mass", "card", "band", "rope", "slip", "win", "dream", "evening", "condition", "feed", "tool", "total", "basic", "smell", "valley", "nor", "double", "seat", "arrive", "master", "track", "parent", "shore", "division", "sheet", "substance", "favor", "connect", "post", "spend", "chord", "fat", "glad", "original", "share", "station", "dad", "bread", "charge", "proper", "bar", "offer", "segment", "slave", "duck", "instant", "market", "degree", "populate", "chick", "dear", "enemy", "reply", "drink", "occur", "support", "speech", "nature", "range", "steam", "motion", "path", "liquid", "log", "meant", "quotient", "teeth", "shell", "neck", ], ) # https://gist.github.com/deekayen/4148741 ================================================ FILE: featuretools/primitives/standard/transform/natural_language/count_string.py ================================================ import re import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive class CountString(TransformPrimitive): """Determines how many times a given string shows up in a text field. Args: string (str): The string to determine the count of. Defaults to the word "the". ignore_case (bool): Determines if case of the string should be considered or not. Defaults to true. ignore_non_alphanumeric (bool): Determines if non-alphanumeric characters should be used in the search. Defaults to False. is_regex (bool): Defines if the string argument is a regex or not. Defaults to False. match_whole_words_only (bool): Determines if whole words should be matched or not. For example searching for word `the` against `then, the, there` should only return `the` if this argument was True. Defaults to False. Examples: >>> count_string = CountString(string="the") >>> count_string(["The problem was difficult.", ... "He was there.", ... "The girl went to the store."]).tolist() [1.0, 1.0, 2.0] >>> # Match case of string >>> count_string_ignore_case = CountString(string="the", ignore_case=False) >>> count_string_ignore_case(["The problem was difficult.", ... "He was there.", ... "The girl went to the store."]).tolist() [0.0, 1.0, 1.0] >>> # Ignore non-alphanumeric characters in the search >>> count_string_ignore_non_alphanumeric = CountString(string="the", ... ignore_non_alphanumeric=True) >>> count_string_ignore_non_alphanumeric(["Th*/e problem was difficult.", ... "He was there.", ... "The girl went to the store."]).tolist() [1.0, 1.0, 2.0] >>> # Specify the string as a regex >>> count_string_is_regex = CountString(string="t.e", is_regex=True) >>> count_string_is_regex(["The problem was difficult.", ... "He was there.", ... "The girl went to the store."]).tolist() [1.0, 1.0, 2.0] >>> # Match whole words only >>> count_string_match_whole_words_only = CountString(string="the", ... match_whole_words_only=True) >>> count_string_match_whole_words_only(["The problem was difficult.", ... "He was there.", ... "The girl went to the store."]).tolist() [1.0, 0.0, 2.0] """ name = "count_string" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) def __init__( self, string="the", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ): self.string = string self.ignore_case = ignore_case self.ignore_non_alphanumeric = ignore_non_alphanumeric self.match_whole_words_only = match_whole_words_only self.is_regex = is_regex # we don't want to strip non alphanumeric characters from the pattern # ie h.ll. should match "hello" so we can't strip the dots to make hll if not is_regex: self.pattern = re.escape(self.process_text(string)) else: self.pattern = string if ignore_case: self.pattern = self.pattern.lower() # \b\b.*\b\b is the same as \b.*\b so we don't have to check if # the pattern is given to us as regex and if it already has leading # and trailing \b's if match_whole_words_only: self.pattern = "\\b" + self.pattern + "\\b" def process_text(self, text): if self.ignore_non_alphanumeric: text = re.sub("[^0-9a-zA-Z ]+", "", text) if self.ignore_case: text = text.lower() return text def get_function(self): def count_string(words): if not isinstance(words, str): return np.nan words = self.process_text(words) return len(re.findall(self.pattern, words)) return np.vectorize(count_string, otypes=[float]) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/mean_characters_per_word.py ================================================ # -*- coding: utf-8 -*- import re import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, NaturalLanguage from featuretools.primitives.base import TransformPrimitive PUNCTUATION = re.escape("!,.:;?") END_OF_SENTENCE_PUNCT_RE = re.compile( rf"[{PUNCTUATION}]+$|[{PUNCTUATION}]+ |[{PUNCTUATION}]+\n", ) def _mean_characters_per_word(value): if pd.isna(value): return np.nan # replace end-of-sentence punctuation with space value = END_OF_SENTENCE_PUNCT_RE.sub(" ", value) words = value.split() character_count = [len(x) for x in words] return np.mean(character_count) if len(character_count) else 0 class MeanCharactersPerWord(TransformPrimitive): """Determines the mean number of characters per word. Description: Given list of strings, determine the mean number of characters per word in each string. A word is defined as a series of any characters not separated by white space. Punctuation is removed before counting. If a string is empty or `NaN`, return `NaN`. Examples: >>> x = ['This is a test file', 'This is second line', 'third line $1,000'] >>> mean_characters_per_word = MeanCharactersPerWord() >>> mean_characters_per_word(x).tolist() [3.0, 4.0, 5.0] """ name = "mean_characters_per_word" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) default_value = 0 def get_function(self): def mean_characters_per_word(series): return series.apply(_mean_characters_per_word) return mean_characters_per_word ================================================ FILE: featuretools/primitives/standard/transform/natural_language/median_word_length.py ================================================ from numpy import median from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( DELIMITERS, ) class MedianWordLength(TransformPrimitive): """Determines the median word length. Description: Given list of strings, determine the median word length in each string. A word is defined as a series of any characters not separated by a delimiter. If a string is empty or `NaN`, return `NaN`. Args: delimiters_regex (str): Delimiters as a regex string for splitting text into words. Defaults to whitespace characters. Examples: >>> x = ['This is a test file', 'This is second line', 'third line $1,000', None] >>> median_word_length = MedianWordLength() >>> median_word_length(x).tolist() [4.0, 4.0, 5.0, nan] """ name = "median_word_length" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) default_value = 0 def __init__(self, delimiters_regex=DELIMITERS): self.delimiters_regex = delimiters_regex def get_function(self): def get_median(words): if isinstance(words, list): return median([len(word) for word in words if len(word) != 0]) def median_word_length(x): words = x.str.split(self.delimiters_regex) return words.apply(get_median) return median_word_length ================================================ FILE: featuretools/primitives/standard/transform/natural_language/num_characters.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive class NumCharacters(TransformPrimitive): """Calculates the number of characters in a given string, including whitespace and punctuation. Description: Returns the number of characters in a string. This is equivalent to the length of a string. Examples: >>> num_characters = NumCharacters() >>> num_characters(['This is a string', ... 'second item', ... 'final1']).tolist() [16, 11, 6] """ name = "num_characters" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) description_template = "the number of characters in {}" def get_function(self): def character_counter(array): def _get_num_characters(elem): """Returns the length of elem, or pd.NA given null input""" if pd.isna(elem): return pd.NA return len(elem) return array.apply(_get_num_characters) return character_counter ================================================ FILE: featuretools/primitives/standard/transform/natural_language/num_unique_separators.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive NATURAL_LANGUAGE_SEPARATORS = [" ", ".", ",", "!", "?", ";", "\n"] class NumUniqueSeparators(TransformPrimitive): r"""Calculates the number of unique separators. Description: Given a string and a list of separators, determine the number of unique separators in each string. If a string is null determined by pd.isnull return pd.NA. Args: separators (list, optional): a list of separator characters to count. ``[" ", ".", ",", "!", "?", ";", "\n"]`` is used by default. Examples: >>> x = ["First. Line.", "This. is the second, line!", "notinlist@#$%^%&"] >>> num_unique_separators = NumUniqueSeparators([".", ",", "!"]) >>> num_unique_separators(x).tolist() [1, 3, 0] """ name = "num_unique_separators" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) def __init__(self, separators=NATURAL_LANGUAGE_SEPARATORS): assert separators is not None, "separators needs to be defined" self.separators = separators def get_function(self): def count_unique_separator(s): if pd.isnull(s): return pd.NA return len(set(self.separators).intersection(set(s))) def get_separator_count(column): return column.apply(count_unique_separator) return get_separator_count ================================================ FILE: featuretools/primitives/standard/transform/natural_language/num_words.py ================================================ import re from string import punctuation from typing import Optional import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( DELIMITERS, ) class NumWords(TransformPrimitive): """Determines the number of words in a string. Words are sequences of characters delimited by whitespace. Examples: >>> num_words = NumWords() >>> num_words(['This is a string', ... 'Two words', ... 'no-spaces', ... 'Also works with sentences. Second sentence!']).tolist() [4, 2, 1, 6] """ name = "num_words" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) description_template = "the number of words in {}" def get_function(self): def word_counter(array): def _get_number_of_words(elem: Optional[str]): """Returns the number of words in given element, or pd.NA given null input""" if pd.isna(elem): return pd.NA return sum( 1 for word in re.split(DELIMITERS, elem) if word.strip(punctuation) ) return array.apply(_get_number_of_words) return word_counter ================================================ FILE: featuretools/primitives/standard/transform/natural_language/number_of_common_words.py ================================================ from string import punctuation from typing import Iterable import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( DELIMITERS, common_words_1000, ) class NumberOfCommonWords(TransformPrimitive): """Determines the number of common words in a string. Description: Given string, determine the number of words that appear in a supplied word set. The word set defaults to nlp_primitives.constants.common_words_1000. The string is case insensitive. The word bank should consist of only lower case strings. If a string is missing, return `NaN`. Args: word_set (set, optional): The set of words to look for in the string. These words should all be lower case strings. delimiters_regex (str, optional): The regular expression used to determine what separates words. Defaults to whitespace characters. Examples: >>> x = ['Hey! This is some natural language', 'bacon, cheesburger, AND, fries', 'I! Am. A; duck?'] >>> number_of_common_words = NumberOfCommonWords(word_set={'and', 'some', 'am', 'a', 'the', 'is', 'i'}) >>> number_of_common_words(x).tolist() [2, 1, 3] >>> x = ['Hey! This is. some. natural language'] >>> number_of_common_words = NumberOfCommonWords(word_set={'hey', 'is', 'some'}, delimiters_regex="[ .]") >>> number_of_common_words(x).tolist() [3] """ name = "number_of_common_words" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__( self, word_set=common_words_1000, delimiters_regex=DELIMITERS, ): self.delimiters_regex = delimiters_regex self.word_set = word_set def get_function(self): def get_num_in_word_bank(words): if not isinstance(words, Iterable): return pd.NA num_common_words = 0 for w in words: if ( w.lower().strip(punctuation) in self.word_set ): # assumes word_set is all lowercase num_common_words += 1 return num_common_words def num_common_words(x): words = x.str.split(self.delimiters_regex) return words.apply(get_num_in_word_bank) return num_common_words ================================================ FILE: featuretools/primitives/standard/transform/natural_language/number_of_hashtags.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) class NumberOfHashtags(CountString): """Determines the number of hashtags in a string. Description: Given a list of strings, determine the number of hashtags in each string. A hashtag is defined as a string that meets the following criteria: - Starts with a '#' character, followed by a sequence of alphanumeric characters containing at least one alphabetic character - Present at the start of a string or after whitespace - Terminated by the end of the string, a whitespace, or a punctuation character other than '#' - e.g. The string '#yes-no' contains a valid hashtag ('#yes') - e.g. The string '#yes#' does not contain a valid hashtag This implementation handles Unicode characters. This implementation does not impose any character limit on hashtags. If a string is missing, return `NaN`. Examples: >>> x = ['#regular #expression', 'this is a string', '###__regular#1and_0#expression'] >>> number_of_hashtags = NumberOfHashtags() >>> number_of_hashtags(x).tolist() [2.0, 0.0, 0.0] """ name = "number_of_hashtags" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self): pattern = r"((^#)|\s#)(\w*([^\W\d])+\w*)(?![#\w])" super().__init__(string=pattern, is_regex=True, ignore_case=False) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/number_of_mentions.py ================================================ import re import string from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) class NumberOfMentions(CountString): """Determines the number of mentions in a string. Description: Given a list of strings, determine the number of mentions in each string. A mention is defined as a string that meets the following criteria: - Starts with a '@' character, followed by a sequence of alphanumeric characters - Present at the start of a string or after whitespace - Terminated by the end of the string, a whitespace, or a punctuation character other than '@' - e.g. The string '@yes-no' contains a valid mention ('@yes') - e.g. The string '@yes@' does not contain a valid mention This implementation handles Unicode characters. This implementation does not impose any character limit on mentions. If a string is missing, return `NaN`. Examples: >>> x = ['@user1 @user2', 'this is a string', '@@@__user1@1and_0@expression'] >>> number_of_mentions = NumberOfMentions() >>> number_of_mentions(x).tolist() [2.0, 0.0, 0.0] """ name = "number_of_mentions" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self): SPECIALS_MINUS_AT = "".join(list(set(string.punctuation) - {"@"})) SPECIALS_MINUS_AT = re.escape(SPECIALS_MINUS_AT) pattern = rf"((^@)|(\s+@))(\w+)(?=\s|$|[{SPECIALS_MINUS_AT}])" super().__init__(string=pattern, is_regex=True, ignore_case=False) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/number_of_unique_words.py ================================================ from string import punctuation from typing import Iterable import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( DELIMITERS, ) class NumberOfUniqueWords(TransformPrimitive): """Determines the number of unique words in a string. Description: Determines the number of unique words in a given string. Includes options for case-insensitive behavior. Args: case_insensitive (bool, optional): Specify case_insensitivity when searching for unique words. For example, setting this to True would mean "WORD word" would be treated as having one unique word. Defaults to False. Examples: >>> x = ['Word word Word', 'This is a SENTENCE.', 'green red green'] >>> number_of_unique_words = NumberOfUniqueWords() >>> number_of_unique_words(x).tolist() [2, 4, 2] >>> x = ['word WoRD WORD worD', 'dog dog dog', 'catt CAT caT'] >>> number_of_unique_words = NumberOfUniqueWords(case_insensitive=True) >>> number_of_unique_words(x).tolist() [1, 1, 2] """ name = "number_of_unique_words" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self, case_insensitive=False): self.case_insensitive = case_insensitive def get_function(self): def _unique_word_helper(text): if not isinstance(text, Iterable): return pd.NA unique = set() for t in text: punct_less = t.strip(punctuation) if len(punct_less) > 0: unique.add(punct_less) return len(unique) def num_unique_words(array): if self.case_insensitive: array = array.str.lower() array = array.str.split(f"{DELIMITERS}") return array.apply(_unique_word_helper) return num_unique_words ================================================ FILE: featuretools/primitives/standard/transform/natural_language/number_of_words_in_quotes.py ================================================ import re from string import punctuation import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( DELIMITERS, ) class NumberOfWordsInQuotes(TransformPrimitive): """Determines the number of words in quotes in a string. Description: Given a list of strings, determine the number of words in quotes in each string. This implementation handles Unicode characters. If a string is missing, return `NaN`. Args: quote_type (str, optional): Specifies what type of quotation marks to match. Argument "single" matches on only single quotes (' '). Argument "double" matches words between double quotes (" "). Argument "both" matches words between either type of quotes. Defaults to "both". Examples: >>> x = ['"python" java prolog "Diffie-Hellman" "4.99"', "Reach me at 'user@email.com'", "'Here's an interesting example!'"] >>> number_of_words_in_quotes = NumberOfWordsInQuotes() >>> number_of_words_in_quotes(x).tolist() [3, 1, 4] """ name = "number_of_words_in_quotes" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self, quote_type="both"): if quote_type not in ["both", "single", "double"]: raise ValueError( f"{quote_type} is not a valid quote_type. Specify 'both', 'single', or 'double'", ) self.quote_type = quote_type IN_DOUBLE_QUOTES = r'((^|\W)"(.)*?"(?!\w))' IN_SINGLE_QUOTES = r"((^|\W)'(.)*?'(?!\w))" if quote_type == "double": self.regex = IN_DOUBLE_QUOTES elif quote_type == "single": self.regex = IN_SINGLE_QUOTES else: self.regex = f"({IN_SINGLE_QUOTES}|{IN_DOUBLE_QUOTES})" def get_function(self): def count_words_in_quotes(text): if pd.isnull(text): return pd.NA matches = re.findall(self.regex, text, re.DOTALL) count = 0 for match in matches: matched_phrase = match[0] words = re.split(f"{DELIMITERS}", matched_phrase) for word in words: if len(word.strip(punctuation + " ")): count += 1 return count def num_words_in_quotes(array): return array.apply(count_words_in_quotes).astype("Int64") return num_words_in_quotes ================================================ FILE: featuretools/primitives/standard/transform/natural_language/punctuation_count.py ================================================ # -*- coding: utf-8 -*- import re import string from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) class PunctuationCount(CountString): """Determines number of punctuation characters in a string. Description: Given list of strings, determine the number of punctuation characters in each string. Looks for any of the following: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ If a string is missing, return `NaN`. Examples: >>> x = ['This is a test file.', 'This is second line', 'third line: $1,000'] >>> punctuation_count = PunctuationCount() >>> punctuation_count(x).tolist() [1.0, 0.0, 3.0] """ name = "punctuation_count" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self): pattern = "(%s)" % "|".join([re.escape(x) for x in string.punctuation]) super().__init__(string=pattern, is_regex=True, ignore_case=False) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/title_word_count.py ================================================ # -*- coding: utf-8 -*- from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) class TitleWordCount(CountString): """Determines the number of title words in a string. Description: Given list of strings, determine the number of title words in each string. A title word is defined as any word starting with a capital letter. Words at the start of a sentence will be counted. If a string is missing, return `NaN`. Examples: >>> x = ['My favorite movie is Jaws.', 'this is a string', 'AAA'] >>> title_word_count = TitleWordCount() >>> title_word_count(x).tolist() [2.0, 0.0, 1.0] """ name = "title_word_count" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self): pattern = r"([A-Z][^\s]*)" super().__init__(string=pattern, is_regex=True, ignore_case=False) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/total_word_length.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( PUNCTUATION_AND_WHITESPACE, ) class TotalWordLength(TransformPrimitive): """Determines the total word length. Description: Given list of strings, determine the total word length in each string. A word is defined as a series of any characters not separated by a delimiter. If a string is empty or `NaN`, return `NaN`. Args: delimiters_regex (str): Delimiters as a regex string for splitting text into words. Defaults to whitespace characters. Examples: >>> x = ['This is a test file', 'This is second line', 'third line $1,000', None] >>> total_word_length = TotalWordLength() >>> total_word_length(x).tolist() [15.0, 16.0, 13.0, nan] """ name = "total_word_length" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self, do_not_count=PUNCTUATION_AND_WHITESPACE): self.do_not_count = do_not_count def get_function(self): def total_word_length(x): return x.str.len() - x.str.count(self.do_not_count) return total_word_length ================================================ FILE: featuretools/primitives/standard/transform/natural_language/upper_case_count.py ================================================ # -*- coding: utf-8 -*- from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) class UpperCaseCount(CountString): """Calculates the number of upper case letters in text. Description: Given a list of strings, determine the number of characters in each string that are capitalized. Counts every letter individually, not just every word that contains capitalized letters. If a string is missing, return `NaN` Examples: >>> x = ['This IS a string.', 'This is a string', 'aaa'] >>> upper_case_count = UpperCaseCount() >>> upper_case_count(x).tolist() [3.0, 1.0, 0.0] """ name = "upper_case_count" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def __init__(self): pattern = r"([A-Z])" super().__init__(string=pattern, is_regex=True, ignore_case=False) ================================================ FILE: featuretools/primitives/standard/transform/natural_language/upper_case_word_count.py ================================================ import re from string import punctuation import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import IntegerNullable, NaturalLanguage from featuretools.primitives.base import TransformPrimitive from featuretools.primitives.standard.transform.natural_language.constants import ( DELIMITERS, ) class UpperCaseWordCount(TransformPrimitive): """Determines the number of words in a string that are entirely capitalized. Description: Given list of strings, determine the number of words in each string that are entirely capitalized. If a string is missing, return `NaN`. Examples: >>> x = ['This IS a string.', 'This is a string', 'AAA'] >>> upper_case_word_count = UpperCaseWordCount() >>> upper_case_word_count(x).tolist() [1, 0, 1] """ name = "upper_case_word_count" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) default_value = 0 def get_function(self): def upper_case_word_count(x): def _count_upper_case_words(elem): if pd.isna(elem): return pd.NA return sum( 1 for word in re.split(DELIMITERS, elem) if word.strip(punctuation) and word.upper() == word ) return x.apply(_count_upper_case_words) return upper_case_word_count ================================================ FILE: featuretools/primitives/standard/transform/natural_language/whitespace_count.py ================================================ from featuretools.primitives.standard.transform.natural_language.count_string import ( CountString, ) class WhitespaceCount(CountString): """Calculates number of whitespaces in a string. Description: Given a list of strings, determine the whitespaces in each string If a string is missing, return `NaN` Examples: >>> x = ['', 'hi im ethan', 'multiple spaces'] >>> upper_case_count = WhitespaceCount() >>> upper_case_count(x).tolist() [0.0, 2.0, 4.0] """ name = "whitespace_count" default_value = 0 def __init__(self): super().__init__(string=" ") ================================================ FILE: featuretools/primitives/standard/transform/not_primitive.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base import TransformPrimitive class Not(TransformPrimitive): """Negates a boolean value. Examples: >>> not_func = Not() >>> not_func([True, True, False]).tolist() [False, False, True] """ name = "not" input_types = [ [ColumnSchema(logical_type=Boolean)], [ColumnSchema(logical_type=BooleanNullable)], ] return_type = ColumnSchema(logical_type=BooleanNullable) description_template = "the negation of {}" def generate_name(self, base_feature_names): return "NOT({})".format(base_feature_names[0]) def get_function(self): return np.logical_not ================================================ FILE: featuretools/primitives/standard/transform/nth_week_of_month.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base import TransformPrimitive class NthWeekOfMonth(TransformPrimitive): """Determines the nth week of the month from a given date. Description: Converts a datetime to an float representing the week of the month in which the date falls. The first day of the month starts week 1, and the week number is incremented each Sunday. Examples: >>> from datetime import datetime >>> nth_week_of_month = NthWeekOfMonth() >>> times = [datetime(2019, 3, 1), ... datetime(2019, 3, 3), ... datetime(2019, 3, 31), ... datetime(2019, 3, 30)] >>> nth_week_of_month(times).tolist() [1.0, 2.0, 6.0, 5.0] """ name = "nth_week_of_month" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) def get_function(self): def nth_week_of_month(x): df = pd.DataFrame({"date": x}) df["first_day"] = df.date - pd.to_timedelta(df["date"].dt.day - 1, unit="d") df["dom"] = df.date.dt.day df["first_day_weekday"] = df.first_day.dt.weekday df["adjusted_dom"] = df.dom + df.first_day_weekday + 1 df.loc[df["first_day_weekday"].astype(float) == 6.0, "adjusted_dom"] = df[ "dom" ] df["week_of_month"] = np.ceil(df.adjusted_dom / 7.0) return df.week_of_month.values return nth_week_of_month ================================================ FILE: featuretools/primitives/standard/transform/numeric/__init__.py ================================================ from featuretools.primitives.standard.transform.numeric.absolute import Absolute from featuretools.primitives.standard.transform.numeric.cosine import Cosine from featuretools.primitives.standard.transform.numeric.diff import Diff from featuretools.primitives.standard.transform.numeric.natural_logarithm import ( NaturalLogarithm, ) from featuretools.primitives.standard.transform.numeric.negate import Negate from featuretools.primitives.standard.transform.numeric.percentile import Percentile from featuretools.primitives.standard.transform.numeric.rate_of_change import ( RateOfChange, ) from featuretools.primitives.standard.transform.numeric.same_as_previous import ( SameAsPrevious, ) from featuretools.primitives.standard.transform.numeric.sine import Sine from featuretools.primitives.standard.transform.numeric.square_root import SquareRoot from featuretools.primitives.standard.transform.numeric.tangent import Tangent ================================================ FILE: featuretools/primitives/standard/transform/numeric/absolute.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class Absolute(TransformPrimitive): """Computes the absolute value of a number. Examples: >>> absolute = Absolute() >>> absolute([3.0, -5.0, -2.4]).tolist() [3.0, 5.0, 2.4] """ name = "absolute" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the absolute value of {}" def get_function(self): return np.absolute ================================================ FILE: featuretools/primitives/standard/transform/numeric/cosine.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class Cosine(TransformPrimitive): """Computes the cosine of a number. Examples: >>> cos = Cosine() >>> cos([0.0, np.pi/2.0, np.pi]).tolist() [1.0, 6.123233995736766e-17, -1.0] """ name = "cosine" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) description_template = "the cosine of {}" def get_function(self): return np.cos ================================================ FILE: featuretools/primitives/standard/transform/numeric/diff.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class Diff(TransformPrimitive): """Computes the difference between the value in a list and the previous value in that list. Args: periods (int): The number of periods by which to shift the index row. Default is 0. Periods correspond to rows. Description: Given a list of values, compute the difference from the previous item in the list. The result for the first element of the list will always be `NaN`. Examples: >>> diff = Diff() >>> values = [1, 10, 3, 4, 15] >>> diff(values).tolist() [nan, 9.0, -7.0, 1.0, 11.0] You can specify the number of periods to shift the values >>> values = [1, 2, 4, 7, 11, 16] >>> diff_periods = Diff(periods = 1) >>> diff_periods(values).tolist() [nan, nan, 1.0, 2.0, 3.0, 4.0] """ name = "diff" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the difference from the previous value of {}" def __init__(self, periods=0): self.periods = periods def get_function(self): def pd_diff(values): return values.shift(self.periods).diff() return pd_diff ================================================ FILE: featuretools/primitives/standard/transform/numeric/natural_logarithm.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class NaturalLogarithm(TransformPrimitive): """Computes the natural logarithm of a number. Examples: >>> log = NaturalLogarithm() >>> results = log([1.0, np.e]).tolist() >>> results = [round(x, 2) for x in results] >>> results [0.0, 1.0] """ name = "natural_logarithm" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) description_template = "the natural logarithm of {}" def get_function(self): return np.log ================================================ FILE: featuretools/primitives/standard/transform/numeric/negate.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class Negate(TransformPrimitive): """Negates a numeric value. Examples: >>> negate = Negate() >>> negate([1.0, 23.2, -7.0]).tolist() [-1.0, -23.2, 7.0] """ name = "negate" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the negation of {}" def get_function(self): def negate(vals): return vals * -1 return negate def generate_name(self, base_feature_names): return "-(%s)" % (base_feature_names[0]) ================================================ FILE: featuretools/primitives/standard/transform/numeric/percentile.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class Percentile(TransformPrimitive): """Determines the percentile rank for each value in a list. Examples: >>> percentile = Percentile() >>> percentile([10, 15, 1, 20]).tolist() [0.5, 0.75, 0.25, 1.0] Nan values are ignored when determining rank >>> percentile([10, 15, 1, None, 20]).tolist() [0.5, 0.75, 0.25, nan, 1.0] """ name = "percentile" uses_full_dataframe = True input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) description_template = "the percentile rank of {}" def get_function(self): return lambda array: array.rank(pct=True) ================================================ FILE: featuretools/primitives/standard/transform/numeric/rate_of_change.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base import TransformPrimitive class RateOfChange(TransformPrimitive): """Computes the rate of change of a value per second. Examples: >>> import pandas as pd >>> rate_of_change = RateOfChange() >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> results = rate_of_change([0, 30, 180, -90, 0], times).tolist() >>> results = [round(x, 2) for x in results] >>> results [nan, 0.5, 2.5, -4.5, 1.5] """ name = "rate_of_change" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True description_template = "the rate of change of {} per second" def get_function(self): def rate_of_change(values, time): time_delta = time.diff().dt.total_seconds() value_delta = values.diff() return value_delta / time_delta return rate_of_change ================================================ FILE: featuretools/primitives/standard/transform/numeric/same_as_previous.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import BooleanNullable from featuretools.primitives.base import TransformPrimitive class SameAsPrevious(TransformPrimitive): """Determines if a value is equal to the previous value in a list. Description: Compares a value in a list to the previous value and returns True if the value is equal to the previous value or False otherwise. The first item in the output will always be False, since there is no previous element for the first element comparison. Any nan values in the input will be filled using either a forward-fill or backward-fill method, specified by the fill_method argument. The number of consecutive nan values that get filled can be limited with the limit argument. Any nan values left after filling will result in False being returned for any comparison involving the nan value. Args: fill_method (str): Method for filling gaps in series. Valid options are `backfill`, `bfill`, `pad`, `ffill`. `pad / ffill`: fill gap with last valid observation. `backfill / bfill`: fill gap with next valid observation. Default is `pad`. limit (int): The max number of consecutive NaN values in a gap that can be filled. Default is None. Examples: >>> same_as_previous = SameAsPrevious() >>> same_as_previous([1, 2, 2, 4]).tolist() [False, False, True, False] The fill method for nan values can be specified >>> same_as_previous_fillna = SameAsPrevious(fill_method="bfill") >>> same_as_previous_fillna([1, None, 2, 4]).tolist() [False, False, True, False] The number of nan values that are filled can be limited >>> same_as_previous_limitfill = SameAsPrevious(limit=2) >>> same_as_previous_limitfill([1, None, None, None, 2, 3]).tolist() [False, True, True, False, False, False] """ name = "same_as_previous" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(BooleanNullable) def __init__(self, fill_method="pad", limit=None): if fill_method not in ["backfill", "bfill", "pad", "ffill"]: raise ValueError("Invalid fill_method") self.fill_method = fill_method self.limit = limit def get_function(self): def same_as_previous(x): x = x.fillna(method=self.fill_method, limit=self.limit) x = x.eq(x.shift()) # first value will always be false, since there is no previous value x.iloc[0] = False return x return same_as_previous ================================================ FILE: featuretools/primitives/standard/transform/numeric/sine.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class Sine(TransformPrimitive): """Computes the sine of a number. Examples: >>> sin = Sine() >>> sin([-np.pi/2.0, 0.0, np.pi/2.0]).tolist() [-1.0, 0.0, 1.0] """ name = "sine" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) description_template = "the sine of {}" def get_function(self): return np.sin ================================================ FILE: featuretools/primitives/standard/transform/numeric/square_root.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class SquareRoot(TransformPrimitive): """Computes the square root of a number. Examples: >>> sqrt = SquareRoot() >>> sqrt([9.0, 16.0, 4.0]).tolist() [3.0, 4.0, 2.0] """ name = "square_root" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) description_template = "the square root of {}" def get_function(self): return np.sqrt ================================================ FILE: featuretools/primitives/standard/transform/numeric/tangent.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class Tangent(TransformPrimitive): """Computes the tangent of a number. Examples: >>> tan = Tangent() >>> tan([-np.pi, 0.0, np.pi/2.0]).tolist() [1.2246467991473532e-16, 0.0, 1.633123935319537e+16] """ name = "tangent" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) description_template = "the tangent of {}" def get_function(self): return np.tan ================================================ FILE: featuretools/primitives/standard/transform/percent_change.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class PercentChange(TransformPrimitive): """Determines the percent difference between values in a list. Description: Given a list of numbers, return the percent difference between each subsequent number. Percentages are shown in decimal form (not multiplied by 100). Uses pandas' pct_change function. Args: periods (int): Periods to shift for calculating percent change. Default is 1. fill_method (str): Method for filling gaps in reindexed Series. Valid options are `backfill`, `bfill`, `pad`, `ffill`. `pad / ffill`: fill gap with last valid observation. `backfill / bfill`: fill gap with next valid observation. Default is `pad`. limit (int): The max number of consecutive NaN values in a gap that can be filled. Default is None. freq (DateOffset, timedelta, or offset alias string): If `freq` is specified, instead of calcualting change between subsequent points, PercentChange will calculate change between points with a certain interval between their date indices. `freq` defines the desired interval. When freq is used, the resulting index will also be filled to include any missing dates from the specified interval. If the index is not date/datetime and freq is used, it will raise a NotImplementedError. If freq is None, no changes will be applied. Default is None. Examples: >>> percent_change = PercentChange() >>> percent_change([2, 5, 15, 3, 3, 9, 4.5]).to_list() [nan, 1.5, 2.0, -0.8, 0.0, 2.0, -0.5] We can control the number of periods to return the percent difference between points further from one another. >>> percent_change_2 = PercentChange(periods=2) >>> percent_change_2([2, 5, 15, 3, 3, 9, 4.5]).to_list() [nan, nan, 6.5, -0.4, -0.8, 2.0, 0.5] We can control the method used to handle gaps in data. >>> percent_change = PercentChange() >>> percent_change([2, 4, 8, None, 16, None, 32, None]).to_list() [nan, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0] >>> percent_change_backfill = PercentChange(fill_method='backfill') >>> percent_change_backfill([2, 4, 8, None, 16, None, 32, None]).to_list() [nan, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, nan] We can also control the maximum number of NaN values to fill in a gap. >>> percent_change = PercentChange() >>> percent_change([2, None, None, None, 4]).to_list() [nan, 0.0, 0.0, 0.0, 1.0] >>> percent_change_limited = PercentChange(limit=2) >>> percent_change_limited([2, None, None, None, 4]).to_list() [nan, 0.0, 0.0, nan, nan] Finally, we can specify a date frequency on which to calculate percent change. >>> import pandas as pd >>> dates = pd.DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-05']) >>> x_indexed = pd.Series([1, 2, 3, 4], index=dates) >>> percent_change = PercentChange() >>> percent_change(x_indexed).to_list() [nan, 1.0, 0.5, 0.33333333333333326] >>> date_offset = pd.tseries.offsets.DateOffset(days=1) >>> percent_change_freq = PercentChange(freq=date_offset) >>> percent_change_freq(x_indexed).to_list() [nan, 1.0, 0.5, nan] """ name = "percent_change" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) def __init__(self, periods=1, fill_method="pad", limit=None, freq=None): if fill_method not in ["backfill", "bfill", "pad", "ffill"]: raise ValueError("Invalid fill_method") self.periods = periods self.fill_method = fill_method self.limit = limit self.freq = freq def get_function(self): def percent_change(data): return data.pct_change( self.periods, self.fill_method, self.limit, self.freq, ) return percent_change ================================================ FILE: featuretools/primitives/standard/transform/postal/__init__.py ================================================ from featuretools.primitives.standard.transform.postal.one_digit_postal_code import ( OneDigitPostalCode, ) from featuretools.primitives.standard.transform.postal.two_digit_postal_code import ( TwoDigitPostalCode, ) ================================================ FILE: featuretools/primitives/standard/transform/postal/one_digit_postal_code.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, PostalCode from featuretools.primitives.base import TransformPrimitive class OneDigitPostalCode(TransformPrimitive): """Returns the one digit prefix of a given postal code. Description: Given a list of postal codes, returns the one digit prefix for each postal code. Examples: >>> one_digit_postal_code = OneDigitPostalCode() >>> one_digit_postal_code(['92432', '34514']).tolist() ['9', '3'] """ name = "one_digit_postal_code" input_types = [ColumnSchema(logical_type=PostalCode)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) description_template = "The one digit postal code prefix of {}" def get_function(self): def one_digit_postal_code(postal_codes): def transform_postal_code(pc): return str(pc)[0] if pd.notna(pc) else pd.NA return postal_codes.apply(transform_postal_code) return one_digit_postal_code ================================================ FILE: featuretools/primitives/standard/transform/postal/two_digit_postal_code.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, PostalCode from featuretools.primitives.base import TransformPrimitive class TwoDigitPostalCode(TransformPrimitive): """Returns the two digit prefix of a given postal code. Description: Given a list of postal codes, returns the two digit prefix for each postal code. Examples: >>> two_digit_postal_code = TwoDigitPostalCode() >>> two_digit_postal_code(['92432', '34514']).tolist() ['92', '34'] """ name = "two_digit_postal_code" input_types = [ColumnSchema(logical_type=PostalCode)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) description_template = "The two digit postal code prefix of {}" def get_function(self): def two_digit_postal_code(postal_codes): def transform_postal_code(pc): return str(pc)[:2] if pd.notna(pc) else pd.NA return postal_codes.apply(transform_postal_code) return two_digit_postal_code ================================================ FILE: featuretools/primitives/standard/transform/savgol_filter.py ================================================ from math import floor import numpy as np from scipy.signal import savgol_coeffs, savgol_filter from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double from featuretools.primitives.base import TransformPrimitive class SavgolFilter(TransformPrimitive): """Applies a Savitzky-Golay filter to a list of values. Description: Given a list of values, return a smoothed list which increases the signal to noise ratio without greatly distoring the signal. Uses the `Savitzky–Golay filter` method. If the input list has less than 20 values, it will be returned as is. See the following page for more info: https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.signal.savgol_filter.html Args: window_length (int): The length of the filter window (i.e. the number of coefficients). `window_length` must be a positive odd integer. polyorder (int): The order of the polynomial used to fit the samples. `polyorder` must be less than `window_length`. deriv (int): Optional. The order of the derivative to compute. This must be a nonnegative integer. The default is 0, which means to filter the data without differentiating. delta (float): Optional. The spacing of the samples to which the filter will be applied. This is only used if deriv > 0. Default is 1.0. mode (str): Optional. Must be 'mirror', 'constant', 'nearest', 'wrap' or 'interp'. This determines the type of extension to use for the padded signal to which the filter is applied. When `mode` is 'constant', the padding value is given by `cval`. See the Notes for more details on 'mirror', 'constant', 'wrap', and 'nearest'. When the 'interp' mode is selected (the default), no extension is used. Instead, a degree `polyorder` polynomial is fit to the last `window_length` values of the edges, and this polynomial is used to evaluate the last `window_length // 2` output values. cval (scalar): Optional. Value to fill past the edges of the input if `mode` is 'constant'. Default is 0.0. Examples: >>> savgol_filter = SavgolFilter() >>> data = [0, 1, 1, 2, 3, 4, 5, 7, 8, 7, 9, 9, 12, 11, 12, 14, 15, 17, 17, 17, 20] >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]] [0.0429, 0.8286, 1.2571] We can control `window_length` and `polyorder` of the filter. >>> savgol_filter = SavgolFilter(window_length=13, polyorder=3) >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]] [-0.0962, 0.6484, 1.4451] We can also control the `deriv` and `delta` parameters. >>> savgol_filter = SavgolFilter(deriv=1, delta=1.5) >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]] [0.754, 0.3492, 0.2778] Finally, we can use `mode` to control how edge values are handled. >>> savgol_filter = SavgolFilter(mode='constant', cval=5) >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]] [1.5429, 0.2286, 1.2571] """ name = "savgol_filter" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) def __init__( self, window_length=None, polyorder=None, deriv=0, delta=1.0, mode="interp", cval=0.0, ): if window_length is not None and polyorder is not None: try: if mode not in ["mirror", "constant", "nearest", "interp", "wrap"]: raise ValueError( "mode must be 'mirror', 'constant', " "'nearest', 'wrap' or 'interp'.", ) savgol_coeffs(window_length, polyorder, deriv=deriv, delta=delta) except Exception: raise elif (window_length is None and polyorder is not None) or ( window_length is not None and polyorder is None ): error_text = ( "Both window_length and polyorder must be defined if you define one." ) raise ValueError(error_text) self.window_length = window_length self.polyorder = polyorder self.deriv = deriv self.delta = delta self.mode = mode self.cval = cval def get_function(self): def smooth(x): if x.shape[0] < 20: return x if np.isnan(np.min(x)): # interpolate the nan values, works for edges & middle nans mask = np.isnan(x) x[mask] = np.interp( np.flatnonzero(mask), np.flatnonzero(~mask), x[~mask], ) window_length = self.window_length polyorder = self.polyorder if window_length is None and polyorder is None: window_length = floor(len(x) / 10) * 2 + 1 polyorder = 3 return savgol_filter( x, window_length=window_length, polyorder=polyorder, deriv=self.deriv, delta=self.delta, mode=self.mode, cval=self.cval, ) return smooth ================================================ FILE: featuretools/primitives/standard/transform/time_series/__init__.py ================================================ from featuretools.primitives.standard.transform.time_series.lag import Lag from featuretools.primitives.standard.transform.time_series.numeric_lag import ( NumericLag, ) from featuretools.primitives.standard.transform.time_series.rolling_count import ( RollingCount, ) from featuretools.primitives.standard.transform.time_series.rolling_max import ( RollingMax, ) from featuretools.primitives.standard.transform.time_series.rolling_mean import ( RollingMean, ) from featuretools.primitives.standard.transform.time_series.rolling_min import ( RollingMin, ) from featuretools.primitives.standard.transform.time_series.rolling_outlier_count import ( RollingOutlierCount, ) from featuretools.primitives.standard.transform.time_series.rolling_std import ( RollingSTD, ) from featuretools.primitives.standard.transform.time_series.rolling_trend import ( RollingTrend, ) from featuretools.primitives.standard.transform.time_series.expanding import ( ExpandingCount, ExpandingMax, ExpandingMean, ExpandingMin, ExpandingSTD, ExpandingTrend, ) ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/__init__.py ================================================ from featuretools.primitives.standard.transform.time_series.expanding.expanding_count import ( ExpandingCount, ) from featuretools.primitives.standard.transform.time_series.expanding.expanding_max import ( ExpandingMax, ) from featuretools.primitives.standard.transform.time_series.expanding.expanding_mean import ( ExpandingMean, ) from featuretools.primitives.standard.transform.time_series.expanding.expanding_min import ( ExpandingMin, ) from featuretools.primitives.standard.transform.time_series.expanding.expanding_std import ( ExpandingSTD, ) from featuretools.primitives.standard.transform.time_series.expanding.expanding_trend import ( ExpandingTrend, ) ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_count.py ================================================ import numpy as np from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, IntegerNullable from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) class ExpandingCount(TransformPrimitive): """Computes the expanding count of events over a given window. Description: Given a list of datetimes, returns an expanding count starting at the row `gap` rows away from the current row. An expanding primitive calculates the value of a primitive for a given time with all the data available up to the corresponding point in time. Input datetimes should be monotonic. Args: gap (int, optional): Specifies a gap backwards from each instance before the usable data begins. Corresponds to number of rows. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Defaults to 1. Examples: >>> import pandas as pd >>> expanding_count = ExpandingCount() >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_count(times).tolist() [nan, 1.0, 2.0, 3.0, 4.0] We can also control the gap before the expanding calculation. >>> import pandas as pd >>> expanding_count = ExpandingCount(gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_count(times).tolist() [1.0, 2.0, 3.0, 4.0, 5.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> expanding_count = ExpandingCount(min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_count(times).tolist() [nan, nan, nan, 3.0, 4.0] """ name = "expanding_count" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, gap=1, min_periods=1): self.gap = gap self.min_periods = min_periods def get_function(self): def expanding_count(datetime_series): datetime_series = _apply_gap_for_expanding_primitives( datetime_series, self.gap, ) count_series = datetime_series.expanding( min_periods=self.min_periods, ).count() num_nans = self.gap + self.min_periods - 1 count_series[range(num_nans)] = np.nan return count_series return expanding_count ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_max.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) class ExpandingMax(TransformPrimitive): """Computes the expanding maximum of events over a given window. Description: Given a list of datetimes, returns an expanding maximum starting at the row `gap` rows away from the current row. An expanding primitive calculates the value of a primitive for a given time with all the data available up to the corresponding point in time. Input datetimes should be monotonic. Args: gap (int, optional): Specifies a gap backwards from each instance before the usable data begins. Corresponds to number of rows. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Defaults to 1. Examples: >>> import pandas as pd >>> expanding_min = ExpandingMax() >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist() [nan, 2.0, 4.0, 6.0, 7.0] We can also control the gap before the expanding calculation. >>> import pandas as pd >>> expanding_min = ExpandingMax(gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist() [2.0, 4.0, 6.0, 7.0, 7.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> expanding_min = ExpandingMax(min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist() [nan, nan, nan, 6.0, 7.0] """ name = "expanding_max" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, gap=1, min_periods=1): self.gap = gap self.min_periods = min_periods def get_function(self): def expanding_max(datetime, numeric): x = pd.Series(numeric.values, index=datetime) x = _apply_gap_for_expanding_primitives(x, self.gap) return x.expanding(min_periods=self.min_periods).max().values return expanding_max ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_mean.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) class ExpandingMean(TransformPrimitive): """Computes the expanding mean of events over a given window. Description: Given a list of datetimes, returns an expanding mean starting at the row `gap` rows away from the current row. An expanding primitive calculates the value of a primitive for a given time with all the data available up to the corresponding point in time. Input datetimes should be monotonic. Args: gap (int, optional): Specifies a gap backwards from each instance before the usable data begins. Corresponds to number of rows. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Defaults to 1. Examples: >>> import pandas as pd >>> expanding_mean = ExpandingMean() >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist() [nan, 5.0, 4.5, 4.0, 3.5] We can also control the gap before the expanding calculation. >>> import pandas as pd >>> expanding_mean = ExpandingMean(gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist() [5.0, 4.5, 4.0, 3.5, 3.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> expanding_mean = ExpandingMean(min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist() [nan, nan, nan, 4.0, 3.5] """ name = "expanding_mean" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, gap=1, min_periods=1): self.gap = gap self.min_periods = min_periods def get_function(self): def expanding_mean(datetime, numeric): x = pd.Series(numeric.values, index=datetime) x = _apply_gap_for_expanding_primitives(x, self.gap) return x.expanding(min_periods=self.min_periods).mean().values return expanding_mean ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_min.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) class ExpandingMin(TransformPrimitive): """Computes the expanding minimum of events over a given window. Description: Given a list of datetimes, returns an expanding minimum starting at the row `gap` rows away from the current row. An expanding primitive calculates the value of a primitive for a given time with all the data available up to the corresponding point in time. Input datetimes should be monotonic. Args: gap (int, optional): Specifies a gap backwards from each instance before the usable data begins. Corresponds to number of rows. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Defaults to 1. Examples: >>> import pandas as pd >>> expanding_min = ExpandingMin() >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist() [nan, 5.0, 4.0, 3.0, 2.0] We can also control the gap before the expanding calculation. >>> import pandas as pd >>> expanding_min = ExpandingMin(gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist() [5.0, 4.0, 3.0, 2.0, 1.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> expanding_min = ExpandingMin(min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist() [nan, nan, nan, 3.0, 2.0] """ name = "expanding_min" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, gap=1, min_periods=1): self.gap = gap self.min_periods = min_periods def get_function(self): def expanding_min(datetime, numeric): x = pd.Series(numeric.values, index=datetime) x = _apply_gap_for_expanding_primitives(x, self.gap) return x.expanding(min_periods=self.min_periods).min().values return expanding_min ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_std.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) class ExpandingSTD(TransformPrimitive): """Computes the expanding standard deviation for events over a given window. Description: Given a list of datetimes, returns the expanding standard deviation starting at the row `gap` rows away from the current row. An expanding primitive calculates the value of a primitive for a given time with all the data available up to the corresponding point in time. Input datetimes should be monotonic. Args: gap (int, optional): Specifies a gap backwards from each instance before the usable data begins. Corresponds to number of rows. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Defaults to 1. Examples: >>> import pandas as pd >>> expanding_std = ExpandingSTD() >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist() >>> [round(x, 2) for x in ans] [nan, nan, 0.71, 1.0, 1.29] We can also control the gap before the expanding calculation. >>> import pandas as pd >>> expanding_std = ExpandingSTD(gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist() >>> [round(x, 2) for x in ans] [nan, 0.71, 1.0, 1.29, 1.58] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> expanding_std = ExpandingSTD(min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist() >>> [round(x, 2) for x in ans] [nan, nan, nan, 1.0, 1.29] """ name = "expanding_std" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, gap=1, min_periods=1): self.gap = gap self.min_periods = min_periods def get_function(self): def expanding_std(datetime, numeric): x = pd.Series(numeric.values, index=datetime) x = _apply_gap_for_expanding_primitives(x, self.gap) return x.expanding(min_periods=self.min_periods).std().values return expanding_std ================================================ FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_trend.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) from featuretools.utils import calculate_trend class ExpandingTrend(TransformPrimitive): """Computes the expanding trend for events over a given window. Description: Given a list of datetimes, returns the expanding trend starting at the row `gap` rows away from the current row. An expanding primitive calculates the value of a primitive for a given time with all the data available up to the corresponding point in time. Input datetimes should be monotonic. Args: gap (int, optional): Specifies a gap backwards from each instance before the usable data begins. Corresponds to number of rows. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Defaults to 1. Examples: >>> import pandas as pd >>> expanding_trend = ExpandingTrend() >>> times = pd.date_range(start='2019-01-01', freq='1D', periods=5) >>> ans = expanding_trend(times, [5, 4, 3, 2, 1]).tolist() >>> [round(x, 2) for x in ans] [nan, nan, nan, -1.0, -1.0] We can also control the gap before the expanding calculation. >>> import pandas as pd >>> expanding_trend = ExpandingTrend(gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1D', periods=5) >>> ans = expanding_trend(times, [5, 4, 3, 2, 1]).tolist() >>> [round(x, 2) for x in ans] [nan, nan, -1.0, -1.0, -1.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> expanding_trend = ExpandingTrend(min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> ans = expanding_trend(times, [50, 4, 13, 22, 10]).tolist() >>> [round(x, 2) for x in ans] [nan, nan, nan, -18.5, -7.5] """ name = "expanding_trend" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, gap=1, min_periods=1): self.gap = gap self.min_periods = min_periods def get_function(self): def expanding_trend(datetime, numeric): x = pd.Series(numeric.values, index=datetime) x = _apply_gap_for_expanding_primitives(x, self.gap) return ( x.expanding(min_periods=self.min_periods) .aggregate(calculate_trend) .values ) return expanding_trend ================================================ FILE: featuretools/primitives/standard/transform/time_series/lag.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools.primitives.base import TransformPrimitive class Lag(TransformPrimitive): """Shifts an array of values by a specified number of periods. Args: periods (int): The number of periods by which to shift the input. Default is 1. Periods correspond to rows. Examples: >>> lag = Lag() >>> lag([1, 2, 3, 4, 5], pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D'))).tolist() [nan, 1.0, 2.0, 3.0, 4.0] You can specify the number of periods to shift the values >>> lag_periods = Lag(periods=3) >>> lag_periods([True, False, False, True, True], pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D'))).tolist() [nan, nan, nan, True, False] """ # Note: with pandas 1.5.0, using Lag with a string input will result in `None` values # being introduced instead of `nan` values that were present in previous versions. # All missing values will be replaced by `np.nan` (for Double) or `pd.NA` (all other types) # once Woodwork is initialized on the feature matrix. name = "lag" input_types = [ [ ColumnSchema(semantic_tags={"category"}), ColumnSchema(semantic_tags={"time_index"}), ], [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"time_index"}), ], [ ColumnSchema(logical_type=Boolean), ColumnSchema(semantic_tags={"time_index"}), ], [ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(semantic_tags={"time_index"}), ], ] return_type = None uses_full_dataframe = True def __init__(self, periods=1): self.periods = periods def get_function(self): def lag(input_col, time_index): x = pd.Series(input_col.values, index=time_index.values) return x.shift(periods=self.periods, fill_value=None).values return lag ================================================ FILE: featuretools/primitives/standard/transform/time_series/numeric_lag.py ================================================ import warnings import pandas as pd from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import TransformPrimitive class NumericLag(TransformPrimitive): """Shifts an array of values by a specified number of periods. Args: periods (int): The number of periods by which to shift the input. Default is 1. Periods correspond to rows. fill_value (int, float, optional): The value to use to fill in the gaps left after shifting the input. Default is None. Examples: >>> lag = NumericLag() >>> lag(pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D')), [1, 2, 3, 4, 5]).tolist() [nan, 1.0, 2.0, 3.0, 4.0] You can specify the number of periods to shift the values >>> lag_periods = NumericLag(periods=3) >>> lag_periods(pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D')), [1, 2, 3, 4, 5]).tolist() [nan, nan, nan, 1.0, 2.0] You can specify the fill value to use >>> lag_fill_value = NumericLag(fill_value=100) >>> lag_fill_value(pd.Series(pd.date_range(start="2020-01-01", periods=4, freq='D')), [1, 2, 3, 4]).tolist() [100, 1, 2, 3] """ name = "numeric_lag" input_types = [ ColumnSchema(semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, periods=1, fill_value=None): self.periods = periods self.fill_value = fill_value warnings.warn( "NumericLag is deprecated and will be removed in a future version. Please use the 'Lag' primitive instead.", FutureWarning, ) def get_function(self): def lag(time_index, numeric): x = pd.Series(numeric.values, index=time_index.values) return x.shift(periods=self.periods, fill_value=self.fill_value).values return lag ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_count.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) class RollingCount(TransformPrimitive): """Determines a rolling count of events over a given window. Description: Given a list of datetimes, return a rolling count starting at the row `gap` rows away from the current row and looking backward over the specified time window (by `window_length` and `gap`). Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Note: Only offset aliases with fixed frequencies can be used when defining gap and h. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months have different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient. Examples: >>> import pandas as pd >>> rolling_count = RollingCount(window_length=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_count(times).tolist() [nan, 1.0, 2.0, 3.0, 3.0] We can also control the gap before the rolling calculation. >>> import pandas as pd >>> rolling_count = RollingCount(window_length=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_count(times).tolist() [1.0, 2.0, 3.0, 3.0, 3.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> rolling_count = RollingCount(window_length=3, min_periods=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_count(times).tolist() [nan, nan, 3.0, 3.0, 3.0] We can also set the window_length and gap using offset alias strings. >>> import pandas as pd >>> rolling_count = RollingCount(window_length='3min', gap='1min') >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_count(times).tolist() [nan, 1.0, 2.0, 3.0, 3.0] """ name = "rolling_count" input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=0): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_function(self): def rolling_count(datetime): x = pd.Series(1, index=datetime) return apply_rolling_agg_to_series( x, lambda series: series.count(), self.window_length, self.gap, self.min_periods, ignore_window_nans=True, ) return rolling_count ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_max.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) class RollingMax(TransformPrimitive): """Determines the maximum of entries over a given window. Description: Given a list of numbers and a corresponding list of datetimes, return a rolling maximum of the numeric values, starting at the row `gap` rows away from the current row and looking backward over the specified window (by `window_length` and `gap`). Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Note: Only offset aliases with fixed frequencies can be used when defining gap and window_length. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months have different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient. Examples: >>> import pandas as pd >>> rolling_max = RollingMax(window_length=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist() [nan, 4.0, 4.0, 4.0, 3.0] We can also control the gap before the rolling calculation. >>> import pandas as pd >>> rolling_max = RollingMax(window_length=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist() [4.0, 4.0, 4.0, 3.0, 2.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> rolling_max = RollingMax(window_length=3, min_periods=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist() [nan, nan, 4.0, 3.0, 2.0] We can also set the window_length and gap using offset alias strings. >>> import pandas as pd >>> rolling_max = RollingMax(window_length='3min', gap='1min') >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist() [nan, 4.0, 4.0, 4.0, 3.0] """ name = "rolling_max" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=1): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_function(self): def rolling_max(datetime, numeric): x = pd.Series(numeric.values, index=datetime.values) return apply_rolling_agg_to_series( x, lambda series: series.max(), self.window_length, self.gap, self.min_periods, ) return rolling_max ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_mean.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) class RollingMean(TransformPrimitive): """Calculates the mean of entries over a given window. Description: Given a list of numbers and a corresponding list of datetimes, return a rolling mean of the numeric values, starting at the row `gap` rows away from the current row and looking backward over the specified time window (by `window_length` and `gap`). Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Note: Only offset aliases with fixed frequencies can be used when defining gap and window_length. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months have different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient. Examples: >>> import pandas as pd >>> rolling_mean = RollingMean(window_length=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist() [nan, 4.0, 3.5, 3.0, 2.0] We can also control the gap before the rolling calculation. >>> import pandas as pd >>> rolling_mean = RollingMean(window_length=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist() [4.0, 3.5, 3.0, 2.0, 1.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> rolling_mean = RollingMean(window_length=3, min_periods=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist() [nan, nan, 3.0, 2.0, 1.0] """ name = "rolling_mean" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=0): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_function(self): def rolling_mean(datetime, numeric): x = pd.Series(numeric.values, index=datetime.values) return apply_rolling_agg_to_series( x, np.mean, self.window_length, self.gap, self.min_periods, ) return rolling_mean ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_min.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) class RollingMin(TransformPrimitive): """Determines the minimum of entries over a given window. Description: Given a list of numbers and a corresponding list of datetimes, return a rolling minimum of the numeric values, starting at the row `gap` rows away from the current row and looking backward over the specified window (by `window_length` and `gap`). Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Note: Only offset aliases with fixed frequencies can be used when defining gap and window_length. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months have different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient. Examples: >>> import pandas as pd >>> rolling_min = RollingMin(window_length=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist() [nan, 4.0, 3.0, 2.0, 1.0] We can also control the gap before the rolling calculation. >>> import pandas as pd >>> rolling_min = RollingMin(window_length=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist() [4.0, 3.0, 2.0, 1.0, 0.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> rolling_min = RollingMin(window_length=3, min_periods=3, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist() [nan, nan, 2.0, 1.0, 0.0] We can also set the window_length and gap using offset alias strings. >>> import pandas as pd >>> rolling_min = RollingMin(window_length='3min', gap='1min') >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist() [nan, 4.0, 3.0, 2.0, 1.0] """ name = "rolling_min" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=1): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_function(self): def rolling_min(datetime, numeric): x = pd.Series(numeric.values, index=datetime.values) return apply_rolling_agg_to_series( x, lambda series: series.min(), self.window_length, self.gap, self.min_periods, ) return rolling_min ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_outlier_count.py ================================================ import numpy as np import pandas as pd from woodwork import init_series from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) class RollingOutlierCount(TransformPrimitive): """Determines how many values are outliers over a given window. Description: Given a list of numbers and a corresponding list of datetimes, return a rolling count of outliers within the numeric values, starting at the row `gap` rows away from the current row and looking backward over the specified window (by `window_length` and `gap`). Values are deemed outliers using the IQR method, computed over the whole series. Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of Pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of Pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1, which excludes the target instance from the window. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Note: Only offset aliases with fixed frequencies can be used when defining gap and window_length. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months are different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient. Examples: >>> import pandas as pd >>> rolling_outlier_count = RollingOutlierCount(window_length=4) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6) >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist() [nan, 0.0, 0.0, 0.0, 0.0, 1.0] We can also control the gap before the rolling calculation. >>> import pandas as pd >>> rolling_outlier_count = RollingOutlierCount(window_length=4, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6) >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist() [0.0, 0.0, 0.0, 0.0, 1.0, 1.0] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> rolling_outlier_count = RollingOutlierCount(window_length=4, min_periods=3) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6) >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist() [nan, nan, nan, 0.0, 0.0, 1.0] We can also set the window_length and gap using offset alias strings. >>> import pandas as pd >>> rolling_outlier_count = RollingOutlierCount(window_length='4min', gap='1min') >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6) >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist() [nan, 0.0, 0.0, 0.0, 0.0, 1.0] """ name = "rolling_outlier_count" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=0): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_outliers_count(self, numeric_series): # We know the column is numeric, so use the Double logical type in case Woodwork's # type inference could not infer a numeric type if not len(numeric_series.dropna()): return np.nan if numeric_series.ww.schema is None: numeric_series = init_series(numeric_series, logical_type="Double") box_plot_info = numeric_series.ww.box_plot_dict() return len(box_plot_info["high_values"]) + len(box_plot_info["low_values"]) def get_function(self): def rolling_outlier_count(datetime, numeric): x = pd.Series(numeric.values, index=datetime.values) return apply_rolling_agg_to_series( series=x, agg_func=self.get_outliers_count, window_length=self.window_length, gap=self.gap, min_periods=self.min_periods, ignore_window_nans=False, ) return rolling_outlier_count ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_std.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) class RollingSTD(TransformPrimitive): """Calculates the standard deviation of entries over a given window. Description: Given a list of numbers and a corresponding list of datetimes, return a rolling standard deviation of the numeric values, starting at the row `gap` rows away from the current row and looking backward over the specified time window (by `window_length` and `gap`). Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Note: Only offset aliases with fixed frequencies can be used when defining gap and window_length. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months have different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient. Examples: >>> import pandas as pd >>> rolling_std = RollingSTD(window_length=4) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist() [nan, nan, 0.7071067811865476, 1.0, 1.2909944487358056] We can also control the gap before the rolling calculation. >>> import pandas as pd >>> rolling_std = RollingSTD(window_length=4, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist() [nan, 0.7071067811865476, 1.0, 1.2909944487358056, 1.2909944487358056] We can also control the minimum number of periods required for the rolling calculation. >>> import pandas as pd >>> rolling_std = RollingSTD(window_length=4, min_periods=4, gap=0) >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist() [nan, nan, nan, 1.2909944487358056, 1.2909944487358056] We can also set the window_length and gap using offset alias strings. >>> import pandas as pd >>> rolling_std = RollingSTD(window_length='4min', gap='1min') >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5) >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist() [nan, nan, 0.7071067811865476, 1.0, 1.2909944487358056] """ name = "rolling_std" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=1): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_function(self): def rolling_std(datetime, numeric): x = pd.Series(numeric.values, index=datetime.values) return apply_rolling_agg_to_series( x, lambda series: series.std(), self.window_length, self.gap, self.min_periods, ) return rolling_std ================================================ FILE: featuretools/primitives/standard/transform/time_series/rolling_trend.py ================================================ import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Double from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) from featuretools.utils import calculate_trend class RollingTrend(TransformPrimitive): """Calculates the trend of a given window of entries of a column over time. Description: Given a list of numbers and a corresponding list of datetimes, return a rolling slope of the linear trend of values, starting at the row `gap` rows away from the current row and looking backward over the specified time window (by `window_length` and `gap`). Input datetimes should be monotonic. Args: window_length (int, string, optional): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. Defaults to 3. gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 1. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Examples: >>> import pandas as pd >>> rolling_trend = RollingTrend() >>> times = pd.date_range(start="2019-01-01", freq="1D", periods=10) >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist() [nan, nan, nan, 1.4999999999999998, 2.9999999999999996, 5.999999999999999, 7.999999999999999, 16.0, 36.0, 72.0] We can also control the gap before the rolling calculation. >>> rolling_trend = RollingTrend(gap=0) >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist() [nan, nan, 1.4999999999999998, 2.9999999999999996, 5.999999999999999, 7.999999999999999, 16.0, 36.0, 72.0, 144.0] We can also control the minimum number of periods required for the rolling calculation. >>> rolling_trend = RollingTrend(window_length=4, min_periods=4, gap=0) >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist() [nan, nan, nan, 2.299999999999999, 4.599999999999998, 6.799999999999996, 12.799999999999992, 26.399999999999984, 55.19999999999997, 110.39999999999993] We can also set the window_length and gap using offset alias strings. >>> rolling_trend = RollingTrend(window_length="4D", gap="1D") >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist() [nan, nan, nan, 1.4999999999999998, 2.299999999999999, 4.599999999999998, 6.799999999999996, 12.799999999999992, 26.399999999999984, 55.19999999999997] """ name = "rolling_trend" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"}) uses_full_dataframe = True def __init__(self, window_length=3, gap=1, min_periods=0): self.window_length = window_length self.gap = gap self.min_periods = min_periods def get_function(self): def rolling_trend(datetime, numeric): x = pd.Series(numeric.values, index=datetime.values) return apply_rolling_agg_to_series( x, calculate_trend, self.window_length, self.gap, self.min_periods, ) return rolling_trend ================================================ FILE: featuretools/primitives/standard/transform/time_series/utils.py ================================================ from typing import Callable, Optional, Union import numpy as np import pandas as pd from pandas import Series from pandas.core.window.rolling import Rolling from pandas.tseries.frequencies import to_offset def roll_series_with_gap( series: Series, window_length: Union[int, str], gap: Union[int, str], min_periods: int, ) -> Rolling: """Provide rolling window calculations where the windows are determined using both a gap parameter that indicates the amount of time between each instance and its window and a window length parameter that determines the amount of data in each window. Args: series (Series): The series over which rolling windows will be created. The series must have numeric values and a DatetimeIndex. window_length (int, string): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 0, which will include the target instance in the window. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. Returns: pandas.core.window.rolling.Rolling: The Rolling object for the series passed in. """ _check_window_length(window_length) _check_gap(window_length, gap) functional_window_length = window_length if isinstance(gap, str): # Add the window_length and gap so that the rolling operation correctly takes gap into account. # That way, we can later remove the gap rows in order to apply the primitive function # to the correct window functional_window_length = to_offset(window_length) + to_offset(gap) elif gap > 0: # When gap is numeric, we can apply a shift to incorporate gap right now # since the gap will be the same number of rows for the whole dataset series = series.shift(gap) return series.rolling(functional_window_length, min_periods) def _get_rolled_series_without_gap(window: Series, gap_offset: str) -> Series: """Applies the gap offset_string to the rolled window, returning a window that is the correct length of time away from the original instance. Args: window (Series): A rolling window that includes both the window length and gap spans of time. gap_offset (string): The pandas offset alias that determines how much time at the end of the window should be removed. Returns: Series: The window with gap rows removed """ if not len(window): return window window_start_date = window.index[0] window_end_date = window.index[-1] gap_bound = window_end_date - to_offset(gap_offset) # If the gap is larger than the series, no rows are left in the window if gap_bound < window_start_date: return Series(dtype="float64") # Only return the rows that are within the offset's bounds return window[window.index <= gap_bound] def apply_roll_with_offset_gap( window: Series, gap_offset: str, reducer_fn: Callable[[Series], float], min_periods: int, ) -> float: """Takes in a series to which an offset gap will be applied, removing however many rows fall under the gap before applying the reducing function. Args: window (Series): A rolling window that includes both the window length and gap spans of time. gap_offset (string): The pandas offset alias that determines how much time at the end of the window should be removed. reducer_fn (callable[Series -> float]): The function to be applied to the window in order to produce the aggregate that will be included in the resulting feature. min_periods (int): Minimum number of observations required for performing calculations over the window. Returns: float: The aggregate value to be used as a feature value. """ window = _get_rolled_series_without_gap(window, gap_offset) if min_periods is None: min_periods = 1 if len(window) < min_periods or not len(window): return np.nan return reducer_fn(window) def _check_window_length(window_length: Union[int, str]) -> None: # Window length must either be a valid offset alias if isinstance(window_length, str): try: to_offset(window_length) except ValueError: raise ValueError( f"Cannot roll series. The specified window length, {window_length}, is not a valid offset alias.", ) # Or an integer greater than zero elif isinstance(window_length, int): if window_length <= 0: raise ValueError("Window length must be greater than zero.") else: raise TypeError("Window length must be either an offset string or an integer.") def _check_gap(window_length: Union[int, str], gap: Union[int, str]) -> None: # Gap must either be a valid offset string that also has an offset string window length if isinstance(gap, str): if not isinstance(window_length, str): raise TypeError( f"Cannot roll series with offset gap, {gap}, and numeric window length, {window_length}. " "If an offset alias is used for gap, the window length must also be defined as an offset alias. " "Please either change gap to be numeric or change window length to be an offset alias.", ) try: to_offset(gap) except ValueError: raise ValueError( f"Cannot roll series. The specified gap, {gap}, is not a valid offset alias.", ) # Or an integer greater than or equal to zero elif isinstance(gap, int): if gap < 0: raise ValueError("Gap must be greater than or equal to zero.") else: raise TypeError("Gap must be either an offset string or an integer.") def apply_rolling_agg_to_series( series: Series, agg_func: Callable[[Series], float], window_length: Union[int, str], gap: Union[int, str] = 0, min_periods: int = 1, ignore_window_nans: bool = False, ) -> np.ndarray: """Applies a given aggregation function to a rolled series. Args: series (Series): The series over which rolling windows will be created. The series must have numeric values and a DatetimeIndex. agg_func (callable[Series -> float]): The aggregation function to apply to a rolled series. window_length (int, string): Specifies the amount of data included in each window. If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency, for example of one day, the window_length will correspond to a period of time, in this case, 7 days for a window_length of 7. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time that each window should span. The list of available offset aliases can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases gap (int, string, optional): Specifies a gap backwards from each instance before the window of usable data begins. If an integer is provided, it will correspond to a number of rows. If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc), and it will indicate a length of time between a target instance and the beginning of its window. Defaults to 0, which will include the target instance in the window. min_periods (int, optional): Minimum number of observations required for performing calculations over the window. Can only be as large as window_length when window_length is an integer. When window_length is an offset alias string, this limitation does not exist, but care should be taken to not choose a min_periods that will always be larger than the number of observations in a window. Defaults to 1. ignore_window_nans (bool, optional): Whether or not NaNs in the rolling window should be included in the rolling calculation. NaNs by default get counted towards min_periods. When set to True, all partial values calculated by `agg_func` in the rolling window get replaced with NaN. Defaults to False. Returns: numpy.ndarray: The array of rolling calculated values. Note: Certain operations, like `pandas.core.window.rolling.Rolling.count` that can be performed on the Rolling object returned here may treat NaNs as periods to include in window calculations. So a window [NaN, 1, 3] when `min_periods=3` will proceed with count, saying there are three periods but only two values and would return count=2. The calculation `max` on the other hand, would not recognize NaN as a valid period, and would therefore return `max=NaN` as the window has less valid periods (two, in this case) than `min_periods` (three, in this case). Most rolling calculations act this way. The implication of that here is that in order to achieve the gap, we insert NaNs at the beginning of the series, which would cause `count` to calculate on windows that technically should not have the correct number of periods. Any primitive that uses this function should determine whether `ignore_window_nans` should be set to `true`. Note: Only offset aliases with fixed frequencies can be used when defining gap and window_length. This means that aliases such as `M` or `W` cannot be used, as they can indicate different numbers of days. ('M', because different months have different numbers of days; 'W' because week will indicate a certain day of the week, like W-Wed, so that will indicate a different number of days depending on the anchoring date.) Note: When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`. This limitation does not exist when using an offset alias to define `window_length`. In fact, if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more efficient.""" rolled_series = roll_series_with_gap(series, window_length, gap, min_periods) if isinstance(gap, str): additional_args = (gap, agg_func, min_periods) return rolled_series.apply( apply_roll_with_offset_gap, args=additional_args, ).values applied_rolled_series = rolled_series.apply(agg_func) if ignore_window_nans: if not min_periods: # when min periods is 0 or None it's treated the same as if it's 1 num_nans = gap else: num_nans = min_periods - 1 + gap applied_rolled_series.iloc[range(num_nans)] = np.nan return applied_rolled_series.values def _apply_gap_for_expanding_primitives( x: Union[Series, pd.Index], gap: Union[int, str], ) -> Optional[Series]: if not isinstance(gap, int): raise TypeError( "String offsets are not supported for the gap parameter in Expanding primitives", ) if isinstance(x, pd.Index): return x.to_series().shift(gap) return x.shift(gap) ================================================ FILE: featuretools/primitives/standard/transform/url/__init__.py ================================================ from featuretools.primitives.standard.transform.url.url_to_domain import URLToDomain from featuretools.primitives.standard.transform.url.url_to_protocol import URLToProtocol from featuretools.primitives.standard.transform.url.url_to_tld import URLToTLD ================================================ FILE: featuretools/primitives/standard/transform/url/url_to_domain.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import URL, Categorical from featuretools.primitives.base import TransformPrimitive class URLToDomain(TransformPrimitive): """Determines the domain of a url. Description: Calculates the label to identify the network domain of a URL. Supports urls with or without protocol as well as international country domains. Examples: >>> url_to_domain = URLToDomain() >>> urls = ['https://play.google.com', ... 'http://www.google.co.in', ... 'www.facebook.com'] >>> url_to_domain(urls).tolist() ['play.google.com', 'google.co.in', 'facebook.com'] """ name = "url_to_domain" input_types = [ColumnSchema(logical_type=URL)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def url_to_domain(x): p = r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)" return x.str.extract(p, expand=False) return url_to_domain ================================================ FILE: featuretools/primitives/standard/transform/url/url_to_protocol.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import URL, Categorical from featuretools.primitives.base import TransformPrimitive class URLToProtocol(TransformPrimitive): """Determines the protocol (http or https) of a url. Description: Extract the protocol of a url using regex. It will be either https or http. Returns nan if the url doesn't contain a protocol. Examples: >>> url_to_protocol = URLToProtocol() >>> urls = ['https://play.google.com', ... 'http://www.google.co.in', ... 'www.facebook.com'] >>> url_to_protocol(urls).to_list() ['https', 'http', nan] """ name = "url_to_protocol" input_types = [ColumnSchema(logical_type=URL)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): def url_to_protocol(x): p = r"^(https|http)(?:\:)" return x.str.extract(p, expand=False) return url_to_protocol ================================================ FILE: featuretools/primitives/standard/transform/url/url_to_tld.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import URL, Categorical from featuretools.primitives.base import TransformPrimitive from featuretools.utils.common_tld_utils import COMMON_TLDS class URLToTLD(TransformPrimitive): """Determines the top level domain of a url. Description: Extract the top level domain of a url, using regex, and a list of common top level domains. Returns nan if the url is invalid or null. Common top level domains were pulled from this list: https://www.hayksaakian.com/most-popular-tlds/ Examples: >>> url_to_tld = URLToTLD() >>> urls = ['https://www.google.com', 'http://www.google.co.in', ... 'www.facebook.com'] >>> url_to_tld(urls).to_list() ['com', 'in', 'com'] """ name = "url_to_tld" input_types = [ColumnSchema(logical_type=URL)] return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"}) def get_function(self): self.tlds_pattern = r"(?:\.({}))".format("|".join(COMMON_TLDS)) def url_to_domain(x): p = r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)" return x.str.extract(p, expand=False) def url_to_tld(x): domains = url_to_domain(x) df = domains.str.extractall(self.tlds_pattern) matches = df.groupby(level=0).last()[0] return matches.reindex(x.index) return url_to_tld ================================================ FILE: featuretools/primitives/utils.py ================================================ import importlib.util import os from inspect import getfullargspec, getsource, isclass from typing import Dict, List import pandas as pd from woodwork import list_logical_types, list_semantic_tags, type_system from woodwork.column_schema import ColumnSchema from woodwork.logical_types import NaturalLanguage import featuretools from featuretools.primitives import NumberOfCommonWords from featuretools.primitives.base import ( AggregationPrimitive, PrimitiveBase, TransformPrimitive, ) from featuretools.utils.gen_utils import find_descendents def _get_primitives(primitive_kind): """Helper function that selects all primitives that are instances of `primitive_kind` """ primitives = set() for attribute_string in dir(featuretools.primitives): attribute = getattr(featuretools.primitives, attribute_string) if isclass(attribute): if issubclass(attribute, primitive_kind) and attribute.name: primitives.add(attribute) return {prim.name.lower(): prim for prim in primitives} def get_aggregation_primitives(): """Returns all aggregation primitives""" return _get_primitives(featuretools.primitives.AggregationPrimitive) def get_transform_primitives(): """Returns all transform primitives""" return _get_primitives(featuretools.primitives.TransformPrimitive) def get_all_primitives(): """Helper function to return all primitives""" primitives = set() for attribute_string in dir(featuretools.primitives): attribute = getattr(featuretools.primitives, attribute_string) if isclass(attribute): if issubclass(attribute, PrimitiveBase) and attribute.name: primitives.add(attribute) return {prim.__name__: prim for prim in primitives} def _get_natural_language_primitives(): """Returns all Natural Language transform primitives""" transform_primitives = get_transform_primitives() def _natural_language_in_input_type(primitive): for input_type in primitive.input_types: if isinstance(input_type, list): if any( isinstance(column_schema.logical_type, NaturalLanguage) for column_schema in input_type ): return True else: if isinstance(input_type.logical_type, NaturalLanguage): return True return False return { name: primitive for name, primitive in transform_primitives.items() if _natural_language_in_input_type(primitive) } def list_primitives(): """Returns a DataFrame that lists and describes each built-in primitive.""" trans_names, trans_primitives, valid_inputs, return_type = _get_names_primitives( get_transform_primitives, ) transform_df = pd.DataFrame( { "name": trans_names, "description": _get_descriptions(trans_primitives), "valid_inputs": valid_inputs, "return_type": return_type, }, ) transform_df["type"] = "transform" agg_names, agg_primitives, valid_inputs, return_type = _get_names_primitives( get_aggregation_primitives, ) agg_df = pd.DataFrame( { "name": agg_names, "description": _get_descriptions(agg_primitives), "valid_inputs": valid_inputs, "return_type": return_type, }, ) agg_df["type"] = "aggregation" columns = [ "name", "type", "description", "valid_inputs", "return_type", ] return pd.concat([agg_df, transform_df], ignore_index=True)[columns] def summarize_primitives() -> pd.DataFrame: """Returns a metrics summary DataFrame of all primitives found in list_primitives.""" ( trans_names, trans_primitives, trans_valid_inputs, trans_return_type, ) = _get_names_primitives(get_transform_primitives) ( agg_names, agg_primitives, agg_valid_inputs, agg_return_type, ) = _get_names_primitives(get_aggregation_primitives) tot_trans = len(trans_names) tot_agg = len(agg_names) tot_prims = tot_trans + tot_agg all_primitives = trans_primitives + agg_primitives primitives_summary = _get_summary_primitives(all_primitives) summary_dict = { "total_primitives": tot_prims, "aggregation_primitives": tot_agg, "transform_primitives": tot_trans, **primitives_summary["general_metrics"], } summary_dict.update( { f"uses_{ltype}_input": count for ltype, count in primitives_summary["logical_type_input_metrics"].items() }, ) summary_dict.update( { f"uses_{tag}_tag_input": count for tag, count in primitives_summary["semantic_tag_metrics"].items() }, ) summary_df = pd.DataFrame( [{"Metric": k, "Count": v} for k, v in summary_dict.items()], ) return summary_df def get_default_aggregation_primitives(): agg_primitives = [ featuretools.primitives.Sum, featuretools.primitives.Std, featuretools.primitives.Max, featuretools.primitives.Skew, featuretools.primitives.Min, featuretools.primitives.Mean, featuretools.primitives.Count, featuretools.primitives.PercentTrue, featuretools.primitives.NumUnique, featuretools.primitives.Mode, ] return agg_primitives def get_default_transform_primitives(): # featuretools.primitives.TimeSince trans_primitives = [ featuretools.primitives.Age, featuretools.primitives.Day, featuretools.primitives.Year, featuretools.primitives.Month, featuretools.primitives.Weekday, featuretools.primitives.Haversine, featuretools.primitives.NumWords, featuretools.primitives.NumCharacters, ] return trans_primitives def _get_descriptions(primitives): descriptions = [] for prim in primitives: description = "" if prim.__doc__ is not None: # Break on the empty line between the docstring description and the remainder of the docstring description = prim.__doc__.split("\n\n")[0] # remove any excess whitespace from line breaks description = " ".join(description.split()) descriptions.append(description) return descriptions def _get_summary_primitives(primitives: List) -> Dict[str, int]: """Provides metrics for a list of primitives.""" unique_input_types = set() unique_output_types = set() uses_multi_input = 0 uses_multi_output = 0 uses_external_data = 0 are_controllable = 0 logical_type_metrics = { log_type: 0 for log_type in list(list_logical_types()["type_string"]) } semantic_tag_metrics = { sem_tag: 0 for sem_tag in list(list_semantic_tags()["name"]) } semantic_tag_metrics.update( {"foreign_key": 0}, ) # not currently in list_semantic_tags() for prim in primitives: log_in_type_checks = set() sem_tag_type_checks = set() input_types = prim.flatten_nested_input_types(prim.input_types) _check_input_types( input_types, log_in_type_checks, sem_tag_type_checks, unique_input_types, ) for ltype in list(log_in_type_checks): logical_type_metrics[ltype] += 1 for sem_tag in list(sem_tag_type_checks): semantic_tag_metrics[sem_tag] += 1 if len(prim.input_types) > 1: uses_multi_input += 1 # checks if number_output_features is set as an instance variable or set as a constant if ( "self.number_output_features =" in getsource(prim.__init__) or prim.number_output_features > 1 ): uses_multi_output += 1 unique_output_types.add(str(prim.return_type)) if hasattr(prim, "filename"): uses_external_data += 1 if len(getfullargspec(prim.__init__).args) > 1: are_controllable += 1 return { "general_metrics": { "unique_input_types": len(unique_input_types), "unique_output_types": len(unique_output_types), "uses_multi_input": uses_multi_input, "uses_multi_output": uses_multi_output, "uses_external_data": uses_external_data, "are_controllable": are_controllable, }, "logical_type_input_metrics": logical_type_metrics, "semantic_tag_metrics": semantic_tag_metrics, } def _check_input_types( input_types: List[ColumnSchema], log_in_type_checks: set, sem_tag_type_checks: set, unique_input_types: set, ): """Checks if any logical types or semantic tags occur in a list of Woodwork input types and keeps track of unique input types.""" for in_type in input_types: if in_type.semantic_tags: for sem_tag in in_type.semantic_tags: sem_tag_type_checks.add(sem_tag) if in_type.logical_type: log_in_type_checks.add(in_type.logical_type.type_string) unique_input_types.add(str(in_type)) def _get_names_primitives(primitive_func): names = [] primitives = [] valid_inputs = [] return_type = [] for name, primitive in primitive_func().items(): names.append(name) primitives.append(primitive) input_types = _get_unique_input_types(primitive.input_types) valid_inputs.append(", ".join(input_types)) return_type.append( str(primitive.return_type), ) if primitive.return_type is not None else return_type.append(None) return names, primitives, valid_inputs, return_type def _get_unique_input_types(input_types): types = set() for input_type in input_types: if isinstance(input_type, list): types |= _get_unique_input_types(input_type) else: types.add(str(input_type)) return types def list_primitive_files(directory): """returns list of files in directory that might contain primitives""" files = os.listdir(directory) keep = [] for path in files: if not check_valid_primitive_path(path): continue keep.append(os.path.join(directory, path)) return keep def check_valid_primitive_path(path): if os.path.isdir(path): return False filename = os.path.basename(path) if filename[:2] == "__" or filename[0] == "." or filename[-3:] != ".py": return False return True def load_primitive_from_file(filepath): """load primitive objects in a file""" module = os.path.basename(filepath)[:-3] # TODO: what is the first argument"? spec = importlib.util.spec_from_file_location(module, filepath) module = importlib.util.module_from_spec(spec) spec.loader.exec_module(module) primitives = [] for primitive_name in vars(module): primitive_class = getattr(module, primitive_name) if ( isclass(primitive_class) and issubclass(primitive_class, PrimitiveBase) and primitive_class not in (AggregationPrimitive, TransformPrimitive) ): primitives.append((primitive_name, primitive_class)) if len(primitives) == 0: raise RuntimeError("No primitive defined in file %s" % filepath) elif len(primitives) > 1: raise RuntimeError("More than one primitive defined in file %s" % filepath) return primitives[0] def serialize_primitive(primitive: PrimitiveBase): """build a dictionary with the data necessary to construct the given primitive""" args_dict = {name: val for name, val in primitive.get_arguments()} cls = type(primitive) if cls == NumberOfCommonWords and "word_set" in args_dict: args_dict["word_set"] = list(args_dict["word_set"]) return { "type": cls.__name__, "module": cls.__module__, "arguments": args_dict, } class PrimitivesDeserializer(object): """ This class wraps a cache and a generator which iterates over all primitive classes. When deserializing a primitive if it is not in the cache then we iterate until it is found, adding every seen class to the cache. When deserializing the next primitive the iteration resumes where it left off. This means that we never visit a class more than once. """ def __init__(self): # Cache to avoid repeatedly searching for primitive class # (class_name, module_name) -> class self.class_cache = {} self.primitive_classes = find_descendents(PrimitiveBase) def deserialize_primitive(self, primitive_dict): """ Construct a primitive from the given dictionary (output from serialize_primitive). """ class_name = primitive_dict["type"] module_name = primitive_dict["module"] class_cache_key = (class_name, module_name.split(".")[0]) if class_cache_key in self.class_cache: cls = self.class_cache[class_cache_key] else: cls = self._find_class_in_descendants(class_cache_key) if not cls: raise RuntimeError( 'Primitive "%s" in module "%s" not found' % (class_name, module_name), ) arguments = primitive_dict["arguments"] if cls == NumberOfCommonWords and "word_set" in arguments: # We converted word_set from a set to a list to make it serializable, # we should convert it back now. arguments["word_set"] = set(arguments["word_set"]) primitive_instance = cls(**arguments) return primitive_instance def _find_class_in_descendants(self, search_key): for cls in self.primitive_classes: cls_key = (cls.__name__, cls.__module__.split(".")[0]) self.class_cache[cls_key] = cls if cls_key == search_key: return cls def get_all_logical_type_names(): """Helper function that returns all registered woodwork logical types""" return {lt.__name__: lt for lt in type_system.registered_types} ================================================ FILE: featuretools/selection/__init__.py ================================================ # flake8: noqa from featuretools.selection.api import * ================================================ FILE: featuretools/selection/api.py ================================================ # flake8: noqa from featuretools.selection.selection import * ================================================ FILE: featuretools/selection/selection.py ================================================ import pandas as pd from woodwork.logical_types import Boolean, BooleanNullable def remove_low_information_features(feature_matrix, features=None): """Select features that have at least 2 unique values and that are not all null Args: feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select Returns: (feature_matrix, features) """ keep = [ c for c in feature_matrix if ( feature_matrix[c].nunique(dropna=False) > 1 and feature_matrix[c].dropna().shape[0] > 0 ) ] feature_matrix = feature_matrix[keep] if features is not None: features = [f for f in features if f.get_name() in feature_matrix.columns] return feature_matrix, features return feature_matrix def remove_highly_null_features(feature_matrix, features=None, pct_null_threshold=0.95): """ Removes columns from a feature matrix that have higher than a set threshold of null values. Args: feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances. features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select. pct_null_threshold (float): If the percentage of NaN values in an input feature exceeds this amount, that feature will be considered highly-null. Defaults to 0.95. Returns: pd.DataFrame, list[:class:`.FeatureBase`]: The feature matrix and the list of generated feature definitions. Matches dfs output. If no feature list is provided as input, the feature list will not be returned. """ if pct_null_threshold < 0 or pct_null_threshold > 1: raise ValueError( "pct_null_threshold must be a float between 0 and 1, inclusive.", ) percent_null_by_col = (feature_matrix.isnull().mean()).to_dict() if pct_null_threshold == 0.0: keep = [ f_name for f_name, pct_null in percent_null_by_col.items() if pct_null <= pct_null_threshold ] else: keep = [ f_name for f_name, pct_null in percent_null_by_col.items() if pct_null < pct_null_threshold ] return _apply_feature_selection(keep, feature_matrix, features) def remove_single_value_features( feature_matrix, features=None, count_nan_as_value=False, ): """Removes columns in feature matrix where all the values are the same. Args: feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances. features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select. count_nan_as_value (bool): If True, missing values will be counted as their own unique value. If set to False, a feature that has one unique value and all other data missing will be removed from the feature matrix. Defaults to False. Returns: pd.DataFrame, list[:class:`.FeatureBase`]: The feature matrix and the list of generated feature definitions. Matches dfs output. If no feature list is provided as input, the feature list will not be returned. """ unique_counts_by_col = feature_matrix.nunique( dropna=not count_nan_as_value, ).to_dict() keep = [ f_name for f_name, unique_count in unique_counts_by_col.items() if unique_count > 1 ] return _apply_feature_selection(keep, feature_matrix, features) def remove_highly_correlated_features( feature_matrix, features=None, pct_corr_threshold=0.95, features_to_check=None, features_to_keep=None, ): """Removes columns in feature matrix that are highly correlated with another column. Note: We make the assumption that, for a pair of features, the feature that is further right in the feature matrix produced by ``dfs`` is the more complex one. The assumption does not hold if the order of columns in the feature matrix has changed from what ``dfs`` produces. Args: feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances. If Woodwork is not initalized, will perform Woodwork initialization, which may result in slightly different types than those in the original feature matrix created by Featuretools. features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select. pct_corr_threshold (float): The correlation threshold to be considered highly correlated. Defaults to 0.95. features_to_check (list[str], optional): List of column names to check whether any pairs are highly correlated. Will not check any other columns, meaning the only columns that can be removed are in this list. If null, defaults to checking all columns. features_to_keep (list[str], optional): List of colum names to keep even if correlated to another column. If null, all columns will be candidates for removal. Returns: pd.DataFrame, list[:class:`.FeatureBase`]: The feature matrix and the list of generated feature definitions. Matches dfs output. If no feature list is provided as input, the feature list will not be returned. For consistent results, do not change the order of features outputted by dfs. """ if feature_matrix.ww.schema is None: feature_matrix.ww.init() if pct_corr_threshold < 0 or pct_corr_threshold > 1: raise ValueError( "pct_corr_threshold must be a float between 0 and 1, inclusive.", ) if features_to_check is None: features_to_check = list(feature_matrix.columns) else: for f_name in features_to_check: assert ( f_name in feature_matrix.columns ), "feature named {} is not in feature matrix".format(f_name) if features_to_keep is None: features_to_keep = [] to_select = ["numeric", Boolean, BooleanNullable] fm = feature_matrix.ww[features_to_check] fm_to_check = fm.ww.select(include=to_select) dropped = set() columns_to_check = fm_to_check.columns # When two features are found to be highly correlated, # we drop the more complex feature # Columns produced later in dfs are more complex for i in range(len(columns_to_check) - 1, 0, -1): more_complex_name = columns_to_check[i] more_complex_col = fm_to_check[more_complex_name] # Convert boolean or Int64 column to be float64 if pd.api.types.is_bool_dtype(more_complex_col) or isinstance( more_complex_col.dtype, pd.Int64Dtype, ): more_complex_col = more_complex_col.astype("float64") for j in range(i - 1, -1, -1): less_complex_name = columns_to_check[j] less_complex_col = fm_to_check[less_complex_name] # Convert boolean or Int64 column to be float64 if pd.api.types.is_bool_dtype(less_complex_col) or isinstance( less_complex_col.dtype, pd.Int64Dtype, ): less_complex_col = less_complex_col.astype("float64") if abs(more_complex_col.corr(less_complex_col)) >= pct_corr_threshold: dropped.add(more_complex_name) break keep = [ f_name for f_name in feature_matrix.columns if (f_name in features_to_keep or f_name not in dropped) ] return _apply_feature_selection(keep, feature_matrix, features) def _apply_feature_selection(keep, feature_matrix, features=None): new_matrix = feature_matrix[keep] new_feature_names = set(new_matrix.columns) if features is not None: new_features = [] for f in features: if f.number_output_features > 1: slices = [ f[i] for i in range(f.number_output_features) if f[i].get_name() in new_feature_names ] if len(slices) == f.number_output_features: new_features.append(f) else: new_features.extend(slices) else: if f.get_name() in new_feature_names: new_features.append(f) return new_matrix, new_features return new_matrix ================================================ FILE: featuretools/synthesis/__init__.py ================================================ # flake8: noqa from featuretools.synthesis.api import * ================================================ FILE: featuretools/synthesis/api.py ================================================ # flake8: noqa from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis from featuretools.synthesis.dfs import dfs from featuretools.synthesis.encode_features import encode_features from featuretools.synthesis.get_valid_primitives import get_valid_primitives ================================================ FILE: featuretools/synthesis/deep_feature_synthesis.py ================================================ import functools import logging import operator import warnings from collections import defaultdict from typing import Any, DefaultDict, Dict, List, Tuple, Type from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable from featuretools import primitives from featuretools.entityset.entityset import LTI_COLUMN_NAME from featuretools.entityset.relationship import RelationshipPath from featuretools.feature_base import ( AggregationFeature, DirectFeature, FeatureBase, GroupByTransformFeature, IdentityFeature, TransformFeature, ) from featuretools.feature_base.cache import CacheType, feature_cache from featuretools.feature_base.utils import is_valid_input from featuretools.primitives.base import ( AggregationPrimitive, PrimitiveBase, TransformPrimitive, ) from featuretools.primitives.options_utils import ( filter_groupby_matches_by_options, filter_matches_by_options, generate_all_primitive_options, ignore_dataframe_for_primitive, ) from featuretools.utils.gen_utils import camel_and_title_to_snake logger = logging.getLogger("featuretools") class DeepFeatureSynthesis(object): """Automatically produce features for a target dataframe in an Entityset. Args: target_dataframe_name (str): Name of dataframe for which to build features. entityset (EntitySet): Entityset for which to build features. agg_primitives (list[str or :class:`.primitives.`], optional): list of Aggregation Feature types to apply. Default: ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"] trans_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional): list of Transform primitives to use. Default: ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"] where_primitives (list[str or :class:`.primitives.PrimitiveBase`], optional): only add where clauses to these types of Primitives Default: ["count"] groupby_trans_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional): list of Transform primitives to make GroupByTransformFeatures with max_depth (int, optional) : maximum allowed depth of features. Default: 2. If -1, no limit. max_features (int, optional) : Cap the number of generated features to this number. If -1, no limit. allowed_paths (list[list[str]], optional): Allowed dataframe paths to make features for. If None, use all paths. ignore_dataframes (list[str], optional): List of dataframes to blacklist when creating features. If None, use all dataframes. ignore_columns (dict[str -> list[str]], optional): List of specific columns within each dataframe to blacklist when creating features. If None, use all columns. seed_features (list[:class:`.FeatureBase`], optional): List of manually defined features to use. drop_contains (list[str], optional): Drop features that contains these strings in name. drop_exact (list[str], optional): Drop features that exactly match these strings in name. where_stacking_limit (int, optional): Cap the depth of the where features. Default: 1 primitive_options (dict[str or tuple[str] or PrimitiveBase -> dict or list[dict]], optional): Specify options for a single primitive or a group of primitives. Lists of option dicts are used to specify options per input for primitives with multiple inputs. Each option ``dict`` can have the following keys: ``"include_dataframes"`` List of dataframes to be included when creating features for the primitive(s). All other dataframes will be ignored (list[str]). ``"ignore_dataframes"`` List of dataframes to be blacklisted when creating features for the primitive(s) (list[str]). ``"include_columns"`` List of specific columns within each dataframe to include when creating features for the primitive(s). All other columns in a given dataframe will be ignored (dict[str -> list[str]]). ``"ignore_columns"`` List of specific columns within each dataframe to blacklist when creating features for the primitive(s) (dict[str -> list[str]]). ``"include_groupby_dataframes"`` List of dataframes to be included when finding groupbys. All other dataframes will be ignored (list[str]). ``"ignore_groupby_dataframes"`` List of dataframes to blacklist when finding groupbys (list[str]). ``"include_groupby_columns"`` List of specific columns within each dataframe to include as groupbys, if applicable. All other columns in each dataframe will be ignored (dict[str -> list[str]]). ``"ignore_groupby_columns"`` List of specific columns within each dataframe to blacklist as groupbys (dict[str -> list[str]]). """ def __init__( self, target_dataframe_name, entityset, agg_primitives=None, trans_primitives=None, where_primitives=None, groupby_trans_primitives=None, max_depth=2, max_features=-1, allowed_paths=None, ignore_dataframes=None, ignore_columns=None, primitive_options=None, seed_features=None, drop_contains=None, drop_exact=None, where_stacking_limit=1, ): if target_dataframe_name not in entityset.dataframe_dict: es_name = entityset.id or "entity set" msg = "Provided target dataframe %s does not exist in %s" % ( target_dataframe_name, es_name, ) raise KeyError(msg) # Multiple calls to dfs() should start with a fresh cache feature_cache.clear_all() feature_cache.enabled = True # need to change max_depth to None because DFs terminates when <0 if max_depth == -1: max_depth = None # if just one dataframe, set max depth to 1 (transform stacking rule) if len(entityset.dataframe_dict) == 1 and (max_depth is None or max_depth > 1): warnings.warn( "Only one dataframe in entityset, changing max_depth to " "1 since deeper features cannot be created", ) max_depth = 1 self.max_depth = max_depth self.max_features = max_features self.allowed_paths = allowed_paths if self.allowed_paths: self.allowed_paths = set() for path in allowed_paths: self.allowed_paths.add(tuple(path)) if ignore_dataframes is None: self.ignore_dataframes = set() else: if not isinstance(ignore_dataframes, list): raise TypeError("ignore_dataframes must be a list") assert ( target_dataframe_name not in ignore_dataframes ), "Can't ignore target_dataframe!" self.ignore_dataframes = set(ignore_dataframes) self.ignore_columns = _build_ignore_columns(ignore_columns) self.target_dataframe_name = target_dataframe_name self.es = entityset aggregation_primitive_dict = primitives.get_aggregation_primitives() transform_primitive_dict = primitives.get_transform_primitives() if agg_primitives is None: agg_primitives = primitives.get_default_aggregation_primitives() self.agg_primitives = sorted( [ check_primitive( p, "aggregation", aggregation_primitive_dict, transform_primitive_dict, ) for p in agg_primitives ], ) if trans_primitives is None: trans_primitives = primitives.get_default_transform_primitives() self.trans_primitives = sorted( [ check_primitive( p, "transform", aggregation_primitive_dict, transform_primitive_dict, ) for p in trans_primitives ], ) if where_primitives is None: where_primitives = [primitives.Count] self.where_primitives = sorted( [ check_primitive( p, "where", aggregation_primitive_dict, transform_primitive_dict, ) for p in where_primitives ], ) if groupby_trans_primitives is None: groupby_trans_primitives = [] self.groupby_trans_primitives = sorted( [ check_primitive( p, "groupby transform", aggregation_primitive_dict, transform_primitive_dict, ) for p in groupby_trans_primitives ], ) if primitive_options is None: primitive_options = {} all_primitives = ( self.trans_primitives + self.agg_primitives + self.where_primitives + self.groupby_trans_primitives ) ( self.primitive_options, self.ignore_dataframes, self.ignore_columns, ) = generate_all_primitive_options( all_primitives, primitive_options, self.ignore_dataframes, self.ignore_columns, self.es, ) self.seed_features = sorted(seed_features or [], key=lambda f: f.unique_name()) self.drop_exact = drop_exact or [] self.drop_contains = drop_contains or [] self.where_stacking_limit = where_stacking_limit def build_features(self, return_types=None, verbose=False): """Automatically builds feature definitions for target dataframe using Deep Feature Synthesis algorithm Args: return_types (list[woodwork.ColumnSchema] or str, optional): List of ColumnSchemas defining the types of columns to return. If None, defaults to returning all numeric, categorical and boolean types. If given as the string 'all', use all available return types. verbose (bool, optional): If True, print progress. Returns: list[BaseFeature]: Returns a list of features for target dataframe, sorted by feature depth (shallow first). """ all_features = {} self.where_clauses = defaultdict(set) if return_types is None: return_types = [ ColumnSchema(semantic_tags=["numeric"]), ColumnSchema(semantic_tags=["category"]), ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=BooleanNullable), ] elif return_types == "all": pass else: msg = "return_types must be a list, or 'all'" assert isinstance(return_types, list), msg self._run_dfs( self.es[self.target_dataframe_name], RelationshipPath([]), all_features, max_depth=self.max_depth, ) new_features = list(all_features[self.target_dataframe_name].values()) def filt(f): # remove identity features of the ID field of the target dataframe if ( isinstance(f, IdentityFeature) and f.dataframe_name == self.target_dataframe_name and f.column_name == self.es[self.target_dataframe_name].ww.index ): return False return True # filter out features with undesired return types if return_types != "all": new_features = [ f for f in new_features if any( True for schema in return_types if is_valid_input(f.column_schema, schema) ) ] new_features = list(filter(filt, new_features)) new_features.sort(key=lambda f: f.get_depth()) new_features = self._filter_features(new_features) if self.max_features > 0: new_features = new_features[: self.max_features] if verbose: print("Built {} features".format(len(new_features))) verbose = None return new_features def _filter_features(self, features): assert isinstance(self.drop_exact, list), "drop_exact must be a list" assert isinstance(self.drop_contains, list), "drop_contains must be a list" f_keep = [] for f in features: keep = True for contains in self.drop_contains: if contains in f.get_name(): keep = False break if f.get_name() in self.drop_exact: keep = False if keep: f_keep.append(f) return f_keep def _run_dfs(self, dataframe, relationship_path, all_features, max_depth): """ Create features for the provided dataframe Args: dataframe (DataFrame): Dataframe for which to create features. relationship_path (RelationshipPath): The path to this dataframe. all_features (dict[dataframe name -> dict[str -> BaseFeature]]): Dict containing a dict for each dataframe. Each nested dict has features as values with their ids as keys. max_depth (int) : Maximum allowed depth of features. """ if max_depth is not None and max_depth < 0: return all_features[dataframe.ww.name] = {} """ Step 1 - Create identity features """ self._add_identity_features(all_features, dataframe) """ Step 2 - Recursively build features for each dataframe in a backward relationship """ backward_dataframes = self.es.get_backward_dataframes(dataframe.ww.name) for b_dataframe_id, sub_relationship_path in backward_dataframes: # Skip if we've already created features for this dataframe. if b_dataframe_id in all_features: continue if b_dataframe_id in self.ignore_dataframes: continue new_path = relationship_path + sub_relationship_path if ( self.allowed_paths and tuple(new_path.dataframes()) not in self.allowed_paths ): continue new_max_depth = None if max_depth is not None: new_max_depth = max_depth - 1 self._run_dfs( dataframe=self.es[b_dataframe_id], relationship_path=new_path, all_features=all_features, max_depth=new_max_depth, ) """ Step 3 - Create aggregation features for all deep backward relationships """ backward_dataframes = self.es.get_backward_dataframes( dataframe.ww.name, deep=True, ) for b_dataframe_id, sub_relationship_path in backward_dataframes: if b_dataframe_id in self.ignore_dataframes: continue new_path = relationship_path + sub_relationship_path if ( self.allowed_paths and tuple(new_path.dataframes()) not in self.allowed_paths ): continue self._build_agg_features( parent_dataframe=self.es[dataframe.ww.name], child_dataframe=self.es[b_dataframe_id], all_features=all_features, max_depth=max_depth, relationship_path=sub_relationship_path, ) """ Step 4 - Create transform features of identity and aggregation features """ self._build_transform_features(all_features, dataframe, max_depth=max_depth) """ Step 5 - Recursively build features for each dataframe in a forward relationship """ forward_dataframes = self.es.get_forward_dataframes(dataframe.ww.name) for f_dataframe_id, sub_relationship_path in forward_dataframes: # Skip if we've already created features for this dataframe. if f_dataframe_id in all_features: continue if f_dataframe_id in self.ignore_dataframes: continue new_path = relationship_path + sub_relationship_path if ( self.allowed_paths and tuple(new_path.dataframes()) not in self.allowed_paths ): continue new_max_depth = None if max_depth is not None: new_max_depth = max_depth - 1 self._run_dfs( dataframe=self.es[f_dataframe_id], relationship_path=new_path, all_features=all_features, max_depth=new_max_depth, ) """ Step 6 - Create direct features for forward relationships """ forward_dataframes = self.es.get_forward_dataframes(dataframe.ww.name) for f_dataframe_id, sub_relationship_path in forward_dataframes: if f_dataframe_id in self.ignore_dataframes: continue new_path = relationship_path + sub_relationship_path if ( self.allowed_paths and tuple(new_path.dataframes()) not in self.allowed_paths ): continue self._build_forward_features( all_features=all_features, relationship_path=sub_relationship_path, max_depth=max_depth, ) """ Step 7 - Create transform features of direct features """ self._build_transform_features( all_features, dataframe, max_depth=max_depth, require_direct_input=True, ) # now that all features are added, build where clauses self._build_where_clauses(all_features, dataframe) def _handle_new_feature(self, new_feature, all_features): """Adds new feature to the dict Args: new_feature (:class:`.FeatureBase`): New feature being checked. all_features (dict[dataframe name -> dict[str -> BaseFeature]]): Dict containing a dict for each dataframe. Each nested dict has features as values with their ids as keys. Returns: dict[PrimitiveBase -> dict[feature id -> feature]]: Dict of features with any new features. Raises: Exception: Attempted to add a single feature multiple times """ dataframe_name = new_feature.dataframe_name name = new_feature.unique_name() # Warn if this feature is already present, and it is not a seed feature. # It is expected that a seed feature could also be generated by dfs. if name in all_features[dataframe_name] and name not in ( f.unique_name() for f in self.seed_features ): logger.warning( "Attempting to add feature %s which is already " "present. This is likely a bug." % new_feature, ) return all_features[dataframe_name][name] = new_feature def _add_identity_features(self, all_features, dataframe): """converts all columns from the given dataframe into features Args: all_features (dict[dataframe name -> dict[str -> BaseFeature]]): Dict containing a dict for each dataframe. Each nested dict has features as values with their ids as keys. dataframe (DataFrame): DataFrame to calculate features for. """ for col in dataframe.columns: if col in self.ignore_columns[dataframe.ww.name] or col == LTI_COLUMN_NAME: continue new_f = IdentityFeature(self.es[dataframe.ww.name].ww[col]) self._handle_new_feature(all_features=all_features, new_feature=new_f) # add seed features, if any, for dfs to build on top of # if there are any multi output features, this will build on # top of each output of the feature. for f in self.seed_features: if f.dataframe_name == dataframe.ww.name: self._handle_new_feature(all_features=all_features, new_feature=f) def _build_where_clauses(self, all_features, dataframe): """Traverses all identity features and creates a Compare for each one, based on some heuristics Args: all_features (dict[dataframe name -> dict[str -> BaseFeature]]): Dict containing a dict for each dataframe. Each nested dict has features as values with their ids as keys. dataframe (DataFrame): DataFrame to calculate features for. """ def is_valid_feature(f): if isinstance(f, IdentityFeature): return True if isinstance(f, DirectFeature) and getattr( f.base_features[0], "column_name", None, ): return True return False for feat in [ f for f in all_features[dataframe.ww.name].values() if is_valid_feature(f) ]: # Get interesting_values from the EntitySet that was passed, which # is assumed to be the most recent version of the EntitySet. # Features can contain a stale EntitySet reference without # interesting_values if isinstance(feat, DirectFeature): df = feat.base_features[0].dataframe_name col = feat.base_features[0].column_name else: df = feat.dataframe_name col = feat.column_name metadata = self.es[df].ww.columns[col].metadata interesting_values = metadata.get("interesting_values") if interesting_values: for val in interesting_values: self.where_clauses[dataframe.ww.name].add(feat == val) def _build_transform_features( self, all_features, dataframe, max_depth=0, require_direct_input=False, ): """Creates trans_features for all the columns in a dataframe Args: all_features (dict[dataframe name: dict->[str->:class:`BaseFeature`]]): Dict containing a dict for each dataframe. Each nested dict has features as values with their ids as keys dataframe (DataFrame): DataFrame to calculate features for. """ new_max_depth = None if max_depth is not None: new_max_depth = max_depth - 1 # Keep track of features to add until the end to avoid applying # transform primitives to features that were also built by transform primitives features_to_add = [] for trans_prim in self.trans_primitives: current_options = self.primitive_options.get( trans_prim, self.primitive_options.get(trans_prim.name), ) if ignore_dataframe_for_primitive(current_options, dataframe): continue input_types = trans_prim.input_types matching_inputs = self._get_matching_inputs( all_features, dataframe, new_max_depth, input_types, trans_prim, current_options, require_direct_input=require_direct_input, feature_filter=not_a_transform_input, ) for matching_input in matching_inputs: if not can_stack_primitive_on_inputs(trans_prim, matching_input): continue if not any( True for bf in matching_input if bf.number_output_features != 1 ): new_f = TransformFeature(matching_input, primitive=trans_prim) features_to_add.append(new_f) for groupby_prim in self.groupby_trans_primitives: current_options = self.primitive_options.get( groupby_prim, self.primitive_options.get(groupby_prim.name), ) if ignore_dataframe_for_primitive(current_options, dataframe, groupby=True): continue input_types = groupby_prim.input_types[:] matching_inputs = self._get_matching_inputs( all_features, dataframe, new_max_depth, input_types, groupby_prim, current_options, feature_filter=not_a_transform_input, ) # get columns to use as groupbys, use IDs as default unless other groupbys specified if any( True for option in current_options if dataframe.ww.name in option.get("include_groupby_columns", []) ): column_schemas = "all" else: column_schemas = [ColumnSchema(semantic_tags=["foreign_key"])] groupby_matches = self._features_by_type( all_features=all_features, dataframe=dataframe, max_depth=new_max_depth, column_schemas=column_schemas, ) groupby_matches = filter_groupby_matches_by_options( groupby_matches, current_options, ) for matching_input in matching_inputs: if not can_stack_primitive_on_inputs(groupby_prim, matching_input): continue if any(True for bf in matching_input if bf.number_output_features != 1): continue if require_direct_input: if any_direct_in_matching_input := any( isinstance(bf, DirectFeature) for bf in matching_input ): all_direct_and_same_path_in_matching_input = ( _all_direct_and_same_path(matching_input) ) for groupby in groupby_matches: if require_direct_input: # If require_direct_input, require a DirectFeature in input or as a # groupby, and don't create features of inputs/groupbys which are # all direct features with the same relationship path # # If we require_direct_input, we skip Feature generation # in the following two cases: # (1) --> There are no DirectFeatures in the matching input, # and groupby is not a DirectFeature # (2) --> All of the matching input and groupby are DirectFeatures # with the same relationship path groupby_is_direct = isinstance(groupby[0], DirectFeature) # Checks case (1) if not any_direct_in_matching_input: if not groupby_is_direct: continue elif all_direct_and_same_path_in_matching_input: # Checks case (2) if ( groupby_is_direct and groupby[0].relationship_path == matching_input[0].relationship_path ): continue new_f = GroupByTransformFeature( list(matching_input), groupby=groupby[0], primitive=groupby_prim, ) features_to_add.append(new_f) for new_f in features_to_add: self._handle_new_feature(all_features=all_features, new_feature=new_f) def _build_forward_features(self, all_features, relationship_path, max_depth=0): _, relationship = relationship_path[0] child_dataframe_name = relationship.child_dataframe.ww.name parent_dataframe = relationship.parent_dataframe features = self._features_by_type( all_features=all_features, dataframe=parent_dataframe, max_depth=max_depth, column_schemas="all", ) for f in features: if self._feature_in_relationship_path(relationship_path, f): continue # limits allowing direct features of agg_feats with where clauses if isinstance(f, AggregationFeature): deep_base_features = [f] + f.get_dependencies(deep=True) for feat in deep_base_features: if isinstance(feat, AggregationFeature) and feat.where is not None: continue new_f = DirectFeature(f, child_dataframe_name, relationship=relationship) self._handle_new_feature(all_features=all_features, new_feature=new_f) def _build_agg_features( self, all_features, parent_dataframe, child_dataframe, max_depth, relationship_path, ): new_max_depth = None if max_depth is not None: new_max_depth = max_depth - 1 for agg_prim in self.agg_primitives: current_options = self.primitive_options.get( agg_prim, self.primitive_options.get(agg_prim.name), ) if ignore_dataframe_for_primitive(current_options, child_dataframe): continue def feature_filter(f): # Remove direct features of parent dataframe and features in relationship path. return ( not _direct_of_dataframe(f, parent_dataframe) ) and not self._feature_in_relationship_path(relationship_path, f) input_types = agg_prim.input_types matching_inputs = self._get_matching_inputs( all_features, child_dataframe, new_max_depth, input_types, agg_prim, current_options, feature_filter=feature_filter, ) matching_inputs = filter_matches_by_options( matching_inputs, current_options, ) wheres = list(self.where_clauses[child_dataframe.ww.name]) for matching_input in matching_inputs: if not can_stack_primitive_on_inputs(agg_prim, matching_input): continue new_f = AggregationFeature( matching_input, parent_dataframe_name=parent_dataframe.ww.name, relationship_path=relationship_path, primitive=agg_prim, ) self._handle_new_feature(new_f, all_features) # limit the stacking of where features # count up the the number of where features # in this feature and its dependencies feat_wheres = [] for f in matching_input: if isinstance(f, AggregationFeature) and f.where is not None: feat_wheres.append(f) for feat in f.get_dependencies(deep=True): if ( isinstance(feat, AggregationFeature) and feat.where is not None ): feat_wheres.append(feat) if len(feat_wheres) >= self.where_stacking_limit: continue # limits the aggregation feature by the given allowed feature types. if not any( True for primitive in self.where_primitives if issubclass(type(agg_prim), type(primitive)) ): continue for where in wheres: # limits the where feats so they are different than base feats base_names = [f.unique_name() for f in new_f.base_features] if any( True for base_feat in where.base_features if base_feat.unique_name() in base_names ): continue new_f = AggregationFeature( matching_input, parent_dataframe_name=parent_dataframe.ww.name, relationship_path=relationship_path, where=where, primitive=agg_prim, ) self._handle_new_feature(new_f, all_features) def _features_by_type( self, all_features, dataframe, max_depth, column_schemas=None, ): if max_depth is not None and max_depth < 0: return [] if dataframe.ww.name not in all_features: return [] def expand_features(feature) -> List[Any]: """Internal method to return either the single feature or the output features Args: feature (Feature): Feature instance Returns: List[Any]: list of features """ outputs = feature.number_output_features if outputs > 1: return [feature[i] for i in range(outputs)] return [feature] # Build the complete list of features prior to processing selected_features = [ expand_features(feature) for feature in all_features[dataframe.ww.name].values() ] selected_features = functools.reduce(operator.iconcat, selected_features, []) column_schemas = column_schemas if column_schemas else set() if max_depth is None and column_schemas == "all": return selected_features # assigning seed_features locally adds a slight performance benefit by not having to look # up the property for each round of the comprehension seed_features = self.seed_features if max_depth is not None: selected_features = [ feature for feature in selected_features if get_feature_depth(feature, stop_at=seed_features) <= max_depth ] def valid_input(column_schema) -> bool: """Helper method to validate the feature schema to the allowed column_schemas Args: column_schema (ColumnSchema): feature column schema Returns: bool: True if valid """ return any( True for schema in column_schemas if is_valid_input(column_schema, schema) ) if column_schemas and column_schemas != "all": selected_features = [ feature for feature in selected_features if valid_input(feature.column_schema) ] return selected_features def _feature_in_relationship_path(self, relationship_path, feature): # must be identity feature to be in the relationship path if not isinstance(feature, IdentityFeature): return False for _, relationship in relationship_path: if ( relationship.child_name == feature.dataframe_name and relationship._child_column_name == feature.column_name ): return True if ( relationship.parent_name == feature.dataframe_name and relationship._parent_column_name == feature.column_name ): return True return False def _get_matching_inputs( self, all_features, dataframe, max_depth, input_types, primitive, primitive_options, require_direct_input=False, feature_filter=None, ): if not isinstance(input_types[0], list): input_types = [input_types] matching_inputs = [] for input_type in input_types: features = self._features_by_type( all_features=all_features, dataframe=dataframe, max_depth=max_depth, column_schemas=list(input_type), ) if not features: continue if feature_filter: features = [f for f in features if feature_filter(f)] matches = match( input_type, features, commutative=primitive.commutative, require_direct_input=require_direct_input, ) matching_inputs.extend(matches) # everything following depends on populated matching_inputs if not matching_inputs: return matching_inputs if require_direct_input: # Don't create trans features of inputs which are all direct # features with the same relationship_path. matching_inputs = { inputs for inputs in matching_inputs if not _all_direct_and_same_path(inputs) } matching_inputs = filter_matches_by_options( matching_inputs, primitive_options, commutative=primitive.commutative, ) # Don't build features on numeric foreign key columns matching_inputs = [ match for match in matching_inputs if not _match_contains_numeric_foreign_key(match) ] return matching_inputs def _match_contains_numeric_foreign_key(match): match_schema = ColumnSchema(semantic_tags={"foreign_key", "numeric"}) return any(True for f in match if is_valid_input(f.column_schema, match_schema)) def not_a_transform_input(feature): """ Verifies transform inputs are not transform features or direct features of transform features Returns True if a transform primitive can stack on the feature, and False if it cannot. """ primitive = _find_root_primitive(feature) return not isinstance(primitive, TransformPrimitive) def _find_root_primitive(feature): """ If a feature is a DirectFeature, finds the primitive of the "original" base feature. """ if isinstance(feature, DirectFeature): return _find_root_primitive(feature.base_features[0]) return feature.primitive def _check_if_stacking_is_prohibited( feature: FeatureBase, f_primitive: PrimitiveBase, primitive: PrimitiveBase, primitive_class: Type[PrimitiveBase], primitive_stack_on_self: bool, tuple_primitive_stack_on_exclude: Tuple[Type[PrimitiveBase]], ): if not primitive_stack_on_self and isinstance(f_primitive, primitive_class): return True if isinstance(f_primitive, tuple_primitive_stack_on_exclude): return True if feature.number_output_features > 1: return True if f_primitive.base_of_exclude is not None and isinstance( primitive, tuple(f_primitive.base_of_exclude), ): return True return False def _check_if_stacking_is_permitted( f_primitive: PrimitiveBase, primitive_class: Type[PrimitiveBase], primitive_stack_on_self: bool, tuple_primitive_stack_on: Tuple[Type[PrimitiveBase]], ): if primitive_stack_on_self and isinstance(f_primitive, primitive_class): return True if tuple_primitive_stack_on is None or isinstance( f_primitive, tuple_primitive_stack_on, ): return True if f_primitive.base_of is None: return True if primitive_class in f_primitive.base_of: return True return False def can_stack_primitive_on_inputs(primitive: PrimitiveBase, inputs: List[FeatureBase]): """ Checks if features in inputs can be used with supplied primitive using the stacking rules. Returns True if stacking is possible, and False if not. """ primitive_class = primitive.__class__ tuple_primitive_stack_on = ( tuple(primitive.stack_on) if primitive.stack_on is not None else None ) tuple_primitive_stack_on_exclude = ( tuple(primitive.stack_on_exclude) if primitive.stack_on_exclude is not None else tuple() ) primitive_stack_on_self: bool = primitive.stack_on_self for feature in inputs: # In the case that the feature is a DirectFeature, the feature's primitive will be a PrimitiveBase object. # However, we want to check stacking rules with the primitive the DirectFeature is based on. f_primitive = _find_root_primitive(feature) # check if stacking is prohibited if _check_if_stacking_is_prohibited( feature, f_primitive, primitive, primitive_class, primitive_stack_on_self, tuple_primitive_stack_on_exclude, ): return False # we permit stacking only if it is not prohibited and meets the criterion to be permitted if not _check_if_stacking_is_permitted( f_primitive, primitive_class, primitive_stack_on_self, tuple_primitive_stack_on, ): return False # if we reach this line nothing is prohibited and stacking is permitted for all inputs return True def match_by_schema(features, column_schema): return [f for f in features if is_valid_input(f.column_schema, column_schema)] def match( input_types, features, replace=False, commutative=False, require_direct_input=False, ): to_match = input_types[0] matches = match_by_schema(features, to_match) if len(input_types) == 1: return [ (m,) for m in matches if (not require_direct_input or isinstance(m, DirectFeature)) ] matching_inputs = set() for m in matches: copy = features[:] if not replace: copy = [c for c in copy if c.unique_name() != m.unique_name()] # If we need a DirectFeature and this is not a DirectFeature then one of the rest must be. still_require_direct_input = require_direct_input and not isinstance( m, DirectFeature, ) rest = match( input_types[1:], copy, replace, require_direct_input=still_require_direct_input, ) for r in rest: new_match = [m] + list(r) # commutative uses frozenset instead of tuple because it doesn't # want multiple orderings of the same input if commutative: new_match = frozenset(new_match) else: new_match = tuple(new_match) matching_inputs.add(new_match) if commutative: matching_inputs = { tuple(sorted(s, key=lambda x: x.get_name().lower())) for s in matching_inputs } return matching_inputs def handle_primitive(primitive): if not isinstance(primitive, PrimitiveBase): primitive = primitive() assert isinstance(primitive, PrimitiveBase), "must be a primitive" return primitive def check_primitive( primitive, prim_type, aggregation_primitive_dict, transform_primitive_dict, ): if prim_type in ("transform", "groupby transform"): prim_dict = transform_primitive_dict supertype = TransformPrimitive arg_name = ( "trans_primitives" if prim_type == "transform" else "groupby_trans_primitives" ) s = "a transform" if prim_type in ("aggregation", "where"): prim_dict = aggregation_primitive_dict supertype = AggregationPrimitive arg_name = ( "agg_primitives" if prim_type == "aggregation" else "where_primitives" ) s = "an aggregation" if isinstance(primitive, str): prim_string = camel_and_title_to_snake(primitive) if prim_string not in prim_dict: raise ValueError( "Unknown {} primitive {}. " "Call ft.primitives.list_primitives() to get" " a list of available primitives".format(prim_type, prim_string), ) primitive = prim_dict[prim_string] primitive = handle_primitive(primitive) if not isinstance(primitive, supertype): raise ValueError( "Primitive {} in {} is not {} " "primitive".format( type(primitive), arg_name, s, ), ) return primitive def _all_direct_and_same_path(input_features: List[FeatureBase]) -> bool: """Given a list of features, returns True if they are all DirectFeatures with the same relationship_path, and False if not """ path = input_features[0].relationship_path for f in input_features: if not isinstance(f, DirectFeature) or f.relationship_path != path: return False return True def _build_ignore_columns(input_dict: Dict[str, List[str]]) -> DefaultDict[str, set]: """Iterates over the input dictionary to build the ignore_columns defaultdict. Expects the input_dict's keys to be strings, and values to be lists of strings. Throws a TypeError if they are not. """ ignore_columns = defaultdict(set) if input_dict is not None: for df_name, cols in input_dict.items(): if not isinstance(df_name, str) or not isinstance(cols, list): raise TypeError("ignore_columns should be dict[str -> list]") elif not all(isinstance(c, str) for c in cols): raise TypeError("list in ignore_columns must only have string values") ignore_columns[df_name] = set(cols) return ignore_columns def _direct_of_dataframe(feature, parent_dataframe): return ( isinstance(feature, DirectFeature) and feature.parent_dataframe_name == parent_dataframe.ww.name ) def get_feature_depth(feature, stop_at=None): """Helper method to allow caching of feature.get_depth() Why here and not in FeatureBase? This keeps the caching local to DFS. """ hash_key = hash(f"{feature.get_name()}{feature.dataframe_name}{stop_at}") if cached_depth := feature_cache.get(CacheType.DEPTH, hash_key): return cached_depth depth = feature.get_depth(stop_at=stop_at) feature_cache.add(CacheType.DEPTH, hash_key, depth) return depth ================================================ FILE: featuretools/synthesis/dfs.py ================================================ import warnings from featuretools.computational_backends import calculate_feature_matrix from featuretools.entityset import EntitySet from featuretools.exceptions import UnusedPrimitiveWarning from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis from featuretools.synthesis.utils import _categorize_features, get_unused_primitives from featuretools.utils import entry_point @entry_point("featuretools_dfs") def dfs( dataframes=None, relationships=None, entityset=None, target_dataframe_name=None, cutoff_time=None, instance_ids=None, agg_primitives=None, trans_primitives=None, groupby_trans_primitives=None, allowed_paths=None, max_depth=2, ignore_dataframes=None, ignore_columns=None, primitive_options=None, seed_features=None, drop_contains=None, drop_exact=None, where_primitives=None, max_features=-1, cutoff_time_in_index=False, save_progress=None, features_only=False, training_window=None, approximate=None, chunk_size=None, n_jobs=1, dask_kwargs=None, verbose=False, return_types=None, progress_callback=None, include_cutoff_time=True, ): """Calculates a feature matrix and features given a dictionary of dataframes and a list of relationships. Args: dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]): Dictionary of DataFrames. Entries take the format {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}. Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters will be ignored. relationships (list[(str, str, str, str)]): List of relationships between dataframes. List items are a tuple with the format (parent dataframe name, parent column, child dataframe name, child column). entityset (EntitySet): An already initialized entityset. Required if dataframes and relationships are not defined. target_dataframe_name (str): Name of dataframe on which to make predictions. cutoff_time (pd.DataFrame or Datetime or str): Specifies times at which to calculate the features for each instance. The resulting feature matrix will use data up to and including the cutoff_time. Can either be a DataFrame, a single value, or a string that can be parsed into a datetime. If a DataFrame is passed the instance ids for which to calculate features must be in a column with the same name as the target dataframe index or a column named `instance_id`. The cutoff time values in the DataFrame must be in a column with the same name as the target dataframe time index or a column named `time`. If the DataFrame has more than two columns, any additional columns will be added to the resulting feature matrix. If a single value is passed, this value will be used for all instances. instance_ids (list): List of instances on which to calculate features. Only used if cutoff_time is a single datetime. agg_primitives (list[str or AggregationPrimitive], optional): List of Aggregation Feature types to apply. Default: ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"] trans_primitives (list[str or TransformPrimitive], optional): List of Transform Feature functions to apply. Default: ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"] groupby_trans_primitives (list[str or TransformPrimitive], optional): list of Transform primitives to make GroupByTransformFeatures with allowed_paths (list[list[str]]): Allowed dataframe paths on which to make features. max_depth (int) : Maximum allowed depth of features. ignore_dataframes (list[str], optional): List of dataframes to blacklist when creating features. ignore_columns (dict[str -> list[str]], optional): List of specific columns within each dataframe to blacklist when creating features. primitive_options (list[dict[str or tuple[str] -> dict] or dict[str or tuple[str] -> dict, optional]): Specify options for a single primitive or a group of primitives. Lists of option dicts are used to specify options per input for primitives with multiple inputs. Each option ``dict`` can have the following keys: ``"include_dataframes"`` List of dataframes to be included when creating features for the primitive(s). All other dataframes will be ignored (list[str]). ``"ignore_dataframes"`` List of dataframes to be blacklisted when creating features for the primitive(s) (list[str]). ``"include_columns"`` List of specific columns within each dataframe to include when creating features for the primitive(s). All other columns in a given dataframe will be ignored (dict[str -> list[str]]). ``"ignore_columns"`` List of specific columns within each dataframe to blacklist when creating features for the primitive(s) (dict[str -> list[str]]). ``"include_groupby_dataframes"`` List of dataframes to be included when finding groupbys. All other dataframes will be ignored (list[str]). ``"ignore_groupby_dataframes"`` List of dataframes to blacklist when finding groupbys (list[str]). ``"include_groupby_columns"`` List of specific columns within each dataframe to include as groupbys, if applicable. All other columns in each dataframe will be ignored (dict[str -> list[str]]). ``"ignore_groupby_columns"`` List of specific columns within each dataframe to blacklist as groupbys (dict[str -> list[str]]). seed_features (list[:class:`.FeatureBase`]): List of manually defined features to use. drop_contains (list[str], optional): Drop features that contains these strings in name. drop_exact (list[str], optional): Drop features that exactly match these strings in name. where_primitives (list[str or PrimitiveBase], optional): List of Primitives names (or types) to apply with where clauses. Default: ["count"] max_features (int, optional) : Cap the number of generated features to this number. If -1, no limit. features_only (bool, optional): If True, returns the list of features without calculating the feature matrix. cutoff_time_in_index (bool): If True, return a DataFrame with a MultiIndex where the second index is the cutoff time (first is instance id). DataFrame will be sorted by (time, instance_id). training_window (Timedelta or str, optional): Window defining how much time before the cutoff time data can be used when calculating features. If ``None`` , all data before cutoff time is used. Defaults to ``None``. Month and year units are not relative when Pandas Timedeltas are used. Relative units should be passed as a Featuretools Timedelta or a string. approximate (Timedelta): Bucket size to group instances with similar cutoff times by for features with costly calculations. For example, if bucket is 24 hours, all instances with cutoff times on the same day will use the same calculation for expensive features. save_progress (str, optional): Path to save intermediate computational results. n_jobs (int, optional): number of parallel processes to use when calculating feature matrix chunk_size (int or float or None or "cutoff time", optional): Number of rows of output feature matrix to calculate at time. If passed an integer greater than 0, will try to use that many rows per chunk. If passed a float value between 0 and 1 sets the chunk size to that percentage of all instances. If passed the string "cutoff time", rows are split per cutoff time. dask_kwargs (dict, optional): Dictionary of keyword arguments to be passed when creating the dask client and scheduler. Even if n_jobs is not set, using `dask_kwargs` will enable multiprocessing. Main parameters: cluster (str or dask.distributed.LocalCluster): cluster or address of cluster to send tasks to. If unspecified, a cluster will be created. diagnostics port (int): port number to use for web dashboard. If left unspecified, web interface will not be enabled. Valid keyword arguments for LocalCluster will also be accepted. return_types (list[woodwork.ColumnSchema] or str, optional): List of ColumnSchemas defining the types of columns to return. If None, defaults to returning all numeric, categorical and boolean types. If given as the string 'all', returns all available types. progress_callback (callable): function to be called with incremental progress updates. Has the following parameters: update: percentage change (float between 0 and 100) in progress since last call progress_percent: percentage (float between 0 and 100) of total computation completed time_elapsed: total time in seconds that has elapsed since start of call include_cutoff_time (bool): Include data at cutoff times in feature calculations. Defaults to ``True``. Returns: list[:class:`.FeatureBase`], pd.DataFrame: The list of generated feature defintions, and the feature matrix. If ``features_only`` is ``True``, the feature matrix will not be generated. Examples: .. code-block:: python from featuretools.primitives import Mean # cutoff times per instance dataframes = { "sessions" : (session_df, "id"), "transactions" : (transactions_df, "id", "transaction_time") } relationships = [("sessions", "id", "transactions", "session_id")] feature_matrix, features = dfs(dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times) feature_matrix features = dfs(dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", features_only=True) """ if not isinstance(entityset, EntitySet): entityset = EntitySet("dfs", dataframes, relationships) dfs_object = DeepFeatureSynthesis( target_dataframe_name, entityset, agg_primitives=agg_primitives, trans_primitives=trans_primitives, groupby_trans_primitives=groupby_trans_primitives, max_depth=max_depth, where_primitives=where_primitives, allowed_paths=allowed_paths, drop_exact=drop_exact, drop_contains=drop_contains, ignore_dataframes=ignore_dataframes, ignore_columns=ignore_columns, primitive_options=primitive_options, max_features=max_features, seed_features=seed_features, ) features = dfs_object.build_features(verbose=verbose, return_types=return_types) trans, agg, groupby, where = _categorize_features(features) trans_unused = get_unused_primitives(trans_primitives, trans) agg_unused = get_unused_primitives(agg_primitives, agg) groupby_unused = get_unused_primitives(groupby_trans_primitives, groupby) where_unused = get_unused_primitives(where_primitives, where) unused_primitives = [trans_unused, agg_unused, groupby_unused, where_unused] if any(unused_primitives): warn_unused_primitives(unused_primitives) if features_only: return features assert ( features != [] ), "No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data." feature_matrix = calculate_feature_matrix( features, entityset=entityset, cutoff_time=cutoff_time, instance_ids=instance_ids, training_window=training_window, approximate=approximate, cutoff_time_in_index=cutoff_time_in_index, save_progress=save_progress, chunk_size=chunk_size, n_jobs=n_jobs, dask_kwargs=dask_kwargs, verbose=verbose, progress_callback=progress_callback, include_cutoff_time=include_cutoff_time, ) return feature_matrix, features def warn_unused_primitives(unused_primitives): messages = [ " trans_primitives: {}\n", " agg_primitives: {}\n", " groupby_trans_primitives: {}\n", " where_primitives: {}\n", ] unused_string = "" for primitives, message in zip(unused_primitives, messages): if primitives: unused_string += message.format(primitives) warning_msg = ( "Some specified primitives were not used during DFS:\n{}".format(unused_string) + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, " + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call " + "contained multiple instances of a primitive in the list above, none of them were used." ) warnings.warn(warning_msg, UnusedPrimitiveWarning) ================================================ FILE: featuretools/synthesis/encode_features.py ================================================ import logging import pandas as pd from featuretools.computational_backends.utils import get_ww_types_from_features from featuretools.utils.gen_utils import make_tqdm_iterator logger = logging.getLogger("featuretools") DEFAULT_TOP_N = 10 def encode_features( feature_matrix, features, top_n=DEFAULT_TOP_N, include_unknown=True, to_encode=None, inplace=False, drop_first=False, verbose=False, ): """Encode categorical features Args: feature_matrix (pd.DataFrame): Dataframe of features. features (list[PrimitiveBase]): Feature definitions in feature_matrix. top_n (int or dict[string -> int]): Number of top values to include. If dict[string -> int] is used, key is feature name and value is the number of top values to include for that feature. If a feature's name is not in dictionary, a default value of 10 is used. include_unknown (pd.DataFrame): Add feature encoding an unknown class. defaults to True to_encode (list[str]): List of feature names to encode. features not in this list are unencoded in the output matrix defaults to encode all necessary features. inplace (bool): Encode feature_matrix in place. Defaults to False. drop_first (bool): Whether to get k-1 dummies out of k categorical levels by removing the first level. defaults to False verbose (str): Print progress info. Returns: (pd.Dataframe, list) : encoded feature_matrix, encoded features Example: .. ipython:: python :suppress: from featuretools.tests.testing_utils import make_ecommerce_entityset import featuretools as ft es = make_ecommerce_entityset() .. ipython:: python f1 = ft.Feature(es["log"].ww["product_id"]) f2 = ft.Feature(es["log"].ww["purchased"]) f3 = ft.Feature(es["log"].ww["value"]) features = [f1, f2, f3] ids = [0, 1, 2, 3, 4, 5] feature_matrix = ft.calculate_feature_matrix(features, es, instance_ids=ids) fm_encoded, f_encoded = ft.encode_features(feature_matrix, features) f_encoded fm_encoded, f_encoded = ft.encode_features(feature_matrix, features, top_n=2) f_encoded fm_encoded, f_encoded = ft.encode_features(feature_matrix, features, include_unknown=False) f_encoded fm_encoded, f_encoded = ft.encode_features(feature_matrix, features, to_encode=['purchased']) f_encoded fm_encoded, f_encoded = ft.encode_features(feature_matrix, features, drop_first=True) f_encoded """ if inplace: X = feature_matrix else: X = feature_matrix.copy() old_feature_names = set() for feature in features: for fname in feature.get_feature_names(): assert fname in X.columns, "Feature %s not found in feature matrix" % ( fname ) old_feature_names.add(fname) pass_through = [col for col in X.columns if col not in old_feature_names] if verbose: iterator = make_tqdm_iterator( iterable=features, total=len(features), desc="Encoding pass 1", unit="feature", ) else: iterator = features new_feature_list = [] kept_columns = [] encoded_columns = [] columns_info = feature_matrix.ww.columns for f in iterator: # TODO: features with multiple columns are not encoded by this method, # which can cause an "encoded" matrix with non-numeric values is_discrete = {"category", "foreign_key"}.intersection( f.column_schema.semantic_tags, ) if f.number_output_features > 1 or not is_discrete: if f.number_output_features > 1: logger.warning( "Feature %s has multiple columns and will not " "be encoded. This may result in a matrix with" " non-numeric values." % (f), ) new_feature_list.append(f) kept_columns.extend(f.get_feature_names()) continue if to_encode is not None and f.get_name() not in to_encode: new_feature_list.append(f) kept_columns.extend(f.get_feature_names()) continue val_counts = X[f.get_name()].value_counts() # Remove 0 count category values val_counts = val_counts[val_counts > 0].to_frame() index_name = val_counts.index.name val_counts = val_counts.rename(columns={val_counts.columns[0]: "count"}) if index_name is None: if "index" in val_counts.columns: index_name = "level_0" else: index_name = "index" val_counts.reset_index(inplace=True) val_counts = val_counts.sort_values(["count", index_name], ascending=False) val_counts.set_index(index_name, inplace=True) select_n = top_n if isinstance(top_n, dict): select_n = top_n.get(f.get_name(), DEFAULT_TOP_N) if drop_first: select_n = min(len(val_counts), top_n) select_n = max(select_n - 1, 1) unique = val_counts.head(select_n).index.tolist() for label in unique: add = f == label add_name = add.get_name() new_feature_list.append(add) new_col = X[f.get_name()] == label new_col.rename(add_name, inplace=True) encoded_columns.append(new_col) if include_unknown: unknown = f.isin(unique).NOT().rename(f.get_name() + " is unknown") unknown_name = unknown.get_name() new_feature_list.append(unknown) new_col = ~X[f.get_name()].isin(unique) new_col.rename(unknown_name, inplace=True) encoded_columns.append(new_col) if inplace: X.drop(f.get_name(), axis=1, inplace=True) kept_columns.extend(pass_through) if inplace: for encoded_column in encoded_columns: X[encoded_column.name] = encoded_column else: X = pd.concat([X[kept_columns]] + encoded_columns, axis=1) entityset = new_feature_list[0].entityset ww_init_kwargs = get_ww_types_from_features(new_feature_list, entityset) # Grab ww metadata from feature matrix since it may be more exact for column in kept_columns: ww_init_kwargs["logical_types"][column] = columns_info[column].logical_type ww_init_kwargs["semantic_tags"][column] = columns_info[column].semantic_tags ww_init_kwargs["column_origins"][column] = columns_info[column].origin X.ww.init(**ww_init_kwargs) return X, new_feature_list ================================================ FILE: featuretools/synthesis/get_valid_primitives.py ================================================ from featuretools.primitives import AggregationPrimitive, TransformPrimitive from featuretools.primitives.utils import ( get_aggregation_primitives, get_transform_primitives, ) from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis from featuretools.synthesis.utils import _categorize_features, get_unused_primitives def get_valid_primitives( entityset, target_dataframe_name, max_depth=2, selected_primitives=None, **dfs_kwargs, ): """ Returns two lists of primitives (transform and aggregation) containing primitives that can be applied to the specific target dataframe to create features. If the optional 'selected_primitives' parameter is not used, all discoverable primitives will be considered. Note: When using a ``max_depth`` greater than 1, some primitives returned by this function may not create any features if passed to DFS alone. These primitives relied on features created by other primitives as input (primitive stacking). Args: entityset (EntitySet): An already initialized entityset target_dataframe_name (str): Name of dataframe to create features for. max_depth (int, optional): Maximum allowed depth of features. selected_primitives(list[str or AggregationPrimitive/TransformPrimitive], optional): list of primitives to consider when looking for valid primitives. If None, all primitives will be considered dfs_kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the DeepFeatureSynthesis object. Should not include ``max_depth``, ``agg_primitives``, or ``trans_primitives``, as those are passed in explicity. Returns: list[AggregationPrimitive], list[TransformPrimitive]: The list of valid aggregation primitives and the list of valid transform primitives. """ agg_primitives = [] trans_primitives = [] available_aggs = get_aggregation_primitives() available_trans = get_transform_primitives() if selected_primitives: for prim in selected_primitives: if not isinstance(prim, str): if issubclass(prim, AggregationPrimitive): prim_list = agg_primitives elif issubclass(prim, TransformPrimitive): prim_list = trans_primitives else: raise ValueError( f"Selected primitive {prim} is not an " "AggregationPrimitive, TransformPrimitive, or str", ) elif prim in available_aggs: prim = available_aggs[prim] prim_list = agg_primitives elif prim in available_trans: prim = available_trans[prim] prim_list = trans_primitives else: raise ValueError(f"'{prim}' is not a recognized primitive name") prim_list.append(prim) else: agg_primitives = [agg for agg in available_aggs.values()] trans_primitives = [trans for trans in available_trans.values()] dfs_object = DeepFeatureSynthesis( target_dataframe_name, entityset, agg_primitives=agg_primitives, trans_primitives=trans_primitives, max_depth=max_depth, **dfs_kwargs, ) features = dfs_object.build_features() trans, agg, _, _ = _categorize_features(features) trans_unused = get_unused_primitives(trans_primitives, trans) agg_unused = get_unused_primitives(agg_primitives, agg) # switch from str to class agg_unused = [available_aggs[name] for name in agg_unused] trans_unused = [available_trans[name] for name in trans_unused] used_agg_prims = set(agg_primitives).difference(set(agg_unused)) used_trans_prims = set(trans_primitives).difference(set(trans_unused)) return list(used_agg_prims), list(used_trans_prims) ================================================ FILE: featuretools/synthesis/utils.py ================================================ from featuretools.feature_base import ( AggregationFeature, FeatureOutputSlice, GroupByTransformFeature, TransformFeature, ) from featuretools.utils.gen_utils import camel_and_title_to_snake def _categorize_features(features): """Categorize each feature by its primitive type in a set of primitives along with any dependencies""" transform = set() agg = set() groupby = set() where = set() explored = set() def get_feature_data(feature): if feature.get_name() in explored: return dependencies = [] if isinstance(feature, FeatureOutputSlice): feature = feature.base_feature if isinstance(feature, AggregationFeature): if feature.where: where.add(feature.primitive.name) else: agg.add(feature.primitive.name) elif isinstance(feature, GroupByTransformFeature): groupby.add(feature.primitive.name) elif isinstance(feature, TransformFeature): transform.add(feature.primitive.name) feature_deps = feature.get_dependencies() if feature_deps: dependencies.extend(feature_deps) explored.add(feature.get_name()) for dep in dependencies: get_feature_data(dep) for feature in features: get_feature_data(feature) return transform, agg, groupby, where def get_unused_primitives(specified, used): """Get a list of unused primitives based on a list of specified primitives and a list of output features""" if not specified: return [] specified = { camel_and_title_to_snake(primitive) if isinstance(primitive, str) else primitive.name for primitive in specified } return sorted(specified.difference(used)) ================================================ FILE: featuretools/tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/computational_backend/__init__.py ================================================ ================================================ FILE: featuretools/tests/computational_backend/test_calculate_feature_matrix.py ================================================ import logging import os import re import shutil from datetime import datetime from itertools import combinations from random import randint import numpy as np import pandas as pd import psutil import pytest from tqdm import tqdm from woodwork.column_schema import ColumnSchema from woodwork.logical_types import ( Age, AgeNullable, Boolean, BooleanNullable, Integer, IntegerNullable, ) from featuretools import ( EntitySet, Feature, GroupByTransformFeature, Timedelta, calculate_feature_matrix, dfs, ) from featuretools.computational_backends import utils from featuretools.computational_backends.calculate_feature_matrix import ( FEATURE_CALCULATION_PERCENTAGE, _chunk_dataframe_groups, _handle_chunk_size, scatter_warning, ) from featuretools.computational_backends.utils import ( bin_cutoff_times, create_client_and_cluster, n_jobs_to_workers, ) from featuretools.feature_base import ( AggregationFeature, DirectFeature, FeatureOutputSlice, IdentityFeature, ) from featuretools.primitives import ( Count, Max, Min, Negate, NMostCommon, Percentile, Sum, TransformPrimitive, ) from featuretools.tests.testing_utils import ( backward_path, get_mock_client_cluster, ) def test_scatter_warning(caplog): logger = logging.getLogger("featuretools") match = "EntitySet was only scattered to {} out of {} workers" warning_message = match.format(1, 2) logger.propagate = True scatter_warning(1, 2) logger.propagate = False assert warning_message in caplog.text def test_calc_feature_matrix(es): times = list( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)], ) instances = range(17) cutoff_time = pd.DataFrame({"time": times, es["log"].ww.index: instances}) labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2 property_feature = Feature(es["log"].ww["value"]) > 10 feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, verbose=True, ) assert (feature_matrix[property_feature.get_name()] == labels).values.all() error_text = "features must be a non-empty list of features" with pytest.raises(AssertionError, match=error_text): feature_matrix = calculate_feature_matrix( "features", es, cutoff_time=cutoff_time, ) with pytest.raises(AssertionError, match=error_text): feature_matrix = calculate_feature_matrix([], es, cutoff_time=cutoff_time) with pytest.raises(AssertionError, match=error_text): feature_matrix = calculate_feature_matrix( [1, 2, 3], es, cutoff_time=cutoff_time, ) error_text = ( "cutoff_time times must be datetime type: try casting via " "pd\\.to_datetime\\(\\)" ) with pytest.raises(TypeError, match=error_text): calculate_feature_matrix( [property_feature], es, instance_ids=range(17), cutoff_time=17, ) error_text = "cutoff_time must be a single value or DataFrame" with pytest.raises(TypeError, match=error_text): calculate_feature_matrix( [property_feature], es, instance_ids=range(17), cutoff_time=times, ) cutoff_times_dup = pd.DataFrame( { "time": [datetime(2018, 3, 1), datetime(2018, 3, 1)], es["log"].ww.index: [1, 1], }, ) error_text = "Duplicated rows in cutoff time dataframe." with pytest.raises(AssertionError, match=error_text): feature_matrix = calculate_feature_matrix( [property_feature], entityset=es, cutoff_time=cutoff_times_dup, ) cutoff_reordered = cutoff_time.iloc[[-1, 10, 1]] # 3 ids not ordered by cutoff time feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_reordered, verbose=True, ) assert all(feature_matrix.index == cutoff_reordered["id"].values) def test_cfm_compose(es, lt): property_feature = Feature(es["log"].ww["value"]) > 10 feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=lt, verbose=True, ) assert ( feature_matrix[property_feature.get_name()] == feature_matrix["label_func"] ).values.all() def test_cfm_compose_approximate(es, lt): property_feature = Feature(es["log"].ww["value"]) > 10 feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=lt, approximate="1s", verbose=True, ) assert type(feature_matrix) == pd.core.frame.DataFrame assert ( feature_matrix[property_feature.get_name()] == feature_matrix["label_func"] ).values.all() def test_cfm_approximate_correct_ordering(): trips = { "trip_id": [i for i in range(1000)], "flight_time": [datetime(1998, 4, 2) for i in range(350)] + [datetime(1997, 4, 3) for i in range(650)], "flight_id": [randint(1, 25) for i in range(1000)], "trip_duration": [randint(1, 999) for i in range(1000)], } df = pd.DataFrame.from_dict(trips) es = EntitySet("flights") es.add_dataframe( dataframe_name="trips", dataframe=df, index="trip_id", time_index="flight_time", ) es.normalize_dataframe( base_dataframe_name="trips", new_dataframe_name="flights", index="flight_id", make_time_index=True, ) features = dfs(entityset=es, target_dataframe_name="trips", features_only=True) flight_features = [ feature for feature in features if isinstance(feature, DirectFeature) and isinstance(feature.base_features[0], AggregationFeature) ] property_feature = IdentityFeature(es["trips"].ww["trip_id"]) cutoff_time = pd.DataFrame.from_dict( {"instance_id": df["trip_id"], "time": df["flight_time"]}, ) time_feature = IdentityFeature(es["trips"].ww["flight_time"]) feature_matrix = calculate_feature_matrix( flight_features + [property_feature, time_feature], es, cutoff_time_in_index=True, cutoff_time=cutoff_time, ) feature_matrix.index.names = ["instance", "time"] assert np.all( feature_matrix.reset_index("time").reset_index()[["instance", "time"]].values == feature_matrix[["trip_id", "flight_time"]].values, ) feature_matrix_2 = calculate_feature_matrix( flight_features + [property_feature, time_feature], es, cutoff_time=cutoff_time, cutoff_time_in_index=True, approximate=Timedelta(2, "d"), ) feature_matrix_2.index.names = ["instance", "time"] assert np.all( feature_matrix_2.reset_index("time").reset_index()[["instance", "time"]].values == feature_matrix_2[["trip_id", "flight_time"]].values, ) for column in feature_matrix: for x, y in zip(feature_matrix[column], feature_matrix_2[column]): assert (pd.isnull(x) and pd.isnull(y)) or (x == y) def test_cfm_no_cutoff_time_index(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat4 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat4, "sessions") cutoff_time = pd.DataFrame( { "time": [datetime(2013, 4, 9, 10, 31, 19), datetime(2013, 4, 9, 11, 0, 0)], "instance_id": [0, 2], }, ) feature_matrix = calculate_feature_matrix( [dfeat, agg_feat], es, cutoff_time_in_index=False, approximate=Timedelta(12, "s"), cutoff_time=cutoff_time, ) assert feature_matrix.index.name == "id" assert feature_matrix.index.tolist() == [0, 2] assert feature_matrix[dfeat.get_name()].tolist() == [10, 10] assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1] cutoff_time = pd.DataFrame( { "time": [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)], "instance_id": [0, 2], }, ) feature_matrix_2 = calculate_feature_matrix( [dfeat, agg_feat], es, cutoff_time_in_index=False, approximate=Timedelta(10, "s"), cutoff_time=cutoff_time, ) assert feature_matrix_2.index.name == "id" assert feature_matrix_2.index.tolist() == [0, 2] assert feature_matrix_2[dfeat.get_name()].tolist() == [7, 10] assert feature_matrix_2[agg_feat.get_name()].tolist() == [5, 1] def test_cfm_duplicated_index_in_cutoff_time(es): times = [ datetime(2011, 4, 1), datetime(2011, 5, 1), datetime(2011, 4, 1), datetime(2011, 5, 1), ] instances = [1, 1, 2, 2] property_feature = Feature(es["log"].ww["value"]) > 10 cutoff_time = pd.DataFrame({"id": instances, "time": times}, index=[1, 1, 1, 1]) feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, chunk_size=1, ) assert feature_matrix.shape[0] == cutoff_time.shape[0] def test_saveprogress(es, tmp_path): times = list( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)], ) cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)}) property_feature = Feature(es["log"].ww["value"]) > 10 save_progress = str(tmp_path) fm_save = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, save_progress=save_progress, ) _, _, files = next(os.walk(save_progress)) files = [os.path.join(save_progress, file) for file in files] # there are 17 datetime files created above assert len(files) == 17 list_df = [] for file_ in files: df = pd.read_csv(file_, index_col="id", header=0) list_df.append(df) merged_df = pd.concat(list_df) merged_df.set_index(pd.DatetimeIndex(times), inplace=True, append=True) fm_no_save = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, ) assert np.all((merged_df.sort_index().values) == (fm_save.sort_index().values)) assert np.all((fm_no_save.sort_index().values) == (fm_save.sort_index().values)) assert np.all((fm_no_save.sort_index().values) == (merged_df.sort_index().values)) shutil.rmtree(save_progress) def test_cutoff_time_correctly(es): property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 1, 2]}) feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, ) labels = [10, 5, 0] assert (feature_matrix[property_feature.get_name()] == labels).values.all() def test_cutoff_time_binning(): cutoff_time = pd.DataFrame( { "time": [ datetime(2011, 4, 9, 12, 31), datetime(2011, 4, 10, 11), datetime(2011, 4, 10, 13, 10, 1), ], "instance_id": [1, 2, 3], }, ) cutoff_time.ww.init() binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(4, "h")) labels = [ datetime(2011, 4, 9, 12), datetime(2011, 4, 10, 8), datetime(2011, 4, 10, 12), ] for i in binned_cutoff_times.index: assert binned_cutoff_times["time"][i] == labels[i] binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(25, "h")) labels = [ datetime(2011, 4, 8, 22), datetime(2011, 4, 9, 23), datetime(2011, 4, 9, 23), ] for i in binned_cutoff_times.index: assert binned_cutoff_times["time"][i] == labels[i] error_text = "Unit is relative" with pytest.raises(ValueError, match=error_text): binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(1, "mo")) def test_cutoff_time_columns_order(es): property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)] id_col_names = ["instance_id", es["customers"].ww.index] time_col_names = ["time", es["customers"].ww.time_index] for id_col in id_col_names: for time_col in time_col_names: cutoff_time = pd.DataFrame( { "dummy_col_1": [1, 2, 3], id_col: [0, 1, 2], "dummy_col_2": [True, False, False], time_col: times, }, ) feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, ) labels = [10, 5, 0] assert (feature_matrix[property_feature.get_name()] == labels).values.all() def test_cutoff_time_df_redundant_column_names(es): property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)] cutoff_time = pd.DataFrame( { es["customers"].ww.index: [0, 1, 2], "instance_id": [0, 1, 2], "dummy_col": [True, False, False], "time": times, }, ) err_msg = ( 'Cutoff time DataFrame cannot contain both a column named "instance_id" and a column' " with the same name as the target dataframe index" ) with pytest.raises(AttributeError, match=err_msg): calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_time) cutoff_time = pd.DataFrame( { es["customers"].ww.time_index: [0, 1, 2], "instance_id": [0, 1, 2], "dummy_col": [True, False, False], "time": times, }, ) err_msg = ( 'Cutoff time DataFrame cannot contain both a column named "time" and a column' " with the same name as the target dataframe time index" ) with pytest.raises(AttributeError, match=err_msg): calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_time) def test_training_window(es): property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) top_level_agg = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) # make sure features that have a direct to a higher level agg # so we have multiple "filter eids" in get_pandas_data_slice, # and we go through the loop to pull data with a training_window param more than once dagg = DirectFeature(top_level_agg, "customers") # for now, warns if last_time_index not present times = [ datetime(2011, 4, 9, 12, 31), datetime(2011, 4, 10, 11), datetime(2011, 4, 10, 13, 10), ] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 1, 2]}) warn_text = ( "Using training_window but last_time_index is not set for dataframe customers" ) with pytest.warns(UserWarning, match=warn_text): feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=cutoff_time, training_window="2 hours", ) es.add_last_time_indexes() error_text = "Training window cannot be in observations" with pytest.raises(AssertionError, match=error_text): feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, training_window=Timedelta(2, "observations"), ) # Case1. include_cutoff_time = True feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=cutoff_time, training_window="2 hours", include_cutoff_time=True, ) prop_values = [4, 5, 1] dagg_values = [3, 2, 1] assert (feature_matrix[property_feature.get_name()] == prop_values).values.all() assert (feature_matrix[dagg.get_name()] == dagg_values).values.all() # Case2. include_cutoff_time = False feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=cutoff_time, training_window="2 hours", include_cutoff_time=False, ) prop_values = [5, 5, 2] dagg_values = [3, 2, 1] assert (feature_matrix[property_feature.get_name()] == prop_values).values.all() assert (feature_matrix[dagg.get_name()] == dagg_values).values.all() # Case3. include_cutoff_time = False with single cutoff time value feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=pd.to_datetime("2011-04-09 10:40:00"), training_window="9 minutes", include_cutoff_time=False, ) prop_values = [0, 4, 0] dagg_values = [3, 3, 3] assert (feature_matrix[property_feature.get_name()] == prop_values).values.all() assert (feature_matrix[dagg.get_name()] == dagg_values).values.all() # Case4. include_cutoff_time = True with single cutoff time value feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=pd.to_datetime("2011-04-10 10:40:00"), training_window="2 days", include_cutoff_time=True, ) prop_values = [0, 10, 1] dagg_values = [3, 3, 3] assert (feature_matrix[property_feature.get_name()] == prop_values).values.all() assert (feature_matrix[dagg.get_name()] == dagg_values).values.all() def test_training_window_overlap(es): es.add_last_time_indexes() count_log = Feature( Feature(es["log"].ww["id"]), parent_dataframe_name="customers", primitive=Count, ) cutoff_time = pd.DataFrame( { "id": [0, 0], "time": ["2011-04-09 10:30:00", "2011-04-09 10:40:00"], }, ).astype({"time": "datetime64[ns]"}) # Case1. include_cutoff_time = True actual = calculate_feature_matrix( features=[count_log], entityset=es, cutoff_time=cutoff_time, cutoff_time_in_index=True, training_window="10 minutes", include_cutoff_time=True, ) actual = actual["COUNT(log)"] np.testing.assert_array_equal(actual.values, [1, 9]) # Case2. include_cutoff_time = False actual = calculate_feature_matrix( features=[count_log], entityset=es, cutoff_time=cutoff_time, cutoff_time_in_index=True, training_window="10 minutes", include_cutoff_time=False, ) actual = actual["COUNT(log)"] np.testing.assert_array_equal(actual.values, [0, 9]) def test_include_cutoff_time_without_training_window(es): es.add_last_time_indexes() count_log = Feature( base=Feature(es["log"].ww["id"]), parent_dataframe_name="customers", primitive=Count, ) cutoff_time = pd.DataFrame( { "id": [0, 0], "time": ["2011-04-09 10:30:00", "2011-04-09 10:31:00"], }, ).astype({"time": "datetime64[ns]"}) # Case1. include_cutoff_time = True actual = calculate_feature_matrix( features=[count_log], entityset=es, cutoff_time=cutoff_time, cutoff_time_in_index=True, include_cutoff_time=True, ) actual = actual["COUNT(log)"] np.testing.assert_array_equal(actual.values, [1, 6]) # Case2. include_cutoff_time = False actual = calculate_feature_matrix( features=[count_log], entityset=es, cutoff_time=cutoff_time, cutoff_time_in_index=True, include_cutoff_time=False, ) actual = actual["COUNT(log)"] np.testing.assert_array_equal(actual.values, [0, 5]) # Case3. include_cutoff_time = True with single cutoff time value actual = calculate_feature_matrix( features=[count_log], entityset=es, cutoff_time=pd.to_datetime("2011-04-09 10:31:00"), instance_ids=[0], cutoff_time_in_index=True, include_cutoff_time=True, ) actual = actual["COUNT(log)"] np.testing.assert_array_equal(actual.values, [6]) # Case4. include_cutoff_time = False with single cutoff time value actual = calculate_feature_matrix( features=[count_log], entityset=es, cutoff_time=pd.to_datetime("2011-04-09 10:31:00"), instance_ids=[0], cutoff_time_in_index=True, include_cutoff_time=False, ) actual = actual["COUNT(log)"] np.testing.assert_array_equal(actual.values, [5]) def test_approximate_dfeat_of_agg_on_target_include_cutoff_time(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat2, "sessions") cutoff_time = pd.DataFrame( {"time": [datetime(2011, 4, 9, 10, 31, 19)], "instance_id": [0]}, ) feature_matrix = calculate_feature_matrix( [dfeat, agg_feat2, agg_feat], es, approximate=Timedelta(20, "s"), cutoff_time=cutoff_time, include_cutoff_time=False, ) # binned cutoff_time will be datetime(2011, 4, 9, 10, 31, 0) and # log event 5 at datetime(2011, 4, 9, 10, 31, 0) will be # excluded due to approximate cutoff time point assert feature_matrix[dfeat.get_name()].tolist() == [5] assert feature_matrix[agg_feat.get_name()].tolist() == [5] feature_matrix = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(20, "s"), cutoff_time=cutoff_time, include_cutoff_time=True, ) # binned cutoff_time will be datetime(2011, 4, 9, 10, 31, 0) and # log event 5 at datetime(2011, 4, 9, 10, 31, 0) will be # included due to approximate cutoff time point assert feature_matrix[dfeat.get_name()].tolist() == [6] assert feature_matrix[agg_feat.get_name()].tolist() == [5] def test_training_window_recent_time_index(es): # customer with no sessions row = { "id": [3], "age": [73], "région_id": ["United States"], "cohort": [1], "cancel_reason": ["Lost interest"], "loves_ice_cream": [True], "favorite_quote": ["Don't look back. Something might be gaining on you."], "signup_date": [datetime(2011, 4, 10)], "upgrade_date": [datetime(2011, 4, 12)], "cancel_date": [datetime(2011, 5, 13)], "birthday": [datetime(1938, 2, 1)], "engagement_level": [2], } to_add_df = pd.DataFrame(row) to_add_df.index = range(3, 4) # have to convert category to int in order to concat old_df = es["customers"] old_df.index = old_df.index.astype("int") old_df["id"] = old_df["id"].astype(int) df = pd.concat([old_df, to_add_df], sort=True) # convert back after df.index = df.index.astype("category") df["id"] = df["id"].astype("category") es.replace_dataframe( dataframe_name="customers", df=df, recalculate_last_time_indexes=False, ) es.add_last_time_indexes() property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) top_level_agg = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dagg = DirectFeature(top_level_agg, "customers") instance_ids = [0, 1, 2, 3] times = [ datetime(2011, 4, 9, 12, 31), datetime(2011, 4, 10, 11), datetime(2011, 4, 10, 13, 10, 1), datetime(2011, 4, 10, 1, 59, 59), ] cutoff_time = pd.DataFrame({"time": times, "instance_id": instance_ids}) # Case1. include_cutoff_time = True feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=cutoff_time, training_window="2 hours", include_cutoff_time=True, ) prop_values = [4, 5, 1, 0] assert (feature_matrix[property_feature.get_name()] == prop_values).values.all() dagg_values = [3, 2, 1, 3] feature_matrix.sort_index(inplace=True) assert (feature_matrix[dagg.get_name()] == dagg_values).values.all() # Case2. include_cutoff_time = False feature_matrix = calculate_feature_matrix( [property_feature, dagg], es, cutoff_time=cutoff_time, training_window="2 hours", include_cutoff_time=False, ) prop_values = [5, 5, 1, 0] assert (feature_matrix[property_feature.get_name()] == prop_values).values.all() dagg_values = [3, 2, 1, 3] feature_matrix.sort_index(inplace=True) assert (feature_matrix[dagg.get_name()] == dagg_values).values.all() def test_approximate_multiple_instances_per_cutoff_time(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat2, "sessions") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(1, "week"), cutoff_time=cutoff_time, ) assert feature_matrix.shape[0] == 2 assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1] def test_approximate_with_multiple_paths(diamond_es): es = diamond_es path = backward_path(es, ["regions", "customers", "transactions"]) agg_feat = AggregationFeature( Feature(es["transactions"].ww["id"]), parent_dataframe_name="regions", relationship_path=path, primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [dfeat], es, approximate=Timedelta(1, "week"), cutoff_time=cutoff_time, ) assert feature_matrix[dfeat.get_name()].tolist() == [6, 2] def test_approximate_dfeat_of_agg_on_target(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat2, "sessions") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_time, ) assert feature_matrix[dfeat.get_name()].tolist() == [7, 10] assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1] def test_approximate_dfeat_of_need_all_values(es): p = Feature(es["log"].ww["value"], primitive=Percentile) agg_feat = Feature(p, parent_dataframe_name="sessions", primitive=Sum) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat2, "sessions") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time_in_index=True, cutoff_time=cutoff_time, ) log_df = es["log"] instances = [0, 2] cutoffs = [pd.Timestamp("2011-04-09 10:31:19"), pd.Timestamp("2011-04-09 11:00:00")] approxes = [ pd.Timestamp("2011-04-09 10:31:10"), pd.Timestamp("2011-04-09 11:00:00"), ] true_vals = [] true_vals_approx = [] for instance, cutoff, approx in zip(instances, cutoffs, approxes): log_data_cutoff = log_df[log_df["datetime"] < cutoff] log_data_cutoff["percentile"] = log_data_cutoff["value"].rank(pct=True) true_agg = ( log_data_cutoff.loc[log_data_cutoff["session_id"] == instance, "percentile"] .fillna(0) .sum() ) true_vals.append(round(true_agg, 3)) log_data_approx = log_df[log_df["datetime"] < approx] log_data_approx["percentile"] = log_data_approx["value"].rank(pct=True) true_agg_approx = ( log_data_approx.loc[ log_data_approx["session_id"].isin([0, 1, 2]), "percentile", ] .fillna(0) .sum() ) true_vals_approx.append(round(true_agg_approx, 3)) lapprox = [round(x, 3) for x in feature_matrix[dfeat.get_name()].tolist()] test_list = [round(x, 3) for x in feature_matrix[agg_feat.get_name()].tolist()] assert lapprox == true_vals_approx assert test_list == true_vals def test_uses_full_dataframe_feat_of_approximate(es): agg_feat = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) agg_feat3 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Max) dfeat = DirectFeature(agg_feat2, "sessions") dfeat2 = DirectFeature(agg_feat3, "sessions") p = Feature(dfeat, primitive=Percentile) times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) # only dfeat2 should be approximated # because Percentile needs all values feature_matrix_only_dfeat2 = calculate_feature_matrix( [dfeat2], es, approximate=Timedelta(10, "s"), cutoff_time_in_index=True, cutoff_time=cutoff_time, ) assert feature_matrix_only_dfeat2[dfeat2.get_name()].tolist() == [50, 50] feature_matrix_approx = calculate_feature_matrix( [p, dfeat, dfeat2, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time_in_index=True, cutoff_time=cutoff_time, ) assert ( feature_matrix_only_dfeat2[dfeat2.get_name()].tolist() == feature_matrix_approx[dfeat2.get_name()].tolist() ) feature_matrix_small_approx = calculate_feature_matrix( [p, dfeat, dfeat2, agg_feat], es, approximate=Timedelta(10, "ms"), cutoff_time_in_index=True, cutoff_time=cutoff_time, ) feature_matrix_no_approx = calculate_feature_matrix( [p, dfeat, dfeat2, agg_feat], es, cutoff_time_in_index=True, cutoff_time=cutoff_time, ) for f in [p, dfeat, agg_feat]: for fm1, fm2 in combinations( [ feature_matrix_approx, feature_matrix_small_approx, feature_matrix_no_approx, ], 2, ): assert fm1[f.get_name()].tolist() == fm2[f.get_name()].tolist() def test_approximate_dfeat_of_dfeat_of_agg_on_target(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(Feature(agg_feat2, "sessions"), "log") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [dfeat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_time, ) assert feature_matrix[dfeat.get_name()].tolist() == [7, 10] def test_empty_path_approximate_full(es): es["sessions"].ww["customer_id"] = pd.Series( [np.nan, np.nan, np.nan, 1, 1, 2], dtype="category", ) # Need to reassign the `foreign_key` tag as the column reassignment above removes it es["sessions"].ww.set_types(semantic_tags={"customer_id": "foreign_key"}) agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat2, "sessions") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_time, ) vals1 = feature_matrix[dfeat.get_name()].tolist() assert vals1[0] == 0 assert vals1[1] == 0 assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1] def test_approx_base_feature_is_also_first_class_feature(es): log_to_products = DirectFeature(Feature(es["products"].ww["rating"]), "log") # This should still be computed properly agg_feat = Feature(log_to_products, parent_dataframe_name="sessions", primitive=Min) customer_agg_feat = Feature( agg_feat, parent_dataframe_name="customers", primitive=Sum, ) # This is to be approximated sess_to_cust = DirectFeature(customer_agg_feat, "sessions") times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)] cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]}) feature_matrix = calculate_feature_matrix( [sess_to_cust, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_time, ) vals1 = feature_matrix[sess_to_cust.get_name()].tolist() assert vals1 == [8.5, 7] vals2 = feature_matrix[agg_feat.get_name()].tolist() assert vals2 == [4, 1.5] def test_approximate_time_split_returns_the_same_result(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum) dfeat = DirectFeature(agg_feat2, "sessions") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:07:30"), pd.Timestamp("2011-04-09 10:07:40"), ], "instance_id": [0, 0], }, ) feature_matrix_at_once = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_df, ) divided_matrices = [] separate_cutoff = [cutoff_df.iloc[0:1], cutoff_df.iloc[1:]] # Make sure indexes are different # Note that this step is unnecessary and done to showcase the issue here separate_cutoff[0].index = [0] separate_cutoff[1].index = [1] for ct in separate_cutoff: fm = calculate_feature_matrix( [dfeat, agg_feat], es, approximate=Timedelta(10, "s"), cutoff_time=ct, ) divided_matrices.append(fm) feature_matrix_from_split = pd.concat(divided_matrices) assert feature_matrix_from_split.shape == feature_matrix_at_once.shape for i1, i2 in zip(feature_matrix_at_once.index, feature_matrix_from_split.index): assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2) for c in feature_matrix_from_split: for i1, i2 in zip(feature_matrix_at_once[c], feature_matrix_from_split[c]): assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2) def test_approximate_returns_correct_empty_default_values(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) dfeat = DirectFeature(agg_feat, "sessions") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-08 11:00:00"), pd.Timestamp("2011-04-09 11:00:00"), ], "instance_id": [0, 0], }, ) fm = calculate_feature_matrix( [dfeat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_df, ) assert fm[dfeat.get_name()].tolist() == [0, 10] def test_approximate_child_aggs_handled_correctly(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") agg_feat_2 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum, ) cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-08 10:30:00"), pd.Timestamp("2011-04-09 10:30:06"), ], "instance_id": [0, 0], }, ) fm = calculate_feature_matrix( [dfeat], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_df, ) fm_2 = calculate_feature_matrix( [dfeat, agg_feat_2], es, approximate=Timedelta(10, "s"), cutoff_time=cutoff_df, ) assert fm[dfeat.get_name()].tolist() == [2, 3] assert fm_2[agg_feat_2.get_name()].tolist() == [0, 5] def test_cutoff_time_naming(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-08 10:30:00"), pd.Timestamp("2011-04-09 10:30:06"), ], "instance_id": [0, 0], }, ) cutoff_df_index_name = cutoff_df.rename(columns={"instance_id": "id"}) cutoff_df_wrong_index_name = cutoff_df.rename(columns={"instance_id": "wrong_id"}) cutoff_df_wrong_time_name = cutoff_df.rename(columns={"time": "cutoff_time"}) fm1 = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df) fm2 = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_index_name) assert all((fm1 == fm2.values).values) error_text = ( "Cutoff time DataFrame must contain a column with either the same name" ' as the target dataframe index or a column named "instance_id"' ) with pytest.raises(AttributeError, match=error_text): calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_wrong_index_name) time_error_text = ( "Cutoff time DataFrame must contain a column with either the same name" ' as the target dataframe time_index or a column named "time"' ) with pytest.raises(AttributeError, match=time_error_text): calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_wrong_time_name) def test_cutoff_time_extra_columns(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:30:06"), pd.Timestamp("2011-04-09 10:30:03"), pd.Timestamp("2011-04-08 10:30:00"), ], "instance_id": [0, 1, 0], "label": [True, True, False], }, columns=["time", "instance_id", "label"], ) fm = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df) # check column was added to end of matrix assert "label" == fm.columns[-1] assert (fm["label"].values == cutoff_df["label"].values).all() def test_cutoff_time_extra_columns_approximate(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:30:06"), pd.Timestamp("2011-04-09 10:30:03"), pd.Timestamp("2011-04-08 10:30:00"), ], "instance_id": [0, 1, 0], "label": [True, True, False], }, columns=["time", "instance_id", "label"], ) fm = calculate_feature_matrix( [dfeat], es, cutoff_time=cutoff_df, approximate="2 days", ) # check column was added to end of matrix assert "label" in fm.columns assert (fm["label"].values == cutoff_df["label"].values).all() def test_cutoff_time_extra_columns_same_name(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:30:06"), pd.Timestamp("2011-04-09 10:30:03"), pd.Timestamp("2011-04-08 10:30:00"), ], "instance_id": [0, 1, 0], "régions.COUNT(customers)": [False, False, True], }, columns=["time", "instance_id", "régions.COUNT(customers)"], ) fm = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df) assert ( fm["régions.COUNT(customers)"].values == cutoff_df["régions.COUNT(customers)"].values ).all() def test_cutoff_time_extra_columns_same_name_approximate(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:30:06"), pd.Timestamp("2011-04-09 10:30:03"), pd.Timestamp("2011-04-08 10:30:00"), ], "instance_id": [0, 1, 0], "régions.COUNT(customers)": [False, False, True], }, columns=["time", "instance_id", "régions.COUNT(customers)"], ) fm = calculate_feature_matrix( [dfeat], es, cutoff_time=cutoff_df, approximate="2 days", ) assert ( fm["régions.COUNT(customers)"].values == cutoff_df["régions.COUNT(customers)"].values ).all() def test_instances_after_cutoff_time_removed(es): property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) cutoff_time = datetime(2011, 4, 8) fm = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, cutoff_time_in_index=True, ) actual_ids = ( [id for (id, _) in fm.index] if isinstance(fm.index, pd.MultiIndex) else fm.index ) # Customer with id 1 should be removed assert set(actual_ids) == set([2, 0]) def test_instances_with_id_kept_after_cutoff(es): property_feature = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) cutoff_time = datetime(2011, 4, 8) fm = calculate_feature_matrix( [property_feature], es, instance_ids=[0, 1, 2], cutoff_time=cutoff_time, cutoff_time_in_index=True, ) # Customer #1 is after cutoff, but since it is included in instance_ids it # should be kept. actual_ids = ( [id for (id, _) in fm.index] if isinstance(fm.index, pd.MultiIndex) else fm.index ) assert set(actual_ids) == set([0, 1, 2]) def test_cfm_returns_original_time_indexes(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:30:06"), pd.Timestamp("2011-04-09 10:30:03"), pd.Timestamp("2011-04-08 10:30:00"), ], "instance_id": [0, 1, 0], }, ) fm = calculate_feature_matrix( [dfeat], es, cutoff_time=cutoff_df, cutoff_time_in_index=True, ) instance_level_vals = fm.index.get_level_values(0).values time_level_vals = fm.index.get_level_values(1).values assert (instance_level_vals == cutoff_df["instance_id"].values).all() assert (time_level_vals == cutoff_df["time"].values).all() def test_cfm_returns_original_time_indexes_approximate(es): agg_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) dfeat = DirectFeature(agg_feat, "customers") agg_feat_2 = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) cutoff_df = pd.DataFrame( { "time": [ pd.Timestamp("2011-04-09 10:30:06"), pd.Timestamp("2011-04-09 10:30:03"), pd.Timestamp("2011-04-08 10:30:00"), ], "instance_id": [0, 1, 0], }, ) # approximate, in different windows, no unapproximated aggs fm = calculate_feature_matrix( [dfeat], es, cutoff_time=cutoff_df, cutoff_time_in_index=True, approximate="1 m", ) instance_level_vals = fm.index.get_level_values(0).values time_level_vals = fm.index.get_level_values(1).values assert (instance_level_vals == cutoff_df["instance_id"].values).all() assert (time_level_vals == cutoff_df["time"].values).all() # approximate, in different windows, unapproximated aggs fm = calculate_feature_matrix( [dfeat, agg_feat_2], es, cutoff_time=cutoff_df, cutoff_time_in_index=True, approximate="1 m", ) instance_level_vals = fm.index.get_level_values(0).values time_level_vals = fm.index.get_level_values(1).values assert (instance_level_vals == cutoff_df["instance_id"].values).all() assert (time_level_vals == cutoff_df["time"].values).all() # approximate, in same window, no unapproximated aggs fm2 = calculate_feature_matrix( [dfeat], es, cutoff_time=cutoff_df, cutoff_time_in_index=True, approximate="2 d", ) instance_level_vals = fm2.index.get_level_values(0).values time_level_vals = fm2.index.get_level_values(1).values assert (instance_level_vals == cutoff_df["instance_id"].values).all() assert (time_level_vals == cutoff_df["time"].values).all() # approximate, in same window, unapproximated aggs fm3 = calculate_feature_matrix( [dfeat, agg_feat_2], es, cutoff_time=cutoff_df, cutoff_time_in_index=True, approximate="2 d", ) instance_level_vals = fm3.index.get_level_values(0).values time_level_vals = fm3.index.get_level_values(1).values assert (instance_level_vals == cutoff_df["instance_id"].values).all() assert (time_level_vals == cutoff_df["time"].values).all() def test_dask_kwargs(es, dask_cluster): times = ( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)] ) labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2 cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)}) property_feature = IdentityFeature(es["log"].ww["value"]) > 10 dkwargs = {"cluster": dask_cluster.scheduler.address} feature_matrix = calculate_feature_matrix( [property_feature], entityset=es, cutoff_time=cutoff_time, verbose=True, chunk_size=0.13, dask_kwargs=dkwargs, approximate="1 hour", ) assert (feature_matrix[property_feature.get_name()] == labels).values.all() def test_dask_persisted_es(es, capsys, dask_cluster): times = ( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)] ) labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2 cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)}) property_feature = IdentityFeature(es["log"].ww["value"]) > 10 dkwargs = {"cluster": dask_cluster.scheduler.address} feature_matrix = calculate_feature_matrix( [property_feature], entityset=es, cutoff_time=cutoff_time, verbose=True, chunk_size=0.13, dask_kwargs=dkwargs, approximate="1 hour", ) assert (feature_matrix[property_feature.get_name()] == labels).values.all() feature_matrix = calculate_feature_matrix( [property_feature], entityset=es, cutoff_time=cutoff_time, verbose=True, chunk_size=0.13, dask_kwargs=dkwargs, approximate="1 hour", ) captured = capsys.readouterr() assert "Using EntitySet persisted on the cluster as dataset " in captured[0] assert (feature_matrix[property_feature.get_name()] == labels).values.all() class TestCreateClientAndCluster(object): def test_user_cluster_as_string(self, monkeypatch): monkeypatch.setattr(utils, "get_client_cluster", get_mock_client_cluster) # cluster in dask_kwargs case client, cluster = create_client_and_cluster( n_jobs=2, dask_kwargs={"cluster": "tcp://127.0.0.1:54321"}, entityset_size=1, ) assert cluster == "tcp://127.0.0.1:54321" def test_cluster_creation(self, monkeypatch): total_memory = psutil.virtual_memory().total monkeypatch.setattr(utils, "get_client_cluster", get_mock_client_cluster) try: cpus = len(psutil.Process().cpu_affinity()) except AttributeError: # pragma: no cover cpus = psutil.cpu_count() # jobs < tasks case client, cluster = create_client_and_cluster( n_jobs=2, dask_kwargs={}, entityset_size=1, ) num_workers = min(cpus, 2) memory_limit = int(total_memory / float(num_workers)) assert cluster == (min(cpus, 2), 1, None, memory_limit) # jobs > tasks case match = r".*workers requested, but only .* workers created" with pytest.warns(UserWarning, match=match) as record: client, cluster = create_client_and_cluster( n_jobs=1000, dask_kwargs={"diagnostics_port": 8789}, entityset_size=1, ) assert len(record) == 1 num_workers = cpus memory_limit = int(total_memory / float(num_workers)) assert cluster == (num_workers, 1, 8789, memory_limit) # dask_kwargs sets memory limit client, cluster = create_client_and_cluster( n_jobs=2, dask_kwargs={"diagnostics_port": 8789, "memory_limit": 1000}, entityset_size=1, ) num_workers = min(cpus, 2) assert cluster == (num_workers, 1, 8789, 1000) def test_not_enough_memory(self, monkeypatch): total_memory = psutil.virtual_memory().total monkeypatch.setattr(utils, "get_client_cluster", get_mock_client_cluster) # errors if not enough memory for each worker to store the entityset with pytest.raises(ValueError, match=""): create_client_and_cluster( n_jobs=1, dask_kwargs={}, entityset_size=total_memory * 2, ) # does not error even if worker memory is less than 2x entityset size create_client_and_cluster( n_jobs=1, dask_kwargs={}, entityset_size=total_memory * 0.75, ) def test_parallel_failure_raises_correct_error(es): times = ( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)] ) cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)}) property_feature = IdentityFeature(es["log"].ww["value"]) > 10 error_text = "Need at least one worker" with pytest.raises(AssertionError, match=error_text): calculate_feature_matrix( [property_feature], entityset=es, cutoff_time=cutoff_time, verbose=True, chunk_size=0.13, n_jobs=0, approximate="1 hour", ) def test_warning_not_enough_chunks( es, capsys, three_worker_dask_cluster, ): # pragma: no cover property_feature = IdentityFeature(es["log"].ww["value"]) > 10 dkwargs = {"cluster": three_worker_dask_cluster.scheduler.address} calculate_feature_matrix( [property_feature], entityset=es, chunk_size=0.5, verbose=True, dask_kwargs=dkwargs, ) captured = capsys.readouterr() pattern = r"Fewer chunks \([0-9]+\), than workers \([0-9]+\) consider reducing the chunk size" assert re.search(pattern, captured.out) is not None def test_n_jobs(): try: cpus = len(psutil.Process().cpu_affinity()) except AttributeError: # pragma: no cover cpus = psutil.cpu_count() assert n_jobs_to_workers(1) == 1 assert n_jobs_to_workers(-1) == cpus assert n_jobs_to_workers(cpus) == cpus assert n_jobs_to_workers((cpus + 1) * -1) == 1 if cpus > 1: assert n_jobs_to_workers(-2) == cpus - 1 error_text = "Need at least one worker" with pytest.raises(AssertionError, match=error_text): n_jobs_to_workers(0) def test_parallel_cutoff_time_column_pass_through(es, dask_cluster): times = ( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)] ) labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2 cutoff_time = pd.DataFrame( {"time": times, "instance_id": range(17), "labels": labels}, ) property_feature = IdentityFeature(es["log"].ww["value"]) > 10 dkwargs = {"cluster": dask_cluster.scheduler.address} feature_matrix = calculate_feature_matrix( [property_feature], entityset=es, cutoff_time=cutoff_time, verbose=True, dask_kwargs=dkwargs, approximate="1 hour", ) assert ( feature_matrix[property_feature.get_name()] == feature_matrix["labels"] ).values.all() def test_integer_time_index(int_es): times = list(range(8, 18)) + list(range(19, 26)) labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2 cutoff_df = pd.DataFrame({"time": times, "instance_id": range(17)}) property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10 feature_matrix = calculate_feature_matrix( [property_feature], int_es, cutoff_time=cutoff_df, cutoff_time_in_index=True, ) time_level_vals = feature_matrix.index.get_level_values(1).values sorted_df = cutoff_df.sort_values(["time", "instance_id"], kind="mergesort") assert (time_level_vals == sorted_df["time"].values).all() assert (feature_matrix[property_feature.get_name()] == labels).values.all() def test_integer_time_index_single_cutoff_value(int_es): labels = [False] * 3 + [True] * 2 + [False] * 4 property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10 cutoff_times = [16, pd.Series([16])[0], 16.0, pd.Series([16.0])[0]] for cutoff_time in cutoff_times: feature_matrix = calculate_feature_matrix( [property_feature], int_es, cutoff_time=cutoff_time, cutoff_time_in_index=True, ) time_level_vals = feature_matrix.index.get_level_values(1).values assert (time_level_vals == [16] * 9).all() assert (feature_matrix[property_feature.get_name()] == labels).values.all() def test_integer_time_index_datetime_cutoffs(int_es): times = [datetime.now()] * 17 cutoff_df = pd.DataFrame({"time": times, "instance_id": range(17)}) property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10 error_text = ( "cutoff_time times must be numeric: try casting via pd\\.to_numeric\\(\\)" ) with pytest.raises(TypeError, match=error_text): calculate_feature_matrix( [property_feature], int_es, cutoff_time=cutoff_df, cutoff_time_in_index=True, ) def test_integer_time_index_passes_extra_columns(int_es): times = list(range(8, 18)) + list(range(19, 23)) + [25, 24, 23] labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True] instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14] cutoff_df = pd.DataFrame( {"time": times, "instance_id": instances, "labels": labels}, ) cutoff_df = cutoff_df[["time", "instance_id", "labels"]] property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10 fm = calculate_feature_matrix( [property_feature], int_es, cutoff_time=cutoff_df, cutoff_time_in_index=True, ) assert (fm[property_feature.get_name()] == fm["labels"]).all() def test_integer_time_index_mixed_cutoff(int_es): times_dt = list(range(8, 17)) + [datetime(2011, 1, 1), 19, 20, 21, 22, 25, 24, 23] labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True] instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14] cutoff_df = pd.DataFrame( {"time": times_dt, "instance_id": instances, "labels": labels}, ) cutoff_df = cutoff_df[["time", "instance_id", "labels"]] property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10 error_text = "cutoff_time times must be.*try casting via.*" with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df) times_str = list(range(8, 17)) + ["foobar", 19, 20, 21, 22, 25, 24, 23] cutoff_df["time"] = times_str with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df) times_date_str = list(range(8, 17)) + ["2018-04-02", 19, 20, 21, 22, 25, 24, 23] cutoff_df["time"] = times_date_str with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df) times_int_str = [0, 1, 2, 3, 4, 5, "6", 7, 8, 9, 9, 10, 11, 12, 15, 14, 13] times_int_str = list(range(8, 17)) + ["17", 19, 20, 21, 22, 25, 24, 23] cutoff_df["time"] = times_int_str # calculate_feature_matrix should convert time column to ints successfully here with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df) def test_datetime_index_mixed_cutoff(es): times = list( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [17] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)], ) labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True] instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14] cutoff_df = pd.DataFrame( {"time": times, "instance_id": instances, "labels": labels}, ) cutoff_df = cutoff_df[["time", "instance_id", "labels"]] property_feature = IdentityFeature(es["log"].ww["value"]) > 10 error_text = "cutoff_time times must be.*try casting via.*" with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df) times[9] = "foobar" cutoff_df["time"] = times with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df) times[9] = "17" cutoff_df["time"] = times with pytest.raises(TypeError, match=error_text): calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df) def test_no_data_for_cutoff_time(mock_customer): es = mock_customer cutoff_times = pd.DataFrame( {"customer_id": [4], "time": pd.Timestamp("2011-04-08 20:08:13")}, ) trans_per_session = Feature( es["transactions"].ww["transaction_id"], parent_dataframe_name="sessions", primitive=Count, ) trans_per_customer = Feature( es["transactions"].ww["transaction_id"], parent_dataframe_name="customers", primitive=Count, ) max_count = Feature( trans_per_session, parent_dataframe_name="customers", primitive=Max, ) features = [trans_per_customer, max_count] fm = calculate_feature_matrix(features, entityset=es, cutoff_time=cutoff_times) # due to default values for each primitive # count will be 0, but max will nan answer = pd.DataFrame( { trans_per_customer.get_name(): pd.Series([0], dtype="Int64"), max_count.get_name(): pd.Series([np.nan], dtype="float"), }, ) for column in fm.columns: pd.testing.assert_series_equal( fm[column], answer[column], check_index=False, check_names=False, ) def test_instances_not_in_data(es): last_instance = max(es["log"].index.values) instances = list(range(last_instance + 1, last_instance + 11)) identity_feature = IdentityFeature(es["log"].ww["value"]) property_feature = identity_feature > 10 agg_feat = AggregationFeature( Feature(es["log"].ww["value"]), parent_dataframe_name="sessions", primitive=Max, ) direct_feature = DirectFeature(agg_feat, "log") features = [identity_feature, property_feature, direct_feature] fm = calculate_feature_matrix(features, entityset=es, instance_ids=instances) assert all(fm.index.values == instances) for column in fm.columns: assert fm[column].isnull().all() fm = calculate_feature_matrix( features, entityset=es, instance_ids=instances, approximate="730 days", ) assert all(fm.index.values == instances) for column in fm.columns: assert fm[column].isnull().all() def test_some_instances_not_in_data(es): a_time = datetime(2011, 4, 10, 10, 41, 9) # only valid data b_time = datetime(2011, 4, 10, 11, 10, 5) # some missing data c_time = datetime(2011, 4, 10, 12, 0, 0) # all missing data times = [a_time, b_time, a_time, a_time, b_time, b_time] + [c_time] * 4 cutoff_time = pd.DataFrame({"instance_id": list(range(12, 22)), "time": times}) identity_feature = IdentityFeature(es["log"].ww["value"]) property_feature = identity_feature > 10 agg_feat = AggregationFeature( Feature(es["log"].ww["value"]), parent_dataframe_name="sessions", primitive=Max, ) direct_feature = DirectFeature(agg_feat, "log") features = [identity_feature, property_feature, direct_feature] fm = calculate_feature_matrix(features, entityset=es, cutoff_time=cutoff_time) ifeat_answer = pd.Series([0, 7, 14, np.nan] + [np.nan] * 6) prop_answer = pd.Series([0, 0, 1, pd.NA, 0] + [pd.NA] * 5, dtype="boolean") dfeat_answer = pd.Series([14, 14, 14, np.nan] + [np.nan] * 6) assert all(fm.index.values == cutoff_time["instance_id"].values) for x, y in zip(fm.columns, [ifeat_answer, prop_answer, dfeat_answer]): pd.testing.assert_series_equal(fm[x], y, check_index=False, check_names=False) fm = calculate_feature_matrix( features, entityset=es, cutoff_time=cutoff_time, approximate="5 seconds", ) dfeat_answer[0] = 7 # approximate calculated before 14 appears dfeat_answer[2] = 7 # approximate calculated before 14 appears prop_answer[3] = False # no_unapproximated_aggs code ignores cutoff time assert all(fm.index.values == cutoff_time["instance_id"].values) for x, y in zip(fm.columns, [ifeat_answer, prop_answer, dfeat_answer]): pd.testing.assert_series_equal(fm[x], y, check_index=False, check_names=False) def test_missing_instances_with_categorical_index(es): instance_ids = ["coke zero", "car", 3, "taco clock"] features = dfs( entityset=es, target_dataframe_name="products", features_only=True, ) fm = calculate_feature_matrix( entityset=es, features=features, instance_ids=instance_ids, ) assert fm.index.values.to_list() == instance_ids assert isinstance(fm.index, pd.CategoricalIndex) def test_handle_chunk_size(): total_size = 100 # user provides no chunk size assert _handle_chunk_size(None, total_size) is None # user provides fractional size assert _handle_chunk_size(0.1, total_size) == total_size * 0.1 assert _handle_chunk_size(0.001, total_size) == 1 # rounds up assert _handle_chunk_size(0.345, total_size) == 35 # rounds up # user provides absolute size assert _handle_chunk_size(1, total_size) == 1 assert _handle_chunk_size(100, total_size) == 100 assert isinstance(_handle_chunk_size(100.0, total_size), int) # test invalid cases with pytest.raises(AssertionError, match="Chunk size must be greater than 0"): _handle_chunk_size(0, total_size) with pytest.raises(AssertionError, match="Chunk size must be greater than 0"): _handle_chunk_size(-1, total_size) def test_chunk_dataframe_groups(): df = pd.DataFrame({"group": [1, 1, 1, 1, 2, 2, 3]}) grouped = df.groupby("group") chunked_grouped = _chunk_dataframe_groups(grouped, 2) # test group larger than chunk size gets split up first = next(chunked_grouped) assert first[0] == 1 and first[1].shape[0] == 2 second = next(chunked_grouped) assert second[0] == 1 and second[1].shape[0] == 2 # test that equal to and less than chunk size stays together third = next(chunked_grouped) assert third[0] == 2 and third[1].shape[0] == 2 fourth = next(chunked_grouped) assert fourth[0] == 3 and fourth[1].shape[0] == 1 def test_calls_progress_callback(mock_customer): class MockProgressCallback: def __init__(self): self.progress_history = [] self.total_update = 0 self.total_progress_percent = 0 def __call__(self, update, progress_percent, time_elapsed): self.total_update += update self.total_progress_percent = progress_percent self.progress_history.append(progress_percent) mock_progress_callback = MockProgressCallback() es = mock_customer # make sure to calculate features that have different paths to same base feature trans_per_session = Feature( es["transactions"].ww["transaction_id"], parent_dataframe_name="sessions", primitive=Count, ) trans_per_customer = Feature( es["transactions"].ww["transaction_id"], parent_dataframe_name="customers", primitive=Count, ) features = [trans_per_session, Feature(trans_per_customer, "sessions")] calculate_feature_matrix( features, entityset=es, progress_callback=mock_progress_callback, ) # second to last entry is the last update from feature calculation assert np.isclose( mock_progress_callback.progress_history[-2], FEATURE_CALCULATION_PERCENTAGE * 100, ) assert np.isclose(mock_progress_callback.total_update, 100.0) assert np.isclose(mock_progress_callback.total_progress_percent, 100.0) # test with cutoff time dataframe mock_progress_callback = MockProgressCallback() cutoff_time = pd.DataFrame( { "instance_id": [1, 2, 3], "time": [ pd.to_datetime("2014-01-01 01:00:00"), pd.to_datetime("2014-01-01 02:00:00"), pd.to_datetime("2014-01-01 03:00:00"), ], }, ) calculate_feature_matrix( features, entityset=es, cutoff_time=cutoff_time, progress_callback=mock_progress_callback, ) assert np.isclose( mock_progress_callback.progress_history[-2], FEATURE_CALCULATION_PERCENTAGE * 100, ) assert np.isclose(mock_progress_callback.total_update, 100.0) assert np.isclose(mock_progress_callback.total_progress_percent, 100.0) def test_calls_progress_callback_cluster(mock_customer, dask_cluster): class MockProgressCallback: def __init__(self): self.progress_history = [] self.total_update = 0 self.total_progress_percent = 0 def __call__(self, update, progress_percent, time_elapsed): self.total_update += update self.total_progress_percent = progress_percent self.progress_history.append(progress_percent) mock_progress_callback = MockProgressCallback() trans_per_session = Feature( mock_customer["transactions"].ww["transaction_id"], parent_dataframe_name="sessions", primitive=Count, ) trans_per_customer = Feature( mock_customer["transactions"].ww["transaction_id"], parent_dataframe_name="customers", primitive=Count, ) features = [trans_per_session, Feature(trans_per_customer, "sessions")] dkwargs = {"cluster": dask_cluster.scheduler.address} calculate_feature_matrix( features, entityset=mock_customer, progress_callback=mock_progress_callback, dask_kwargs=dkwargs, ) assert np.isclose(mock_progress_callback.total_update, 100.0) assert np.isclose(mock_progress_callback.total_progress_percent, 100.0) def test_closes_tqdm(es): class ErrorPrim(TransformPrimitive): """A primitive whose function raises an error""" name = "error_prim" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = "Numeric" def get_function(self): def error(s): raise RuntimeError("This primitive has errored") return error value = Feature(es["log"].ww["value"]) property_feature = value > 10 error_feature = Feature(value, primitive=ErrorPrim) calculate_feature_matrix([property_feature], es, verbose=True) assert len(tqdm._instances) == 0 match = "This primitive has errored" with pytest.raises(RuntimeError, match=match): calculate_feature_matrix([value, error_feature], es, verbose=True) assert len(tqdm._instances) == 0 def test_approximate_with_single_cutoff_warns(es): features = dfs( entityset=es, target_dataframe_name="customers", features_only=True, ignore_dataframes=["cohorts"], agg_primitives=["sum"], ) match = ( "Using approximate with a single cutoff_time value or no cutoff_time " "provides no computational efficiency benefit" ) # test warning with single cutoff time with pytest.warns(UserWarning, match=match): calculate_feature_matrix( features, es, cutoff_time=pd.to_datetime("2020-01-01"), approximate="1 day", ) # test warning with no cutoff time with pytest.warns(UserWarning, match=match): calculate_feature_matrix(features, es, approximate="1 day") # check proper handling of approximate feature_matrix = calculate_feature_matrix( features, es, cutoff_time=pd.to_datetime("2011-04-09 10:31:30"), approximate="1 minute", ) expected_values = [50, 50, 50] assert (feature_matrix["régions.SUM(log.value)"] == expected_values).values.all() def test_calc_feature_matrix_with_cutoff_df_and_instance_ids(es): times = list( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)], ) instances = range(17) cutoff_time = pd.DataFrame({"time": times, es["log"].ww.index: instances}) labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2 property_feature = Feature(es["log"].ww["value"]) > 10 match = "Passing 'instance_ids' is valid only if 'cutoff_time' is a single value or None - ignoring" with pytest.warns(UserWarning, match=match): feature_matrix = calculate_feature_matrix( [property_feature], es, cutoff_time=cutoff_time, instance_ids=[1, 3, 5], verbose=True, ) assert (feature_matrix[property_feature.get_name()] == labels).values.all() def test_calculate_feature_matrix_returns_default_values(default_value_es): sum_features = Feature( default_value_es["transactions"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) sessions_sum = Feature(sum_features, "transactions") feature_matrix = calculate_feature_matrix( features=[sessions_sum], entityset=default_value_es, ) expected_values = [2.0, 2.0, 1.0, 0.0] assert (feature_matrix[sessions_sum.get_name()] == expected_values).values.all() def test_dataframes_relationships(dataframes, relationships): fm_1, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", ) fm_2 = calculate_feature_matrix( features=features, dataframes=dataframes, relationships=relationships, ) assert fm_1.equals(fm_2) def test_no_dataframes(dataframes, relationships): features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", features_only=True, ) msg = "No dataframes or valid EntitySet provided" with pytest.raises(TypeError, match=msg): calculate_feature_matrix(features=features, dataframes=None, relationships=None) def test_no_relationships(dataframes): fm_1, features = dfs( dataframes=dataframes, relationships=None, target_dataframe_name="transactions", ) fm_2 = calculate_feature_matrix( features=features, dataframes=dataframes, relationships=None, ) assert fm_1.equals(fm_2) def test_cfm_with_invalid_time_index(es): features = dfs(entityset=es, target_dataframe_name="customers", features_only=True) es["customers"].ww.set_types(logical_types={"signup_date": "integer"}) match = "customers time index is numeric type " match += "which differs from other entityset time indexes" with pytest.raises(TypeError, match=match): calculate_feature_matrix(features=features, entityset=es) def test_cfm_introduces_nan_values_in_direct_feats(es): es["customers"].ww.set_types( logical_types={"age": "Age", "engagement_level": "Integer"}, ) age_feat = Feature(es["customers"].ww["age"]) engagement_feat = Feature(es["customers"].ww["engagement_level"]) loves_ice_cream_feat = Feature(es["customers"].ww["loves_ice_cream"]) features = [age_feat, engagement_feat, loves_ice_cream_feat] fm = calculate_feature_matrix( features=features, entityset=es, cutoff_time=pd.Timestamp("2010-04-08 04:00"), instance_ids=[1], ) assert isinstance(es["customers"].ww.logical_types["age"], Age) assert isinstance(es["customers"].ww.logical_types["engagement_level"], Integer) assert isinstance(es["customers"].ww.logical_types["loves_ice_cream"], Boolean) assert isinstance(fm.ww.logical_types["age"], AgeNullable) assert isinstance(fm.ww.logical_types["engagement_level"], IntegerNullable) assert isinstance(fm.ww.logical_types["loves_ice_cream"], BooleanNullable) def test_feature_origins_present_on_all_fm_cols(es): class MultiCumSum(TransformPrimitive): name = "multi_cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 def get_function(self): def multi_cum_sum(x): return x.cumsum(), x.cummax(), x.cummin() return multi_cum_sum feature_matrix, _ = dfs( entityset=es, target_dataframe_name="log", trans_primitives=[MultiCumSum], ) for col in feature_matrix.columns: origin = feature_matrix.ww[col].ww.origin assert origin in ["base", "engineered"] def test_renamed_features_have_expected_column_names_in_feature_matrix(es): class MultiCumulative(TransformPrimitive): name = "multi_cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 def get_function(self): def multi_cum_sum(x): return x.cumsum(), x.cummax(), x.cummin() return multi_cum_sum multi_output_trans_feat = Feature( es["log"].ww["value"], primitive=MultiCumulative, ) groupby_trans_feat = GroupByTransformFeature( es["log"].ww["value"], primitive=MultiCumulative, groupby=es["log"].ww["product_id"], ) multi_output_agg_feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) slice = FeatureOutputSlice(multi_output_trans_feat, 1) stacked_feat = Feature(slice, primitive=Negate) multi_output_trans_names = ["cumulative_sum", "cumulative_max", "cumulative_min"] multi_output_trans_feat.set_feature_names(multi_output_trans_names) groupby_trans_feat_names = ["grouped_sum", "grouped_max", "grouped_min"] groupby_trans_feat.set_feature_names(groupby_trans_feat_names) agg_names = ["first_most_common", "second_most_common"] multi_output_agg_feat.set_feature_names(agg_names) features = [ multi_output_trans_feat, multi_output_agg_feat, stacked_feat, groupby_trans_feat, ] feature_matrix = calculate_feature_matrix(entityset=es, features=features) expected_names = multi_output_trans_names + agg_names + groupby_trans_feat_names for renamed_col in expected_names: assert renamed_col in feature_matrix.columns expected_stacked_name = "-(cumulative_max)" assert expected_stacked_name in feature_matrix.columns ================================================ FILE: featuretools/tests/computational_backend/test_feature_set.py ================================================ from featuretools import ( AggregationFeature, DirectFeature, IdentityFeature, TransformFeature, primitives, ) from featuretools.computational_backends.feature_set import FeatureSet from featuretools.entityset.relationship import RelationshipPath from featuretools.tests.testing_utils import backward_path from featuretools.utils import Trie def test_feature_trie_without_needs_full_dataframe(diamond_es): es = diamond_es country_name = IdentityFeature(es["countries"].ww["name"]) direct_name = DirectFeature(country_name, "regions") amount = IdentityFeature(es["transactions"].ww["amount"]) path_through_customers = backward_path(es, ["regions", "customers", "transactions"]) through_customers = AggregationFeature( amount, "regions", primitive=primitives.Mean, relationship_path=path_through_customers, ) path_through_stores = backward_path(es, ["regions", "stores", "transactions"]) through_stores = AggregationFeature( amount, "regions", primitive=primitives.Mean, relationship_path=path_through_stores, ) customers_to_transactions = backward_path(es, ["customers", "transactions"]) customers_mean = AggregationFeature( amount, "customers", primitive=primitives.Mean, relationship_path=customers_to_transactions, ) negation = TransformFeature(customers_mean, primitives.Negate) regions_to_customers = backward_path(es, ["regions", "customers"]) mean_of_mean = AggregationFeature( negation, "regions", primitive=primitives.Mean, relationship_path=regions_to_customers, ) features = [direct_name, through_customers, through_stores, mean_of_mean] feature_set = FeatureSet(features) trie = feature_set.feature_trie assert trie.value == (False, set(), {f.unique_name() for f in features}) assert trie.get_node(direct_name.relationship_path).value == ( False, set(), {country_name.unique_name()}, ) assert trie.get_node(regions_to_customers).value == ( False, set(), {negation.unique_name(), customers_mean.unique_name()}, ) regions_to_stores = backward_path(es, ["regions", "stores"]) assert trie.get_node(regions_to_stores).value == (False, set(), set()) assert trie.get_node(path_through_customers).value == ( False, set(), {amount.unique_name()}, ) assert trie.get_node(path_through_stores).value == ( False, set(), {amount.unique_name()}, ) def test_feature_trie_with_needs_full_dataframe(diamond_es): es = diamond_es amount = IdentityFeature(es["transactions"].ww["amount"]) path_through_customers = backward_path( es, ["regions", "customers", "transactions"], ) agg = AggregationFeature( amount, "regions", primitive=primitives.Mean, relationship_path=path_through_customers, ) trans_of_agg = TransformFeature(agg, primitives.CumSum) path_through_stores = backward_path(es, ["regions", "stores", "transactions"]) trans = TransformFeature(amount, primitives.CumSum) agg_of_trans = AggregationFeature( trans, "regions", primitive=primitives.Mean, relationship_path=path_through_stores, ) features = [agg, trans_of_agg, agg_of_trans] feature_set = FeatureSet(features) trie = feature_set.feature_trie assert trie.value == ( True, {agg.unique_name(), trans_of_agg.unique_name()}, {agg_of_trans.unique_name()}, ) assert trie.get_node(path_through_customers).value == ( True, {amount.unique_name()}, set(), ) assert trie.get_node(path_through_customers[:1]).value == (True, set(), set()) assert trie.get_node(path_through_stores).value == ( True, {amount.unique_name(), trans.unique_name()}, set(), ) assert trie.get_node(path_through_stores[:1]).value == (False, set(), set()) def test_feature_trie_with_needs_full_dataframe_direct(es): value = IdentityFeature(es["log"].ww["value"]) agg = AggregationFeature(value, "sessions", primitive=primitives.Mean) agg_of_agg = AggregationFeature(agg, "customers", primitive=primitives.Sum) direct = DirectFeature(agg_of_agg, "sessions") trans = TransformFeature(direct, primitives.CumSum) features = [trans, agg] feature_set = FeatureSet(features) trie = feature_set.feature_trie assert trie.value == ( True, {direct.unique_name(), trans.unique_name()}, {agg.unique_name()}, ) assert trie.get_node(agg.relationship_path).value == ( False, set(), {value.unique_name()}, ) parent_node = trie.get_node(direct.relationship_path) assert parent_node.value == (True, {agg_of_agg.unique_name()}, set()) child_through_parent_node = parent_node.get_node(agg_of_agg.relationship_path) assert child_through_parent_node.value == (True, {agg.unique_name()}, set()) assert child_through_parent_node.get_node(agg.relationship_path).value == ( True, {value.unique_name()}, set(), ) def test_feature_trie_ignores_approximate_features(es): value = IdentityFeature(es["log"].ww["value"]) agg = AggregationFeature(value, "sessions", primitive=primitives.Mean) agg_of_agg = AggregationFeature(agg, "customers", primitive=primitives.Sum) direct = DirectFeature(agg_of_agg, "sessions") features = [direct, agg] approximate_feature_trie = Trie(default=list, path_constructor=RelationshipPath) approximate_feature_trie.get_node(direct.relationship_path).value = [agg_of_agg] feature_set = FeatureSet( features, approximate_feature_trie=approximate_feature_trie, ) trie = feature_set.feature_trie # Since agg_of_agg is ignored it and its dependencies should not be in the # trie. sub_trie = trie.get_node(direct.relationship_path) for _path, (_, _, features) in sub_trie: assert not features assert trie.value == (False, set(), {direct.unique_name(), agg.unique_name()}) assert trie.get_node(agg.relationship_path).value == ( False, set(), {value.unique_name()}, ) ================================================ FILE: featuretools/tests/computational_backend/test_feature_set_calculator.py ================================================ from datetime import datetime import numpy as np import pandas as pd import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Categorical, Datetime, Double, Integer from featuretools import ( AggregationFeature, EntitySet, Feature, Timedelta, calculate_feature_matrix, ) from featuretools.computational_backends.feature_set import FeatureSet from featuretools.computational_backends.feature_set_calculator import ( FeatureSetCalculator, ) from featuretools.entityset.relationship import RelationshipPath from featuretools.feature_base import DirectFeature, IdentityFeature from featuretools.primitives import ( And, Count, CumSum, EqualScalar, GreaterThanEqualToScalar, GreaterThanScalar, LessThanEqualToScalar, LessThanScalar, Mean, Min, Mode, Negate, NMostCommon, NotEqualScalar, NumTrue, Sum, TimeSinceLast, Trend, ) from featuretools.primitives.base import AggregationPrimitive from featuretools.primitives.standard.aggregation.num_unique import NumUnique from featuretools.tests.testing_utils import backward_path from featuretools.utils import Trie def test_make_identity(es): f = IdentityFeature(es["log"].ww["datetime"]) feature_set = FeatureSet([f]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[f.get_name()][0] assert v == datetime(2011, 4, 9, 10, 30, 0) def test_make_dfeat(es): f = DirectFeature( Feature(es["customers"].ww["age"]), child_dataframe_name="sessions", ) feature_set = FeatureSet([f]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[f.get_name()][0] assert v == 33 def test_make_agg_feat_of_identity_column(es): agg_feat = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) feature_set = FeatureSet([agg_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()][0] assert v == 50 def test_full_dataframe_trans_of_agg(es): agg_feat = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum, ) trans_feat = Feature(agg_feat, primitive=CumSum) feature_set = FeatureSet([trans_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([1])) v = df[trans_feat.get_name()].values[0] assert v == 82 def test_make_agg_feat_of_identity_index_column(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) feature_set = FeatureSet([agg_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()][0] assert v == 5 def test_make_agg_feat_where_count(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", where=IdentityFeature(es["log"].ww["product_id"]) == "coke zero", primitive=Count, ) feature_set = FeatureSet([agg_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()][0] assert v == 3 def test_make_agg_feat_using_prev_time(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", use_previous=Timedelta(10, "s"), primitive=Count, ) feature_set = FeatureSet([agg_feat]) calculator = FeatureSetCalculator( es, time_last=datetime(2011, 4, 9, 10, 30, 10), feature_set=feature_set, ) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()][0] assert v == 2 calculator = FeatureSetCalculator( es, time_last=datetime(2011, 4, 9, 10, 30, 30), feature_set=feature_set, ) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()][0] assert v == 1 def test_make_agg_feat_using_prev_n_events(es): agg_feat_1 = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", use_previous=Timedelta(1, "observations"), primitive=Min, ) agg_feat_2 = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", use_previous=Timedelta(3, "observations"), primitive=Min, ) assert ( agg_feat_1.get_name() != agg_feat_2.get_name() ), "Features should have different names based on use_previous" feature_set = FeatureSet([agg_feat_1, agg_feat_2]) calculator = FeatureSetCalculator( es, time_last=datetime(2011, 4, 9, 10, 30, 6), feature_set=feature_set, ) df = calculator.run(np.array([0])) # time_last is included by default v1 = df[agg_feat_1.get_name()][0] v2 = df[agg_feat_2.get_name()][0] assert v1 == 5 assert v2 == 0 calculator = FeatureSetCalculator( es, time_last=datetime(2011, 4, 9, 10, 30, 30), feature_set=feature_set, ) df = calculator.run(np.array([0])) v1 = df[agg_feat_1.get_name()][0] v2 = df[agg_feat_2.get_name()][0] assert v1 == 20 assert v2 == 10 def test_make_agg_feat_multiple_dtypes(es): compare_prod = IdentityFeature(es["log"].ww["product_id"]) == "coke zero" agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", where=compare_prod, primitive=Count, ) agg_feat2 = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", where=compare_prod, primitive=Mode, ) feature_set = FeatureSet([agg_feat, agg_feat2]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()][0] v2 = df[agg_feat2.get_name()][0] assert v == 3 assert v2 == "coke zero" def test_make_agg_feat_where_different_identity_feat(es): feats = [] where_cmps = [ LessThanScalar, GreaterThanScalar, LessThanEqualToScalar, GreaterThanEqualToScalar, EqualScalar, NotEqualScalar, ] for where_cmp in where_cmps: feats.append( Feature( es["log"].ww["id"], parent_dataframe_name="sessions", where=Feature( es["log"].ww["value"], primitive=where_cmp(10.0), ), primitive=Count, ), ) df = calculate_feature_matrix( entityset=es, features=feats, instance_ids=[0, 1, 2, 3], ) for i, where_cmp in enumerate(where_cmps): name = feats[i].get_name() instances = df[name] v0, v1, v2, v3 = instances[0:4] if where_cmp == LessThanScalar: assert v0 == 2 assert v1 == 4 assert v2 == 1 assert v3 == 2 elif where_cmp == GreaterThanScalar: assert v0 == 2 assert v1 == 0 assert v2 == 0 assert v3 == 0 elif where_cmp == LessThanEqualToScalar: assert v0 == 3 assert v1 == 4 assert v2 == 1 assert v3 == 2 elif where_cmp == GreaterThanEqualToScalar: assert v0 == 3 assert v1 == 0 assert v2 == 0 assert v3 == 0 elif where_cmp == EqualScalar: assert v0 == 1 assert v1 == 0 assert v2 == 0 assert v3 == 0 elif where_cmp == NotEqualScalar: assert v0 == 4 assert v1 == 4 assert v2 == 1 assert v3 == 2 def test_make_agg_feat_of_grandchild_dataframe(es): agg_feat = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) feature_set = FeatureSet([agg_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()].values[0] assert v == 10 def test_make_agg_feat_where_count_feat(es): """ Feature we're creating is: Number of sessions for each customer where the number of logs in the session is less than 3 """ log_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) feat = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", where=log_count_feat > 1, primitive=Count, ) feature_set = FeatureSet([feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0, 1])) name = feat.get_name() instances = df[name] v0, v1 = instances[0:2] assert v0 == 2 assert v1 == 2 def test_make_compare_feat(es): """ Feature we're creating is: Number of sessions for each customer where the number of logs in the session is less than 3 """ log_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) mean_agg_feat = Feature( log_count_feat, parent_dataframe_name="customers", primitive=Mean, ) mean_feat = DirectFeature(mean_agg_feat, child_dataframe_name="sessions") feat = log_count_feat > mean_feat feature_set = FeatureSet([feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0, 1, 2])) name = feat.get_name() instances = df[name] v0, v1, v2 = instances[0:3] assert v0 assert v1 assert not v2 def test_make_agg_feat_where_count_and_device_type_feat(es): """ Feature we're creating is: Number of sessions for each customer where the number of logs in the session is less than 3 """ log_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) compare_count = log_count_feat == 1 compare_device_type = IdentityFeature(es["sessions"].ww["device_type"]) == 1 and_feat = Feature([compare_count, compare_device_type], primitive=And) feat = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", where=and_feat, primitive=Count, ) feature_set = FeatureSet([feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) name = feat.get_name() instances = df[name] assert instances.values[0] == 1 def test_make_agg_feat_where_count_or_device_type_feat(es): """ Feature we're creating is: Number of sessions for each customer where the number of logs in the session is less than 3 """ log_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) compare_count = log_count_feat > 1 compare_device_type = IdentityFeature(es["sessions"].ww["device_type"]) == 1 or_feat = compare_count.OR(compare_device_type) feat = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", where=or_feat, primitive=Count, ) feature_set = FeatureSet([feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) name = feat.get_name() instances = df[name] assert instances.values[0] == 3 def test_make_agg_feat_of_agg_feat(es): log_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) customer_sum_feat = Feature( log_count_feat, parent_dataframe_name="customers", primitive=Sum, ) feature_set = FeatureSet([customer_sum_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[customer_sum_feat.get_name()].values[0] assert v == 10 @pytest.fixture def df(): return pd.DataFrame( { "id": ["a", "b", "c", "d", "e"], "e1": ["h", "h", "i", "i", "j"], "e2": ["x", "x", "y", "y", "x"], "e3": ["z", "z", "z", "z", "z"], "val": [1, 1, 1, 1, 1], }, ) def test_make_3_stacked_agg_feats(df): """ Tests stacking 3 agg features. The test specifically uses non numeric indices to test how ancestor columns are handled as dataframes are merged together """ es = EntitySet() ltypes = {"e1": Categorical, "e2": Categorical, "e3": Categorical, "val": Double} es.add_dataframe( dataframe=df, index="id", dataframe_name="e0", logical_types=ltypes, ) es.normalize_dataframe( base_dataframe_name="e0", new_dataframe_name="e1", index="e1", additional_columns=["e2", "e3"], ) es.normalize_dataframe( base_dataframe_name="e1", new_dataframe_name="e2", index="e2", additional_columns=["e3"], ) es.normalize_dataframe( base_dataframe_name="e2", new_dataframe_name="e3", index="e3", ) sum_1 = Feature(es["e0"].ww["val"], parent_dataframe_name="e1", primitive=Sum) sum_2 = Feature(sum_1, parent_dataframe_name="e2", primitive=Sum) sum_3 = Feature(sum_2, parent_dataframe_name="e3", primitive=Sum) feature_set = FeatureSet([sum_3]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array(["z"])) v = df[sum_3.get_name()][0] assert v == 5 def test_make_dfeat_of_agg_feat_on_self(es): """ The graph looks like this: R R = Regions, a parent of customers | C C = Customers, the dataframe we're trying to predict on | etc. We're trying to calculate a DFeat from C to R on an agg_feat of R on C. """ customer_count_feat = Feature( es["customers"].ww["id"], parent_dataframe_name="régions", primitive=Count, ) num_customers_feat = DirectFeature( customer_count_feat, child_dataframe_name="customers", ) feature_set = FeatureSet([num_customers_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[num_customers_feat.get_name()].values[0] assert v == 3 def test_make_dfeat_of_agg_feat_through_parent(es): """ The graph looks like this: R C = Customers, the dataframe we're trying to predict on / \\ R = Regions, a parent of customers S C S = Stores, a child of regions | etc. We're trying to calculate a DFeat from C to R on an agg_feat of R on S. """ store_id_feat = IdentityFeature(es["stores"].ww["id"]) store_count_feat = Feature( store_id_feat, parent_dataframe_name="régions", primitive=Count, ) num_stores_feat = DirectFeature(store_count_feat, child_dataframe_name="customers") feature_set = FeatureSet([num_stores_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[num_stores_feat.get_name()].values[0] assert v == 3 def test_make_deep_agg_feat_of_dfeat_of_agg_feat(es): """ The graph looks like this (higher implies parent): C C = Customers, the dataframe we're trying to predict on | S = Sessions, a child of Customers P S L = Log, a child of both Sessions and Log \\ / P = Products, a parent of Log which is not a descendent of customers L We're trying to calculate a DFeat from L to P on an agg_feat of P on L, and then aggregate it with another agg_feat of C on L. """ log_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="products", primitive=Count, ) product_purchases_feat = DirectFeature(log_count_feat, child_dataframe_name="log") purchase_popularity = Feature( product_purchases_feat, parent_dataframe_name="customers", primitive=Mean, ) feature_set = FeatureSet([purchase_popularity]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[purchase_popularity.get_name()].values[0] assert v == 38.0 / 10.0 def test_deep_agg_feat_chain(es): """ Agg feat of agg feat: region.Mean(customer.Count(Log)) """ customer_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) region_avg_feat = Feature( customer_count_feat, parent_dataframe_name="régions", primitive=Mean, ) feature_set = FeatureSet([region_avg_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array(["United States"])) v = df[region_avg_feat.get_name()][0] assert v == 17 / 3.0 def test_topn(es): topn = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) feature_set = FeatureSet([topn]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0, 1, 2])) true_results = pd.DataFrame( [ ["toothpaste", "coke zero"], ["coke zero", "Haribo sugar-free gummy bears"], ["taco clock", np.nan], ], ) assert [name in df.columns for name in topn.get_feature_names()] for i in range(df.shape[0]): true = true_results.loc[i] actual = df.loc[i] if i == 0: # coke zero and toothpase have same number of occurrences assert set(true.values) == set(actual.values) else: for i1, i2 in zip(true, actual): assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2) def test_trend(es): trend = Feature( [Feature(es["log"].ww["value"]), Feature(es["log"].ww["datetime"])], parent_dataframe_name="customers", primitive=Trend, ) feature_set = FeatureSet([trend]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0, 1, 2])) true_results = [-0.812730, 4.870378, np.nan] np.testing.assert_almost_equal( df[trend.get_name()].tolist(), true_results, decimal=5, ) def test_direct_squared(es): feature = IdentityFeature(es["log"].ww["value"]) squared = feature * feature feature_set = FeatureSet([feature, squared]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0, 1, 2])) for i, row in df.iterrows(): assert (row[0] * row[0]) == row[1] def test_agg_empty_child(es): customer_count_feat = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) feature_set = FeatureSet([customer_count_feat]) # time last before the customer had any events, so child frame is empty calculator = FeatureSetCalculator( es, time_last=datetime(2011, 4, 8), feature_set=feature_set, ) df = calculator.run(np.array([0])) assert df["COUNT(log)"].iloc[0] == 0 def test_diamond_entityset(diamond_es): es = diamond_es amount = IdentityFeature(es["transactions"].ww["amount"]) path = backward_path(es, ["regions", "customers", "transactions"]) through_customers = AggregationFeature( amount, "regions", primitive=Sum, relationship_path=path, ) path = backward_path(es, ["regions", "stores", "transactions"]) through_stores = AggregationFeature( amount, "regions", primitive=Sum, relationship_path=path, ) feature_set = FeatureSet([through_customers, through_stores]) calculator = FeatureSetCalculator( es, time_last=datetime(2011, 4, 8), feature_set=feature_set, ) df = calculator.run(np.array([0, 1, 2])) assert (df["SUM(stores.transactions.amount)"] == [94, 261, 128]).all() assert (df["SUM(customers.transactions.amount)"] == [72, 411, 0]).all() def test_two_relationships_to_single_dataframe(games_es): es = games_es home_team, away_team = es.relationships path = RelationshipPath([(False, home_team)]) mean_at_home = AggregationFeature( Feature(es["games"].ww["home_team_score"]), "teams", relationship_path=path, primitive=Mean, ) path = RelationshipPath([(False, away_team)]) mean_at_away = AggregationFeature( Feature(es["games"].ww["away_team_score"]), "teams", relationship_path=path, primitive=Mean, ) home_team_mean = DirectFeature(mean_at_home, "games", relationship=home_team) away_team_mean = DirectFeature(mean_at_away, "games", relationship=away_team) feature_set = FeatureSet([home_team_mean, away_team_mean]) calculator = FeatureSetCalculator( es, time_last=datetime(2011, 8, 28), feature_set=feature_set, ) df = calculator.run(np.array(range(3))) assert (df[home_team_mean.get_name()] == [1.5, 1.5, 2.5]).all() assert (df[away_team_mean.get_name()] == [1, 0.5, 2]).all() @pytest.fixture def parent_child(): parent_df = pd.DataFrame({"id": [1]}) child_df = pd.DataFrame( { "id": [1, 2, 3], "parent_id": [1, 1, 1], "time_index": pd.date_range(start="1/1/2018", periods=3), "value": [10, 5, 2], "cat": ["a", "a", "b"], }, ).astype({"cat": "category"}) return (parent_df, child_df) def test_empty_child_dataframe(parent_child): parent_df, child_df = parent_child child_ltypes = { "parent_id": Integer, "time_index": Datetime, "value": Double, "cat": Categorical, } es = EntitySet(id="blah") es.add_dataframe(dataframe_name="parent", dataframe=parent_df, index="id") es.add_dataframe( dataframe_name="child", dataframe=child_df, index="id", time_index="time_index", logical_types=child_ltypes, ) es.add_relationship("parent", "id", "child", "parent_id") # create regular agg count = Feature( es["child"].ww["id"], parent_dataframe_name="parent", primitive=Count, ) # create agg feature that requires multiple arguments trend = Feature( [Feature(es["child"].ww["value"]), Feature(es["child"].ww["time_index"])], parent_dataframe_name="parent", primitive=Trend, ) # create multi-output agg feature n_most_common = Feature( es["child"].ww["cat"], parent_dataframe_name="parent", primitive=NMostCommon, ) # create aggs with where where = Feature(es["child"].ww["value"]) == 1 count_where = Feature( es["child"].ww["id"], parent_dataframe_name="parent", where=where, primitive=Count, ) trend_where = Feature( [Feature(es["child"].ww["value"]), Feature(es["child"].ww["time_index"])], parent_dataframe_name="parent", where=where, primitive=Trend, ) n_most_common_where = Feature( es["child"].ww["cat"], parent_dataframe_name="parent", where=where, primitive=NMostCommon, ) features = [ count, count_where, trend, trend_where, n_most_common, n_most_common_where, ] data = { count.get_name(): pd.Series([0], dtype="Int64"), count_where.get_name(): pd.Series([0], dtype="Int64"), trend.get_name(): pd.Series([np.nan], dtype="float"), trend_where.get_name(): pd.Series([np.nan], dtype="float"), } for name in n_most_common.get_feature_names(): data[name] = pd.Series([np.nan], dtype="category") for name in n_most_common_where.get_feature_names(): data[name] = pd.Series([np.nan], dtype="category") answer = pd.DataFrame(data) # cutoff time before all rows fm = calculate_feature_matrix( entityset=es, features=features, cutoff_time=pd.Timestamp("12/31/2017"), ) for column in data.keys(): pd.testing.assert_series_equal( fm[column], answer[column], check_names=False, check_index=False, ) # cutoff time after all rows, but where clause filters all rows data = { count_where.get_name(): pd.Series([0], dtype="Int64"), trend_where.get_name(): pd.Series([np.nan], dtype="float"), } for name in n_most_common_where.get_feature_names(): data[name] = pd.Series([np.nan], dtype="category") answer = pd.DataFrame(data) for column in data.keys(): pd.testing.assert_series_equal( fm[column], answer[column], check_names=False, check_index=False, ) def test_with_features_built_from_es_metadata(es): metadata = es.metadata agg_feat = Feature( metadata["log"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) feature_set = FeatureSet([agg_feat]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[agg_feat.get_name()].values[0] assert v == 10 def test_handles_primitive_function_name_uniqueness(es): class SumTimesN(AggregationPrimitive): name = "sum_times_n" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, n): self.n = n def get_function(self): def my_function(values): return values.sum() * self.n return my_function # works as expected f1 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=SumTimesN(n=1), ) fm = calculate_feature_matrix(features=[f1], entityset=es) value_sum = pd.Series([56, 26, 0]) assert all(fm[f1.get_name()].sort_index() == value_sum) # works as expected f2 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=SumTimesN(n=2), ) fm = calculate_feature_matrix(features=[f2], entityset=es) double_value_sum = pd.Series([112, 52, 0]) assert all(fm[f2.get_name()].sort_index() == double_value_sum) # same primitive, same column, different args fm = calculate_feature_matrix(features=[f1, f2], entityset=es) assert all(fm[f1.get_name()].sort_index() == value_sum) assert all(fm[f2.get_name()].sort_index() == double_value_sum) # different primitives, same function returned by get_function, # different base features f3 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum, ) f4 = Feature( es["log"].ww["purchased"], parent_dataframe_name="customers", primitive=NumTrue, ) fm = calculate_feature_matrix(features=[f3, f4], entityset=es) purchased_sum = pd.Series([10, 1, 1]) assert all(fm[f3.get_name()].sort_index() == value_sum) assert all(fm[f4.get_name()].sort_index() == purchased_sum) # different primitives, same function returned by get_function, # same base feature class Sum1(AggregationPrimitive): """Sums elements of a numeric or boolean feature.""" name = "sum1" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False stack_on_exclude = [Count] default_value = 0 def get_function(self): return np.sum class Sum2(AggregationPrimitive): """Sums elements of a numeric or boolean feature.""" name = "sum2" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False stack_on_exclude = [Count] default_value = 0 def get_function(self): return np.sum class Sum3(AggregationPrimitive): """Sums elements of a numeric or boolean feature.""" name = "sum3" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False stack_on_exclude = [Count] default_value = 0 def get_function(self): return np.sum f5 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum1, ) f6 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum2, ) f7 = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum3, ) fm = calculate_feature_matrix(features=[f5, f6, f7], entityset=es) assert all(fm[f5.get_name()].sort_index() == value_sum) assert all(fm[f6.get_name()].sort_index() == value_sum) assert all(fm[f7.get_name()].sort_index() == value_sum) def test_returns_order_of_instance_ids(es): feature_set = FeatureSet([Feature(es["customers"].ww["age"])]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) instance_ids = [0, 1, 2] assert list(es["customers"]["id"]) != instance_ids df = calculator.run(np.array(instance_ids)) assert list(df.index) == instance_ids def test_calls_progress_callback(es): # call with all feature types. make sure progress callback calls sum to 1 identity = Feature(es["customers"].ww["age"]) direct = Feature(es["cohorts"].ww["cohort_name"], "customers") agg = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) agg_apply = Feature( es["log"].ww["datetime"], parent_dataframe_name="customers", primitive=TimeSinceLast, ) # this feature is handle differently than simple features trans = Feature(agg, primitive=Negate) trans_full = Feature(agg, primitive=CumSum) groupby_trans = Feature( agg, primitive=CumSum, groupby=Feature(es["customers"].ww["cohort"]), ) all_features = [ identity, direct, agg, agg_apply, trans, trans_full, groupby_trans, ] feature_set = FeatureSet(all_features) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) class MockProgressCallback: def __init__(self): self.total = 0 def __call__(self, update): self.total += update mock_progress_callback = MockProgressCallback() instance_ids = [0, 1, 2] calculator.run(np.array(instance_ids), mock_progress_callback) assert np.isclose(mock_progress_callback.total, 1) # testing again with a time_last with no data feature_set = FeatureSet(all_features) calculator = FeatureSetCalculator( es, time_last=pd.Timestamp("1950"), feature_set=feature_set, ) mock_progress_callback = MockProgressCallback() calculator.run(np.array(instance_ids), mock_progress_callback) assert np.isclose(mock_progress_callback.total, 1) # precalculated_features is only used with approximate def test_precalculated_features(es): error_msg = ( "This primitive should never be used because the features are precalculated" ) class ErrorPrim(AggregationPrimitive): """A primitive whose function raises an error.""" name = "error_prim" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): def error(s): raise RuntimeError(error_msg) return error value = Feature(es["log"].ww["value"]) agg = Feature(value, parent_dataframe_name="sessions", primitive=ErrorPrim) agg2 = Feature(agg, parent_dataframe_name="customers", primitive=ErrorPrim) direct = Feature(agg2, dataframe_name="sessions") # Set up a FeatureSet which knows which features are precalculated. precalculated_feature_trie = Trie(default=set, path_constructor=RelationshipPath) precalculated_feature_trie.get_node(direct.relationship_path).value.add( agg2.unique_name(), ) feature_set = FeatureSet( [direct], approximate_feature_trie=precalculated_feature_trie, ) # Fake precalculated data. values = [0, 1, 2] parent_fm = pd.DataFrame({agg2.get_name(): values}) precalculated_fm_trie = Trie(path_constructor=RelationshipPath) precalculated_fm_trie.get_node(direct.relationship_path).value = parent_fm calculator = FeatureSetCalculator( es, feature_set=feature_set, precalculated_features=precalculated_fm_trie, ) instance_ids = [0, 2, 3, 5] fm = calculator.run(np.array(instance_ids)) assert list(fm[direct.get_name()]) == [values[0], values[0], values[1], values[2]] # Calculating without precalculated features should error. with pytest.raises(RuntimeError, match=error_msg): FeatureSetCalculator(es, feature_set=FeatureSet([direct])).run(instance_ids) def test_nunique_nested_with_agg_bug(es): """Pandas 2.2.0 has a bug where pd.Series.nunique produces columns with the category dtype instead of int64 dtype, causing an error when we attempt another aggregation""" num_unique_feature = AggregationFeature( Feature(es["log"].ww["priority_level"]), "sessions", primitive=NumUnique, ) mean_nunique_feature = AggregationFeature( num_unique_feature, "customers", primitive=Mean, ) feature_set = FeatureSet([mean_nunique_feature]) calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set) df = calculator.run(np.array([0])) assert df.iloc[0, 0].round(4) == 1.6667 ================================================ FILE: featuretools/tests/computational_backend/test_utils.py ================================================ import numpy as np from featuretools import dfs from featuretools.computational_backends import replace_inf_values from featuretools.primitives import DivideByFeature, DivideNumericScalar def test_replace_inf_values(divide_by_zero_es): div_by_scalar = DivideNumericScalar(value=0) div_by_feature = DivideByFeature(value=1) div_by_feature_neg = DivideByFeature(value=-1) for primitive in [ "divide_numeric", div_by_scalar, div_by_feature, div_by_feature_neg, ]: fm, _ = dfs( entityset=divide_by_zero_es, target_dataframe_name="zero", trans_primitives=[primitive], max_depth=1, ) assert np.inf in fm.values or -np.inf in fm.values replaced_fm = replace_inf_values(fm) assert np.inf not in replaced_fm.values assert -np.inf not in replaced_fm.values custom_value_fm = replace_inf_values(fm, replacement_value="custom_val") assert np.inf not in custom_value_fm.values assert -np.inf not in replaced_fm.values assert "custom_val" in custom_value_fm.values def test_replace_inf_values_specify_cols(divide_by_zero_es): div_by_scalar = DivideNumericScalar(value=0) fm, _ = dfs( entityset=divide_by_zero_es, target_dataframe_name="zero", trans_primitives=[div_by_scalar], max_depth=1, ) assert np.inf in fm["col1 / 0"].values replaced_fm = replace_inf_values(fm, columns=["col1 / 0"]) assert np.inf not in replaced_fm["col1 / 0"].values assert np.inf in replaced_fm["col2 / 0"].values ================================================ FILE: featuretools/tests/config_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/config_tests/test_config.py ================================================ from featuretools import config def test_get_default_config_does_not_change(): old_config = config.get_all() key = "primitive_data_folder" value = "This is an example string" config.set({key: value}) config.set_to_default() assert config.get(key) != value config.set(old_config) def test_set_and_get_config(): key = "primitive_data_folder" old_value = config.get(key) value = "This is an example string" config.set({key: value}) assert config.get(key) == value config.set({key: old_value}) def test_get_all(): assert config.get_all() == config._data ================================================ FILE: featuretools/tests/conftest.py ================================================ import contextlib import copy import os import composeml as cp import numpy as np import pandas as pd import pytest from packaging.version import parse from woodwork.column_schema import ColumnSchema from featuretools import EntitySet, demo from featuretools.primitives import AggregationPrimitive, TransformPrimitive from featuretools.tests.testing_utils import make_ecommerce_entityset @pytest.fixture() def dask_cluster(): distributed = pytest.importorskip( "distributed", reason="Dask not installed, skipping", ) if distributed: with distributed.LocalCluster() as cluster: yield cluster @pytest.fixture() def three_worker_dask_cluster(): distributed = pytest.importorskip( "distributed", reason="Dask not installed, skipping", ) if distributed: with distributed.LocalCluster(n_workers=3) as cluster: yield cluster @pytest.fixture(scope="session") def make_es(): return make_ecommerce_entityset() @pytest.fixture(scope="session") def make_int_es(): return make_ecommerce_entityset(with_integer_time_index=True) @pytest.fixture def es(make_es): return copy.deepcopy(make_es) @pytest.fixture def int_es(make_int_es): return copy.deepcopy(make_int_es) @pytest.fixture def latlong_df(): df = pd.DataFrame({"idx": [0, 1, 2], "latLong": [pd.NA, (1, 2), (pd.NA, pd.NA)]}) return df @pytest.fixture def diamond_es(): countries_df = pd.DataFrame({"id": range(2), "name": ["US", "Canada"]}) regions_df = pd.DataFrame( { "id": range(3), "country_id": [0, 0, 1], "name": ["Northeast", "South", "Quebec"], }, ).astype({"name": "category"}) stores_df = pd.DataFrame( { "id": range(5), "region_id": [0, 1, 2, 2, 1], "square_ft": [2000, 3000, 1500, 2500, 2700], }, ) customers_df = pd.DataFrame( { "id": range(5), "region_id": [1, 0, 0, 1, 1], "name": ["A", "B", "C", "D", "E"], }, ) transactions_df = pd.DataFrame( { "id": range(8), "store_id": [4, 4, 2, 3, 4, 0, 1, 1], "customer_id": [3, 0, 2, 4, 3, 3, 2, 3], "amount": [100, 40, 45, 83, 13, 94, 27, 81], }, ) dataframes = { "countries": (countries_df, "id"), "regions": (regions_df, "id"), "stores": (stores_df, "id"), "customers": (customers_df, "id"), "transactions": (transactions_df, "id"), } relationships = [ ("countries", "id", "regions", "country_id"), ("regions", "id", "stores", "region_id"), ("regions", "id", "customers", "region_id"), ("stores", "id", "transactions", "store_id"), ("customers", "id", "transactions", "customer_id"), ] return EntitySet( id="ecommerce_diamond", dataframes=dataframes, relationships=relationships, ) @pytest.fixture def default_value_es(): transactions = pd.DataFrame( {"id": [1, 2, 3, 4], "session_id": ["a", "a", "b", "c"], "value": [1, 1, 1, 1]}, ) sessions = pd.DataFrame({"id": ["a", "b"]}) es = EntitySet() es.add_dataframe(dataframe_name="transactions", dataframe=transactions, index="id") es.add_dataframe(dataframe_name="sessions", dataframe=sessions, index="id") es.add_relationship("sessions", "id", "transactions", "session_id") return es @pytest.fixture def home_games_es(): teams = pd.DataFrame({"id": range(3), "name": ["Breakers", "Spirit", "Thorns"]}) games = pd.DataFrame( { "id": range(5), "home_team_id": [2, 2, 1, 0, 1], "away_team_id": [1, 0, 2, 1, 0], "home_team_score": [3, 0, 1, 0, 4], "away_team_score": [2, 1, 2, 0, 0], }, ) dataframes = {"teams": (teams, "id"), "games": (games, "id")} relationships = [("teams", "id", "games", "home_team_id")] return EntitySet(dataframes=dataframes, relationships=relationships) @pytest.fixture def games_es(home_games_es): return home_games_es.add_relationship("teams", "id", "games", "away_team_id") @pytest.fixture def mock_customer(): return demo.load_mock_customer(return_entityset=True, random_seed=0) @pytest.fixture def lt(es): def label_func(df): return df["value"].sum() > 10 kwargs = { "time_index": "datetime", "labeling_function": label_func, "window_size": "1m", } if parse(cp.__version__) >= parse("0.10.0"): kwargs["target_dataframe_index"] = "id" else: kwargs["target_dataframe_name"] = "id" # pragma: no cover lm = cp.LabelMaker(**kwargs) df = es["log"] labels = lm.search(df, num_examples_per_instance=-1) labels = labels.rename(columns={"cutoff_time": "time"}) return labels @pytest.fixture def dataframes(): cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "card_id": [1, 2, 1, 3, 4, 5], "transaction_time": [10, 12, 13, 20, 21, 20], "fraud": [True, False, False, False, True, True], }, ) dataframes = { "cards": (cards_df, "id"), "transactions": (transactions_df, "id", "transaction_time"), } return dataframes @pytest.fixture def relationships(): return [("cards", "id", "transactions", "card_id")] @pytest.fixture def transform_es(): # Create dataframe df = pd.DataFrame( { "a": [14, 12, 10], "b": [False, False, True], "b1": [True, True, False], "b12": [4, 5, 6], "P": [10, 15, 12], }, ) es = EntitySet(id="test") # Add dataframe to entityset es.add_dataframe( dataframe_name="first", dataframe=df, index="index", make_index=True, ) return es @pytest.fixture def divide_by_zero_es(): df = pd.DataFrame( { "id": [0, 1, 2, 3], "col1": [1, 0, -3, 4], "col2": [0, 0, 0, 4], }, ) return EntitySet("data", {"zero": (df, "id", None)}) @pytest.fixture def window_series(): return pd.Series( range(20), index=pd.date_range(start="2020-01-01", end="2020-01-20"), ) @pytest.fixture def window_date_range(): return pd.date_range(start="2022-11-1", end="2022-11-5", periods=30) @pytest.fixture def rolling_outlier_series(): return pd.Series( [0] * 4 + [10] + [0] * 4 + [10] + [0] * 5, index=pd.date_range(start="2020-01-01", end="2020-01-15", periods=15), ) @pytest.fixture def postal_code_dataframe(): df = pd.DataFrame( { "string_dtype": pd.Series(["90210", "60018", "10010", "92304-4201"]), "int_dtype": pd.Series([10000, 20000, 30000]).astype("category"), "has_nulls": pd.Series([np.nan, 20000, 30000]).astype("category"), }, ) df.ww.init( logical_types={ "string_dtype": "PostalCode", "int_dtype": "PostalCode", "has_nulls": "PostalCode", }, ) return df def create_test_credentials(test_path): with open(test_path, "w+") as f: f.write("[test]\n") f.write("aws_access_key_id=AKIAIOSFODNN7EXAMPLE\n") f.write("aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\n") def create_test_config(test_path_config): with open(test_path_config, "w+") as f: f.write("[profile test]\n") f.write("region=us-east-2\n") f.write("output=text\n") @pytest.fixture def setup_test_profile(monkeypatch, tmp_path): cache = tmp_path.joinpath(".cache") cache.mkdir() test_path = str(cache.joinpath("test_credentials")) test_path_config = str(cache.joinpath("test_config")) monkeypatch.setenv("AWS_SHARED_CREDENTIALS_FILE", test_path) monkeypatch.setenv("AWS_CONFIG_FILE", test_path_config) monkeypatch.delenv("AWS_ACCESS_KEY_ID", raising=False) monkeypatch.delenv("AWS_SECRET_ACCESS_KEY", raising=False) monkeypatch.setenv("AWS_PROFILE", "test") with contextlib.suppress(OSError): os.remove(test_path) os.remove(test_path_config) # pragma: no cover create_test_credentials(test_path) create_test_config(test_path_config) yield os.remove(test_path) os.remove(test_path_config) @pytest.fixture def test_aggregation_primitive(): class TestAgg(AggregationPrimitive): name = "test" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on = [] return TestAgg @pytest.fixture def test_transform_primitive(): class TestTransform(TransformPrimitive): name = "test" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on = [] return TestTransform @pytest.fixture def strings_that_have_triggered_errors_before(): return [ " ", '"This Borderlands game here"" is the perfect conclusion to the ""Borderlands 3"" line, which focuses on the fans ""favorite character and gives the players the opportunity to close for a long time some very important questions about\'s character and the memorable scenery with which the players interact.', ] ================================================ FILE: featuretools/tests/demo_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/demo_tests/test_demo_data.py ================================================ import urllib.request import pandas as pd import pytest from featuretools import EntitySet from featuretools.demo import load_flight, load_mock_customer, load_retail, load_weather @pytest.fixture(autouse=True) def set_testing_headers(): opener = urllib.request.build_opener() opener.addheaders = [("Testing", "True")] urllib.request.install_opener(opener) def test_load_retail_diff(): nrows = 10 es_first = load_retail(nrows=nrows) assert isinstance(es_first, EntitySet) assert es_first["order_products"].shape[0] == nrows nrows_second = 11 es_second = load_retail(nrows=nrows_second) assert es_second["order_products"].shape[0] == nrows_second def test_mock_customer(): n_customers = 4 n_products = 3 n_sessions = 30 n_transactions = 400 es = load_mock_customer( n_customers=n_customers, n_products=n_products, n_sessions=n_sessions, n_transactions=n_transactions, random_seed=0, return_entityset=True, ) assert isinstance(es, EntitySet) df_names = [df.ww.name for df in es.dataframes] expected_names = ["transactions", "products", "sessions", "customers"] assert set(expected_names) == set(df_names) assert len(es["customers"]) == 4 assert len(es["products"]) == 3 assert len(es["sessions"]) == 30 assert len(es["transactions"]) == 400 def test_load_flight(): es = load_flight( month_filter=[1], categorical_filter={"origin_city": ["Charlotte, NC"]}, return_single_table=False, nrows=1000, ) assert isinstance(es, EntitySet) dataframe_names = ["airports", "flights", "trip_logs", "airlines"] realvals = [(11, 3), (13, 9), (103, 21), (1, 1)] for i, name in enumerate(dataframe_names): assert es[name].shape == realvals[i] def test_weather(): es = load_weather() assert isinstance(es, EntitySet) dataframe_names = ["temperatures"] realvals = [(3650, 3)] for i, name in enumerate(dataframe_names): assert es[name].shape == realvals[i] es = load_weather(return_single_table=True) assert isinstance(es, pd.DataFrame) ================================================ FILE: featuretools/tests/entityset_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/entityset_tests/test_es.py ================================================ import copy import logging import pickle import re from datetime import datetime from unittest.mock import patch import numpy as np import pandas as pd import pytest from woodwork.logical_types import ( URL, Boolean, Categorical, CountryCode, Datetime, Double, EmailAddress, Integer, LatLong, NaturalLanguage, Ordinal, PostalCode, SubRegionCode, ) from featuretools import Relationship from featuretools.demo import load_retail from featuretools.entityset import EntitySet from featuretools.entityset.entityset import LTI_COLUMN_NAME, WW_SCHEMA_KEY from featuretools.tests.testing_utils import get_df_tags def test_normalize_time_index_as_additional_column(es): error_text = "Not moving signup_date as it is the base time index column. Perhaps, move the column to the copy_columns." with pytest.raises(ValueError, match=error_text): assert "signup_date" in es["customers"].columns es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index="signup_date", additional_columns=["signup_date"], copy_columns=[], ) def test_normalize_time_index_as_copy_column(es): assert "signup_date" in es["customers"].columns es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index="signup_date", additional_columns=[], copy_columns=["signup_date"], ) assert "signup_date" in es["customers"].columns assert es["customers"].ww.time_index == "signup_date" assert "signup_date" in es["cancellations"].columns assert es["cancellations"].ww.time_index == "signup_date" def test_normalize_time_index_as_copy_column_new_time_index(es): assert "signup_date" in es["customers"].columns es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index=True, additional_columns=[], copy_columns=["signup_date"], ) assert "signup_date" in es["customers"].columns assert es["customers"].ww.time_index == "signup_date" assert "first_customers_time" in es["cancellations"].columns assert "signup_date" not in es["cancellations"].columns assert es["cancellations"].ww.time_index == "first_customers_time" def test_normalize_time_index_as_copy_column_no_time_index(es): assert "signup_date" in es["customers"].columns es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index=False, additional_columns=[], copy_columns=["signup_date"], ) assert "signup_date" in es["customers"].columns assert es["customers"].ww.time_index == "signup_date" assert "signup_date" in es["cancellations"].columns assert es["cancellations"].ww.time_index is None def test_cannot_re_add_relationships_that_already_exists(es): warn_text = "Not adding duplicate relationship: " + str(es.relationships[0]) before_len = len(es.relationships) rel = es.relationships[0] with pytest.warns(UserWarning, match=warn_text): es.add_relationship(relationship=rel) with pytest.warns(UserWarning, match=warn_text): es.add_relationship( rel._parent_dataframe_name, rel._parent_column_name, rel._child_dataframe_name, rel._child_column_name, ) after_len = len(es.relationships) assert before_len == after_len def test_add_relationships_convert_type(es): for r in es.relationships: parent_df = es[r.parent_dataframe.ww.name] child_df = es[r.child_dataframe.ww.name] assert parent_df.ww.index == r._parent_column_name assert "foreign_key" in r.child_column.ww.semantic_tags assert str(parent_df[r._parent_column_name].dtype) == str( child_df[r._child_column_name].dtype, ) def test_add_relationship_diff_param_logical_types(es): ordinal_1 = Ordinal(order=[0, 1, 2, 3, 4, 5, 6]) ordinal_2 = Ordinal(order=[0, 1, 2, 3, 4, 5]) es["sessions"].ww.set_types(logical_types={"id": ordinal_1}) log_2_df = es["log"].copy() log_logical_types = { "id": Integer, "session_id": ordinal_2, "product_id": Categorical(), "datetime": Datetime, "value": Double, "value_2": Double, "latlong": LatLong, "latlong2": LatLong, "zipcode": PostalCode, "countrycode": CountryCode, "subregioncode": SubRegionCode, "value_many_nans": Double, "priority_level": Ordinal(order=[0, 1, 2]), "purchased": Boolean, "comments": NaturalLanguage, "url": URL, "email_address": EmailAddress, } log_semantic_tags = {"session_id": "foreign_key", "product_id": "foreign_key"} assert set(log_logical_types) == set(log_2_df.columns) es.add_dataframe( dataframe_name="log2", dataframe=log_2_df, index="id", logical_types=log_logical_types, semantic_tags=log_semantic_tags, time_index="datetime", ) assert "log2" in es.dataframe_dict assert es["log2"].ww.schema is not None assert isinstance(es["log2"].ww.logical_types["session_id"], Ordinal) assert isinstance(es["sessions"].ww.logical_types["id"], Ordinal) assert ( es["sessions"].ww.logical_types["id"] != es["log2"].ww.logical_types["session_id"] ) warning_text = "Changing child logical type to match parent." with pytest.warns(UserWarning, match=warning_text): es.add_relationship("sessions", "id", "log2", "session_id") assert isinstance(es["log2"].ww.logical_types["product_id"], Categorical) assert isinstance(es["products"].ww.logical_types["id"], Categorical) def test_add_relationship_different_logical_types_same_dtype(es): log_2_df = es["log"].copy() log_logical_types = { "id": Integer, "session_id": Integer, "product_id": CountryCode, "datetime": Datetime, "value": Double, "value_2": Double, "latlong": LatLong, "latlong2": LatLong, "zipcode": PostalCode, "countrycode": CountryCode, "subregioncode": SubRegionCode, "value_many_nans": Double, "priority_level": Ordinal(order=[0, 1, 2]), "purchased": Boolean, "comments": NaturalLanguage, "url": URL, "email_address": EmailAddress, } log_semantic_tags = {"session_id": "foreign_key", "product_id": "foreign_key"} assert set(log_logical_types) == set(log_2_df.columns) es.add_dataframe( dataframe_name="log2", dataframe=log_2_df, index="id", logical_types=log_logical_types, semantic_tags=log_semantic_tags, time_index="datetime", ) assert "log2" in es.dataframe_dict assert es["log2"].ww.schema is not None assert isinstance(es["log2"].ww.logical_types["product_id"], CountryCode) assert isinstance(es["products"].ww.logical_types["id"], Categorical) warning_text = "Logical type CountryCode for child column product_id does not match parent column id logical type Categorical. Changing child logical type to match parent." with pytest.warns(UserWarning, match=warning_text): es.add_relationship("products", "id", "log2", "product_id") assert isinstance(es["log2"].ww.logical_types["product_id"], Categorical) assert isinstance(es["products"].ww.logical_types["id"], Categorical) assert "foreign_key" in es["log2"].ww.semantic_tags["product_id"] def test_add_relationship_different_compatible_dtypes(es): log_2_df = es["log"].copy() log_logical_types = { "id": Integer, "session_id": Datetime, "product_id": Categorical, "datetime": Datetime, "value": Double, "value_2": Double, "latlong": LatLong, "latlong2": LatLong, "zipcode": PostalCode, "countrycode": CountryCode, "subregioncode": SubRegionCode, "value_many_nans": Double, "priority_level": Ordinal(order=[0, 1, 2]), "purchased": Boolean, "comments": NaturalLanguage, "url": URL, "email_address": EmailAddress, } log_semantic_tags = {"session_id": "foreign_key", "product_id": "foreign_key"} assert set(log_logical_types) == set(log_2_df.columns) es.add_dataframe( dataframe_name="log2", dataframe=log_2_df, index="id", logical_types=log_logical_types, semantic_tags=log_semantic_tags, time_index="datetime", ) assert "log2" in es.dataframe_dict assert es["log2"].ww.schema is not None assert isinstance(es["log2"].ww.logical_types["session_id"], Datetime) assert isinstance(es["customers"].ww.logical_types["id"], Integer) warning_text = "Logical type Datetime for child column session_id does not match parent column id logical type Integer. Changing child logical type to match parent." with pytest.warns(UserWarning, match=warning_text): es.add_relationship("customers", "id", "log2", "session_id") assert isinstance(es["log2"].ww.logical_types["session_id"], Integer) assert isinstance(es["customers"].ww.logical_types["id"], Integer) def test_add_relationship_errors_child_v_index(es): new_df = es["log"].ww.copy() new_df.ww._schema.name = "log2" es.add_dataframe(dataframe=new_df) to_match = "Unable to add relationship because child column 'id' in 'log2' is also its index" with pytest.raises(ValueError, match=to_match): es.add_relationship("log", "id", "log2", "id") def test_add_relationship_empty_child_convert_dtype(es): relationship = Relationship(es, "sessions", "id", "log", "session_id") empty_log_df = pd.DataFrame(columns=es["log"].columns) es.add_dataframe(empty_log_df, "log") assert len(es["log"]) == 0 # session_id will be Unknown logical type with dtype string assert es["log"]["session_id"].dtype == "string" es.relationships.remove(relationship) assert relationship not in es.relationships es.add_relationship(relationship=relationship) assert es["log"]["session_id"].dtype == "int64" def test_add_relationship_with_relationship_object(es): relationship = Relationship(es, "sessions", "id", "log", "session_id") es.add_relationship(relationship=relationship) assert relationship in es.relationships def test_add_relationships_with_relationship_object(es): relationships = [Relationship(es, "sessions", "id", "log", "session_id")] es.add_relationships(relationships) assert relationships[0] in es.relationships def test_add_relationship_error(es): relationship = Relationship(es, "sessions", "id", "log", "session_id") error_message = ( "Cannot specify dataframe and column name values and also supply a Relationship" ) with pytest.raises(ValueError, match=error_message): es.add_relationship(parent_dataframe_name="sessions", relationship=relationship) def test_query_by_values_returns_rows_in_given_order(): data = pd.DataFrame( { "id": [1, 2, 3, 4, 5], "value": ["a", "c", "b", "a", "a"], "time": [1000, 2000, 3000, 4000, 5000], }, ) es = EntitySet() es = es.add_dataframe( dataframe=data, dataframe_name="test", index="id", time_index="time", logical_types={"value": "Categorical"}, ) query = es.query_by_values("test", ["b", "a"], column_name="value") assert np.array_equal(query["id"], [1, 3, 4, 5]) def test_query_by_values_secondary_time_index(es): end = np.datetime64(datetime(2011, 10, 1)) all_instances = [0, 1, 2] result = es.query_by_values("customers", all_instances, time_last=end) for col in ["cancel_date", "cancel_reason"]: nulls = result.loc[all_instances][col].isnull() == [False, True, True] assert nulls.all(), "Some instance has data it shouldn't for column %s" % col def test_query_by_id(es): df = es.query_by_values("log", instance_vals=[0]) assert df["id"].values[0] == 0 def test_query_by_single_value(es): df = es.query_by_values("log", instance_vals=0) assert df["id"].values[0] == 0 def test_query_by_df(es): instance_df = pd.DataFrame({"id": [1, 3], "vals": [0, 1]}) df = es.query_by_values("log", instance_vals=instance_df) assert np.array_equal(df["id"], [1, 3]) def test_query_by_id_with_time(es): df = es.query_by_values( dataframe_name="log", instance_vals=[0, 1, 2, 3, 4], time_last=datetime(2011, 4, 9, 10, 30, 2 * 6), ) assert list(df["id"].values) == [0, 1, 2] def test_query_by_column_with_time(es): df = es.query_by_values( dataframe_name="log", instance_vals=[0, 1, 2], column_name="session_id", time_last=datetime(2011, 4, 9, 10, 50, 0), ) true_values = [i * 5 for i in range(5)] + [i * 1 for i in range(4)] + [0] assert list(df["id"].values) == list(range(10)) assert list(df["value"].values) == true_values def test_query_by_column_with_no_lti_and_training_window(es): match = ( "Using training_window but last_time_index is not set for dataframe customers" ) with pytest.warns(UserWarning, match=match): df = es.query_by_values( dataframe_name="customers", instance_vals=[0, 1, 2], column_name="cohort", time_last=datetime(2011, 4, 11), training_window="3d", ) assert list(df["id"].values) == [1] assert list(df["age"].values) == [25] def test_query_by_column_with_lti_and_training_window(es): es.add_last_time_indexes() df = es.query_by_values( dataframe_name="customers", instance_vals=[0, 1, 2], column_name="cohort", time_last=datetime(2011, 4, 11), training_window="3d", ) df = df.reset_index(drop=True).sort_values("id") assert list(df["id"].values) == [0, 1, 2] assert list(df["age"].values) == [33, 25, 56] def test_query_by_indexed_column(es): df = es.query_by_values( dataframe_name="log", instance_vals=["taco clock"], column_name="product_id", ) df = df.reset_index(drop=True).sort_values("id") assert list(df["id"].values) == [15, 16] @pytest.fixture def df(): return pd.DataFrame({"id": [0, 1, 2], "category": ["a", "b", "c"]}) def test_check_columns_and_dataframe(df): # matches logical_types = {"id": Integer, "category": Categorical} es = EntitySet(id="test") es.add_dataframe( df, dataframe_name="test_dataframe", index="id", logical_types=logical_types, ) assert isinstance( es.dataframe_dict["test_dataframe"].ww.logical_types["category"], Categorical, ) assert es.dataframe_dict["test_dataframe"].ww.semantic_tags["category"] == { "category", } def test_make_index_any_location(df): logical_types = {"id": Integer, "category": Categorical} es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id1", make_index=True, logical_types=logical_types, dataframe=df, ) assert es.dataframe_dict["test_dataframe"].columns[0] == "id1" assert es.dataframe_dict["test_dataframe"].ww.index == "id1" def test_replace_dataframe_and_create_index(es): df = pd.DataFrame({"ints": [3, 4, 5], "category": ["a", "b", "a"]}) final_df = df.copy() final_df["id"] = [0, 1, 2] needs_idx_df = df.copy() logical_types = {"ints": Integer, "category": Categorical} es.add_dataframe( dataframe=df, dataframe_name="test_df", index="id", make_index=True, logical_types=logical_types, ) assert es["test_df"].ww.index == "id" # DataFrame that needs the index column added assert "id" not in needs_idx_df.columns es.replace_dataframe("test_df", needs_idx_df) assert es["test_df"].ww.index == "id" df = es["test_df"].sort_values(by="id") assert all(df["id"] == final_df["id"]) assert all(df["ints"] == final_df["ints"]) def test_replace_dataframe_created_index_present(es): df = pd.DataFrame({"ints": [3, 4, 5], "category": ["a", "b", "a"]}) logical_types = {"ints": Integer, "category": Categorical} es.add_dataframe( dataframe=df, dataframe_name="test_df", index="id", make_index=True, logical_types=logical_types, ) # DataFrame that already has the index column has_idx_df = es["test_df"].replace({0: 100}) has_idx_df.set_index("id", drop=False, inplace=True) assert "id" in has_idx_df.columns es.replace_dataframe("test_df", has_idx_df) assert es["test_df"].ww.index == "id" df = es["test_df"].sort_values(by="ints") assert all(df["id"] == [100, 1, 2]) def test_index_any_location(df): logical_types = {"id": Integer, "category": Categorical} es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="category", logical_types=logical_types, dataframe=df, ) assert es.dataframe_dict["test_dataframe"].columns[1] == "category" assert es.dataframe_dict["test_dataframe"].ww.index == "category" def test_extra_column_type(df): # more columns logical_types = {"id": Integer, "category": Categorical, "category2": Categorical} error_text = re.escape( "logical_types contains columns that are not present in dataframe: ['category2']", ) with pytest.raises(LookupError, match=error_text): es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id", logical_types=logical_types, dataframe=df, ) def test_add_parent_not_index_column(es): error_text = "Parent column 'language' is not the index of dataframe régions" with pytest.raises(AttributeError, match=error_text): es.add_relationship("régions", "language", "customers", "région_id") @pytest.fixture def df2(): return pd.DataFrame({"category": [1, 2, 3], "category2": ["1", "2", "3"]}) def test_none_index(df2): es = EntitySet(id="test") copy_df = df2.copy() copy_df.ww.init(name="test_dataframe") error_msg = "Cannot add Woodwork DataFrame to EntitySet without index" with pytest.raises(ValueError, match=error_msg): es.add_dataframe(dataframe=copy_df) warn_text = ( "Using first column as index. To change this, specify the index parameter" ) with pytest.warns(UserWarning, match=warn_text): es.add_dataframe( dataframe_name="test_dataframe", logical_types={"category": "Categorical"}, dataframe=df2, ) assert es["test_dataframe"].ww.index == "category" assert es["test_dataframe"].ww.semantic_tags["category"] == {"index"} assert isinstance(es["test_dataframe"].ww.logical_types["category"], Categorical) @pytest.fixture def df3(): return pd.DataFrame({"category": [1, 2, 3]}) def test_unknown_index(df3): warn_text = "index id not found in dataframe, creating new integer column" es = EntitySet(id="test") with pytest.warns(UserWarning, match=warn_text): es.add_dataframe( dataframe_name="test_dataframe", dataframe=df3, index="id", logical_types={"category": "Categorical"}, ) assert es["test_dataframe"].ww.index == "id" assert list(es["test_dataframe"]["id"]) == list( range(3), ) def test_doesnt_remake_index(df): logical_types = {"id": "Integer", "category": "Categorical"} error_text = "Cannot make index: column with name id already present" with pytest.raises(RuntimeError, match=error_text): es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id", make_index=True, dataframe=df, logical_types=logical_types, ) def test_bad_time_index_column(df3): logical_types = {"category": "Categorical"} error_text = "Specified time index column `time` not found in dataframe" with pytest.raises(LookupError, match=error_text): es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", dataframe=df3, index="category", time_index="time", logical_types=logical_types, ) @pytest.fixture def df4(): df = pd.DataFrame( { "id": [0, 1, 2], "category": ["a", "b", "a"], "category_int": [1, 2, 3], "ints": ["1", "2", "3"], "floats": ["1", "2", "3.0"], }, ) df["category_int"] = df["category_int"].astype("category") return df def test_converts_dtype_on_init(df4): logical_types = {"id": Integer, "ints": Integer, "floats": Double} es = EntitySet(id="test") df4.ww.init(name="test_dataframe", index="id", logical_types=logical_types) es.add_dataframe(dataframe=df4) df = es["test_dataframe"] assert df["ints"].dtype.name == "int64" assert df["floats"].dtype.name == "float64" # this is infer from pandas dtype df = es["test_dataframe"] assert isinstance(df.ww.logical_types["category_int"], Categorical) def test_converts_dtype_after_init(df4): category_dtype = "category" df4["category"] = df4["category"].astype(category_dtype) es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id", dataframe=df4, logical_types=None, ) df = es["test_dataframe"] df.ww.set_types(logical_types={"ints": "Integer"}) assert isinstance(df.ww.logical_types["ints"], Integer) assert df["ints"].dtype == "int64" df.ww.set_types(logical_types={"ints": "Categorical"}) assert isinstance(df.ww.logical_types["ints"], Categorical) assert df["ints"].dtype == category_dtype df.ww.set_types(logical_types={"ints": Ordinal(order=[1, 2, 3])}) assert df.ww.logical_types["ints"] == Ordinal(order=[1, 2, 3]) assert df["ints"].dtype == category_dtype df.ww.set_types(logical_types={"ints": "NaturalLanguage"}) assert isinstance(df.ww.logical_types["ints"], NaturalLanguage) assert df["ints"].dtype == "string" @pytest.fixture def datetime1(): times = pd.date_range("1/1/2011", periods=3, freq="H") time_strs = times.strftime("%Y-%m-%d") return pd.DataFrame({"id": [0, 1, 2], "time": time_strs}) def test_converts_datetime(datetime1): # string converts to datetime correctly # This test fails without defining logical types. # Entityset infers time column should be numeric type logical_types = {"id": Integer, "time": Datetime} es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id", time_index="time", logical_types=logical_types, dataframe=datetime1, ) pd_col = es["test_dataframe"]["time"] assert isinstance(es["test_dataframe"].ww.logical_types["time"], Datetime) assert type(pd_col[0]) == pd.Timestamp @pytest.fixture def datetime2(): datetime_format = "%d-%m-%Y" actual = pd.Timestamp("Jan 2, 2011") time_strs = [actual.strftime(datetime_format)] * 3 return pd.DataFrame( {"id": [0, 1, 2], "time_format": time_strs, "time_no_format": time_strs}, ) def test_handles_datetime_format(datetime2): # check if we load according to the format string # pass in an ambiguous date datetime_format = "%d-%m-%Y" actual = pd.Timestamp("Jan 2, 2011") logical_types = { "id": Integer, "time_format": (Datetime(datetime_format=datetime_format)), "time_no_format": Datetime, } es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id", logical_types=logical_types, dataframe=datetime2, ) col_format = es["test_dataframe"]["time_format"] col_no_format = es["test_dataframe"]["time_no_format"] # without formatting pandas gets it wrong assert (col_no_format != actual).all() # with formatting we correctly get jan2 assert (col_format == actual).all() def test_handles_datetime_mismatch(): # can't convert arbitrary strings df = pd.DataFrame({"id": [0, 1, 2], "time": ["a", "b", "tomorrow"]}) logical_types = {"id": Integer, "time": Datetime} error_text = "Time index column must contain datetime or numeric values" with pytest.raises(TypeError, match=error_text): es = EntitySet(id="test") es.add_dataframe( df, dataframe_name="test_dataframe", index="id", time_index="time", logical_types=logical_types, ) def test_dataframe_init(es): df = pd.DataFrame( { "id": ["0", "1", "2"], "time": [datetime(2011, 4, 9, 10, 31, 3 * i) for i in range(3)], "category": ["a", "b", "a"], "number": [4, 5, 6], }, ) logical_types = {"id": Categorical, "time": Datetime} es.add_dataframe( df.copy(), dataframe_name="test_dataframe", index="id", time_index="time", logical_types=logical_types, ) df_shape = df.shape es_df_shape = es["test_dataframe"].shape assert es_df_shape == df_shape assert es["test_dataframe"].ww.index == "id" assert es["test_dataframe"].ww.time_index == "time" assert set([v for v in es["test_dataframe"].ww.columns]) == set(df.columns) assert es["test_dataframe"]["time"].dtype == df["time"].dtype assert set(es["test_dataframe"]["id"]) == set(df["id"]) @pytest.fixture def bad_df(): return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], 3: ["a", "b", "c"]}) def test_nonstr_column_names(bad_df): es = EntitySet(id="Failure") error_text = r"All column names must be strings \(Columns \[3\] are not strings\)" with pytest.raises(ValueError, match=error_text): es.add_dataframe(dataframe_name="str_cols", dataframe=bad_df, index="a") bad_df.ww.init() with pytest.raises(ValueError, match=error_text): es.add_dataframe(dataframe_name="str_cols", dataframe=bad_df) def test_sort_time_id(): transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "transaction_time": pd.date_range(start="10:00", periods=6, freq="10s")[ ::-1 ], }, ) es = EntitySet( "test", dataframes={"t": (transactions_df.copy(), "id", "transaction_time")}, ) assert es["t"] is not transactions_df times = list(es["t"].transaction_time) assert times == sorted(list(transactions_df.transaction_time)) def test_already_sorted_parameter(): transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "transaction_time": [ datetime(2014, 4, 6), datetime(2012, 4, 8), datetime(2012, 4, 8), datetime(2013, 4, 8), datetime(2015, 4, 8), datetime(2016, 4, 9), ], }, ) es = EntitySet(id="test") es.add_dataframe( transactions_df.copy(), dataframe_name="t", index="id", time_index="transaction_time", already_sorted=True, ) assert es["t"] is not transactions_df times = list(es["t"].transaction_time) assert times == list(transactions_df.transaction_time) def test_concat_not_inplace(es): first_es = copy.deepcopy(es) for df in first_es.dataframes: new_df = df.loc[[], :] first_es.replace_dataframe(df.ww.name, new_df) second_es = copy.deepcopy(es) # set the data description first_es.metadata new_es = first_es.concat(second_es) assert new_es == es assert new_es._data_description is None assert first_es._data_description is not None def test_concat_inplace(es): first_es = copy.deepcopy(es) second_es = copy.deepcopy(es) for df in first_es.dataframes: new_df = df.loc[[], :] first_es.replace_dataframe(df.ww.name, new_df) # set the data description es.metadata es.concat(first_es, inplace=True) assert second_es == es assert es._data_description is None def test_concat_with_lti(es): first_es = copy.deepcopy(es) for df in first_es.dataframes: new_df = df.loc[[], :] first_es.replace_dataframe(df.ww.name, new_df) second_es = copy.deepcopy(es) first_es.add_last_time_indexes() second_es.add_last_time_indexes() es.add_last_time_indexes() new_es = first_es.concat(second_es) assert new_es == es first_es["stores"].ww.pop(LTI_COLUMN_NAME) first_es["stores"].ww.metadata.pop("last_time_index") second_es["stores"].ww.pop(LTI_COLUMN_NAME) second_es["stores"].ww.metadata.pop("last_time_index") assert not first_es.__eq__(es, deep=False) assert not second_es.__eq__(es, deep=False) assert LTI_COLUMN_NAME not in first_es["stores"] assert LTI_COLUMN_NAME not in second_es["stores"] new_es = first_es.concat(second_es) assert new_es.__eq__(es, deep=True) # stores will get last time index re-added because it has children that will get lti calculated assert LTI_COLUMN_NAME in new_es["stores"] def test_concat_errors(es): # entitysets are not equal copy_es = copy.deepcopy(es) copy_es["customers"].ww.pop("phone_number") error = ( "Entitysets must have the same dataframes, relationships" ", and column names" ) with pytest.raises(ValueError, match=error): es.concat(copy_es) def test_concat_sort_index_with_time_index(es): # only pandas dataframes sort on the index and time index es1 = copy.deepcopy(es) es1.replace_dataframe( dataframe_name="customers", df=es["customers"].loc[[0, 1], :], already_sorted=True, ) es2 = copy.deepcopy(es) es2.replace_dataframe( dataframe_name="customers", df=es["customers"].loc[[2], :], already_sorted=True, ) combined_es_order_1 = es1.concat(es2) combined_es_order_2 = es2.concat(es1) assert list(combined_es_order_1["customers"].index) == [2, 0, 1] assert list(combined_es_order_2["customers"].index) == [2, 0, 1] assert combined_es_order_1.__eq__(es, deep=True) assert combined_es_order_2.__eq__(es, deep=True) assert combined_es_order_2.__eq__(combined_es_order_1, deep=True) def test_concat_sort_index_without_time_index(es): # Sorting is only performed on DataFrames with time indices es1 = copy.deepcopy(es) es1.replace_dataframe( dataframe_name="products", df=es["products"].iloc[[0, 1, 2], :], already_sorted=True, ) es2 = copy.deepcopy(es) es2.replace_dataframe( dataframe_name="products", df=es["products"].iloc[[3, 4, 5], :], already_sorted=True, ) combined_es_order_1 = es1.concat(es2) combined_es_order_2 = es2.concat(es1) # order matters when we don't sort assert list(combined_es_order_1["products"].index) == [ "Haribo sugar-free gummy bears", "car", "toothpaste", "brown bag", "coke zero", "taco clock", ] assert list(combined_es_order_2["products"].index) == [ "brown bag", "coke zero", "taco clock", "Haribo sugar-free gummy bears", "car", "toothpaste", ] assert combined_es_order_1.__eq__(es, deep=True) assert not combined_es_order_2.__eq__(es, deep=True) assert combined_es_order_2.__eq__(es, deep=False) assert not combined_es_order_2.__eq__(combined_es_order_1, deep=True) def test_concat_with_make_index(es): df = pd.DataFrame({"id": [0, 1, 2], "category": ["a", "b", "a"]}) logical_types = {"id": Categorical, "category": Categorical} es.add_dataframe( dataframe=df, dataframe_name="test_df", index="id1", make_index=True, logical_types=logical_types, ) es_1 = copy.deepcopy(es) es_2 = copy.deepcopy(es) assert es.__eq__(es_1, deep=True) assert es.__eq__(es_2, deep=True) # map of what rows to take from es_1 and es_2 for each dataframe emap = { "log": [list(range(10)) + [14, 15, 16], list(range(10, 14)) + [15, 16]], "sessions": [[0, 1, 2], [1, 3, 4, 5]], "customers": [[0, 2], [1, 2]], "test_df": [[0, 1], [0, 2]], } for i, _es in enumerate([es_1, es_2]): for df_name, rows in emap.items(): df = _es[df_name] _es.replace_dataframe(dataframe_name=df_name, df=df.loc[rows[i]]) assert es.__eq__(es_1, deep=False) assert es.__eq__(es_2, deep=False) assert not es.__eq__(es_1, deep=True) assert not es.__eq__(es_2, deep=True) old_es_1 = copy.deepcopy(es_1) old_es_2 = copy.deepcopy(es_2) es_3 = es_1.concat(es_2) assert old_es_1.__eq__(es_1, deep=True) assert old_es_2.__eq__(es_2, deep=True) assert es_3.__eq__(es, deep=True) @pytest.fixture def transactions_df(): return pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "card_id": [1, 2, 1, 3, 4, 5], "transaction_time": [10, 12, 13, 20, 21, 20], "fraud": [True, False, False, False, True, True], }, ) def test_set_time_type_on_init(transactions_df): # create cards dataframe cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) cards_logical_types = None transactions_logical_types = None dataframes = { "cards": (cards_df, "id", None, cards_logical_types), "transactions": ( transactions_df, "id", "transaction_time", transactions_logical_types, ), } relationships = [("cards", "id", "transactions", "card_id")] es = EntitySet("fraud", dataframes, relationships) # assert time_type is set assert es.time_type == "numeric" def test_sets_time_when_adding_dataframe(transactions_df): accounts_df = pd.DataFrame( { "id": [3, 4, 5], "signup_date": [ datetime(2002, 5, 1), datetime(2006, 3, 20), datetime(2011, 11, 11), ], }, ) accounts_df_string = pd.DataFrame( {"id": [3, 4, 5], "signup_date": ["element", "exporting", "editable"]}, ) accounts_logical_types = None transactions_logical_types = None # create empty entityset es = EntitySet("fraud") # assert it's not set assert getattr(es, "time_type", None) is None # add dataframe es.add_dataframe( transactions_df, dataframe_name="transactions", index="id", time_index="transaction_time", logical_types=transactions_logical_types, ) # assert time_type is set assert es.time_type == "numeric" # add another dataframe es.normalize_dataframe("transactions", "cards", "card_id", make_time_index=True) # assert time_type unchanged assert es.time_type == "numeric" # add wrong time type dataframe error_text = "accounts time index is Datetime type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error_text): es.add_dataframe( accounts_df, dataframe_name="accounts", index="id", time_index="signup_date", logical_types=accounts_logical_types, ) error_text = "Time index column must contain datetime or numeric values" with pytest.raises(TypeError, match=error_text): es.add_dataframe( accounts_df_string, dataframe_name="accounts", index="id", time_index="signup_date", ) def test_secondary_time_index_no_primary_time_index(es): es["products"].ww.set_types(logical_types={"rating": "Datetime"}) assert es["products"].ww.time_index is None error = ( "Cannot set secondary time index on a DataFrame that has no primary time index." ) with pytest.raises(ValueError, match=error): es.set_secondary_time_index("products", {"rating": ["url"]}) assert "secondary_time_index" not in es["products"].ww.metadata assert es["products"].ww.time_index is None def test_set_non_valid_time_index_type(es): error_text = "Time index column must be a Datetime or numeric column." with pytest.raises(TypeError, match=error_text): es["log"].ww.set_time_index("purchased") def test_checks_time_type_setting_secondary_time_index(es): # entityset is timestamp time type assert es.time_type == Datetime # add secondary index that is timestamp type new_2nd_ti = { "upgrade_date": ["upgrade_date", "favorite_quote"], "cancel_date": ["cancel_date", "cancel_reason"], } es.set_secondary_time_index("customers", new_2nd_ti) assert es.time_type == Datetime # add secondary index that is numeric type new_2nd_ti = {"age": ["age", "loves_ice_cream"]} error_text = "customers time index is numeric type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error_text): es.set_secondary_time_index("customers", new_2nd_ti) # add secondary index that is non-time type new_2nd_ti = {"favorite_quote": ["favorite_quote", "loves_ice_cream"]} error_text = "customers time index not recognized as numeric or datetime" with pytest.raises(TypeError, match=error_text): es.set_secondary_time_index("customers", new_2nd_ti) # add mismatched pair of secondary time indexes new_2nd_ti = { "upgrade_date": ["upgrade_date", "favorite_quote"], "age": ["age", "loves_ice_cream"], } error_text = "customers time index is numeric type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error_text): es.set_secondary_time_index("customers", new_2nd_ti) # create entityset with numeric time type cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "card_id": [1, 2, 1, 3, 4, 5], "transaction_time": [10, 12, 13, 20, 21, 20], "fraud_decision_time": [11, 14, 15, 21, 22, 21], "transaction_city": ["City A"] * 6, "transaction_date": [datetime(1989, 2, i) for i in range(1, 7)], "fraud": [True, False, False, False, True, True], }, ) dataframes = { "cards": (cards_df, "id"), "transactions": (transactions_df, "id", "transaction_time"), } relationships = [("cards", "id", "transactions", "card_id")] card_es = EntitySet("fraud", dataframes, relationships) assert card_es.time_type == "numeric" # add secondary index that is numeric time type new_2nd_ti = {"fraud_decision_time": ["fraud_decision_time", "fraud"]} card_es.set_secondary_time_index("transactions", new_2nd_ti) assert card_es.time_type == "numeric" # add secondary index that is timestamp type new_2nd_ti = {"transaction_date": ["transaction_date", "fraud"]} error_text = "transactions time index is Datetime type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error_text): card_es.set_secondary_time_index("transactions", new_2nd_ti) # add secondary index that is non-time type new_2nd_ti = {"transaction_city": ["transaction_city", "fraud"]} error_text = "transactions time index not recognized as numeric or datetime" with pytest.raises(TypeError, match=error_text): card_es.set_secondary_time_index("transactions", new_2nd_ti) # add mixed secondary time indexes new_2nd_ti = { "transaction_city": ["transaction_city", "fraud"], "fraud_decision_time": ["fraud_decision_time", "fraud"], } with pytest.raises(TypeError, match=error_text): card_es.set_secondary_time_index("transactions", new_2nd_ti) # add bool secondary time index error_text = "transactions time index not recognized as numeric or datetime" with pytest.raises(TypeError, match=error_text): card_es.set_secondary_time_index("transactions", {"fraud": ["fraud"]}) def test_normalize_dataframe(es): error_text = "'additional_columns' must be a list, but received type.*" with pytest.raises(TypeError, match=error_text): es.normalize_dataframe( "sessions", "device_types", "device_type", additional_columns="log", ) error_text = "'copy_columns' must be a list, but received type.*" with pytest.raises(TypeError, match=error_text): es.normalize_dataframe( "sessions", "device_types", "device_type", copy_columns="log", ) es.normalize_dataframe( "sessions", "device_types", "device_type", additional_columns=["device_name"], make_time_index=False, ) assert len(es.get_forward_relationships("sessions")) == 2 assert ( es.get_forward_relationships("sessions")[1].parent_dataframe.ww.name == "device_types" ) assert "device_name" in es["device_types"].columns assert "device_name" not in es["sessions"].columns assert "device_type" in es["device_types"].columns def test_normalize_dataframe_add_index_as_column(es): error_text = "Not adding device_type as both index and column in additional_columns" with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( "sessions", "device_types", "device_type", additional_columns=["device_name", "device_type"], make_time_index=False, ) error_text = "Not adding device_type as both index and column in copy_columns" with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( "sessions", "device_types", "device_type", copy_columns=["device_name", "device_type"], make_time_index=False, ) def test_normalize_dataframe_new_time_index_in_base_dataframe_error_check(es): error_text = "'make_time_index' must be a column in the base dataframe" with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index="non-existent", ) def test_normalize_dataframe_new_time_index_in_column_list_error_check(es): error_text = ( "'make_time_index' must be specified in 'additional_columns' or 'copy_columns'" ) with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index="cancel_date", ) def test_normalize_dataframe_new_time_index_copy_success_check(es): es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index="cancel_date", additional_columns=[], copy_columns=["cancel_date"], ) def test_normalize_dataframe_new_time_index_additional_success_check(es): es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancellations", index="cancel_reason", make_time_index="cancel_date", additional_columns=["cancel_date"], copy_columns=[], ) @pytest.fixture def normalize_es(): df = pd.DataFrame( { "id": [0, 1, 2, 3], "A": [5, 4, 2, 3], "time": [ datetime(2020, 6, 3), (datetime(2020, 3, 12)), datetime(2020, 5, 1), datetime(2020, 4, 22), ], }, ) es = EntitySet("es") return es.add_dataframe(dataframe_name="data", dataframe=df, index="id") def test_normalize_time_index_from_none(normalize_es): assert normalize_es["data"].ww.time_index is None normalize_es.normalize_dataframe( base_dataframe_name="data", new_dataframe_name="normalized", index="A", make_time_index="time", copy_columns=["time"], ) assert normalize_es["normalized"].ww.time_index == "time" df = normalize_es["normalized"] assert df["time"].is_monotonic_increasing def test_raise_error_if_dupicate_additional_columns_passed(es): error_text = ( "'additional_columns' contains duplicate columns. All columns must be unique." ) with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( "sessions", "device_types", "device_type", additional_columns=["device_name", "device_name"], ) def test_raise_error_if_dupicate_copy_columns_passed(es): error_text = ( "'copy_columns' contains duplicate columns. All columns must be unique." ) with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( "sessions", "device_types", "device_type", copy_columns=["device_name", "device_name"], ) def test_normalize_dataframe_copies_logical_types(es): es["log"].ww.set_types( logical_types={ "value": Ordinal( order=[0.0, 1.0, 2.0, 3.0, 5.0, 7.0, 10.0, 14.0, 15.0, 20.0], ), }, ) assert isinstance(es["log"].ww.logical_types["value"], Ordinal) assert len(es["log"].ww.logical_types["value"].order) == 10 assert isinstance(es["log"].ww.logical_types["priority_level"], Ordinal) assert len(es["log"].ww.logical_types["priority_level"].order) == 3 es.normalize_dataframe( "log", "values_2", "value_2", additional_columns=["priority_level"], copy_columns=["value"], make_time_index=False, ) assert len(es.get_forward_relationships("log")) == 3 assert es.get_forward_relationships("log")[2].parent_dataframe.ww.name == "values_2" assert "priority_level" in es["values_2"].columns assert "value" in es["values_2"].columns assert "priority_level" not in es["log"].columns assert "value" in es["log"].columns assert "value_2" in es["values_2"].columns assert isinstance(es["values_2"].ww.logical_types["priority_level"], Ordinal) assert len(es["values_2"].ww.logical_types["priority_level"].order) == 3 assert isinstance(es["values_2"].ww.logical_types["value"], Ordinal) assert len(es["values_2"].ww.logical_types["value"].order) == 10 def test_make_time_index_keeps_original_sorting(): trips = { "trip_id": [999 - i for i in range(1000)], "flight_time": [datetime(1997, 4, 1) for i in range(1000)], "flight_id": [1 for i in range(350)] + [2 for i in range(650)], } order = [i for i in range(1000)] df = pd.DataFrame.from_dict(trips) es = EntitySet("flights") es.add_dataframe( dataframe=df, dataframe_name="trips", index="trip_id", time_index="flight_time", ) assert (es["trips"]["trip_id"] == order).all() es.normalize_dataframe( base_dataframe_name="trips", new_dataframe_name="flights", index="flight_id", make_time_index=True, ) assert (es["trips"]["trip_id"] == order).all() def test_normalize_dataframe_new_time_index(es): new_time_index = "value_time" es.normalize_dataframe( "log", "values", "value", make_time_index=True, new_dataframe_time_index=new_time_index, ) assert es["values"].ww.time_index == new_time_index assert new_time_index in es["values"].columns assert len(es["values"].columns) == 2 df = es["values"] assert df[new_time_index].is_monotonic_increasing def test_normalize_dataframe_same_index(es): transactions_df = pd.DataFrame( { "id": [1, 2, 3], "transaction_time": pd.date_range(start="10:00", periods=3, freq="10s"), "first_df_time": [1, 2, 3], }, ) es = EntitySet("example") es.add_dataframe( dataframe_name="df", index="id", time_index="transaction_time", dataframe=transactions_df, ) error_text = "'index' must be different from the index column of the base dataframe" with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( base_dataframe_name="df", new_dataframe_name="new_dataframe", index="id", make_time_index=True, ) def test_secondary_time_index(es): es.normalize_dataframe( "log", "values", "value", make_time_index=True, make_secondary_time_index={"datetime": ["comments"]}, new_dataframe_time_index="value_time", new_dataframe_secondary_time_index="second_ti", ) assert isinstance(es["values"].ww.logical_types["second_ti"], Datetime) assert es["values"].ww.semantic_tags["second_ti"] == set() assert es["values"].ww.metadata["secondary_time_index"] == { "second_ti": ["comments", "second_ti"], } def test_sizeof(es): es.add_last_time_indexes() total_size = 0 for df in es.dataframes: total_size += df.__sizeof__() assert es.__sizeof__() == total_size def test_construct_without_id(): assert EntitySet().id is None def test_repr_without_id(): match = "Entityset: None\n DataFrames:\n Relationships:\n No relationships" assert repr(EntitySet()) == match def test_getitem_without_id(): error_text = "DataFrame test does not exist in entity set" with pytest.raises(KeyError, match=error_text): EntitySet()["test"] def test_metadata_without_id(): es = EntitySet() assert es.metadata.id is None @pytest.fixture def datetime3(): return pd.DataFrame({"id": [0, 1, 2], "ints": ["1", "2", "1"]}) def test_datetime64_conversion(datetime3): df = datetime3 df["time"] = pd.Timestamp.now() df["time"] = df["time"].dt.tz_localize("UTC") es = EntitySet(id="test") es.add_dataframe( dataframe_name="test_dataframe", index="id", dataframe=df, logical_types=None, ) es["test_dataframe"].ww.set_time_index("time") assert es["test_dataframe"].ww.time_index == "time" @pytest.fixture def index_df(): return pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "transaction_time": pd.date_range(start="10:00", periods=6, freq="10s"), "first_dataframe_time": [1, 2, 3, 5, 6, 6], }, ) def test_same_index_values(index_df): es = EntitySet("example") error_text = ( '"id" is already set as the index. An index cannot also be the time index.' ) with pytest.raises(ValueError, match=error_text): es.add_dataframe( dataframe_name="dataframe", index="id", time_index="id", dataframe=index_df, logical_types=None, ) es.add_dataframe( dataframe_name="dataframe", index="id", time_index="transaction_time", dataframe=index_df, logical_types=None, ) error_text = "time_index and index cannot be the same value, first_dataframe_time" with pytest.raises(ValueError, match=error_text): es.normalize_dataframe( base_dataframe_name="dataframe", new_dataframe_name="new_dataframe", index="first_dataframe_time", make_time_index=True, ) def test_use_time_index(index_df): bad_ltypes = {"transaction_time": Datetime} bad_semantic_tags = {"transaction_time": "time_index"} logical_types = None es = EntitySet() error_text = re.escape( "Cannot add 'time_index' tag directly for column transaction_time. To set a column as the time index, use DataFrame.ww.set_time_index() instead.", ) with pytest.raises(ValueError, match=error_text): es.add_dataframe( dataframe_name="dataframe", index="id", logical_types=bad_ltypes, semantic_tags=bad_semantic_tags, dataframe=index_df, ) es.add_dataframe( dataframe_name="dataframe", index="id", time_index="transaction_time", logical_types=logical_types, dataframe=index_df, ) def test_normalize_with_datetime_time_index(es): es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancel_reason", index="cancel_reason", make_time_index=False, copy_columns=["signup_date", "upgrade_date"], ) assert isinstance(es["cancel_reason"].ww.logical_types["signup_date"], Datetime) assert isinstance(es["cancel_reason"].ww.logical_types["upgrade_date"], Datetime) def test_normalize_with_numeric_time_index(int_es): int_es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancel_reason", index="cancel_reason", make_time_index=False, copy_columns=["signup_date", "upgrade_date"], ) assert int_es["cancel_reason"].ww.semantic_tags["signup_date"] == {"numeric"} def test_normalize_with_invalid_time_index(es): error_text = "Time index column must contain datetime or numeric values" with pytest.raises(TypeError, match=error_text): es.normalize_dataframe( base_dataframe_name="customers", new_dataframe_name="cancel_reason", index="cancel_reason", copy_columns=["upgrade_date", "favorite_quote"], make_time_index="favorite_quote", ) def test_entityset_init(): cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "card_id": [1, 2, 1, 3, 4, 5], "transaction_time": [10, 12, 13, 20, 21, 20], "upgrade_date": [51, 23, 45, 12, 22, 53], "fraud": [True, False, False, False, True, True], }, ) logical_types = {"fraud": "boolean", "card_id": "integer"} dataframes = { "cards": (cards_df.copy(), "id", None, {"id": "Integer"}), "transactions": ( transactions_df.copy(), "id", "transaction_time", logical_types, None, False, ), } relationships = [("cards", "id", "transactions", "card_id")] es = EntitySet(id="fraud_data", dataframes=dataframes, relationships=relationships) assert es["transactions"].ww.index == "id" assert es["transactions"].ww.time_index == "transaction_time" es_copy = EntitySet(id="fraud_data") es_copy.add_dataframe(dataframe_name="cards", dataframe=cards_df.copy(), index="id") es_copy.add_dataframe( dataframe_name="transactions", dataframe=transactions_df.copy(), index="id", logical_types=logical_types, make_index=False, time_index="transaction_time", ) es_copy.add_relationship("cards", "id", "transactions", "card_id") assert es["cards"].ww == es_copy["cards"].ww assert es["transactions"].ww == es_copy["transactions"].ww def test_add_interesting_values_specified_vals(es): product_vals = ["coke zero", "taco clock"] country_vals = ["AL", "US"] interesting_values = { "product_id": product_vals, "countrycode": country_vals, } es.add_interesting_values(dataframe_name="log", values=interesting_values) assert es["log"].ww["product_id"].ww.metadata["interesting_values"] == product_vals assert es["log"].ww["countrycode"].ww.metadata["interesting_values"] == country_vals def test_add_interesting_values_vals_specified_without_dataframe_name(es): interesting_values = { "countrycode": ["AL", "US"], } error_msg = "dataframe_name must be specified if values are provided" with pytest.raises(ValueError, match=error_msg): es.add_interesting_values(values=interesting_values) def test_add_interesting_values_single_dataframe(es): es.add_interesting_values(dataframe_name="log") expected_vals = { "zipcode": ["02116", "02116-3899", "12345-6789", "1234567890", "0"], "countrycode": ["US", "AL", "ALB", "USA"], "subregioncode": ["US-AZ", "US-MT", "ZM-06", "UG-219"], "priority_level": [0, 1, 2], } for col in es["log"].columns: if col in expected_vals: assert ( es["log"].ww.columns[col].metadata.get("interesting_values") == expected_vals[col] ) else: assert es["log"].ww.columns[col].metadata.get("interesting_values") is None def test_add_interesting_values_multiple_dataframes(es): es.add_interesting_values() expected_cols_with_vals = { "régions": {"language"}, "stores": {}, "products": {"department"}, "customers": {"cancel_reason", "engagement_level"}, "sessions": {"device_type", "device_name"}, "log": {"zipcode", "countrycode", "subregioncode", "priority_level"}, "cohorts": {"cohort_name"}, } for df_id, df in es.dataframe_dict.items(): expected_cols = expected_cols_with_vals[df_id] for col in df.columns: if col in expected_cols: assert df.ww.columns[col].metadata.get("interesting_values") is not None else: assert df.ww.columns[col].metadata.get("interesting_values") is None def test_add_interesting_values_verbose_output(caplog): es = load_retail(nrows=200) es["order_products"].ww.set_types({"quantity": "Categorical"}) es["orders"].ww.set_types({"country": "Categorical"}) logger = logging.getLogger("featuretools") logger.propagate = True logger_es = logging.getLogger("featuretools.entityset") logger_es.propagate = True es.add_interesting_values(verbose=True, max_values=10) logger.propagate = False logger_es.propagate = False assert ( "Column country: Marking United Kingdom as an interesting value" in caplog.text ) assert "Column quantity: Marking 6 as an interesting value" in caplog.text def test_entityset_equality(es): first_es = EntitySet() second_es = EntitySet() assert first_es == second_es first_es.add_dataframe( dataframe_name="customers", dataframe=es["customers"].copy(), index="id", time_index="signup_date", logical_types=es["customers"].ww.logical_types, semantic_tags=get_df_tags(es["customers"]), ) assert first_es != second_es second_es.add_dataframe( dataframe_name="sessions", dataframe=es["sessions"].copy(), index="id", logical_types=es["sessions"].ww.logical_types, semantic_tags=get_df_tags(es["sessions"]), ) assert first_es != second_es first_es.add_dataframe( dataframe_name="sessions", dataframe=es["sessions"].copy(), index="id", logical_types=es["sessions"].ww.logical_types, semantic_tags=get_df_tags(es["sessions"]), ) second_es.add_dataframe( dataframe_name="customers", dataframe=es["customers"].copy(), index="id", time_index="signup_date", logical_types=es["customers"].ww.logical_types, semantic_tags=get_df_tags(es["customers"]), ) assert first_es == second_es first_es.add_relationship("customers", "id", "sessions", "customer_id") assert first_es != second_es assert second_es != first_es second_es.add_relationship("customers", "id", "sessions", "customer_id") assert first_es == second_es def test_entityset_dataframe_dict_and_relationship_equality(es): first_es = EntitySet() second_es = EntitySet() first_es.add_dataframe( dataframe_name="sessions", dataframe=es["sessions"].copy(), index="id", logical_types=es["sessions"].ww.logical_types, semantic_tags=get_df_tags(es["sessions"]), ) # Tests if two entity sets are not equal if they have a different # number of dataframes attached. # first_es has 1 dataframe, second_es has 0 dataframes attached. assert first_es != second_es second_es.add_dataframe( dataframe_name="customers", dataframe=es["customers"].copy(), index="id", logical_types=es["customers"].ww.logical_types, semantic_tags=get_df_tags(es["customers"]), ) # Tests if two entity sets are not equal if they have a different # dataframes attached. # first_es has the sessions dataframe attached, # second_es has the customers dataframe attached. assert first_es != second_es first_es.add_dataframe( dataframe_name="customers", dataframe=es["customers"].copy(), index="id", logical_types=es["customers"].ww.logical_types, semantic_tags=get_df_tags(es["customers"]), ) first_es.add_dataframe( dataframe_name="stores", dataframe=es["stores"].copy(), index="id", logical_types=es["stores"].ww.logical_types, semantic_tags=get_df_tags(es["stores"]), ) first_es.add_dataframe( dataframe_name="régions", dataframe=es["régions"].copy(), index="id", logical_types=es["régions"].ww.logical_types, semantic_tags=get_df_tags(es["régions"]), ) second_es.add_dataframe( dataframe_name="sessions", dataframe=es["sessions"].copy(), index="id", logical_types=es["sessions"].ww.logical_types, semantic_tags=get_df_tags(es["sessions"]), ) second_es.add_dataframe( dataframe_name="stores", dataframe=es["stores"].copy(), index="id", logical_types=es["stores"].ww.logical_types, semantic_tags=get_df_tags(es["stores"]), ) second_es.add_dataframe( dataframe_name="régions", dataframe=es["régions"].copy(), index="id", logical_types=es["régions"].ww.logical_types, semantic_tags=get_df_tags(es["régions"]), ) # Now the two entity sets should be equal, # since they have the same dataframes. assert first_es == second_es first_es.add_relationship("customers", "id", "sessions", "customer_id") second_es.add_relationship("régions", "id", "stores", "région_id") # Test if two entity sets are not equal # if they have different relationships. assert first_es != second_es def test_entityset_id_equality(): first_es = EntitySet(id="first") first_es_copy = EntitySet(id="first") second_es = EntitySet(id="second") assert first_es != second_es assert first_es == first_es_copy def test_entityset_time_type_equality(): first_es = EntitySet() second_es = EntitySet() assert first_es == second_es first_es.time_type = "numeric" assert first_es != second_es second_es.time_type = Datetime assert first_es != second_es second_es.time_type = "numeric" assert first_es == second_es def test_entityset_deep_equality(es): first_es = EntitySet() second_es = EntitySet() first_es.add_dataframe( dataframe_name="customers", dataframe=es["customers"].copy(), index="id", time_index="signup_date", logical_types=es["customers"].ww.logical_types, semantic_tags=get_df_tags(es["customers"]), ) first_es.add_dataframe( dataframe_name="sessions", dataframe=es["sessions"].copy(), index="id", logical_types=es["sessions"].ww.logical_types, semantic_tags=get_df_tags(es["sessions"]), ) second_es.add_dataframe( dataframe_name="sessions", dataframe=es["sessions"].copy(), index="id", logical_types=es["sessions"].ww.logical_types, semantic_tags=get_df_tags(es["sessions"]), ) second_es.add_dataframe( dataframe_name="customers", dataframe=es["customers"].copy(), index="id", time_index="signup_date", logical_types=es["customers"].ww.logical_types, semantic_tags=get_df_tags(es["customers"]), ) assert first_es.__eq__(second_es, deep=False) assert first_es.__eq__(second_es, deep=True) # Woodwork metadata only gets included in deep equality check first_es["sessions"].ww.metadata["created_by"] = "user0" assert first_es.__eq__(second_es, deep=False) assert not first_es.__eq__(second_es, deep=True) second_es["sessions"].ww.metadata["created_by"] = "user0" assert first_es.__eq__(second_es, deep=False) assert first_es.__eq__(second_es, deep=True) updated_df = first_es["customers"].loc[[2, 0], :] first_es.replace_dataframe("customers", updated_df) assert first_es.__eq__(second_es, deep=False) assert not first_es.__eq__(second_es, deep=True) def test_deepcopy_entityset(make_es): # Uses make_es since the es fixture uses deepcopy copied_es = copy.deepcopy(make_es) assert copied_es == make_es assert copied_es is not make_es for df_name in make_es.dataframe_dict.keys(): original_df = make_es[df_name] new_df = copied_es[df_name] assert new_df.ww.schema == original_df.ww.schema assert new_df.ww._schema is not original_df.ww._schema pd.testing.assert_frame_equal(new_df, original_df) assert new_df is not original_df def test_deepcopy_entityset_woodwork_changes(es): copied_es = copy.deepcopy(es) assert copied_es == es assert copied_es is not es copied_es["products"].ww.add_semantic_tags({"id": "new_tag"}) assert copied_es["products"].ww.semantic_tags["id"] == {"index", "new_tag"} assert es["products"].ww.semantic_tags["id"] == {"index"} assert copied_es != es def test_deepcopy_entityset_featuretools_changes(es): copied_es = copy.deepcopy(es) assert copied_es == es assert copied_es is not es copied_es.set_secondary_time_index( "customers", {"upgrade_date": ["engagement_level"]}, ) assert copied_es["customers"].ww.metadata["secondary_time_index"] == { "upgrade_date": ["engagement_level", "upgrade_date"], } assert es["customers"].ww.metadata["secondary_time_index"] == { "cancel_date": ["cancel_reason", "cancel_date"], } def test_es__getstate__key_unique(es): assert not hasattr(es, WW_SCHEMA_KEY) def test_es_pickling(es): pkl = pickle.dumps(es) unpickled = pickle.loads(pkl) assert es.__eq__(unpickled, deep=True) assert not hasattr(unpickled, WW_SCHEMA_KEY) def test_empty_es_pickling(): es = EntitySet(id="empty") pkl = pickle.dumps(es) unpickled = pickle.loads(pkl) assert es.__eq__(unpickled, deep=True) @patch("featuretools.entityset.entityset.EntitySet.add_dataframe") def test_setitem(add_dataframe): es = EntitySet() df = pd.DataFrame() es["new_df"] = df assert add_dataframe.called add_dataframe.assert_called_with(dataframe=df, dataframe_name="new_df") def test_latlong_nan_normalization(latlong_df): latlong_df.ww.init( name="latLong", index="idx", logical_types={"latLong": "LatLong"}, ) dataframes = {"latLong": (latlong_df,)} relationships = [] es = EntitySet("latlong-test", dataframes, relationships) normalized_df = es["latLong"] expected_df = pd.DataFrame( {"idx": [0, 1, 2], "latLong": [(np.nan, np.nan), (1, 2), (np.nan, np.nan)]}, ) pd.testing.assert_frame_equal(normalized_df, expected_df) def test_latlong_nan_normalization_add_dataframe(latlong_df): latlong_df.ww.init( name="latLong", index="idx", logical_types={"latLong": "LatLong"}, ) es = EntitySet("latlong-test") es.add_dataframe(latlong_df) normalized_df = es["latLong"] expected_df = pd.DataFrame( {"idx": [0, 1, 2], "latLong": [(np.nan, np.nan), (1, 2), (np.nan, np.nan)]}, ) pd.testing.assert_frame_equal(normalized_df, expected_df) ================================================ FILE: featuretools/tests/entityset_tests/test_es_metadata.py ================================================ import pandas as pd import pytest from featuretools import EntitySet from featuretools.tests.testing_utils import backward_path, forward_path def test_cannot_re_add_relationships_that_already_exists(es): before_len = len(es.relationships) es.add_relationship(relationship=es.relationships[0]) after_len = len(es.relationships) assert before_len == after_len def test_add_relationships_convert_type(es): for r in es.relationships: assert r.parent_dataframe.ww.index == r._parent_column_name assert "foreign_key" in r.child_column.ww.semantic_tags assert r.child_column.ww.logical_type == r.parent_column.ww.logical_type def test_get_forward_dataframes(es): dataframes = es.get_forward_dataframes("log") path_to_sessions = forward_path(es, ["log", "sessions"]) path_to_products = forward_path(es, ["log", "products"]) assert list(dataframes) == [ ("sessions", path_to_sessions), ("products", path_to_products), ] def test_get_backward_dataframes(es): dataframes = es.get_backward_dataframes("customers") path_to_sessions = backward_path(es, ["customers", "sessions"]) assert list(dataframes) == [("sessions", path_to_sessions)] def test_get_forward_dataframes_deep(es): dataframes = es.get_forward_dataframes("log", deep=True) path_to_sessions = forward_path(es, ["log", "sessions"]) path_to_products = forward_path(es, ["log", "products"]) path_to_customers = forward_path(es, ["log", "sessions", "customers"]) path_to_regions = forward_path(es, ["log", "sessions", "customers", "régions"]) path_to_cohorts = forward_path(es, ["log", "sessions", "customers", "cohorts"]) assert list(dataframes) == [ ("sessions", path_to_sessions), ("customers", path_to_customers), ("cohorts", path_to_cohorts), ("régions", path_to_regions), ("products", path_to_products), ] def test_get_backward_dataframes_deep(es): dataframes = es.get_backward_dataframes("customers", deep=True) path_to_log = backward_path(es, ["customers", "sessions", "log"]) path_to_sessions = backward_path(es, ["customers", "sessions"]) assert list(dataframes) == [("sessions", path_to_sessions), ("log", path_to_log)] def test_get_forward_relationships(es): relationships = es.get_forward_relationships("log") assert len(relationships) == 2 assert relationships[0]._parent_dataframe_name == "sessions" assert relationships[0]._child_dataframe_name == "log" assert relationships[1]._parent_dataframe_name == "products" assert relationships[1]._child_dataframe_name == "log" relationships = es.get_forward_relationships("sessions") assert len(relationships) == 1 assert relationships[0]._parent_dataframe_name == "customers" assert relationships[0]._child_dataframe_name == "sessions" def test_get_backward_relationships(es): relationships = es.get_backward_relationships("sessions") assert len(relationships) == 1 assert relationships[0]._parent_dataframe_name == "sessions" assert relationships[0]._child_dataframe_name == "log" relationships = es.get_backward_relationships("customers") assert len(relationships) == 1 assert relationships[0]._parent_dataframe_name == "customers" assert relationships[0]._child_dataframe_name == "sessions" def test_find_forward_paths(es): paths = list(es.find_forward_paths("log", "customers")) assert len(paths) == 1 path = paths[0] assert len(path) == 2 assert path[0]._child_dataframe_name == "log" assert path[0]._parent_dataframe_name == "sessions" assert path[1]._child_dataframe_name == "sessions" assert path[1]._parent_dataframe_name == "customers" def test_find_forward_paths_multiple_paths(diamond_es): paths = list(diamond_es.find_forward_paths("transactions", "regions")) assert len(paths) == 2 path1, path2 = paths r1, r2 = path1 assert r1._child_dataframe_name == "transactions" assert r1._parent_dataframe_name == "stores" assert r2._child_dataframe_name == "stores" assert r2._parent_dataframe_name == "regions" r1, r2 = path2 assert r1._child_dataframe_name == "transactions" assert r1._parent_dataframe_name == "customers" assert r2._child_dataframe_name == "customers" assert r2._parent_dataframe_name == "regions" def test_find_forward_paths_multiple_relationships(games_es): paths = list(games_es.find_forward_paths("games", "teams")) assert len(paths) == 2 path1, path2 = paths assert len(path1) == 1 assert len(path2) == 1 r1 = path1[0] r2 = path2[0] assert r1._child_dataframe_name == "games" assert r2._child_dataframe_name == "games" assert r1._parent_dataframe_name == "teams" assert r2._parent_dataframe_name == "teams" assert r1._child_column_name == "home_team_id" assert r2._child_column_name == "away_team_id" assert r1._parent_column_name == "id" assert r2._parent_column_name == "id" @pytest.fixture def employee_df(): return pd.DataFrame({"id": [0], "manager_id": [0]}) def test_find_forward_paths_ignores_loops(employee_df): dataframes = {"employees": (employee_df, "id")} relationships = [("employees", "id", "employees", "manager_id")] es = EntitySet(dataframes=dataframes, relationships=relationships) paths = list(es.find_forward_paths("employees", "employees")) assert len(paths) == 1 assert paths[0] == [] def test_find_backward_paths(es): paths = list(es.find_backward_paths("customers", "log")) assert len(paths) == 1 path = paths[0] assert len(path) == 2 assert path[0]._child_dataframe_name == "sessions" assert path[0]._parent_dataframe_name == "customers" assert path[1]._child_dataframe_name == "log" assert path[1]._parent_dataframe_name == "sessions" def test_find_backward_paths_multiple_paths(diamond_es): paths = list(diamond_es.find_backward_paths("regions", "transactions")) assert len(paths) == 2 path1, path2 = paths r1, r2 = path1 assert r1._child_dataframe_name == "stores" assert r1._parent_dataframe_name == "regions" assert r2._child_dataframe_name == "transactions" assert r2._parent_dataframe_name == "stores" r1, r2 = path2 assert r1._child_dataframe_name == "customers" assert r1._parent_dataframe_name == "regions" assert r2._child_dataframe_name == "transactions" assert r2._parent_dataframe_name == "customers" def test_find_backward_paths_multiple_relationships(games_es): paths = list(games_es.find_backward_paths("teams", "games")) assert len(paths) == 2 path1, path2 = paths assert len(path1) == 1 assert len(path2) == 1 r1 = path1[0] r2 = path2[0] assert r1._child_dataframe_name == "games" assert r2._child_dataframe_name == "games" assert r1._parent_dataframe_name == "teams" assert r2._parent_dataframe_name == "teams" assert r1._child_column_name == "home_team_id" assert r2._child_column_name == "away_team_id" assert r1._parent_column_name == "id" assert r2._parent_column_name == "id" def test_has_unique_path(diamond_es): assert diamond_es.has_unique_forward_path("customers", "regions") assert not diamond_es.has_unique_forward_path("transactions", "regions") def test_raise_key_error_missing_dataframe(es): error_text = "DataFrame testing does not exist in ecommerce" with pytest.raises(KeyError, match=error_text): es["testing"] es_without_id = EntitySet() error_text = "DataFrame testing does not exist in entity set" with pytest.raises(KeyError, match=error_text): es_without_id["testing"] def test_add_parent_not_index_column(es): error_text = "Parent column 'language' is not the index of dataframe régions" with pytest.raises(AttributeError, match=error_text): es.add_relationship("régions", "language", "customers", "région_id") ================================================ FILE: featuretools/tests/entityset_tests/test_last_time_index.py ================================================ from datetime import datetime import pandas as pd import pytest from woodwork.logical_types import Categorical, Datetime, Integer from featuretools.entityset.entityset import LTI_COLUMN_NAME @pytest.fixture def values_es(es): es.normalize_dataframe( "log", "values", "value", make_time_index=True, new_dataframe_time_index="value_time", ) return es @pytest.fixture def true_values_lti(): true_values_lti = pd.Series( [ datetime(2011, 4, 10, 10, 41, 0), datetime(2011, 4, 9, 10, 31, 9), datetime(2011, 4, 9, 10, 31, 18), datetime(2011, 4, 9, 10, 31, 27), datetime(2011, 4, 10, 10, 40, 1), datetime(2011, 4, 10, 10, 41, 3), datetime(2011, 4, 9, 10, 30, 12), datetime(2011, 4, 10, 10, 41, 6), datetime(2011, 4, 9, 10, 30, 18), datetime(2011, 4, 9, 10, 30, 24), datetime(2011, 4, 10, 11, 10, 3), ], ) return true_values_lti @pytest.fixture def true_sessions_lti(): sessions_lti = pd.Series( [ datetime(2011, 4, 9, 10, 30, 24), datetime(2011, 4, 9, 10, 31, 27), datetime(2011, 4, 9, 10, 40, 0), datetime(2011, 4, 10, 10, 40, 1), datetime(2011, 4, 10, 10, 41, 6), datetime(2011, 4, 10, 11, 10, 3), ], ) return sessions_lti @pytest.fixture def wishlist_df(): wishlist_df = pd.DataFrame( { "session_id": [0, 1, 2, 2, 3, 4, 5], "datetime": [ datetime(2011, 4, 9, 10, 30, 15), datetime(2011, 4, 9, 10, 31, 30), datetime(2011, 4, 9, 10, 30, 30), datetime(2011, 4, 9, 10, 35, 30), datetime(2011, 4, 10, 10, 41, 0), datetime(2011, 4, 10, 10, 39, 59), datetime(2011, 4, 10, 11, 10, 2), ], "product_id": [ "coke zero", "taco clock", "coke zero", "car", "toothpaste", "brown bag", "coke zero", ], }, ) return wishlist_df @pytest.fixture def extra_session_df(es): row_values = {"customer_id": 2, "device_name": "PC", "device_type": 0, "id": 6} row = pd.DataFrame(row_values, index=pd.Index([6], name="id")) df = es["sessions"] df = pd.concat([df, row]).sort_index() return df class TestLastTimeIndex(object): def test_leaf(self, es): es.add_last_time_indexes() log = es["log"] lti_name = log.ww.metadata.get("last_time_index") assert lti_name == LTI_COLUMN_NAME assert len(log[lti_name]) == 17 log_df = log for v1, v2 in zip(log_df[lti_name], log_df["datetime"]): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_leaf_no_time_index(self, es): es.add_last_time_indexes() stores = es["stores"] true_lti = pd.Series([None for x in range(6)], dtype="datetime64[ns]") assert len(true_lti) == len(stores[LTI_COLUMN_NAME]) stores_lti = stores[LTI_COLUMN_NAME] for v1, v2 in zip(stores_lti, true_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 # TODO: possible issue with either normalize_dataframe or add_last_time_indexes def test_parent(self, values_es, true_values_lti): # test dataframe with time index and all instances in child dataframe values_es.add_last_time_indexes() values = values_es["values"] lti_name = values.ww.metadata.get("last_time_index") assert len(values[lti_name]) == 10 sorted_lti = values[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_values_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_parent_some_missing(self, values_es, true_values_lti): # test dataframe with time index and not all instances have children values = values_es["values"] # add extra value instance with no children row_values = { "value": [21.0], "value_time": [pd.Timestamp("2011-04-10 11:10:02")], } # make sure index doesn't have same name as column to suppress pandas warning row = pd.DataFrame(row_values, index=pd.Index([21])) df = pd.concat([values, row]) df = df.sort_values(by="value") df.index.name = None values_es.replace_dataframe(dataframe_name="values", df=df) values_es.add_last_time_indexes() # lti value should default to instance's time index true_values_lti[10] = pd.Timestamp("2011-04-10 11:10:02") true_values_lti[11] = pd.Timestamp("2011-04-10 11:10:03") values = values_es["values"] lti_name = values.ww.metadata.get("last_time_index") assert len(values[lti_name]) == 11 sorted_lti = values[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_values_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_parent_no_time_index(self, es, true_sessions_lti): # test dataframe without time index and all instances have children es.add_last_time_indexes() sessions = es["sessions"] lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 6 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_parent_no_time_index_missing( self, es, extra_session_df, true_sessions_lti, ): # test dataframe without time index and not all instance have children # add session instance with no associated log instances es.replace_dataframe(dataframe_name="sessions", df=extra_session_df) es.add_last_time_indexes() # since sessions has no time index, default value is NaT true_sessions_lti[6] = pd.NaT sessions = es["sessions"] lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 7 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_multiple_children(self, es, wishlist_df, true_sessions_lti): # test all instances in both children logical_types = { "session_id": Integer, "datetime": Datetime, "product_id": Categorical, } es.add_dataframe( dataframe_name="wishlist_log", dataframe=wishlist_df, index="id", make_index=True, time_index="datetime", logical_types=logical_types, ) es.add_relationship("sessions", "id", "wishlist_log", "session_id") es.add_last_time_indexes() sessions = es["sessions"] # wishlist df has more recent events for two session ids true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30") true_sessions_lti[3] = pd.Timestamp("2011-4-10 10:41:00") lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 6 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_multiple_children_right_missing(self, es, wishlist_df, true_sessions_lti): # test all instances in left child # drop wishlist instance related to id 3 so it's only in log wishlist_df.drop(4, inplace=True) logical_types = { "session_id": Integer, "datetime": Datetime, "product_id": Categorical, } es.add_dataframe( dataframe_name="wishlist_log", dataframe=wishlist_df, index="id", make_index=True, time_index="datetime", logical_types=logical_types, ) es.add_relationship("sessions", "id", "wishlist_log", "session_id") es.add_last_time_indexes() sessions = es["sessions"] # now only session id 1 has newer event in wishlist_log true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30") lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 6 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_multiple_children_left_missing( self, es, extra_session_df, wishlist_df, true_sessions_lti, ): # add row to sessions so not all session instances are in log es.replace_dataframe(dataframe_name="sessions", df=extra_session_df) # add row to wishlist df so new session instance in in wishlist_log row_values = { "session_id": [6], "datetime": [pd.Timestamp("2011-04-11 11:11:11")], "product_id": ["toothpaste"], } row = pd.DataFrame(row_values, index=pd.RangeIndex(start=7, stop=8)) df = pd.concat([wishlist_df, row]) logical_types = { "session_id": Integer, "datetime": Datetime, "product_id": Categorical, } es.add_dataframe( dataframe_name="wishlist_log", dataframe=df, index="id", make_index=True, time_index="datetime", logical_types=logical_types, ) es.add_relationship("sessions", "id", "wishlist_log", "session_id") es.add_last_time_indexes() # test all instances in right child sessions = es["sessions"] # now wishlist_log has newer events for 3 session ids true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30") true_sessions_lti[3] = pd.Timestamp("2011-4-10 10:41:00") true_sessions_lti[6] = pd.Timestamp("2011-04-11 11:11:11") lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 7 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_multiple_children_all_combined( self, es, extra_session_df, wishlist_df, true_sessions_lti, ): # add row to sessions so not all session instances are in log es.replace_dataframe(dataframe_name="sessions", df=extra_session_df) # add row to wishlist_log so extra session has child instance row_values = { "session_id": [6], "datetime": [pd.Timestamp("2011-04-11 11:11:11")], "product_id": ["toothpaste"], } row = pd.DataFrame(row_values, index=pd.RangeIndex(start=7, stop=8)) df = pd.concat([wishlist_df, row]) # drop instance 4 so wishlist_log does not have session id 3 instance df.drop(4, inplace=True) logical_types = { "session_id": Integer, "datetime": Datetime, "product_id": Categorical, } es.add_dataframe( dataframe_name="wishlist_log", dataframe=df, index="id", make_index=True, time_index="datetime", logical_types=logical_types, ) es.add_relationship("sessions", "id", "wishlist_log", "session_id") es.add_last_time_indexes() # test some instances in right, some in left, all when combined sessions = es["sessions"] # wishlist has newer events for 2 sessions true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30") true_sessions_lti[6] = pd.Timestamp("2011-04-11 11:11:11") lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 7 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_multiple_children_both_missing( self, es, extra_session_df, wishlist_df, true_sessions_lti, ): # test all instances in neither child sessions = es["sessions"] logical_types = { "session_id": Integer, "datetime": Datetime, "product_id": Categorical, } # add row to sessions to create session with no events es.replace_dataframe(dataframe_name="sessions", df=extra_session_df) es.add_dataframe( dataframe_name="wishlist_log", dataframe=wishlist_df, index="id", make_index=True, time_index="datetime", logical_types=logical_types, ) es.add_relationship("sessions", "id", "wishlist_log", "session_id") es.add_last_time_indexes() sessions = es["sessions"] # wishlist has 2 newer events and one is NaT true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30") true_sessions_lti[3] = pd.Timestamp("2011-4-10 10:41:00") true_sessions_lti[6] = pd.NaT lti_name = sessions.ww.metadata.get("last_time_index") assert len(sessions[lti_name]) == 7 sorted_lti = sessions[lti_name].sort_index() for v1, v2 in zip(sorted_lti, true_sessions_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 def test_grandparent(self, es): # test sorting by time works correctly across several generations df = es["log"] # For one user, change a log event to be newer than the user's normal # last time index. This event should be from a different session than # the current last time index. df["datetime"][5] = pd.Timestamp("2011-4-09 10:40:01") df = ( df.set_index("datetime", append=True) .sort_index(level=[1, 0], kind="mergesort") .reset_index("datetime", drop=False) ) es.replace_dataframe(dataframe_name="log", df=df) es.add_last_time_indexes() customers = es["customers"] true_customers_lti = pd.Series( [ datetime(2011, 4, 9, 10, 40, 1), datetime(2011, 4, 10, 10, 41, 6), datetime(2011, 4, 10, 11, 10, 3), ], ) lti_name = customers.ww.metadata.get("last_time_index") assert len(customers[lti_name]) == 3 sorted_lti = customers.sort_values("id")[lti_name] for v1, v2 in zip(sorted_lti, true_customers_lti): assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2 ================================================ FILE: featuretools/tests/entityset_tests/test_plotting.py ================================================ import os import re import graphviz import pandas as pd import pytest from featuretools import EntitySet @pytest.fixture def simple_es(): es = EntitySet("test") df = pd.DataFrame({"foo": [1]}) es.add_dataframe(df, dataframe_name="test", index="foo") return es def test_returns_digraph_object(es): graph = es.plot() assert isinstance(graph, graphviz.Digraph) def test_saving_png_file(es, tmp_path): output_path = str(tmp_path.joinpath("test1.png")) es.plot(to_file=output_path) assert os.path.isfile(output_path) os.remove(output_path) def test_missing_file_extension(es): output_path = "test1" with pytest.raises(ValueError) as excinfo: es.plot(to_file=output_path) assert str(excinfo.value).startswith("Please use a file extension") def test_invalid_format(es): output_path = "test1.xzy" with pytest.raises(ValueError) as excinfo: es.plot(to_file=output_path) assert str(excinfo.value).startswith("Unknown format") def test_multiple_rows(es): plot_ = es.plot() result = re.findall(r"\((\d+\srows?)\)", plot_.source) expected = ["{} rows".format(str(i.shape[0])) for i in es.dataframes] assert result == expected def test_single_row(simple_es): plot_ = simple_es.plot() result = re.findall(r"\((\d+\srows?)\)", plot_.source) expected = ["1 row"] assert result == expected ================================================ FILE: featuretools/tests/entityset_tests/test_relationship.py ================================================ from featuretools.entityset.relationship import Relationship, RelationshipPath def test_relationship_path(es): log_to_sessions = Relationship(es, "sessions", "id", "log", "session_id") sessions_to_customers = Relationship( es, "customers", "id", "sessions", "customer_id", ) path_list = [ (True, log_to_sessions), (True, sessions_to_customers), (False, sessions_to_customers), ] path = RelationshipPath(path_list) for i, edge in enumerate(path_list): assert path[i] == edge assert [edge for edge in path] == path_list def test_relationship_path_name(es): assert RelationshipPath([]).name == "" log_to_sessions = Relationship(es, "sessions", "id", "log", "session_id") sessions_to_customers = Relationship( es, "customers", "id", "sessions", "customer_id", ) forward_path = [(True, log_to_sessions), (True, sessions_to_customers)] assert RelationshipPath(forward_path).name == "sessions.customers" backward_path = [(False, sessions_to_customers), (False, log_to_sessions)] assert RelationshipPath(backward_path).name == "sessions.log" mixed_path = [(True, log_to_sessions), (False, log_to_sessions)] assert RelationshipPath(mixed_path).name == "sessions.log" def test_relationship_path_dataframes(es): assert list(RelationshipPath([]).dataframes()) == [] log_to_sessions = Relationship(es, "sessions", "id", "log", "session_id") sessions_to_customers = Relationship( es, "customers", "id", "sessions", "customer_id", ) forward_path = [(True, log_to_sessions), (True, sessions_to_customers)] assert list(RelationshipPath(forward_path).dataframes()) == [ "log", "sessions", "customers", ] backward_path = [(False, sessions_to_customers), (False, log_to_sessions)] assert list(RelationshipPath(backward_path).dataframes()) == [ "customers", "sessions", "log", ] mixed_path = [(True, log_to_sessions), (False, log_to_sessions)] assert list(RelationshipPath(mixed_path).dataframes()) == ["log", "sessions", "log"] def test_names_when_multiple_relationships_between_dataframes(games_es): relationship = Relationship(games_es, "teams", "id", "games", "home_team_id") assert relationship.child_name == "games[home_team_id]" assert relationship.parent_name == "teams[home_team_id]" def test_names_when_no_other_relationship_between_dataframes(home_games_es): relationship = Relationship(home_games_es, "teams", "id", "games", "home_team_id") assert relationship.child_name == "games" assert relationship.parent_name == "teams" def test_relationship_serialization(es): relationship = Relationship(es, "sessions", "id", "log", "session_id") dictionary = { "parent_dataframe_name": "sessions", "parent_column_name": "id", "child_dataframe_name": "log", "child_column_name": "session_id", } assert relationship.to_dictionary() == dictionary assert Relationship.from_dictionary(dictionary, es) == relationship ================================================ FILE: featuretools/tests/entityset_tests/test_serialization.py ================================================ import json import logging import os import tempfile from unittest.mock import MagicMock, patch from urllib.request import urlretrieve import boto3 import pandas as pd import pytest import woodwork.type_sys.type_system as ww_type_system from woodwork.logical_types import LogicalType, Ordinal from woodwork.serializers.serializer_base import typing_info_to_dict from woodwork.type_sys.utils import list_logical_types from featuretools.entityset import EntitySet, deserialize, serialize from featuretools.version import ENTITYSET_SCHEMA_VERSION BUCKET_NAME = "test-bucket" WRITE_KEY_NAME = "test-key" TEST_S3_URL = "s3://{}/{}".format(BUCKET_NAME, WRITE_KEY_NAME) TEST_FILE = "test_serialization_data_entityset_schema_{}_2022_09_02.tar".format( ENTITYSET_SCHEMA_VERSION, ) S3_URL = "s3://featuretools-static/" + TEST_FILE URL = "https://featuretools-static.s3.amazonaws.com/" + TEST_FILE TEST_KEY = "test_access_key_es" def test_entityset_description(es): description = serialize.entityset_to_description(es) _es = deserialize.description_to_entityset(description) assert es.metadata.__eq__(_es, deep=True) def test_all_ww_logical_types(): logical_types = list_logical_types()["type_string"].to_list() dataframe = pd.DataFrame(columns=logical_types) es = EntitySet() ltype_dict = {ltype: ltype for ltype in logical_types} ltype_dict["ordinal"] = Ordinal(order=[]) es.add_dataframe( dataframe=dataframe, dataframe_name="all_types", index="integer", logical_types=ltype_dict, ) description = serialize.entityset_to_description(es) _es = deserialize.description_to_entityset(description) assert es.__eq__(_es, deep=True) def test_with_custom_ww_logical_type(): class CustomLogicalType(LogicalType): pass ww_type_system.add_type(CustomLogicalType) columns = ["integer", "natural_language", "custom_logical_type"] dataframe = pd.DataFrame(columns=columns) es = EntitySet() ltype_dict = { "integer": "integer", "natural_language": "natural_language", "custom_logical_type": CustomLogicalType, } es.add_dataframe( dataframe=dataframe, dataframe_name="custom_type", index="integer", logical_types=ltype_dict, ) description = serialize.entityset_to_description(es) _es = deserialize.description_to_entityset(description) assert isinstance( _es["custom_type"].ww.logical_types["custom_logical_type"], CustomLogicalType, ) assert es.__eq__(_es, deep=True) def test_serialize_invalid_formats(es, tmp_path): error_text = "must be one of the following formats: {}" error_text = error_text.format(", ".join(serialize.FORMATS)) with pytest.raises(ValueError, match=error_text): serialize.write_data_description(es, path=str(tmp_path), format="") def test_empty_dataframe(es): for df in es.dataframes: description = typing_info_to_dict(df) dataframe = deserialize.empty_dataframe(description) assert dataframe.empty assert all(dataframe.columns == df.columns) def test_to_csv(es, tmp_path): es.to_csv(str(tmp_path), encoding="utf-8", engine="python") new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) df = es["log"] new_df = new_es["log"] assert type(df["latlong"][0]) in (tuple, list) assert type(new_df["latlong"][0]) in (tuple, list) def test_to_csv_interesting_values(es, tmp_path): es.add_interesting_values() es.to_csv(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) def test_to_csv_manual_interesting_values(es, tmp_path): es.add_interesting_values( dataframe_name="log", values={"product_id": ["coke_zero"]}, ) es.to_csv(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) assert new_es["log"].ww["product_id"].ww.metadata["interesting_values"] == [ "coke_zero", ] def test_to_pickle(es, tmp_path): es.to_pickle(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) assert type(es["log"]["latlong"][0]) == tuple assert type(new_es["log"]["latlong"][0]) == tuple def test_to_pickle_interesting_values(es, tmp_path): es.add_interesting_values() es.to_pickle(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) def test_to_pickle_manual_interesting_values(es, tmp_path): es.add_interesting_values( dataframe_name="log", values={"product_id": ["coke_zero"]}, ) es.to_pickle(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) assert new_es["log"].ww["product_id"].ww.metadata["interesting_values"] == [ "coke_zero", ] def test_to_parquet(es, tmp_path): es.to_parquet(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) df = es["log"] new_df = new_es["log"] assert type(df["latlong"][0]) in (tuple, list) assert type(new_df["latlong"][0]) in (tuple, list) def test_to_parquet_manual_interesting_values(es, tmp_path): es.add_interesting_values( dataframe_name="log", values={"product_id": ["coke_zero"]}, ) es.to_parquet(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) assert new_es["log"].ww["product_id"].ww.metadata["interesting_values"] == [ "coke_zero", ] def test_to_parquet_interesting_values(es, tmp_path): es.add_interesting_values() es.to_parquet(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) def test_to_parquet_with_lti(tmp_path, mock_customer): es = mock_customer es.to_parquet(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) def test_to_pickle_id_none(tmp_path): es = EntitySet() es.to_pickle(str(tmp_path)) new_es = deserialize.read_entityset(str(tmp_path)) assert es.__eq__(new_es, deep=True) # TODO: Fix Moto tests needing to explicitly set permissions for objects @pytest.fixture def s3_client(): _environ = os.environ.copy() from moto import mock_aws with mock_aws(): s3 = boto3.resource("s3") yield s3 os.environ.clear() os.environ.update(_environ) @pytest.fixture def s3_bucket(s3_client, region="us-east-2"): location = {"LocationConstraint": region} s3_client.create_bucket( Bucket=BUCKET_NAME, ACL="public-read-write", CreateBucketConfiguration=location, ) s3_bucket = s3_client.Bucket(BUCKET_NAME) yield s3_bucket def make_public(s3_client, s3_bucket): obj = list(s3_bucket.objects.all())[0].key s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write") @pytest.mark.parametrize("profile_name", [None, False]) def test_serialize_s3_csv(es, s3_client, s3_bucket, profile_name): es.to_csv(TEST_S3_URL, encoding="utf-8", engine="python", profile_name=profile_name) make_public(s3_client, s3_bucket) new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name) assert es.__eq__(new_es, deep=True) @pytest.mark.parametrize("profile_name", [None, False]) def test_serialize_s3_pickle(es, s3_client, s3_bucket, profile_name): es.to_pickle(TEST_S3_URL, profile_name=profile_name) make_public(s3_client, s3_bucket) new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name) assert es.__eq__(new_es, deep=True) @pytest.mark.parametrize("profile_name", [None, False]) def test_serialize_s3_parquet(es, s3_client, s3_bucket, profile_name): es.to_parquet(TEST_S3_URL, profile_name=profile_name) make_public(s3_client, s3_bucket) new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name) assert es.__eq__(new_es, deep=True) def test_s3_test_profile(es, s3_client, s3_bucket, setup_test_profile): es.to_csv(TEST_S3_URL, encoding="utf-8", engine="python", profile_name="test") make_public(s3_client, s3_bucket) new_es = deserialize.read_entityset(TEST_S3_URL, profile_name="test") assert es.__eq__(new_es, deep=True) def test_serialize_url_csv(es): error_text = "Writing to URLs is not supported" with pytest.raises(ValueError, match=error_text): es.to_csv(URL, encoding="utf-8", engine="python") def test_serialize_subdirs_not_removed(es, tmp_path): write_path = tmp_path.joinpath("test") write_path.mkdir() test_dir = write_path.joinpath("test_dir") test_dir.mkdir() description_path = write_path.joinpath("data_description.json") with open(description_path, "w") as f: json.dump("__SAMPLE_TEXT__", f) compression = None serialize.write_data_description( es, path=str(write_path), index="1", sep="\t", encoding="utf-8", compression=compression, ) assert os.path.exists(str(test_dir)) with open(description_path, "r") as f: assert "__SAMPLE_TEXT__" not in json.load(f) def test_deserialize_local_tar(es): with tempfile.TemporaryDirectory() as tmp_path: temp_tar_filepath = os.path.join(tmp_path, TEST_FILE) urlretrieve(URL, filename=temp_tar_filepath) new_es = deserialize.read_entityset(temp_tar_filepath) assert es.__eq__(new_es, deep=True) @patch("featuretools.entityset.deserialize.getfullargspec") def test_deserialize_errors_if_python_version_unsafe(mock_inspect, es): mock_response = MagicMock() mock_response.kwonlyargs = [] mock_inspect.return_value = mock_response with tempfile.TemporaryDirectory() as tmp_path: temp_tar_filepath = os.path.join(tmp_path, TEST_FILE) urlretrieve(URL, filename=temp_tar_filepath) with pytest.raises(RuntimeError, match=""): deserialize.read_entityset(temp_tar_filepath) def test_deserialize_url_csv(es): new_es = deserialize.read_entityset(URL) assert es.__eq__(new_es, deep=True) def test_deserialize_s3_csv(es): new_es = deserialize.read_entityset(S3_URL, profile_name=False) assert es.__eq__(new_es, deep=True) def test_operations_invalidate_metadata(es): new_es = EntitySet(id="test") # test metadata gets created on access assert new_es._data_description is None assert new_es.metadata is not None # generated after access assert new_es._data_description is not None customers_ltypes = None new_es.add_dataframe( es["customers"], "customers", logical_types=customers_ltypes, ) sessions_ltypes = None new_es.add_dataframe( es["sessions"], "sessions", logical_types=sessions_ltypes, ) assert new_es._data_description is None assert new_es.metadata is not None assert new_es._data_description is not None new_es = new_es.add_relationship("customers", "id", "sessions", "customer_id") assert new_es._data_description is None assert new_es.metadata is not None assert new_es._data_description is not None new_es = new_es.normalize_dataframe("customers", "cohort", "cohort") assert new_es._data_description is None assert new_es.metadata is not None assert new_es._data_description is not None new_es.add_last_time_indexes() assert new_es._data_description is None assert new_es.metadata is not None assert new_es._data_description is not None new_es.add_interesting_values() assert new_es._data_description is None assert new_es.metadata is not None assert new_es._data_description is not None def test_reset_metadata(es): assert es.metadata is not None assert es._data_description is not None es.reset_data_description() assert es._data_description is None @patch("featuretools.utils.schema_utils.ENTITYSET_SCHEMA_VERSION", "1.1.1") @pytest.mark.parametrize( "hardcoded_schema_version, warns", [("2.1.1", True), ("1.2.1", True), ("1.1.2", True), ("1.0.2", False)], ) def test_later_schema_version(es, caplog, hardcoded_schema_version, warns): def test_version(version, warns): if warns: warning_text = ( "The schema version of the saved entityset" "(%s) is greater than the latest supported (%s). " "You may need to upgrade featuretools. Attempting to load entityset ..." % (version, "1.1.1") ) else: warning_text = None _check_schema_version(version, es, warning_text, caplog, "warn") test_version(hardcoded_schema_version, warns) @patch("featuretools.utils.schema_utils.ENTITYSET_SCHEMA_VERSION", "1.1.1") @pytest.mark.parametrize( "hardcoded_schema_version, warns", [("0.1.1", True), ("1.0.1", False), ("1.1.0", False)], ) def test_earlier_schema_version( es, caplog, monkeypatch, hardcoded_schema_version, warns, ): def test_version(version, warns): if warns: warning_text = ( "The schema version of the saved entityset" "(%s) is no longer supported by this version " "of featuretools. Attempting to load entityset ..." % version ) else: warning_text = None _check_schema_version(version, es, warning_text, caplog, "log") test_version(hardcoded_schema_version, warns) def _check_schema_version(version, es, warning_text, caplog, warning_type=None): dataframes = { dataframe.ww.name: typing_info_to_dict(dataframe) for dataframe in es.dataframes } relationships = [relationship.to_dictionary() for relationship in es.relationships] dictionary = { "schema_version": version, "id": es.id, "dataframes": dataframes, "relationships": relationships, } if warning_type == "warn" and warning_text: with pytest.warns(UserWarning) as record: deserialize.description_to_entityset(dictionary) assert record[0].message.args[0] == warning_text elif warning_type == "log": logger = logging.getLogger("featuretools") logger.propagate = True deserialize.description_to_entityset(dictionary) if warning_text: assert warning_text in caplog.text else: assert not len(caplog.text) logger.propagate = False ================================================ FILE: featuretools/tests/entityset_tests/test_timedelta.py ================================================ import pandas as pd import pytest from dateutil.relativedelta import relativedelta from featuretools.entityset import Timedelta from featuretools.feature_base import Feature from featuretools.primitives import Count from featuretools.utils.wrangle import _check_timedelta def test_timedelta_equality(): assert Timedelta(10, "d") == Timedelta(10, "d") assert Timedelta(10, "d") != 1 def test_singular(): assert Timedelta.make_singular("Month") == "Month" assert Timedelta.make_singular("Months") == "Month" def test_delta_with_observations(es): four_delta = Timedelta(4, "observations") assert not four_delta.is_absolute() assert four_delta.get_value("o") == 4 neg_four_delta = -four_delta assert not neg_four_delta.is_absolute() assert neg_four_delta.get_value("o") == -4 time = pd.to_datetime("2019-05-01") error_txt = "Invalid unit" with pytest.raises(Exception, match=error_txt): time + four_delta with pytest.raises(Exception, match=error_txt): time - four_delta def test_delta_with_time_unit_matches_pandas(es): customer_id = 0 sessions_df = es["sessions"] sessions_df = sessions_df[sessions_df["customer_id"] == customer_id] log_df = es["log"] log_df = log_df[log_df["session_id"].isin(sessions_df["id"])] all_times = log_df["datetime"].sort_values().tolist() # 4 observation delta value = 4 unit = "h" delta = Timedelta(value, unit) neg_delta = -delta # first plus 4 obs is fifth assert all_times[0] + delta == all_times[0] + pd.Timedelta(value, unit) # using negative assert all_times[0] - neg_delta == all_times[0] + pd.Timedelta(value, unit) # fifth minus 4 obs is first assert all_times[4] - delta == all_times[4] - pd.Timedelta(value, unit) # using negative assert all_times[4] + neg_delta == all_times[4] - pd.Timedelta(value, unit) def test_check_timedelta(es): time_units = list(Timedelta._readable_units.keys()) expanded_units = list(Timedelta._readable_units.values()) exp_to_standard_unit = {e: t for e, t in zip(expanded_units, time_units)} singular_units = [u[:-1] for u in expanded_units] sing_to_standard_unit = {s: t for s, t in zip(singular_units, time_units)} to_standard_unit = {} to_standard_unit.update(exp_to_standard_unit) to_standard_unit.update(sing_to_standard_unit) full_units = singular_units + expanded_units + time_units + time_units strings = ["2 {}".format(u) for u in singular_units + expanded_units + time_units] strings += ["2{}".format(u) for u in time_units] for i, s in enumerate(strings): unit = full_units[i] standard_unit = unit if unit in to_standard_unit: standard_unit = to_standard_unit[unit] td = _check_timedelta(s) assert td.get_value(standard_unit) == 2 def test_check_pd_timedelta(es): pdtd = pd.Timedelta(5, "m") td = _check_timedelta(pdtd) assert td.get_value("s") == 300 def test_string_timedelta_args(): assert Timedelta("1 second") == Timedelta(1, "second") assert Timedelta("1 seconds") == Timedelta(1, "second") assert Timedelta("10 days") == Timedelta(10, "days") assert Timedelta("100 days") == Timedelta(100, "days") assert Timedelta("1001 days") == Timedelta(1001, "days") assert Timedelta("1001 weeks") == Timedelta(1001, "weeks") def test_feature_takes_timedelta_string(es): feature = Feature( Feature(es["log"].ww["id"]), parent_dataframe_name="customers", use_previous="1 day", primitive=Count, ) assert feature.use_previous == Timedelta(1, "d") def test_deltas_week(es): customer_id = 0 sessions_df = es["sessions"] sessions_df = sessions_df[sessions_df["customer_id"] == customer_id] log_df = es["log"] log_df = log_df[log_df["session_id"].isin(sessions_df["id"])] all_times = log_df["datetime"].sort_values().tolist() delta_week = Timedelta(1, "w") delta_days = Timedelta(7, "d") assert all_times[0] + delta_days == all_times[0] + delta_week def test_relative_year(): td_time = "1 years" td = _check_timedelta(td_time) assert td.get_value("Y") == 1 assert isinstance(td.delta_obj, relativedelta) time = pd.to_datetime("2020-02-29") assert time + td == pd.to_datetime("2021-02-28") def test_serialization(): times = [Timedelta(1, unit="w"), Timedelta(3, unit="d"), Timedelta(5, unit="o")] dictionaries = [ {"value": 1, "unit": "w"}, {"value": 3, "unit": "d"}, {"value": 5, "unit": "o"}, ] for td, expected in zip(times, dictionaries): assert expected == td.get_arguments() for expected, dictionary in zip(times, dictionaries): assert expected == Timedelta.from_dictionary(dictionary) # Test multiple temporal parameters separately since it is not deterministic mult_time = {"years": 4, "months": 3, "days": 2} mult_td = Timedelta(mult_time) # Serialize td_units = mult_td.get_arguments()["unit"] td_values = mult_td.get_arguments()["value"] arg_list = list(zip(td_values, td_units)) assert (4, "Y") in arg_list assert (3, "mo") in arg_list assert (2, "d") in arg_list # Deserialize assert mult_td == Timedelta.from_dictionary( {"value": [4, 3, 2], "unit": ["Y", "mo", "d"]}, ) def test_relative_month(): td_time = "1 month" td = _check_timedelta(td_time) assert td.get_value("mo") == 1 assert isinstance(td.delta_obj, relativedelta) time = pd.to_datetime("2020-01-31") assert time + td == pd.to_datetime("2020-02-29") td_time = "6 months" td = _check_timedelta(td_time) assert td.get_value("mo") == 6 assert isinstance(td.delta_obj, relativedelta) time = pd.to_datetime("2020-01-31") assert time + td == pd.to_datetime("2020-07-31") def test_has_multiple_units(): single_unit = pd.DateOffset(months=3) multiple_units = pd.DateOffset(months=3, years=3, days=5) single_td = _check_timedelta(single_unit) multiple_td = _check_timedelta(multiple_units) assert single_td.has_multiple_units() is False assert multiple_td.has_multiple_units() is True def test_pd_dateoffset_to_timedelta(): single_temporal = pd.DateOffset(months=3) single_td = _check_timedelta(single_temporal) assert single_td.get_value("mo") == 3 assert single_td.delta_obj == pd.DateOffset(months=3) mult_temporal = pd.DateOffset(years=10, months=3, days=5) mult_td = _check_timedelta(mult_temporal) expected = {"Y": 10, "mo": 3, "d": 5} assert mult_td.get_value() == expected assert mult_td.delta_obj == mult_temporal # get_name() for multiple values is not deterministic assert len(mult_td.get_name()) == len("10 Years 3 Months 5 Days") special_dateoffset = pd.offsets.BDay(100) special_td = _check_timedelta(special_dateoffset) assert special_td.get_value("businessdays") == 100 assert special_td.delta_obj == special_dateoffset def test_pd_dateoffset_to_timedelta_math(): base = pd.to_datetime("2020-01-31") add = _check_timedelta(pd.DateOffset(months=2)) res = base + add assert res == pd.to_datetime("2020-03-31") base_2 = pd.to_datetime("2020-01-31") add_2 = _check_timedelta(pd.DateOffset(months=2, days=3)) res_2 = base_2 + add_2 assert res_2 == pd.to_datetime("2020-04-03") base_3 = pd.to_datetime("2019-09-20") sub = _check_timedelta(pd.offsets.BDay(10)) res_3 = base_3 - sub assert res_3 == pd.to_datetime("2019-09-06") ================================================ FILE: featuretools/tests/entityset_tests/test_ww_es.py ================================================ from datetime import datetime import numpy as np import pandas as pd import pytest from woodwork.exceptions import TypeConversionError from woodwork.logical_types import ( Boolean, Categorical, Datetime, Double, Integer, NaturalLanguage, ) from featuretools.entityset.entityset import LTI_COLUMN_NAME, EntitySet def test_empty_es(): es = EntitySet("es") assert es.id == "es" assert es.dataframe_dict == {} assert es.relationships == [] assert es.time_type is None @pytest.fixture def df(): return pd.DataFrame({"id": [0, 1, 2], "category": ["a", "b", "c"]}).astype( {"category": "category"}, ) def test_init_es_with_dataframe(df): es = EntitySet("es", dataframes={"table": (df, "id")}) assert es.id == "es" assert len(es.dataframe_dict) == 1 assert es["table"] is df assert es["table"].ww.schema is not None assert isinstance(es["table"].ww.logical_types["id"], Integer) assert isinstance(es["table"].ww.logical_types["category"], Categorical) def test_init_es_with_woodwork_table_same_name(df): df.ww.init(index="id", name="table") es = EntitySet("es", dataframes={"table": (df,)}) assert es.id == "es" assert len(es.dataframe_dict) == 1 assert es["table"] is df assert es["table"].ww.schema is not None assert es["table"].ww.index == "id" assert es["table"].ww.time_index is None assert isinstance(es["table"].ww.logical_types["id"], Integer) assert isinstance(es["table"].ww.logical_types["category"], Categorical) def test_init_es_with_woodwork_table_diff_name_error(df): df.ww.init(index="id", name="table") error = "Naming conflict in dataframes dictionary: dictionary key 'diff_name' does not match dataframe name 'table'" with pytest.raises(ValueError, match=error): EntitySet("es", dataframes={"diff_name": (df,)}) def test_init_es_with_dataframe_and_params(df): logical_types = {"id": "NaturalLanguage", "category": NaturalLanguage} semantic_tags = {"category": "new_tag"} es = EntitySet( "es", dataframes={"table": (df, "id", None, logical_types, semantic_tags)}, ) assert es.id == "es" assert len(es.dataframe_dict) == 1 assert es["table"] is df assert es["table"].ww.schema is not None assert es["table"].ww.index == "id" assert es["table"].ww.time_index is None assert isinstance(es["table"].ww.logical_types["id"], NaturalLanguage) assert isinstance(es["table"].ww.logical_types["category"], NaturalLanguage) assert es["table"].ww.semantic_tags["id"] == {"index"} assert es["table"].ww.semantic_tags["category"] == {"new_tag"} def test_init_es_with_multiple_dataframes(df): second_df = pd.DataFrame({"id": [0, 1, 2, 3], "first_table_id": [1, 2, 2, 1]}) df.ww.init(name="first_table", index="id") es = EntitySet( "es", dataframes={ "first_table": (df,), "second_table": ( second_df, "id", None, None, {"first_table_id": "foreign_key"}, ), }, ) assert len(es.dataframe_dict) == 2 assert es["first_table"].ww.schema is not None assert es["second_table"].ww.schema is not None def test_add_dataframe_to_es(df): es1 = EntitySet("es") assert es1.dataframe_dict == {} es1.add_dataframe( df, dataframe_name="table", index="id", semantic_tags={"category": "new_tag"}, ) assert len(es1.dataframe_dict) == 1 copy_df = df.ww.copy() es2 = EntitySet("es") assert es2.dataframe_dict == {} es2.add_dataframe(copy_df) assert len(es2.dataframe_dict) == 1 assert es1["table"].ww == es2["table"].ww def test_change_es_dataframe_schema(df): df.ww.init(index="id", name="table") es = EntitySet("es", dataframes={"table": (df,)}) assert es["table"].ww.index == "id" es["table"].ww.set_index("category") assert es["table"].ww.index == "category" def test_init_es_with_relationships(df): second_df = pd.DataFrame({"id": [0, 1, 2, 3], "first_table_id": [1, 2, 2, 1]}) df.ww.init(name="first_table", index="id") second_df.ww.init(name="second_table", index="id") es = EntitySet( "es", dataframes={"first_table": (df,), "second_table": (second_df,)}, relationships=[("first_table", "id", "second_table", "first_table_id")], ) assert len(es.relationships) == 1 forward_dataframes = [name for name, _ in es.get_forward_dataframes("second_table")] assert forward_dataframes[0] == "first_table" relationship = es.relationships[0] assert "foreign_key" in relationship.child_column.ww.semantic_tags assert "index" in relationship.parent_column.ww.semantic_tags @pytest.fixture def dates_df(): return pd.DataFrame( { "backwards_order": [8, 7, 6, 5, 4, 3, 2, 1, 0], "dates_backwards": [ "2020-09-09", "2020-09-08", "2020-09-07", "2020-09-06", "2020-09-05", "2020-09-04", "2020-09-03", "2020-09-02", "2020-09-01", ], "random_order": [7, 6, 8, 0, 2, 4, 3, 1, 5], "repeating_dates": [ "2020-08-01", "2019-08-01", "2020-08-01", "2012-08-01", "2019-08-01", "2019-08-01", "2019-08-01", "2013-08-01", "2019-08-01", ], "special": [7, 8, 0, 1, 4, 2, 6, 3, 5], "special_dates": [ "2020-08-01", "2019-08-01", "2020-08-01", "2012-08-01", "2019-08-01", "2019-08-01", "2019-08-01", "2013-08-01", "2019-08-01", ], }, ) def test_add_secondary_time_index(dates_df): dates_df.ww.init( name="dates_table", index="backwards_order", time_index="dates_backwards", ) es = EntitySet("es") es.add_dataframe( dates_df, secondary_time_index={"repeating_dates": ["random_order", "special"]}, ) assert dates_df.ww.metadata["secondary_time_index"] == { "repeating_dates": ["random_order", "special", "repeating_dates"], } def test_time_type_check_order(dates_df): dates_df.ww.init( name="dates_table", index="backwards_order", time_index="random_order", ) es = EntitySet("es") error = "dates_table time index is Datetime type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error): es.add_dataframe( dates_df, secondary_time_index={"repeating_dates": ["random_order", "special"]}, ) assert "secondary_time_index" not in dates_df.ww.metadata def test_add_time_index_through_woodwork_different_type(dates_df): dates_df.ww.init( name="dates_table", index="backwards_order", time_index="dates_backwards", ) es = EntitySet("es") es.add_dataframe( dates_df, secondary_time_index={"repeating_dates": ["random_order", "special"]}, ) assert dates_df.ww.metadata["secondary_time_index"] == { "repeating_dates": ["random_order", "special", "repeating_dates"], } assert es.time_type == Datetime assert es._check_uniform_time_index(es["dates_table"]) is None dates_df.ww.set_time_index("random_order") assert dates_df.ww.time_index == "random_order" error = "dates_table time index is numeric type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error): es._check_uniform_time_index(es["dates_table"]) def test_init_with_mismatched_time_types(dates_df): dates_df.ww.init( name="dates_table", index="backwards_order", time_index="repeating_dates", ) es = EntitySet("es") es.add_dataframe(dates_df, secondary_time_index={"special_dates": ["special"]}) assert es.time_type == Datetime nums_df = pd.DataFrame({"id": [1, 2, 3], "times": [9, 8, 7]}) nums_df.ww.init(name="numerics_table", index="id", time_index="times") error = "numerics_table time index is numeric type which differs from other entityset time indexes" with pytest.raises(TypeError, match=error): es.add_dataframe(nums_df) def test_int_double_time_type(dates_df): dates_df.ww.init( name="dates_table", index="backwards_order", time_index="random_order", logical_types={"random_order": "Integer", "special": "Double"}, ) es = EntitySet("es") # Both random_order and special are numeric, but they are different logical types es.add_dataframe(dates_df, secondary_time_index={"special": ["dates_backwards"]}) assert isinstance(es["dates_table"].ww.logical_types["random_order"], Integer) assert isinstance(es["dates_table"].ww.logical_types["special"], Double) assert es["dates_table"].ww.time_index == "random_order" assert "special" in es["dates_table"].ww.metadata["secondary_time_index"] def test_normalize_dataframe(): df = pd.DataFrame( { "id": range(4), "full_name": [ "Mr. John Doe", "Doe, Mrs. Jane", "James Brown", "Ms. Paige Turner", ], "email": [ "john.smith@example.com", np.nan, "team@featuretools.com", "junk@example.com", ], "phone_number": [ "5555555555", "555-555-5555", "1-(555)-555-5555", "555-555-5555", ], "age": pd.Series([33, None, 33, 57], dtype="Int64"), "signup_date": [pd.to_datetime("2020-09-01")] * 4, "is_registered": pd.Series([True, False, True, None], dtype="boolean"), }, ) df.ww.init(name="first_table", index="id", time_index="signup_date") es = EntitySet("es") es.add_dataframe(df) es.normalize_dataframe( "first_table", "second_table", "age", additional_columns=["phone_number", "full_name"], make_time_index=True, ) assert len(es.dataframe_dict) == 2 assert "foreign_key" in es["first_table"].ww.semantic_tags["age"] def test_replace_dataframe(): df = pd.DataFrame( { "id": range(4), "full_name": [ "Mr. John Doe", "Doe, Mrs. Jane", "James Brown", "Ms. Paige Turner", ], "email": [ "john.smith@example.com", np.nan, "team@featuretools.com", "junk@example.com", ], "phone_number": [ "5555555555", "555-555-5555", "1-(555)-555-5555", "555-555-5555", ], "age": pd.Series([33, None, 33, 57], dtype="Int64"), "signup_date": [pd.to_datetime("2020-09-01")] * 4, "is_registered": pd.Series([True, False, True, None], dtype="boolean"), }, ) df.ww.init(name="table", index="id") es = EntitySet("es") es.add_dataframe(df) original_schema = es["table"].ww.schema new_df = df.iloc[2:] es.replace_dataframe("table", new_df) assert len(es["table"]) == 2 assert es["table"].ww.schema == original_schema def test_add_last_time_index(es): es.add_last_time_indexes(["products"]) assert "last_time_index" in es["products"].ww.metadata assert es["products"].ww.metadata["last_time_index"] == LTI_COLUMN_NAME assert LTI_COLUMN_NAME in es["products"] assert "last_time_index" in es["products"].ww.semantic_tags[LTI_COLUMN_NAME] assert isinstance(es["products"].ww.logical_types[LTI_COLUMN_NAME], Datetime) def test_lti_already_has_last_time_column_name(es): col = es["customers"].ww.pop("loves_ice_cream") col.name = LTI_COLUMN_NAME es["customers"].ww[LTI_COLUMN_NAME] = col assert LTI_COLUMN_NAME in es["customers"].columns assert isinstance(es["customers"].ww.logical_types[LTI_COLUMN_NAME], Boolean) error = ( "Cannot add a last time index on DataFrame with an existing " f"'{LTI_COLUMN_NAME}' column. Please rename '{LTI_COLUMN_NAME}'." ) with pytest.raises(ValueError, match=error): es.add_last_time_indexes(["customers"]) def test_numeric_es_last_time_index_logical_type(int_es): assert int_es.time_type == "numeric" int_es.add_last_time_indexes() for df in int_es.dataframes: assert isinstance(df.ww.logical_types[LTI_COLUMN_NAME], Double) int_es._check_uniform_time_index(df, LTI_COLUMN_NAME) def test_datetime_es_last_time_index_logical_type(es): assert es.time_type == Datetime es.add_last_time_indexes() for df in es.dataframes: assert isinstance(df.ww.logical_types[LTI_COLUMN_NAME], Datetime) es._check_uniform_time_index(df, LTI_COLUMN_NAME) def test_dataframe_without_name(es): new_es = EntitySet() new_df = es["sessions"].copy() assert new_df.ww.schema is None error = "Cannot add dataframe to EntitySet without a name. Please provide a value for the dataframe_name parameter." with pytest.raises(ValueError, match=error): new_es.add_dataframe(new_df) def test_dataframe_with_name_parameter(es): new_es = EntitySet() new_df = es["sessions"][["id"]] assert new_df.ww.schema is None new_es.add_dataframe( new_df, dataframe_name="df_name", index="id", logical_types={"id": "Integer"}, ) assert new_es["df_name"].ww.name == "df_name" def test_woodwork_dataframe_without_name_errors(es): new_es = EntitySet() new_df = es["sessions"].ww.copy() new_df.ww._schema.name = None assert new_df.ww.name is None error = "Cannot add a Woodwork DataFrame to EntitySet without a name" with pytest.raises(ValueError, match=error): new_es.add_dataframe(new_df) def test_woodwork_dataframe_with_name(es): new_es = EntitySet() new_df = es["sessions"].ww.copy() new_df.ww._schema.name = "df_name" assert new_df.ww.name == "df_name" new_es.add_dataframe(new_df) assert new_es["df_name"].ww.name == "df_name" def test_woodwork_dataframe_ignore_conflicting_name_parameter_warning(es): new_es = EntitySet() new_df = es["sessions"].ww.copy() new_df.ww._schema.name = "df_name" assert new_df.ww.name == "df_name" warning = "A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: dataframe_name" with pytest.warns(UserWarning, match=warning): new_es.add_dataframe(new_df, dataframe_name="conflicting_name") assert new_es["df_name"].ww.name == "df_name" def test_woodwork_dataframe_same_name_parameter(es): new_es = EntitySet() new_df = es["sessions"].ww.copy() new_df.ww._schema.name = "df_name" assert new_df.ww.name == "df_name" new_es.add_dataframe(new_df, dataframe_name="df_name") assert new_es["df_name"].ww.name == "df_name" def test_extra_woodwork_params(es): new_es = EntitySet() sessions_df = es["sessions"].ww.copy() assert sessions_df.ww.index == "id" assert sessions_df.ww.time_index is None assert isinstance(sessions_df.ww.logical_types["id"], Integer) warning_msg = ( "A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: " "index, time_index, logical_types, make_index, semantic_tags, already_sorted" ) with pytest.warns(UserWarning, match=warning_msg): new_es.add_dataframe( dataframe_name="sessions", dataframe=sessions_df, index="filepath", time_index="customer_id", logical_types={"id": Categorical}, make_index=True, already_sorted=True, semantic_tags={"id": "new_tag"}, ) assert sessions_df.ww.index == "id" assert sessions_df.ww.time_index is None assert isinstance(sessions_df.ww.logical_types["id"], Integer) assert "new_tag" not in sessions_df.ww.semantic_tags def test_replace_dataframe_errors(es): df = es["customers"].copy() df["new"] = pd.Series([1, 2, 3]) error_text = "New dataframe is missing new cohort column" with pytest.raises(ValueError, match=error_text): es.replace_dataframe(dataframe_name="customers", df=df.drop(columns=["cohort"])) error_text = "New dataframe contains 16 columns, expecting 15" with pytest.raises(ValueError, match=error_text): es.replace_dataframe(dataframe_name="customers", df=df) def test_replace_dataframe_already_sorted(es): # test already_sorted on dataframe without time index df = es["sessions"].copy() updated_id = df["id"] updated_id.iloc[1] = 2 updated_id.iloc[2] = 1 df = df.set_index("id", drop=False) df.index.name = None es.replace_dataframe(dataframe_name="sessions", df=df.copy(), already_sorted=False) sessions_df = es["sessions"] assert sessions_df["id"].iloc[1] == 2 # no sorting since time index not defined es.replace_dataframe(dataframe_name="sessions", df=df.copy(), already_sorted=True) sessions_df = es["sessions"] assert sessions_df["id"].iloc[1] == 2 # test already_sorted on dataframe with time index df = es["customers"].copy() updated_signup = df["signup_date"] updated_signup.iloc[0] = datetime(2011, 4, 11) assert es["customers"].ww.time_index == "signup_date" df["signup_date"] = updated_signup es.replace_dataframe(dataframe_name="customers", df=df.copy(), already_sorted=True) customers_df = es["customers"] assert customers_df["id"].iloc[0] == 2 es.replace_dataframe(dataframe_name="customers", df=df.copy(), already_sorted=False) updated_customers = es["customers"] assert updated_customers["id"].iloc[0] == 0 def test_replace_dataframe_invalid_schema(es): df = es["customers"].copy() df["id"] = pd.Series([1, 1, 1]) error_text = "Index column must be unique" with pytest.raises(IndexError, match=error_text): es.replace_dataframe(dataframe_name="customers", df=df) def test_replace_dataframe_mismatched_index(es): df = es["customers"].copy() df["id"] = pd.Series([99, 88, 77]) es.replace_dataframe(dataframe_name="customers", df=df) assert all([77, 99, 88] == es["customers"]["id"]) assert all([77, 99, 88] == (es["customers"]["id"]).index) def test_replace_dataframe_different_dtypes(es): float_dtype_df = es["customers"].copy() float_dtype_df = float_dtype_df.astype({"age": "float64"}) es.replace_dataframe(dataframe_name="customers", df=float_dtype_df) assert es["customers"]["age"].dtype == "int64" assert isinstance(es["customers"].ww.logical_types["age"], Integer) incompatible_dtype_df = es["customers"].copy() incompatible_list = ["hi", "bye", "bye"] incompatible_dtype_df["age"] = pd.Series(incompatible_list) error_msg = "Error converting datatype for age from type object to type int64. Please confirm the underlying data is consistent with logical type Integer." with pytest.raises(TypeConversionError, match=error_msg): es.replace_dataframe(dataframe_name="customers", df=incompatible_dtype_df) @pytest.fixture() def latlong_df(): latlong_df = pd.DataFrame( { "tuples": pd.Series([(1, 2), (3, 4)]), "string_tuple": pd.Series(["(1, 2)", "(3, 4)"]), "bracketless_string_tuple": pd.Series(["1, 2", "3, 4"]), "list_strings": pd.Series([["1", "2"], ["3", "4"]]), "combo_tuple_types": pd.Series(["[1, 2]", "(3, 4)"]), }, ) latlong_df.set_index("string_tuple", drop=False, inplace=True) latlong_df.index.name = None return latlong_df def test_replace_dataframe_data_transformation(latlong_df): initial_df = latlong_df.copy() initial_df.ww.init( name="latlongs", index="string_tuple", logical_types={col_name: "LatLong" for col_name in initial_df.columns}, ) es = EntitySet() es.add_dataframe(dataframe=initial_df) df = es["latlongs"] expected_val = (1, 2) for col in latlong_df.columns: series = df[col] assert series.iloc[0] == expected_val es.replace_dataframe("latlongs", latlong_df) df = es["latlongs"] expected_val = (3, 4) for col in latlong_df.columns: series = df[col] assert series.iloc[-1] == expected_val def test_replace_dataframe_column_order(es): original_column_order = es["customers"].columns.copy() df = es["customers"].copy() col = df.pop("cohort") df[col.name] = col assert not df.columns.equals(original_column_order) assert set(df.columns) == set(original_column_order) es.replace_dataframe(dataframe_name="customers", df=df) assert es["customers"].columns.equals(original_column_order) def test_replace_dataframe_different_woodwork_initialized(es): df = es["customers"].copy() df["age"] = pd.Series([1, 2, 3]) # Initialize Woodwork on the new DataFrame and change the schema so it won't match the original DataFrame's schema df.ww.init(schema=es["customers"].ww.schema) df.ww.set_types( logical_types={"id": "NaturalLanguage", "cancel_date": "NaturalLanguage"}, ) assert df["id"].dtype == "string" assert df["cancel_date"].dtype == "string" assert es["customers"]["id"].dtype == "int64" assert es["customers"]["cancel_date"].dtype == "datetime64[ns]" original_schema = es["customers"].ww.schema warning = "Woodwork typing information on new dataframe will be replaced with existing typing information from customers" with pytest.warns(UserWarning, match=warning): es.replace_dataframe("customers", df, already_sorted=True) actual = es["customers"]["age"].sort_values() assert all(actual == [1, 2, 3]) assert es["customers"].ww._schema == original_schema assert es["customers"]["id"].dtype == "int64" assert es["customers"]["cancel_date"].dtype == "datetime64[ns]" def test_replace_dataframe_and_min_last_time_index(es): es.add_last_time_indexes(["products"]) original_time_index = es["log"]["datetime"].copy() original_last_time_index = es["products"][LTI_COLUMN_NAME].copy() new_time_index = original_time_index + pd.Timedelta(days=1) expected_last_time_index = original_last_time_index + pd.Timedelta(days=1) new_dataframe = es["log"].copy() new_dataframe["datetime"] = new_time_index new_dataframe.pop(LTI_COLUMN_NAME) es.replace_dataframe("log", new_dataframe, recalculate_last_time_indexes=True) pd.testing.assert_series_equal( es["products"][LTI_COLUMN_NAME].sort_index(), expected_last_time_index.sort_index(), ) pd.testing.assert_series_equal( es["log"][LTI_COLUMN_NAME].sort_index(), new_time_index.sort_index(), check_names=False, ) def test_replace_dataframe_dont_recalculate_last_time_index_present(es): es.add_last_time_indexes() original_time_index = es["customers"]["signup_date"].copy() original_last_time_index = es["customers"][LTI_COLUMN_NAME].copy() new_time_index = original_time_index + pd.Timedelta(days=10) new_dataframe = es["customers"].copy() new_dataframe["signup_date"] = new_time_index es.replace_dataframe( "customers", new_dataframe, recalculate_last_time_indexes=False, ) pd.testing.assert_series_equal( es["customers"][LTI_COLUMN_NAME], original_last_time_index, ) def test_replace_dataframe_dont_recalculate_last_time_index_not_present(es): es.add_last_time_indexes() original_lti_name = es["customers"].ww.metadata.get("last_time_index") assert original_lti_name is not None original_time_index = es["customers"]["signup_date"].copy() new_time_index = original_time_index + pd.Timedelta(days=10) new_dataframe = es["customers"].copy() new_dataframe["signup_date"] = new_time_index new_dataframe.pop(LTI_COLUMN_NAME) es.replace_dataframe( "customers", new_dataframe, recalculate_last_time_indexes=False, ) assert "last_time_index" not in es["customers"].ww.metadata assert original_lti_name not in es["customers"].columns def test_replace_dataframe_recalculate_last_time_index_not_present(es): es.add_last_time_indexes() original_time_index = es["log"]["datetime"].copy() new_time_index = original_time_index + pd.Timedelta(days=10) new_dataframe = es["log"].copy() new_dataframe["datetime"] = new_time_index new_dataframe.pop(LTI_COLUMN_NAME) es.replace_dataframe("log", new_dataframe, recalculate_last_time_indexes=True) pd.testing.assert_series_equal( es["log"]["datetime"].sort_index(), new_time_index.sort_index(), check_names=False, ) pd.testing.assert_series_equal( es["log"][LTI_COLUMN_NAME].sort_index(), new_time_index.sort_index(), check_names=False, ) def test_replace_dataframe_recalculate_last_time_index_present(es): es.add_last_time_indexes() original_time_index = es["log"]["datetime"].copy() new_time_index = original_time_index + pd.Timedelta(days=10) new_dataframe = es["log"].copy() new_dataframe["datetime"] = new_time_index assert LTI_COLUMN_NAME in new_dataframe.columns es.replace_dataframe("log", new_dataframe, recalculate_last_time_indexes=True) pd.testing.assert_series_equal( es["log"]["datetime"].sort_index(), new_time_index.sort_index(), check_names=False, ) pd.testing.assert_series_equal( es["log"][LTI_COLUMN_NAME].sort_index(), new_time_index.sort_index(), check_names=False, ) def test_normalize_dataframe_loses_column_metadata(es): es["log"].ww.columns["value"].metadata["interesting_values"] = [0.0, 1.0] es["log"].ww.columns["priority_level"].metadata["interesting_values"] = [1] es["log"].ww.columns["value"].description = "a value column" es["log"].ww.columns["priority_level"].description = "a priority level column" assert "interesting_values" in es["log"].ww.columns["priority_level"].metadata assert "interesting_values" in es["log"].ww.columns["value"].metadata assert es["log"].ww.columns["value"].description == "a value column" assert ( es["log"].ww.columns["priority_level"].description == "a priority level column" ) es.normalize_dataframe( "log", "values_2", "value_2", additional_columns=["priority_level"], copy_columns=["value"], make_time_index=False, ) # Metadata in the original dataframe and the new dataframe are maintained assert "interesting_values" in es["log"].ww.columns["value"].metadata assert "interesting_values" in es["values_2"].ww.columns["value"].metadata assert "interesting_values" in es["values_2"].ww.columns["priority_level"].metadata assert es["log"].ww.columns["value"].description == "a value column" assert es["values_2"].ww.columns["value"].description == "a value column" assert ( es["values_2"].ww.columns["priority_level"].description == "a priority level column" ) def test_normalize_ww_init(): es = EntitySet() df = pd.DataFrame( { "id": [1, 2, 3, 4], "col": ["a", "b", "c", "d"], "df2_id": [1, 1, 2, 2], "df2_col": [True, False, True, True], }, ) df.ww.init(index="id", name="test_name") es.add_dataframe(dataframe=df) assert es["test_name"].ww.name == "test_name" assert es["test_name"].ww.schema.name == "test_name" es.normalize_dataframe( "test_name", "new_df", "df2_id", additional_columns=["df2_col"], ) assert es["test_name"].ww.name == "test_name" assert es["test_name"].ww.schema.name == "test_name" assert es["new_df"].ww.name == "new_df" assert es["new_df"].ww.schema.name == "new_df" ================================================ FILE: featuretools/tests/entry_point_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/__init__.py ================================================ ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/__init__.py ================================================ ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/featuretools_plugin/__init__.py ================================================ raise NotImplementedError("plugin not implemented") ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/setup.py ================================================ from setuptools import setup setup( name="featuretools_plugin", packages=["featuretools_plugin"], entry_points={ "featuretools_plugin": [ "module = featuretools_plugin", ], }, ) ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/__init__.py ================================================ ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/__init__.py ================================================ ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/existing_primitive.py ================================================ from featuretools.primitives.base import AggregationPrimitive class Sum(AggregationPrimitive): """A primitive that should currently exist for testing.""" pass ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/invalid_primitive.py ================================================ raise NotImplementedError("invalid primitive") ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/new_primitive.py ================================================ from featuretools.primitives.base import TransformPrimitive class NewPrimitive(TransformPrimitive): """A primitive that should not currently exist for testing.""" pass ================================================ FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/setup.py ================================================ from setuptools import find_packages, setup setup( name="featuretools_primitives", packages=find_packages(), entry_points={ "featuretools_primitives": [ "new = featuretools_primitives.new_primitive", "invalid = featuretools_primitives.invalid_primitive", "existing = featuretools_primitives.existing_primitive", ], }, ) ================================================ FILE: featuretools/tests/entry_point_tests/test_plugin.py ================================================ from featuretools.tests.entry_point_tests.utils import ( _import_featuretools, _install_featuretools_plugin, _uninstall_featuretools_plugin, ) def test_plugin_warning(): _install_featuretools_plugin() warning = _import_featuretools("warning").stdout.decode() debug = _import_featuretools("debug").stdout.decode() _uninstall_featuretools_plugin() message = ( "Featuretools failed to load plugin module from library featuretools_plugin" ) traceback = "NotImplementedError: plugin not implemented" assert message in warning assert traceback not in warning assert message in debug assert traceback in debug ================================================ FILE: featuretools/tests/entry_point_tests/test_primitives.py ================================================ from featuretools.tests.entry_point_tests.utils import ( _import_featuretools, _install_featuretools_primitives, _python, _uninstall_featuretools_primitives, ) def test_entry_point(): _install_featuretools_primitives() featuretools_log = _import_featuretools("debug").stdout.decode() new_primitive = _python("-c", "from featuretools.primitives import NewPrimitive") _uninstall_featuretools_primitives() assert new_primitive.returncode == 0 invalid_primitive = 'Featuretools failed to load "invalid" primitives from "featuretools_primitives.invalid_primitive". ' invalid_primitive += "For a full stack trace, set logging to debug." assert invalid_primitive in featuretools_log existing_primitive = 'While loading primitives via "existing" entry point, ' existing_primitive += 'ignored primitive "Sum" from "featuretools_primitives.existing_primitive" because a primitive ' existing_primitive += 'with that name already exists in "featuretools.primitives.standard.aggregation.sum_primitive"' assert existing_primitive in featuretools_log ================================================ FILE: featuretools/tests/entry_point_tests/utils.py ================================================ import os import subprocess import sys def _get_path_to_add_ons(*args): pwd = os.path.dirname(__file__) return os.path.join(pwd, "add-ons", *args) def _python(*args): command = [sys.executable, *args] return subprocess.run(command, stdout=subprocess.PIPE) def _install_featuretools_plugin(): os.chdir(_get_path_to_add_ons("featuretools_plugin")) return _python("-m", "pip", "install", "-e", ".") def _uninstall_featuretools_plugin(): return _python("-m", "pip", "uninstall", "featuretools_plugin", "-y") def _install_featuretools_primitives(): os.chdir(_get_path_to_add_ons("featuretools_primitives")) return _python("-m", "pip", "install", "-e", ".") def _uninstall_featuretools_primitives(): return _python("-m", "pip", "uninstall", "featuretools_primitives", "-y") def _import_featuretools(level=None): c = "" if level: c += "import os;" c += 'os.environ["FEATURETOOLS_LOG_LEVEL"] = "%s";' % level c += "import featuretools;" return _python("-c", c) ================================================ FILE: featuretools/tests/feature_discovery/__init__.py ================================================ ================================================ FILE: featuretools/tests/feature_discovery/test_convertors.py ================================================ from woodwork.logical_types import Double, NaturalLanguage from featuretools.entityset.entityset import EntitySet from featuretools.feature_base.feature_base import ( FeatureBase, IdentityFeature, TransformFeature, ) from featuretools.feature_discovery.convertors import ( _convert_feature_to_featurebase, convert_feature_list_to_featurebase_list, convert_featurebase_list_to_feature_list, ) from featuretools.feature_discovery.feature_discovery import ( generate_features_from_primitives, schema_to_features, ) from featuretools.feature_discovery.LiteFeature import ( LiteFeature, ) from featuretools.primitives import Absolute, AddNumeric, Lag from featuretools.synthesis import dfs from featuretools.tests.feature_discovery.test_feature_discovery import ( MultiOutputPrimitiveForTest, ) from featuretools.tests.testing_utils.generate_fake_dataframe import ( generate_fake_dataframe, ) def test_convert_featurebase_list_to_feature_list(): col_defs = [ ("idx", "Integer", {"index"}), ("f_1", "Double"), ("f_2", "Double"), ("f_3", "NaturalLanguage"), ] df = generate_fake_dataframe( col_defs=col_defs, ) es = EntitySet(id="es") es.add_dataframe(df, df.ww.name) fdefs = dfs( entityset=es, target_dataframe_name=df.ww.name, trans_primitives=[AddNumeric, MultiOutputPrimitiveForTest], features_only=True, max_depth=1, ) assert isinstance(fdefs, list) assert isinstance(fdefs[0], FeatureBase) converted_features = set(convert_featurebase_list_to_feature_list(fdefs)) f1 = LiteFeature("f_1", Double) f2 = LiteFeature("f_2", Double) f3 = LiteFeature("f_3", NaturalLanguage) fadd = LiteFeature( name="f_1 + f_2", tags={"numeric"}, primitive=AddNumeric(), base_features=[f1, f2], ) fmo0 = LiteFeature( name="TEST_MO(f_3)[0]", tags={"numeric"}, primitive=MultiOutputPrimitiveForTest(), base_features=[f3], idx=0, ) fmo1 = LiteFeature( name="TEST_MO(f_3)[1]", tags={"numeric"}, primitive=MultiOutputPrimitiveForTest(), base_features=[f3], idx=1, ) fmo0.related_features = {fmo1} fmo1.related_features = {fmo0} orig_features = set([f1, f2, fadd, fmo0, fmo1]) assert len(orig_features.symmetric_difference(converted_features)) == 0 def test_origin_feature_to_featurebase(): df = generate_fake_dataframe( col_defs=[("idx", "Double", {"index"}), ("f_1", "Double")], ) es = EntitySet(id="test") es.add_dataframe(df, df.ww.name) origin_features = schema_to_features(df.ww.schema) f_1 = [f for f in origin_features if f.name == "f_1"][0] fb = _convert_feature_to_featurebase(f_1, df, {}) assert isinstance(fb, IdentityFeature) assert fb.get_name() == "f_1" f_1.set_alias("new name") df.ww.rename({"f_1": "new name"}, inplace=True) fb = _convert_feature_to_featurebase(f_1, df, {}) assert isinstance(fb, IdentityFeature) assert fb.get_name() == "new name" def test_stacked_feature_to_featurebase(): df = generate_fake_dataframe( col_defs=[("idx", "Double", {"index"}), ("f_1", "Double")], ) es = EntitySet(id="test") es.add_dataframe(df, df.ww.name) origin_features = schema_to_features(df.ww.schema) f_1 = [f for f in origin_features if f.name == "f_1"][0] features = generate_features_from_primitives([f_1], [Absolute()]) f_2 = [f for f in features if f.name == "ABSOLUTE(f_1)"][0] fb = _convert_feature_to_featurebase(f_2, df, {}) assert isinstance(fb, TransformFeature) assert fb.get_name() == "ABSOLUTE(f_1)" assert len(fb.base_features) == 1 assert fb.base_features[0].get_name() == "f_1" f_2.set_alias("f_2") fb = _convert_feature_to_featurebase(f_2, df, {}) assert isinstance(fb, TransformFeature) assert fb.get_name() == "f_2" assert len(fb.base_features) == 1 assert fb.base_features[0].get_name() == "f_1" def test_multi_output_to_featurebase(): df = generate_fake_dataframe( col_defs=[ ("idx", "Double", {"index"}), ("f_1", "NaturalLanguage"), ], ) es = EntitySet(id="test") es.add_dataframe(df, df.ww.name) origin_features = schema_to_features(df.ww.schema) f_1 = [f for f in origin_features if f.name == "f_1"][0] features = generate_features_from_primitives([f_1], [MultiOutputPrimitiveForTest()]) lsa_features = [f for f in features if f.get_primitive_name() == "test_mo"] assert len(lsa_features) == 2 # Test Single LiteFeature fb = _convert_feature_to_featurebase(lsa_features[0], df, {}) assert isinstance(fb, TransformFeature) assert fb.get_name() == "TEST_MO(f_1)" assert len(fb.base_features) == 1 assert set(fb.get_feature_names()) == set(["TEST_MO(f_1)[0]", "TEST_MO(f_1)[1]"]) assert fb.base_features[0].get_name() == "f_1" # Test that feature gets consolidated fb_list = convert_feature_list_to_featurebase_list(lsa_features, df) assert len(fb_list) == 1 assert fb_list[0].get_name() == "TEST_MO(f_1)" assert len(fb_list[0].base_features) == 1 assert set(fb_list[0].get_feature_names()) == set( ["TEST_MO(f_1)[0]", "TEST_MO(f_1)[1]"], ) assert fb_list[0].base_features[0].get_name() == "f_1" lsa_features[0].set_alias("f_2") lsa_features[1].set_alias("f_3") fb = _convert_feature_to_featurebase(lsa_features[0], df, {}) assert isinstance(fb, TransformFeature) assert len(fb.base_features) == 1 assert set(fb.get_feature_names()) == set(["f_2", "f_3"]) assert fb.base_features[0].get_name() == "f_1" # Test that feature gets consolidated fb_list = convert_feature_list_to_featurebase_list(lsa_features, df) assert len(fb_list) == 1 assert len(fb_list[0].base_features) == 1 assert set(fb_list[0].get_feature_names()) == set(["f_2", "f_3"]) assert fb_list[0].base_features[0].get_name() == "f_1" def test_stacking_on_multioutput_to_featurebase(): col_defs = [ ("idx", "Double", {"index"}), ("t_idx", "Datetime", {"time_index"}), ("f_1", "NaturalLanguage"), ] df = generate_fake_dataframe( col_defs=col_defs, ) es = EntitySet(id="test") es.add_dataframe(df, df.ww.name) origin_features = schema_to_features(df.ww.schema) time_index_feature = [f for f in origin_features if f.name == "t_idx"][0] f_1 = [f for f in origin_features if f.name == "f_1"][0] features = generate_features_from_primitives([f_1], [MultiOutputPrimitiveForTest()]) lsa_features = [f for f in features if f.get_primitive_name() == "test_mo"] assert len(lsa_features) == 2 features = generate_features_from_primitives( lsa_features + [time_index_feature], [Lag(periods=2)], ) lag_features = [f for f in features if f.get_primitive_name() == "lag"] assert len(lag_features) == 2 fb_list = convert_feature_list_to_featurebase_list(lag_features, df) assert len(fb_list) == 2 assert isinstance(fb_list[0], TransformFeature) assert set([x.get_name() for x in fb_list]) == set( [ "LAG(TEST_MO(f_1)[0], t_idx, periods=2)", "LAG(TEST_MO(f_1)[1], t_idx, periods=2)", ], ) lsa_features[0].set_alias("f_2") lsa_features[1].set_alias("f_3") features = generate_features_from_primitives( lsa_features + [time_index_feature], [Lag(periods=2)], ) lag_features = [f for f in features if f.get_primitive_name() == "lag"] assert len(lag_features) == 2 fb_list = convert_feature_list_to_featurebase_list(lag_features, df) assert len(fb_list) == 2 assert isinstance(fb_list[0], TransformFeature) assert set([x.get_name() for x in fb_list]) == set( ["LAG(f_2, t_idx, periods=2)", "LAG(f_3, t_idx, periods=2)"], ) ================================================ FILE: featuretools/tests/feature_discovery/test_feature_collection.py ================================================ import pytest from woodwork.logical_types import ( Boolean, Double, Ordinal, ) from featuretools.feature_discovery.FeatureCollection import FeatureCollection from featuretools.feature_discovery.LiteFeature import LiteFeature from featuretools.primitives import Absolute, AddNumeric @pytest.mark.parametrize( "feature_args, expected", [ ( ("idx", Double), ["ANY", "Double", "Double,numeric", "numeric"], ), ( ("idx", Double, {"index"}), ["ANY", "Double", "Double,index", "index"], ), ( ("idx", Double, {"other"}), [ "ANY", "Double", "other", "numeric", "Double,other", "Double,numeric", "numeric,other", "Double,numeric,other", ], ), ( ("idx", Ordinal, {"other"}), [ "ANY", "Ordinal", "other", "category", "Ordinal,other", "Ordinal,category", "category,other", "Ordinal,category,other", ], ), ( ("idx", Double, {"a", "b", "numeric"}), [ "ANY", "Double", "a", "b", "numeric", "Double,a", "Double,b", "Double,numeric", "a,b", "a,numeric", "b,numeric", "a,b,numeric", "Double,a,b", "Double,a,numeric", "Double,b,numeric", "Double,a,b,numeric", ], ), ], ) def test_to_keys_method(feature_args, expected): feature = LiteFeature(*feature_args) keys = FeatureCollection.feature_to_keys(feature) assert set(keys) == set(expected) def test_feature_collection_hashing(): f1 = LiteFeature(name="f1", logical_type=Double) f2 = LiteFeature(name="f2", logical_type=Double, tags={"index"}) f3 = LiteFeature(name="f3", logical_type=Boolean, tags={"other"}) f4 = LiteFeature(name="f4", primitive=Absolute(), base_features=[f1]) f5 = LiteFeature(name="f5", primitive=AddNumeric(), base_features=[f1, f2]) fc1 = FeatureCollection([f1, f2, f3, f4, f5]) fc2 = FeatureCollection([f1, f2, f3, f4, f5]) assert len(set([fc1, fc2])) == 1 fc1.reindex() assert fc1.get_by_logical_type(Double) == set([f1, f2]) assert fc1.get_by_tag("index") == set([f2]) assert fc1.get_by_origin_feature(f1) == set([f1, f4, f5]) assert fc1.get_dependencies_by_origin_name("f1") == set([f1, f4, f5]) assert fc1.get_dependencies_by_origin_name("null") == set() assert fc1.get_by_origin_feature_name("f1") == f1 assert fc1.get_by_origin_feature_name("null") is None ================================================ FILE: featuretools/tests/feature_discovery/test_feature_discovery.py ================================================ from unittest.mock import patch import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import ( Boolean, BooleanNullable, Datetime, Double, NaturalLanguage, Ordinal, ) from featuretools.entityset.entityset import EntitySet from featuretools.feature_discovery.feature_discovery import ( _get_features, _get_matching_features, _index_column_set, generate_features_from_primitives, schema_to_features, ) from featuretools.feature_discovery.FeatureCollection import FeatureCollection from featuretools.feature_discovery.LiteFeature import ( LiteFeature, ) from featuretools.feature_discovery.utils import column_schema_to_keys from featuretools.primitives import ( Absolute, AddNumeric, Count, DateFirstEvent, Equal, Lag, MultiplyNumericBoolean, NumUnique, TransformPrimitive, ) from featuretools.primitives.utils import get_transform_primitives from featuretools.synthesis import dfs from featuretools.tests.testing_utils.generate_fake_dataframe import ( generate_fake_dataframe, ) DEFAULT_LT_FOR_TAG = { "category": Ordinal, "numeric": Double, "time_index": Datetime, } class MultiOutputPrimitiveForTest(TransformPrimitive): name = "test_mo" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 2 class DoublePrimitiveForTest(TransformPrimitive): name = "test_double" input_types = [ColumnSchema(logical_type=Double)] return_type = ColumnSchema(logical_type=Double) @pytest.mark.parametrize( "column_schema, expected", [ (ColumnSchema(logical_type=Double), "Double"), (ColumnSchema(semantic_tags={"index"}), "index"), ( ColumnSchema(logical_type=Double, semantic_tags={"index", "other"}), "Double,index,other", ), ], ) def test_column_schema_to_keys(column_schema, expected): actual = column_schema_to_keys(column_schema) assert set(actual) == set(expected) @pytest.mark.parametrize( "column_list, expected", [ ([ColumnSchema(logical_type=Boolean)], [("Boolean", 1)]), ([ColumnSchema()], [("ANY", 1)]), ( [ ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean), ], [("Boolean", 2)], ), ], ) def test_index_input_set(column_list, expected): actual = _index_column_set(column_list) assert actual == expected @pytest.mark.parametrize( "feature_args, input_set, commutative, expected", [ ( [("f1", Boolean), ("f2", Boolean), ("f3", Boolean)], [ColumnSchema(logical_type=Boolean)], False, [["f1"], ["f2"], ["f3"]], ), ( [("f1", Boolean), ("f2", Boolean)], [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)], False, [["f1", "f2"], ["f2", "f1"]], ), ( [("f1", Boolean), ("f2", Boolean)], [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)], True, [["f1", "f2"]], ), ( [("f1", Datetime, {"time_index"})], [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})], False, [["f1"]], ), ( [("f1", Double, {"other", "index"})], [ColumnSchema(logical_type=Double, semantic_tags={"index", "other"})], False, [["f1"]], ), ( [ ("f1", Double), ("f2", Boolean), ("f3", Double), ("f4", Boolean), ("f5", Double), ], [ ColumnSchema(logical_type=Double), ColumnSchema(logical_type=Double), ColumnSchema(logical_type=Boolean), ], True, [ ["f1", "f3", "f2"], ["f1", "f3", "f4"], ["f1", "f5", "f2"], ["f1", "f5", "f4"], ["f3", "f5", "f2"], ["f3", "f5", "f4"], ], ), ], ) @patch.object(LiteFeature, "_generate_hash", lambda x: x.name) def test_get_features(feature_args, input_set, commutative, expected): features = [LiteFeature(*args) for args in feature_args] feature_collection = FeatureCollection(features).reindex() column_keys = _index_column_set(input_set) actual = _get_features(feature_collection, tuple(column_keys), commutative) assert set([tuple([y.id for y in x]) for x in actual]) == set( [tuple(x) for x in expected], ) @pytest.mark.parametrize( "feature_args, primitive, expected", [ ( [("f1", Double), ("f2", Double), ("f3", Double)], AddNumeric, [["f1", "f2"], ["f1", "f3"], ["f2", "f3"]], ), ( [("f1", Boolean), ("f2", Boolean), ("f3", Boolean)], AddNumeric, [], ), ( [("f7", Double), ("f8", Boolean)], MultiplyNumericBoolean, [["f7", "f8"]], ), ( [("f9", Datetime)], DateFirstEvent, [], ), ( [("f10", Datetime, {"time_index"})], DateFirstEvent, [["f10"]], ), ( [("f11", Datetime, {"time_index"}), ("f12", Double)], NumUnique, [], ), ( [("f13", Datetime, {"time_index"}), ("f14", Double), ("f15", Ordinal)], NumUnique, [["f15"]], ), ( [("f16", Datetime, {"time_index"}), ("f17", Double), ("f18", Ordinal)], Equal, [["f16", "f17"], ["f16", "f18"], ["f17", "f18"]], ), ( [ ("t_idx", Datetime, {"time_index"}), ("f19", Ordinal), ("f20", Double), ("f21", Boolean), ("f22", BooleanNullable), ], Lag, [["f19", "t_idx"], ["f20", "t_idx"], ["f21", "t_idx"], ["f22", "t_idx"]], ), ( [ ("idx", Double, {"index"}), ("f23", Double), ], Count, [["idx"]], ), ( [ ("idx", Double, {"index"}), ("f23", Double), ], AddNumeric, [], ), ], ) @patch.object(LiteFeature, "__lt__", lambda x, y: x.name < y.name) def test_get_matching_features(feature_args, primitive, expected): features = [LiteFeature(*args) for args in feature_args] feature_collection = FeatureCollection(features).reindex() actual = _get_matching_features(feature_collection, primitive()) assert [[y.name for y in x] for x in actual] == expected @pytest.mark.parametrize( "col_defs, primitives, expected", [ ( [ ("f_1", "Double"), ("f_2", "Double"), ("f_3", "Boolean"), ("f_4", "Double"), ], [AddNumeric], {"f_1 + f_2", "f_1 + f_4", "f_2 + f_4"}, ), ( [ ("f_1", "Double"), ("f_2", "Double"), ], [Absolute], {"ABSOLUTE(f_1)", "ABSOLUTE(f_2)"}, ), ], ) @patch.object(LiteFeature, "__lt__", lambda x, y: x.name < y.name) def test_generate_features_from_primitives(col_defs, primitives, expected): input_feature_names = set([x[0] for x in col_defs]) df = generate_fake_dataframe( col_defs=col_defs, ) origin_features = schema_to_features(df.ww.schema) features = generate_features_from_primitives(origin_features, primitives) new_feature_names = set([x.name for x in features]) - input_feature_names assert new_feature_names == expected ALL_TRANSFORM_PRIMITIVES = list(get_transform_primitives().values()) @pytest.mark.parametrize( "col_defs, primitives", [ ( [ ("idx", "Double", {"index"}), ("t_idx", "Datetime", {"time_index"}), ("f_3", "Boolean"), ("f_4", "Boolean"), ("f_5", "BooleanNullable"), ("f_6", "BooleanNullable"), ("f_7", "Categorical"), ("f_8", "Categorical"), ("f_9", "Datetime"), ("f_10", "Datetime"), ("f_11", "Double"), ("f_12", "Double"), ("f_13", "Integer"), ("f_14", "Integer"), ("f_15", "IntegerNullable"), ("f_16", "IntegerNullable"), ("f_17", "EmailAddress"), ("f_18", "EmailAddress"), ("f_19", "LatLong"), ("f_20", "LatLong"), ("f_21", "NaturalLanguage"), ("f_22", "NaturalLanguage"), ("f_23", "Ordinal"), ("f_24", "Ordinal"), ("f_25", "URL"), ("f_26", "URL"), ("f_27", "PostalCode"), ("f_28", "PostalCode"), ], ALL_TRANSFORM_PRIMITIVES, ), ], ) @patch.object(LiteFeature, "_generate_hash", lambda x: x.name) def test_compare_dfs(col_defs, primitives): input_feature_names = set([x[0] for x in col_defs]) df = generate_fake_dataframe( col_defs=col_defs, ) es = EntitySet(id="test") es.add_dataframe(df, "df") features_old = dfs( entityset=es, target_dataframe_name="df", trans_primitives=primitives, features_only=True, return_types="all", max_depth=1, ) origin_features = schema_to_features(df.ww.schema) features = generate_features_from_primitives(origin_features, primitives) feature_names_old = set([x.get_name() for x in features_old]) - input_feature_names # type: ignore feature_names_new = set([x.name for x in features]) - input_feature_names assert feature_names_old == feature_names_new def test_generate_features_from_primitives_inputs(): f1 = LiteFeature("f1", Double) with pytest.raises( ValueError, match="input_features must be an iterable of LiteFeature objects", ): generate_features_from_primitives(f1, [Absolute]) with pytest.raises( ValueError, match="input_features must be an iterable of LiteFeature objects", ): generate_features_from_primitives([f1, "other"], [Absolute]) with pytest.raises( ValueError, match="primitives must be a list of Primitive classes or Primitive instances", ): generate_features_from_primitives([f1], ["absolute"]) with pytest.raises( ValueError, match="primitives must be a list of Primitive classes or Primitive instances", ): generate_features_from_primitives([f1], Absolute) ================================================ FILE: featuretools/tests/feature_discovery/test_type_defs.py ================================================ import json from unittest.mock import patch import pytest from woodwork.logical_types import Boolean, Double from featuretools.feature_discovery.feature_discovery import ( generate_features_from_primitives, schema_to_features, ) from featuretools.feature_discovery.FeatureCollection import FeatureCollection from featuretools.feature_discovery.LiteFeature import LiteFeature from featuretools.primitives import ( Absolute, AddNumeric, DivideNumeric, Lag, MultiplyNumeric, ) from featuretools.tests.feature_discovery.test_feature_discovery import ( MultiOutputPrimitiveForTest, ) from featuretools.tests.testing_utils.generate_fake_dataframe import ( generate_fake_dataframe, ) def test_feature_type_equality(): f1 = LiteFeature("f1", Double) f2 = LiteFeature("f2", Double) # Add Numeric is Commutative, so should all be equal f3 = LiteFeature( name="Column 1", primitive=AddNumeric(), logical_type=Double, base_features=[f1, f2], ) f4 = LiteFeature( name="Column 10", primitive=AddNumeric(), logical_type=Double, base_features=[f1, f2], ) f5 = LiteFeature( name="Column 20", primitive=AddNumeric(), logical_type=Double, base_features=[f2, f1], ) assert f3 == f4 == f5 # Divide Numeric is not Commutative, so should not be equal f6 = LiteFeature( name="Column 1", primitive=DivideNumeric(), logical_type=Double, base_features=[f1, f2], ) f7 = LiteFeature( name="Column 1", primitive=DivideNumeric(), logical_type=Double, base_features=[f2, f1], ) assert f6 != f7 def test_feature_type_assertions(): with pytest.raises( ValueError, match="there must be base features if given a primitive", ): LiteFeature( name="Column 1", primitive=AddNumeric(), logical_type=Double, ) @patch.object(LiteFeature, "_generate_hash", lambda x: x.name) @patch( "featuretools.feature_discovery.LiteFeature.hash_primitive", lambda x: (x.name, None), ) def test_feature_to_dict(): f1 = LiteFeature("f1", Double) f2 = LiteFeature("f2", Double) f = LiteFeature( name="Column 1", primitive=AddNumeric(), base_features=[f1, f2], ) expected = { "name": "Column 1", "logical_type": None, "tags": ["numeric"], "primitive": "add_numeric", "base_features": ["f1", "f2"], "df_id": None, "id": "Column 1", "related_features": [], "idx": 0, } actual = f.to_dict() json_str = json.dumps(actual) assert actual == expected assert json.dumps(expected) == json_str def test_feature_hash(): bf1 = LiteFeature("bf", Double) bf2 = LiteFeature("bf", Double, df_id="df") p1 = Lag(periods=1) p2 = Lag(periods=2) f1 = LiteFeature( primitive=p1, logical_type=Double, base_features=[bf1], ) f2 = LiteFeature( primitive=p2, logical_type=Double, base_features=[bf1], ) f3 = LiteFeature( primitive=p2, logical_type=Double, base_features=[bf1], ) f4 = LiteFeature( primitive=p1, logical_type=Double, base_features=[bf2], ) # TODO(dreed): ensure ID is parquet and arrow acceptable, length and starting character might be problematic assert f1 != f2 assert f2 == f3 assert f1 != f4 def test_feature_forced_name(): bf = LiteFeature("bf", Double) p1 = Lag(periods=1) f1 = LiteFeature( name="target_delay_1", primitive=p1, logical_type=Double, base_features=[bf], ) assert f1.name == "target_delay_1" @patch.object(LiteFeature, "_generate_hash", lambda x: x.name) @patch( "featuretools.feature_discovery.FeatureCollection.hash_primitive", lambda x: (x.name, None), ) @patch( "featuretools.feature_discovery.LiteFeature.hash_primitive", lambda x: (x.name, None), ) def test_feature_collection_to_dict(): f1 = LiteFeature("f1", Double) f2 = LiteFeature("f2", Double) f3 = LiteFeature( name="Column 1", primitive=AddNumeric(), base_features=[f1, f2], ) fc = FeatureCollection([f3]) expected = { "primitives": { "add_numeric": None, }, "feature_ids": ["Column 1"], "all_features": { "Column 1": { "name": "Column 1", "logical_type": None, "tags": ["numeric"], "primitive": "add_numeric", "base_features": ["f1", "f2"], "df_id": None, "id": "Column 1", "related_features": [], "idx": 0, }, "f1": { "name": "f1", "logical_type": "Double", "tags": ["numeric"], "primitive": None, "base_features": [], "df_id": None, "id": "f1", "related_features": [], "idx": 0, }, "f2": { "name": "f2", "logical_type": "Double", "tags": ["numeric"], "primitive": None, "base_features": [], "df_id": None, "id": "f2", "related_features": [], "idx": 0, }, }, } actual = fc.to_dict() assert actual == expected assert json.dumps(expected, sort_keys=True) == json.dumps(actual, sort_keys=True) @patch.object(LiteFeature, "_generate_hash", lambda x: x.name) def test_feature_collection_from_dict(): f1 = LiteFeature("f1", Double) f2 = LiteFeature("f2", Double) f3 = LiteFeature( name="Column 1", primitive=AddNumeric(), base_features=[f1, f2], ) expected = FeatureCollection([f3]) input_dict = { "primitives": { "009da67f0a1430630c4a419c84aac270ec62337ab20c080e4495272950fd03b3": { "type": "AddNumeric", "module": "featuretools.primitives.standard.transform.binary.add_numeric", "arguments": {}, }, }, "feature_ids": ["Column 1"], "all_features": { "f2": { "name": "f2", "logical_type": "Double", "tags": ["numeric"], "primitive": None, "base_features": [], "df_id": None, "id": "f2", "related_features": [], "idx": 0, }, "f1": { "name": "f1", "logical_type": "Double", "tags": ["numeric"], "primitive": None, "base_features": [], "df_id": None, "id": "f1", "related_features": [], "idx": 0, }, "Column 1": { "name": "Column 1", "logical_type": None, "tags": ["numeric"], "primitive": "009da67f0a1430630c4a419c84aac270ec62337ab20c080e4495272950fd03b3", "base_features": ["f1", "f2"], "df_id": None, "id": "Column 1", "related_features": [], "idx": 0, }, }, } actual = FeatureCollection.from_dict(input_dict) assert actual == expected @patch.object(LiteFeature, "__lt__", lambda x, y: x.name < y.name) def test_feature_collection_serialization_roundtrip(): col_defs = [ ("idx", "Integer", {"index"}), ("t_idx", "Datetime", {"time_index"}), ("f_1", "Double"), ("f_2", "Double"), ("f_3", "Categorical"), ("f_4", "Boolean"), ("f_5", "NaturalLanguage"), ] df = generate_fake_dataframe( col_defs=col_defs, ) origin_features = schema_to_features(df.ww.schema) features = generate_features_from_primitives( origin_features, [Absolute, MultiplyNumeric, MultiOutputPrimitiveForTest], ) features = generate_features_from_primitives(features, [Lag]) assert set([x.name for x in features]) == set( [ "idx", "t_idx", "f_1", "f_2", "f_3", "f_4", "f_5", "ABSOLUTE(f_1)", "ABSOLUTE(f_2)", "f_1 * f_2", "TEST_MO(f_5)[0]", "TEST_MO(f_5)[1]", "LAG(f_1, t_idx)", "LAG(f_2, t_idx)", "LAG(f_3, t_idx)", "LAG(f_4, t_idx)", "LAG(ABSOLUTE(f_1), t_idx)", "LAG(ABSOLUTE(f_2), t_idx)", "LAG(f_1 * f_2, t_idx)", "LAG(TEST_MO(f_5)[1], t_idx)", "LAG(TEST_MO(f_5)[0], t_idx)", ], ) fc = FeatureCollection(features=features) fc_dict = fc.to_dict() fc_json = json.dumps(fc_dict) fc2_dict = json.loads(fc_json) fc2 = FeatureCollection.from_dict(fc2_dict) assert fc == fc2 lsa_features = [x for x in fc2.all_features if x.get_primitive_name() == "test_mo"] assert len(lsa_features[0].related_features) == 1 def test_lite_feature_assertions(): f1 = LiteFeature(name="f1", logical_type=Double) f2 = LiteFeature(name="f1", logical_type=Double, df_id="df1") assert f1 != f2 with pytest.raises( TypeError, match="Name must be given if origin feature", ): LiteFeature(logical_type=Double) with pytest.raises( TypeError, match="Logical Type must be given if origin feature", ): LiteFeature(name="f1") with pytest.raises( ValueError, match="primitive input must be of type PrimitiveBase", ): LiteFeature(name="f3", primitive="AddNumeric", base_features=[f1, f2]) f = LiteFeature("f4", logical_type=Double) with pytest.raises(AttributeError, match="name is immutable"): f.name = "new name" with pytest.raises(ValueError, match="only used on multioutput features"): f.non_indexed_name with pytest.raises(AttributeError, match="logical_type is immutable"): f.logical_type = Boolean with pytest.raises(AttributeError, match="tags is immutable"): f.tags = {"other"} with pytest.raises(AttributeError, match="primitive is immutable"): f.primitive = AddNumeric with pytest.raises(AttributeError, match="base_features are immutable"): f.base_features = [f1] with pytest.raises(AttributeError, match="df_id is immutable"): f.df_id = "df_id" with pytest.raises(AttributeError, match="id is immutable"): f.id = "id" with pytest.raises(AttributeError, match="n_output_features is immutable"): f.n_output_features = "n_output_features" with pytest.raises(AttributeError, match="depth is immutable"): f.depth = "depth" with pytest.raises(AttributeError, match="idx is immutable"): f.idx = "idx" def test_lite_feature_to_column_schema(): f1 = LiteFeature(name="f1", logical_type=Double, tags={"index", "numeric"}) column_schema = f1.column_schema assert column_schema.is_numeric assert isinstance(column_schema.logical_type, Double) assert column_schema.semantic_tags == {"index", "numeric"} f2 = LiteFeature(name="f2", primitive=Absolute(), base_features=[f1]) column_schema = f2.column_schema assert column_schema.semantic_tags == {"numeric"} def test_lite_feature_to_dependent_primitives(): f1 = LiteFeature(name="f1", logical_type=Double) f2 = LiteFeature(name="f2", primitive=Absolute(), base_features=[f1]) f3 = LiteFeature(name="f3", primitive=AddNumeric(), base_features=[f1, f2]) f4 = LiteFeature(name="f4", primitive=MultiplyNumeric(), base_features=[f1, f3]) assert set([x.name for x in f4.dependent_primitives()]) == set( ["multiply_numeric", "absolute", "add_numeric"], ) ================================================ FILE: featuretools/tests/primitive_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_agg_primitives.py ================================================ from datetime import datetime from math import sqrt import numpy as np import pandas as pd import pytest from pandas.core.dtypes.dtypes import CategoricalDtype from pytest import raises from featuretools.primitives import ( AverageCountPerUnique, DateFirstEvent, Entropy, FirstLastTimeDelta, HasNoDuplicates, IsMonotonicallyDecreasing, IsMonotonicallyIncreasing, Kurtosis, MaxCount, MaxMinDelta, MedianCount, MinCount, NMostCommon, NMostCommonFrequency, NumFalseSinceLastTrue, NumPeaks, NumTrueSinceLastFalse, NumZeroCrossings, NUniqueDays, NUniqueDaysOfCalendarYear, NUniqueDaysOfMonth, NUniqueMonths, NUniqueWeeks, PercentTrue, Trend, Variance, get_aggregation_primitives, ) from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, check_serialize, find_applicable_primitives, valid_dfs, ) def test_nmostcommon_categorical(): n_most = NMostCommon(3) expected = pd.Series([1.0, 2.0, np.nan]) ints = pd.Series([1, 2, 1, 1]).astype("int64") assert pd.Series(n_most(ints)).equals(expected) cats = pd.Series([1, 2, 1, 1]).astype("category") assert pd.Series(n_most(cats)).equals(expected) # Value counts includes data for categories that are not present in data. # Make sure these counts are not included in most common outputs extra_dtype = CategoricalDtype(categories=[1, 2, 3]) cats_extra = pd.Series([1, 2, 1, 1]).astype(extra_dtype) assert pd.Series(n_most(cats_extra)).equals(expected) def test_agg_primitives_can_init_without_params(): agg_primitives = get_aggregation_primitives().values() for agg_primitive in agg_primitives: agg_primitive() def test_trend_works_with_different_input_dtypes(): dates = pd.to_datetime(["2020-01-01", "2020-01-02", "2020-01-03"]) numeric = pd.Series([1, 2, 3]) trend = Trend() dtypes = ["float64", "int64", "Int64"] for dtype in dtypes: actual = trend(numeric.astype(dtype), dates) assert np.isclose(actual, 1) def test_percent_true_boolean(): booleans = pd.Series([True, False, True, pd.NA], dtype="boolean") pct_true = PercentTrue() pct_true(booleans) == 0.5 class TestAverageCountPerUnique(PrimitiveTestBase): primitive = AverageCountPerUnique array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8]) def test_percent_unique(self): primitive_func = AverageCountPerUnique().get_function() assert primitive_func(self.array) == 1.25 def test_nans(self): primitive_func = AverageCountPerUnique().get_function() array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])]) assert primitive_func(array_nans) == 1.25 primitive_func = AverageCountPerUnique(skipna=False).get_function() array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])]) assert primitive_func(array_nans) == (11 / 9.0) def test_empty_string(self): primitive_func = AverageCountPerUnique().get_function() array_empty_string = pd.concat([self.array.copy(), pd.Series([np.nan, "", ""])]) assert primitive_func(array_empty_string) == (4 / 3.0) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestVariance(PrimitiveTestBase): primitive = Variance def test_regular(self): variance = self.primitive().get_function() np.testing.assert_almost_equal(variance(np.array([0, 3, 4, 3])), 2.25) def test_single(self): variance = self.primitive().get_function() np.testing.assert_almost_equal(variance(np.array([4])), 0) def test_double(self): variance = self.primitive().get_function() np.testing.assert_almost_equal(variance(np.array([3, 4])), 0.25) def test_empty(self): variance = self.primitive().get_function() np.testing.assert_almost_equal(variance(np.array([])), np.nan) def test_nan(self): variance = self.primitive().get_function() np.testing.assert_almost_equal( variance(pd.Series([0, np.nan, 4, 3])), 2.8888888888888893, ) def test_allnan(self): variance = self.primitive().get_function() np.testing.assert_almost_equal( variance(pd.Series([np.nan, np.nan, np.nan])), np.nan, ) class TestFirstLastTimeDelta(PrimitiveTestBase): primitive = FirstLastTimeDelta times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]) actual_delta = (times.iloc[-1] - times.iloc[0]).total_seconds() def test_first_last_time_delta(self): primitive_func = self.primitive().get_function() assert primitive_func(self.times) == self.actual_delta def test_with_nans(self): primitive_func = self.primitive().get_function() times = pd.concat([self.times, pd.Series([np.nan])]) assert primitive_func(times) == self.actual_delta assert pd.isna(primitive_func(pd.Series([np.nan]))) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestEntropy(PrimitiveTestBase): primitive = Entropy @pytest.mark.parametrize( "dtype", ["category", "object", "string"], ) def test_regular(self, dtype): data = pd.Series([1, 2, 3, 2], dtype=dtype) primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert np.isclose(given_answer, 1.03, atol=0.01) @pytest.mark.parametrize( "dtype", ["category", "object", "string"], ) def test_empty(self, dtype): data = pd.Series([], dtype=dtype) primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert given_answer == 0.0 @pytest.mark.parametrize( "dtype", ["category", "object", "string"], ) def test_args(self, dtype): data = pd.Series([1, 2, 3, 2], dtype=dtype) if dtype == "string": data = pd.concat([data, pd.Series([pd.NA, pd.NA], dtype=dtype)]) else: data = pd.concat([data, pd.Series([np.nan, np.nan], dtype=dtype)]) primitive_func = self.primitive(dropna=True, base=2).get_function() given_answer = primitive_func(data) assert np.isclose(given_answer, 1.5, atol=0.001) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive, max_depth=2) class TestKurtosis(PrimitiveTestBase): primitive = Kurtosis @pytest.mark.parametrize( "dtype", ["int64", "float64"], ) def test_regular(self, dtype): data = pd.Series([1, 2, 3, 4, 5], dtype=dtype) answer = -1.3 primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert np.isclose(answer, given_answer, atol=0.01) data = pd.Series([1, 2, 3, 4, 5, 6], dtype=dtype) answer = -1.26 primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert np.isclose(answer, given_answer, atol=0.01) data = pd.Series([x * x for x in list(range(100))], dtype=dtype) answer = -0.85 primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert np.isclose(answer, given_answer, atol=0.01) if dtype == "float64": # Series contains floating point values - only check with float dtype data = pd.Series([sqrt(x) for x in list(range(100))], dtype=dtype) answer = -0.46 primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert np.isclose(answer, given_answer, atol=0.01) def test_nan(self): data = pd.Series([np.nan, 5, 3], dtype="float64") primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert pd.isna(given_answer) @pytest.mark.parametrize( "dtype", ["int64", "float64"], ) def test_empty(self, dtype): data = pd.Series([], dtype=dtype) primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert pd.isna(given_answer) def test_inf(self): data = pd.Series([1, np.inf], dtype="float64") primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert pd.isna(given_answer) data = pd.Series([np.NINF, 1, np.inf], dtype="float64") primitive_func = self.primitive().get_function() given_answer = primitive_func(data) assert pd.isna(given_answer) def test_arg(self): data = pd.Series([1, 2, 3, 4, 5, np.nan, np.nan], dtype="float64") answer = -1.3 primitive_func = self.primitive(nan_policy="omit").get_function() given_answer = primitive_func(data) assert answer == given_answer primitive_func = self.primitive(nan_policy="propagate").get_function() given_answer = primitive_func(data) assert np.isnan(given_answer) primitive_func = self.primitive(nan_policy="raise").get_function() with raises(ValueError): primitive_func(data) def test_error(self): with raises(ValueError): self.primitive(nan_policy="invalid_policy").get_function() def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNumZeroCrossings(PrimitiveTestBase): primitive = NumZeroCrossings def test_nan(self): data = pd.Series([3, np.nan, 5, 3, np.nan, 0, np.nan, 0, np.nan, -2]) # crossing from 0 to np.nan to -2, which is 1 crossing answer = 1 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer def test_empty(self): data = pd.Series([], dtype="int64") answer = 0 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer def test_inf(self): data = pd.Series([-1, np.inf]) answer = 1 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer data = pd.Series([np.NINF, 1, np.inf]) answer = 1 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer def test_zeros(self): data = pd.Series([1, 0, -1, 0, 1, 0, -1]) answer = 3 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer data = pd.Series([1, 0, 1, 0, 1]) answer = 0 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer def test_regular(self): data = pd.Series([1, 2, 3, 4, 5]) answer = 0 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer data = pd.Series([1, -1, 2, -2, 3, -3]) answer = 5 primtive_func = self.primitive().get_function() given_answer = primtive_func(data) assert given_answer == answer def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNumTrueSinceLastFalse(PrimitiveTestBase): primitive = NumTrueSinceLastFalse def test_regular(self): primitive_func = self.primitive().get_function() bools = pd.Series([False, True, False, True, True]) answer = primitive_func(bools) correct_answer = 2 assert answer == correct_answer def test_regular_end_in_false(self): primitive_func = self.primitive().get_function() bools = pd.Series([False, True, False, True, True, False]) answer = primitive_func(bools) correct_answer = 0 assert answer == correct_answer def test_no_false(self): primitive_func = self.primitive().get_function() bools = pd.Series([True] * 5) assert pd.isna(primitive_func(bools)) def test_all_false(self): primitive_func = self.primitive().get_function() bools = pd.Series([False, False, False]) answer = primitive_func(bools) correct_answer = 0 assert answer == correct_answer def test_nan(self): primitive_func = self.primitive().get_function() bools = pd.Series([False, True, np.nan, True, True]) answer = primitive_func(bools) correct_answer = 3 assert answer == correct_answer def test_all_nan(self): primitive_func = self.primitive().get_function() bools = pd.Series([np.nan, np.nan, np.nan]) assert pd.isna(primitive_func(bools)) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNumFalseSinceLastTrue(PrimitiveTestBase): primitive = NumFalseSinceLastTrue def test_regular(self): primitive_func = self.primitive().get_function() bools = pd.Series([True, False, True, False, False]) answer = primitive_func(bools) correct_answer = 2 assert answer == correct_answer def test_regular_end_in_true(self): primitive_func = self.primitive().get_function() bools = pd.Series([True, False, True, False, False, True]) answer = primitive_func(bools) correct_answer = 0 assert answer == correct_answer def test_no_true(self): primitive_func = self.primitive().get_function() bools = pd.Series([False] * 5) assert pd.isna(primitive_func(bools)) def test_all_true(self): primitive_func = self.primitive().get_function() bools = pd.Series([True, True, True]) answer = primitive_func(bools) correct_answer = 0 assert answer == correct_answer def test_nan(self): primitive_func = self.primitive().get_function() bools = pd.Series([True, False, np.nan, False, False]) answer = primitive_func(bools) correct_answer = 3 assert answer == correct_answer def test_all_nan(self): primitive_func = self.primitive().get_function() bools = pd.Series([np.nan, np.nan, np.nan]) assert pd.isna(primitive_func(bools)) def test_numeric_and_string_input(self): primitive_func = self.primitive().get_function() bools = pd.Series([True, 0, 1, "10", ""]) answer = primitive_func(bools) correct_answer = 1 assert answer == correct_answer def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNumPeaks(PrimitiveTestBase): primitive = NumPeaks @pytest.mark.parametrize( "dtype", ["int64", "float64", "Int64"], ) def test_negative_and_positive_nums(self, dtype): get_peaks = self.primitive().get_function() assert ( get_peaks(pd.Series([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0], dtype=dtype)) == 4 ) @pytest.mark.parametrize( "dtype", ["int64", "float64", "Int64"], ) def test_plateu(self, dtype): get_peaks = self.primitive().get_function() assert get_peaks(pd.Series([1, 2, 3, 3, 3, 3, 3, 2, 1], dtype=dtype)) == 1 assert get_peaks(pd.Series([1, 2, 3, 3, 3, 4, 3, 3, 3, 2, 1], dtype=dtype)) == 1 assert ( get_peaks( pd.Series( [ 5, 4, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 3, 3, 3, 3, 4, ], dtype=dtype, ), ) == 1 ) assert ( get_peaks( pd.Series( [ 1, 2, 3, 3, 3, 2, 1, 2, 3, 3, 3, 2, 5, 5, 5, 2, ], dtype=dtype, ), ) == 3 ) @pytest.mark.parametrize( "dtype", ["int64", "float64", "Int64"], ) def test_regular(self, dtype): get_peaks = self.primitive().get_function() assert get_peaks(pd.Series([1, 7, 3, 8, 2, 3, 4, 3, 4, 2, 4], dtype=dtype)) == 4 assert get_peaks(pd.Series([1, 2, 3, 2, 1], dtype=dtype)) == 1 @pytest.mark.parametrize( "dtype", ["int64", "float64", "Int64"], ) def test_no_peak(self, dtype): get_peaks = self.primitive().get_function() assert get_peaks(pd.Series([1, 2, 3], dtype=dtype)) == 0 assert get_peaks(pd.Series([3, 2, 2, 2, 2, 1], dtype=dtype)) == 0 @pytest.mark.parametrize( "dtype", ["int64", "float64", "Int64"], ) def test_too_small_data(self, dtype): get_peaks = self.primitive().get_function() assert get_peaks(pd.Series([], dtype=dtype)) == 0 assert get_peaks(pd.Series([1])) == 0 assert get_peaks(pd.Series([1, 1])) == 0 assert get_peaks(pd.Series([1, 2])) == 0 assert get_peaks(pd.Series([2, 1])) == 0 @pytest.mark.parametrize( "dtype", ["int64", "float64", "Int64"], ) def test_nans(self, dtype): get_peaks = self.primitive().get_function() array = pd.Series( [ 0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14, ], dtype=dtype, ) if dtype == "float64": array = pd.concat([array, pd.Series([np.nan, np.nan])]) elif dtype == "Int64": array = pd.concat([array, pd.Series([pd.NA, pd.NA])]) array = array.astype(dtype=dtype) assert get_peaks(array) == 3 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestDateFirstEvent(PrimitiveTestBase): primitive = DateFirstEvent def test_regular(self): primitive_func = self.primitive().get_function() case = pd.Series( [ "2011-04-09 10:30:00", "2011-04-09 10:30:06", "2011-04-09 10:30:12", "2011-04-09 10:30:18", ], dtype="datetime64[ns]", ) answer = pd.Timestamp("2011-04-09 10:30:00") given_answer = primitive_func(case) assert given_answer == answer def test_nat(self): primitive_func = self.primitive().get_function() case = pd.Series( [ pd.NaT, pd.NaT, "2011-04-09 10:30:12", "2011-04-09 10:30:18", ], dtype="datetime64[ns]", ) answer = pd.Timestamp("2011-04-09 10:30:12") given_answer = primitive_func(case) assert given_answer == answer def test_empty(self): primitive_func = self.primitive().get_function() case = pd.Series([], dtype="datetime64[ns]") given_answer = primitive_func(case) assert pd.isna(given_answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) def test_serialize(self, es): check_serialize(self.primitive, es, target_dataframe_name="sessions") class TestMinCount(PrimitiveTestBase): primitive = MinCount def test_nan(self): data = pd.Series([np.nan, np.nan, np.nan]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert pd.isna(answer) def test_inf(self): data = pd.Series([5, 10, 10, np.inf, np.inf, np.inf]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 1 def test_regular(self): data = pd.Series([1, 2, 2, 2, 3, 4, 4, 4, 5]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 1 data = pd.Series([2, 2, 2, 3, 4, 4, 4]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 3 def test_skipna(self): data = pd.Series([1, 1, 2, 3, 4, 4, np.nan, 5]) primitive_func = self.primitive(skipna=False).get_function() answer = primitive_func(data) assert pd.isna(answer) def test_ninf(self): data = pd.Series([np.NINF, np.NINF, np.nan]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 2 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestMaxCount(PrimitiveTestBase): primitive = MaxCount def test_nan(self): data = pd.Series([np.nan, np.nan, np.nan]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert pd.isna(answer) def test_inf(self): data = pd.Series([5, 10, 10, np.inf, np.inf, np.inf]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 3 def test_regular(self): data = pd.Series([1, 1, 2, 3, 4, 4, 4, 5]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 1 data = pd.Series([1, 1, 2, 3, 4, 4, 4]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 3 def test_skipna(self): data = pd.Series([1, 1, 2, 3, 4, 4, np.nan, 5]) primitive_func = self.primitive(skipna=False).get_function() answer = primitive_func(data) assert pd.isna(answer) def test_ninf(self): data = pd.Series([np.NINF, np.NINF, np.nan]) primitive_func = self.primitive().get_function() answer = primitive_func(data) assert answer == 2 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestMaxMinDelta(PrimitiveTestBase): primitive = MaxMinDelta array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8]) def test_max_min_delta(self): primitive_func = self.primitive().get_function() assert primitive_func(self.array) == 7.0 def test_nans(self): primitive_func = self.primitive().get_function() array_nans = pd.concat([self.array, pd.Series([np.nan])]) assert primitive_func(array_nans) == 7.0 primitive_func = self.primitive(skipna=False).get_function() array_nans = pd.concat([self.array, pd.Series([np.nan])]) assert pd.isna(primitive_func(array_nans)) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestMedianCount(PrimitiveTestBase): primitive = MedianCount def test_regular(self): primitive_func = self.primitive().get_function() case = pd.Series([1, 3, 5, 7]) given_answer = primitive_func(case) assert given_answer == 0 def test_nans(self): primitive_func = self.primitive().get_function() case = pd.Series([1, 3, 4, 4, 4, 5, 7, np.nan, np.nan]) given_answer = primitive_func(case) assert given_answer == 3 primitive_func = self.primitive(skipna=False).get_function() given_answer = primitive_func(case) assert pd.isna(given_answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNMostCommonFrequency(PrimitiveTestBase): primitive = NMostCommonFrequency def test_regular(self): test_cases = [ pd.Series([8, 7, 10, 10, 10, 3, 4, 5, 10, 8, 7]), pd.Series([7, 7, 7, 6, 6, 5, 4]), pd.Series([4, 5, 6, 6, 7, 7, 7]), ] answers = [ pd.Series([4, 2, 2]), pd.Series([3, 2, 1]), pd.Series([3, 2, 1]), ] primtive_func = self.primitive(3).get_function() for case, answer in zip(test_cases, answers): given_answer = primtive_func(case) given_answer = given_answer.reset_index(drop=True) assert given_answer.equals(answer) def test_n_larger_than_len(self): test_cases = [ pd.Series(["red", "red", "blue", "green"]), pd.Series(["red", "red", "red", "blue", "green"]), pd.Series(["red", "blue", "green", "orange"]), ] answers = [ pd.Series([2, 1, 1, np.nan, np.nan]), pd.Series([3, 1, 1, np.nan, np.nan]), pd.Series([1, 1, 1, 1, np.nan]), ] primtive_func = self.primitive(5).get_function() for case, answer in zip(test_cases, answers): given_answer = primtive_func(case) given_answer = given_answer.reset_index(drop=True) assert given_answer.equals(answer) def test_skipna(self): array = pd.Series(["red", "red", "blue", "green", np.nan, np.nan]) primtive_func = self.primitive(5, skipna=False).get_function() given_answer = primtive_func(array) given_answer = given_answer.reset_index(drop=True) answer = pd.Series([2, 2, 1, 1, np.nan]) assert given_answer.equals(answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) aggregation.append(self.primitive(5)) valid_dfs( es, aggregation, transform, self.primitive, target_dataframe_name="customers", multi_output=True, ) def test_with_featuretools_args(self, es): transform, aggregation = find_applicable_primitives(self.primitive) aggregation.append(self.primitive(5, skipna=False)) valid_dfs( es, aggregation, transform, self.primitive, target_dataframe_name="customers", multi_output=True, ) def test_serialize(self, es): check_serialize( primitive=self.primitive, es=es, target_dataframe_name="customers", ) class TestNUniqueDays(PrimitiveTestBase): primitive = NUniqueDays def test_two_years(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2011-12-31")) assert primitive_func(array) == 365 * 2 def test_leap_year(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2016-01-01", "2017-12-31")) assert primitive_func(array) == 365 * 2 + 1 def test_ten_years(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2019-12-31")) assert primitive_func(array) == 365 * 10 + 1 + 1 def test_distinct_dt(self): primitive_func = self.primitive().get_function() array = pd.Series( [ datetime(2019, 2, 21), datetime(2019, 2, 1, 1, 20, 0), datetime(2019, 2, 1, 1, 30, 0), datetime(2018, 2, 1), datetime(2019, 1, 1), ], ) assert primitive_func(array) == 4 def test_NaT(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2011-12-31")) NaT_array = pd.Series([pd.NaT] * 100) assert primitive_func(pd.concat([array, NaT_array])) == 365 * 2 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNUniqueDaysOfCalendarYear(PrimitiveTestBase): primitive = NUniqueDaysOfCalendarYear def test_two_years(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2011-12-31")) assert primitive_func(array) == 365 def test_leap_year(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2016-01-01", "2017-12-31")) assert primitive_func(array) == 366 def test_ten_years(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2019-12-31")) assert primitive_func(array) == 366 def test_distinct_dt(self): primitive_func = self.primitive().get_function() array = pd.Series( [ datetime(2019, 2, 21), datetime(2019, 2, 1, 1, 20, 0), datetime(2019, 2, 1, 1, 30, 0), datetime(2018, 2, 1), datetime(2019, 1, 1), ], ) assert primitive_func(array) == 3 def test_NaT(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2011-12-31")) NaT_array = pd.Series([pd.NaT] * 100) assert primitive_func(pd.concat([array, NaT_array])) == 365 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNUniqueDaysOfMonth(PrimitiveTestBase): primitive = NUniqueDaysOfMonth def test_two_days(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2010-01-02")) assert primitive_func(array) == 2 def test_one_year(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2010-12-31")) assert primitive_func(array) == 31 def test_leap_year(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2016-01-01", "2017-12-31")) assert primitive_func(array) == 31 def test_distinct_dt(self): primitive_func = self.primitive().get_function() array = pd.Series( [ datetime(2019, 2, 21), datetime(2019, 2, 1, 1, 20, 0), datetime(2019, 2, 1, 1, 30, 0), datetime(2018, 2, 1), datetime(2019, 1, 1), ], ) assert primitive_func(array) == 2 def test_NaT(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2010-12-31")) NaT_array = pd.Series([pd.NaT] * 100) assert primitive_func(pd.concat([array, NaT_array])) == 31 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNUniqueMonths(PrimitiveTestBase): primitive = NUniqueMonths def test_two_days(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2010-01-02")) assert primitive_func(array) == 1 def test_ten_years(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2019-12-31")) assert primitive_func(array) == 12 * 10 def test_distinct_dt(self): primitive_func = self.primitive().get_function() array = pd.Series( [ datetime(2019, 2, 21), datetime(2019, 2, 1, 1, 20, 0), datetime(2019, 2, 1, 1, 30, 0), datetime(2018, 2, 1), datetime(2019, 1, 1), ], ) assert primitive_func(array) == 3 def test_NaT(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2011-12-31")) NaT_array = pd.Series([pd.NaT] * 100) assert primitive_func(pd.concat([array, NaT_array])) == 12 * 2 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNUniqueWeeks(PrimitiveTestBase): primitive = NUniqueWeeks def test_same_week(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2019-01-01", "2019-01-02")) assert primitive_func(array) == 1 def test_ten_years(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2010-01-01", "2019-12-31")) assert primitive_func(array) == 523 def test_distinct_dt(self): primitive_func = self.primitive().get_function() array = pd.Series( [ datetime(2019, 2, 21), datetime(2019, 2, 1, 1, 20, 0), datetime(2019, 2, 1, 1, 30, 0), datetime(2018, 2, 2), datetime(2019, 2, 3, 1, 30, 0), datetime(2019, 1, 1), ], ) assert primitive_func(array) == 4 def test_NaT(self): primitive_func = self.primitive().get_function() array = pd.Series(pd.date_range("2019-01-01", "2019-01-02")) NaT_array = pd.Series([pd.NaT] * 100) assert primitive_func(pd.concat([array, NaT_array])) == 1 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() aggregation.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestHasNoDuplicates(PrimitiveTestBase): primitive = HasNoDuplicates def test_regular(self): primitive_func = self.primitive().get_function() data = pd.Series([1, 1, 2]) assert not primitive_func(data) assert isinstance(primitive_func(data), bool) data = pd.Series([1, 2, 3]) assert primitive_func(data) assert isinstance(primitive_func(data), bool) data = pd.Series([1, 2, 4]) assert primitive_func(data) assert isinstance(primitive_func(data), bool) data = pd.Series(["red", "blue", "orange"]) assert primitive_func(data) assert isinstance(primitive_func(data), bool) data = pd.Series(["red", "blue", "red"]) assert not primitive_func(data) def test_nan(self): primitive_func = self.primitive().get_function() data = pd.Series([np.nan, 1, 2, 3]) assert primitive_func(data) assert isinstance(primitive_func(data), bool) data = pd.Series([np.nan, np.nan, 1]) # drop both nans, so has 1 value assert primitive_func(data) is True assert isinstance(primitive_func(data), bool) primitive_func = self.primitive(skipna=False).get_function() data = pd.Series([np.nan, np.nan, 1]) assert primitive_func(data) is False assert isinstance(primitive_func(data), bool) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instantiate = self.primitive() aggregation.append(primitive_instantiate) valid_dfs( es, aggregation, transform, self.primitive, target_dataframe_name="customers", instance_ids=[0, 1, 2], ) class TestIsMonotonicallyDecreasing(PrimitiveTestBase): primitive = IsMonotonicallyDecreasing def test_monotonically_decreasing(self): primitive_func = self.primitive().get_function() case = pd.Series([9, 5, 3, 1, -1]) assert primitive_func(case) is True def test_monotonically_increasing(self): primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, 5, 9]) assert primitive_func(case) is False def test_non_monotonic(self): primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, 2, 5]) assert primitive_func(case) is False def test_weakly_decreasing(self): primitive_func = self.primitive().get_function() case = pd.Series([9, 3, 3, 1, -1]) assert primitive_func(case) is True def test_nan(self): primitive_func = self.primitive().get_function() case = pd.Series([9, 5, 3, np.nan, 1, -1]) assert primitive_func(case) is True primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, np.nan, 5, 9]) assert primitive_func(case) is False def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instantiate = self.primitive() aggregation.append(primitive_instantiate) valid_dfs(es, aggregation, transform, self.primitive) class TestIsMonotonicallyIncreasing(PrimitiveTestBase): primitive = IsMonotonicallyIncreasing def test_monotonically_increasing(self): primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, 5, 9]) assert primitive_func(case) is True def test_monotonically_decreasing(self): primitive_func = self.primitive().get_function() case = pd.Series([9, 5, 3, 1, -1]) assert primitive_func(case) is False def test_non_monotonic(self): primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, 2, 5]) assert primitive_func(case) is False def test_weakly_increasing(self): primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, 3, 9]) assert primitive_func(case) is True def test_nan(self): primitive_func = self.primitive().get_function() case = pd.Series([-1, 1, 3, np.nan, 5, 9]) assert primitive_func(case) is True primitive_func = self.primitive().get_function() case = pd.Series([9, 5, 3, np.nan, 1, -1]) assert primitive_func(case) is False def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instantiate = self.primitive() aggregation.append(primitive_instantiate) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_count_aggregation_primitives.py ================================================ import numpy as np import pandas as pd from pytest import raises from featuretools.primitives import ( CountAboveMean, CountGreaterThan, CountInsideNthSTD, CountInsideRange, CountLessThan, CountOutsideNthSTD, CountOutsideRange, ) from featuretools.tests.primitive_tests.utils import PrimitiveTestBase class TestCountAboveMean(PrimitiveTestBase): primitive = CountAboveMean def test_regular(self): data = pd.Series([1, 2, 3, 4, 5]) expected = 2 primitive_func = self.primitive().get_function() actual = primitive_func(data) assert expected == actual data = pd.Series([1, 2, 3.1, 4, 5]) expected = 3 primitive_func = self.primitive().get_function() actual = primitive_func(data) assert expected == actual def test_nan_without_ignore_nan(self): data = pd.Series([np.nan, 1, 2, 3, 4, 5, np.nan, np.nan]) expected = np.nan primitive_func = self.primitive(skipna=False).get_function() actual = primitive_func(data) assert np.isnan(actual) == np.isnan(expected) data = pd.Series([np.nan]) primitive_func = self.primitive(skipna=False).get_function() actual = primitive_func(data) assert np.isnan(actual) == np.isnan(expected) def test_nan_with_ignore_nan(self): data = pd.Series([np.nan, 1, 2, 3, 4, 5, np.nan, np.nan]) expected = 2 primitive_func = self.primitive(skipna=True).get_function() actual = primitive_func(data) assert expected == actual data = pd.Series([np.nan, 1, 2, 3.1, 4, 5, np.nan, np.nan]) expected = 3 primitive_func = self.primitive(skipna=True).get_function() actual = primitive_func(data) assert expected == actual data = pd.Series([np.nan]) expected = np.nan primitive_func = self.primitive(skipna=True).get_function() actual = primitive_func(data) assert np.isnan(actual) == np.isnan(expected) def test_inf(self): data = pd.Series([np.NINF, 1, 2, 3, 4, 5]) expected = 5 primitive_func = self.primitive().get_function() actual = primitive_func(data) assert expected == actual data = pd.Series([1, 2, 3, 4, 5, np.inf]) expected = 0 primitive_func = self.primitive().get_function() actual = primitive_func(data) assert expected == actual data = pd.Series([np.NINF, 1, 2, 3, 4, 5, np.inf]) expected = np.nan primitive_func = self.primitive().get_function() actual = primitive_func(data) assert np.isnan(actual) == np.isnan(expected) primitive_func = self.primitive(skipna=False).get_function() actual = primitive_func(data) assert np.isnan(actual) == np.isnan(expected) class TestCountGreaterThan(PrimitiveTestBase): primitive = CountGreaterThan def compare_results(self, data, thresholds, results): for threshold, result in zip(thresholds, results): primitive = self.primitive(threshold=threshold) function = primitive.get_function() assert function(data) == result assert isinstance(function(data), np.int64) def test_regular(self): data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]) thresholds = pd.Series([-5, -2, 0, 2, 5]) results = pd.Series([10, 7, 5, 3, 0]) self.compare_results(data, thresholds, results) def test_edges(self): data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]) thresholds = pd.Series([np.inf, np.NINF, None, np.nan]) results = pd.Series([0, len(data), 0, 0]) self.compare_results(data, thresholds, results) def test_nans(self): data = pd.Series([-5, -4, -3, np.inf, np.NINF, np.nan, 1, 2, 3, 4, 5]) thresholds = pd.Series([np.inf, np.NINF, None, 0, np.nan]) results = pd.Series([0, 9, 0, 6, 0]) self.compare_results(data, thresholds, results) class TestCountInsideNthSTD: primitive = CountInsideNthSTD def test_normal_distribution(self): x = pd.Series( [ -76.0, 41.0, -43.0, -152.0, -89.0, 28.0, 49.0, 298.0, -132.0, 146.0, -107.0, -26.0, 26.0, -81.0, 116.0, -217.0, -102.0, 144.0, 120.0, -130.0, ], ) first_outliers = [-152.0, 298.0, 146.0, 116.0, -217.0, 144.0, 120.0] primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(x) - len(first_outliers) second_outliers = [298.0] primitive_instance = self.primitive(2) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(x) - len(second_outliers) def test_poisson_distribution(self): x = pd.Series( [ 1, 1, 3, 3, 0, 0, 1, 3, 3, 1, 2, 3, 2, 0, 1, 3, 2, 1, 0, 2, ], ) first_outliers = [3, 3, 0, 0, 3, 3, 3, 0, 3, 0] primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(x) - len(first_outliers) second_outliers = [] primitive_instance = self.primitive(2) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(x) - len(second_outliers) def test_nan(self): # test if function ignores nan values x = pd.Series( [ -76.0, 41.0, -43.0, -152.0, -89.0, 28.0, 49.0, 298.0, -132.0, 146.0, -107.0, -26.0, 26.0, -81.0, 116.0, -217.0, -102.0, 144.0, 120.0, -130.0, ], ) x = pd.concat([x, pd.Series([np.nan] * 20)]) first_outliers = [-152.0, 298.0, 146.0, 116.0, -217.0, 144.0, 120.0] primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(x) - len(first_outliers) - 20 # test a series with all nan values x = pd.Series([np.nan] * 20) primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 0 def test_negative_n(self): with raises(ValueError): self.primitive(-1) class TestCountInsideRange(PrimitiveTestBase): primitive = CountInsideRange def test_integer_range(self): # all integers from -100 to 100 x = pd.Series(np.arange(-100, 101, 1)) primitive_instance = self.primitive(-100, 100) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 201 primitive_instance = self.primitive(-50, 50) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 101 primitive_instance = self.primitive(1, 1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 1 def test_float_range(self): x = pd.Series(np.linspace(-3, 3, 10)) primitive_instance = self.primitive(-3, 3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 10 primitive_instance = self.primitive(-0.34, 1.68) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 4 primitive_instance = self.primitive(-3, -3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 1 def test_nan(self): x = pd.Series(np.linspace(-3, 3, 10)) x = pd.concat([x, pd.Series([np.nan] * 20)]) primitive_instance = self.primitive(-0.34, 1.68) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 4 primitive_instance = self.primitive(-3, 3, False) primitive_func = primitive_instance.get_function() assert np.isnan(primitive_func(x)) def test_inf(self): x = pd.Series(np.linspace(-3, 3, 10)) num_NINF = 20 x = pd.concat([x, pd.Series([np.NINF] * num_NINF)]) num_inf = 10 x = pd.concat([x, pd.Series([np.inf] * num_inf)]) primitive_instance = self.primitive(-3, 3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 10 primitive_instance = self.primitive(np.NINF, 3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 10 + num_NINF primitive_instance = self.primitive(-3, np.inf) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 10 + num_inf class TestCountLessThan(PrimitiveTestBase): primitive = CountLessThan def compare_answers(self, data, thresholds, answers): for threshold, answer in zip(thresholds, answers): primitive = self.primitive(threshold=threshold) function = primitive.get_function() assert function(data) == answer assert isinstance(function(data), np.int64) def test_regular(self): data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]) thresholds = pd.Series([-5, -2, 0, 2, 5]) answers = pd.Series([0, 3, 5, 7, 10]) self.compare_answers(data, thresholds, answers) def test_edges(self): data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]) thresholds = pd.Series([np.inf, np.NINF, None, np.nan]) answers = pd.Series([len(data), 0, 0, 0]) self.compare_answers(data, thresholds, answers) def test_nans(self): data = pd.Series([-5, -4, -3, np.inf, np.NINF, np.nan, 1, 2, 3, 4, 5]) thresholds = pd.Series([np.inf, np.NINF, None, 0, np.nan]) answers = pd.Series([9, 0, 0, 4, 0]) self.compare_answers(data, thresholds, answers) class TestCountOutsideNthSTD(PrimitiveTestBase): primitive = CountOutsideNthSTD def test_normal_distribution(self): x = pd.Series( [ 10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 763, 557, 304, 404, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543, ], ) outliers = [10, 20, 763] primitive_instance = self.primitive(2) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(outliers) def test_poisson_distribution(self): x = pd.Series( [ 1, 1, 3, 3, 0, 0, 1, 3, 3, 1, 2, 3, 2, 0, 1, 3, 2, 1, 0, 2, ], ) primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 10 primitive_instance = self.primitive(2) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 0 def test_nan(self): # test if function ignores nan values x = pd.Series( [ -76.0, 41.0, -43.0, -152.0, -89.0, 28.0, 49.0, 298.0, -132.0, 146.0, -107.0, -26.0, 26.0, -81.0, 116.0, -217.0, -102.0, 144.0, 120.0, -130.0, ], ) x = pd.concat([x, pd.Series([np.nan * 20])]) primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 7 # test a series with all nan values x = pd.Series([np.nan] * 20) primitive_instance = self.primitive(1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 0 def test_negative_n(self): with raises(ValueError): self.primitive(-1) class TestCountOutsideRange(PrimitiveTestBase): primitive = CountOutsideRange def test_integer_range(self): # all integers from -100 to 100 x = pd.Series(np.arange(-100, 101, 1)) primitive_instance = CountOutsideRange(-100, 100) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 0 primitive_instance = CountOutsideRange(-50, 50) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 100 primitive_instance = CountOutsideRange(1, 1) primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(x) - 1 def test_float_range(self): x = pd.Series(np.linspace(-3, 3, 10)) primitive_instance = CountOutsideRange(-3, 3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 0 primitive_instance = CountOutsideRange(-0.34, 1.68) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 6 primitive_instance = CountOutsideRange(-3, -3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 9 def test_nan(self): x = pd.Series(np.linspace(-3, 3, 10)) x = pd.concat([x, pd.Series([np.nan] * 20)]) primitive_instance = CountOutsideRange(-0.34, 1.68) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 6 primitive_instance = CountOutsideRange(-3, 3, False) primitive_func = primitive_instance.get_function() assert np.isnan(primitive_func(x)) def test_inf(self): x = pd.Series(np.linspace(-3, 3, 10)) num_NINF = 20 x = pd.concat([x, pd.Series([np.NINF] * num_NINF)]) num_inf = 10 x = pd.concat([x, pd.Series([np.inf] * num_inf)]) primitive_instance = CountOutsideRange(-3, 3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == num_inf + num_NINF primitive_instance = CountOutsideRange(-0.34, 1.68) primitive_func = primitive_instance.get_function() assert primitive_func(x) == 6 + num_inf + num_NINF primitive_instance = CountOutsideRange(np.NINF, 3) primitive_func = primitive_instance.get_function() assert primitive_func(x) == num_inf primitive_instance = CountOutsideRange(-3, np.inf) primitive_func = primitive_instance.get_function() assert primitive_func(x) == num_NINF ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_max_consecutive.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import ( MaxConsecutiveFalse, MaxConsecutiveNegatives, MaxConsecutivePositives, MaxConsecutiveTrue, MaxConsecutiveZeros, ) class TestMaxConsecutiveFalse: def test_regular(self): primitive_instance = MaxConsecutiveFalse() primitive_func = primitive_instance.get_function() array = pd.Series([False, False, False, True, True, False, True], dtype="bool") assert primitive_func(array) == 3 def test_all_true(self): primitive_instance = MaxConsecutiveFalse() primitive_func = primitive_instance.get_function() array = pd.Series([True, True, True, True], dtype="bool") assert primitive_func(array) == 0 def test_all_false(self): primitive_instance = MaxConsecutiveFalse() primitive_func = primitive_instance.get_function() array = pd.Series([False, False, False], dtype="bool") assert primitive_func(array) == 3 class TestMaxConsecutiveTrue: def test_regular(self): primitive_instance = MaxConsecutiveTrue() primitive_func = primitive_instance.get_function() array = pd.Series([True, False, True, True, True, False, True], dtype="bool") assert primitive_func(array) == 3 def test_all_true(self): primitive_instance = MaxConsecutiveTrue() primitive_func = primitive_instance.get_function() array = pd.Series([True, True, True, True], dtype="bool") assert primitive_func(array) == 4 def test_all_false(self): primitive_instance = MaxConsecutiveTrue() primitive_func = primitive_instance.get_function() array = pd.Series([False, False, False], dtype="bool") assert primitive_func(array) == 0 @pytest.mark.parametrize("dtype", ["float64", "int64"]) class TestMaxConsecutiveNegatives: def test_regular(self, dtype): if dtype == "int64": pytest.skip("test array contains floats which are not supported int64") primitive_instance = MaxConsecutiveNegatives() primitive_func = primitive_instance.get_function() array = pd.Series([1.3, -3.4, -1, -4, 10, -1.7, -4.9], dtype=dtype) assert primitive_func(array) == 3 def test_all_int(self, dtype): primitive_instance = MaxConsecutiveNegatives() primitive_func = primitive_instance.get_function() array = pd.Series([1, -1, 2, 4, -5], dtype=dtype) assert primitive_func(array) == 1 def test_all_float(self, dtype): if dtype == "int64": pytest.skip("test array contains floats which are not supported int64") primitive_instance = MaxConsecutiveNegatives() primitive_func = primitive_instance.get_function() array = pd.Series([1.0, -1.0, -2.0, 0.0, 5.0], dtype=dtype) assert primitive_func(array) == 2 def test_with_nan(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveNegatives() primitive_func = primitive_instance.get_function() array = pd.Series([1, np.nan, -2, -3], dtype=dtype) assert primitive_func(array) == 2 def test_with_nan_skipna(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveNegatives(skipna=False) primitive_func = primitive_instance.get_function() array = pd.Series([-1, np.nan, -2, -3], dtype=dtype) assert primitive_func(array) == 2 def test_all_nan(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveNegatives() primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype) assert np.isnan(primitive_func(array)) def test_all_nan_skipna(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveNegatives(skipna=True) primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype) assert np.isnan(primitive_func(array)) @pytest.mark.parametrize("dtype", ["float64", "int64"]) class TestMaxConsecutivePositives: def test_regular(self, dtype): if dtype == "int64": pytest.skip("test array contains floats which are not supported int64") primitive_instance = MaxConsecutivePositives() primitive_func = primitive_instance.get_function() array = pd.Series([1.3, -3.4, 1, 4, 10, -1.7, -4.9], dtype=dtype) assert primitive_func(array) == 3 def test_all_int(self, dtype): primitive_instance = MaxConsecutivePositives() primitive_func = primitive_instance.get_function() array = pd.Series([1, -1, 2, 4, -5], dtype=dtype) assert primitive_func(array) == 2 def test_all_float(self, dtype): if dtype == "int64": pytest.skip("test array contains floats which are not supported int64") primitive_instance = MaxConsecutivePositives() primitive_func = primitive_instance.get_function() array = pd.Series([1.0, -1.0, 2.0, 4.0, 5.0], dtype=dtype) assert primitive_func(array) == 3 def test_with_nan(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutivePositives() primitive_func = primitive_instance.get_function() array = pd.Series([1, np.nan, 2, -3], dtype=dtype) assert primitive_func(array) == 2 def test_with_nan_skipna(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutivePositives(skipna=False) primitive_func = primitive_instance.get_function() array = pd.Series([1, np.nan, 2, -3], dtype=dtype) assert primitive_func(array) == 1 def test_all_nan(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutivePositives() primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype) assert np.isnan(primitive_func(array)) def test_all_nan_skipna(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutivePositives(skipna=True) primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype) assert np.isnan(primitive_func(array)) @pytest.mark.parametrize("dtype", ["float64", "int64"]) class TestMaxConsecutiveZeros: def test_regular(self, dtype): if dtype == "int64": pytest.skip("test array contains floats which are not supported int64") primitive_instance = MaxConsecutiveZeros() primitive_func = primitive_instance.get_function() array = pd.Series([1.3, -3.4, 0, 0, 0.0, 1.7, -4.9], dtype=dtype) assert primitive_func(array) == 3 def test_all_int(self, dtype): primitive_instance = MaxConsecutiveZeros() primitive_func = primitive_instance.get_function() array = pd.Series([1, -1, 0, 0, -5], dtype=dtype) assert primitive_func(array) == 2 def test_all_float(self, dtype): if dtype == "int64": pytest.skip("test array contains floats which are not supported int64") primitive_instance = MaxConsecutiveZeros() primitive_func = primitive_instance.get_function() array = pd.Series([1.0, 0.0, 0.0, 0.0, -5.3], dtype=dtype) assert primitive_func(array) == 3 def test_with_nan(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveZeros() primitive_func = primitive_instance.get_function() array = pd.Series([0, np.nan, 0, -3], dtype=dtype) assert primitive_func(array) == 2 def test_with_nan_skipna(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveZeros(skipna=False) primitive_func = primitive_instance.get_function() array = pd.Series([0, np.nan, 0, -3], dtype=dtype) assert primitive_func(array) == 1 def test_all_nan(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveZeros() primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype) assert np.isnan(primitive_func(array)) def test_all_nan_skipna(self, dtype): if dtype == "int64": pytest.skip("nans not supported in int64") primitive_instance = MaxConsecutiveZeros(skipna=True) primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype) assert np.isnan(primitive_func(array)) ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_num_consecutive.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumConsecutiveGreaterMean, NumConsecutiveLessMean class TestNumConsecutiveGreaterMean: primitive = NumConsecutiveGreaterMean def test_continuous_range(self): x = pd.Series(range(10)) longest_sequence = [5, 6, 7, 8, 9] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_subsequence_in_middle(self): x = pd.Series( [ 0.6, 0.18, 1.11, -0.19, 0.25, -1.41, 0.54, 0.29, -1.59, 1.67, 1.19, 0.44, 2.39, -1.38, 0.15, -1.16, 1.54, -0.34, -1.41, 0.58, ], ) longest_sequence = [1.67, 1.19, 0.44, 2.39] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_subsequence_at_start(self): x = pd.Series( [ 1.67, 1.19, 0.44, 2.39, -0.19, 0.6, 0.18, 1.11, 0.25, -1.41, 0.54, 0.29, -1.59, -1.38, 0.15, -1.16, 1.54, -0.34, -1.41, 0.58, ], ) longest_sequence = [1.67, 1.19, 0.44, 2.39] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_subsequence_at_end(self): x = pd.Series( [ 0.6, 0.18, 1.11, -0.19, 0.25, -1.41, 0.54, 0.29, -1.59, -1.38, 0.15, -1.16, 1.54, -0.34, 0.58, -1.41, 1.67, 1.19, 0.44, 2.39, ], ) longest_sequence = [1.67, 1.19, 0.44, 2.39] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_nan(self): x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.nan] * 20)]) longest_sequence = [5, 6, 7, 8, 9] # test ignoring NaN values primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) # test skipna=False primitive_instance = self.primitive(skipna=False) primitive_func = primitive_instance.get_function() assert np.isnan(primitive_func(x)) def test_inf(self): primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.inf])]) assert primitive_func(x) == 0 x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.NINF])]) assert primitive_func(x) == 10 x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.NINF, np.inf, np.inf])]) assert np.isnan(primitive_func(x)) class TestNumConsecutiveLessMean: primitive = NumConsecutiveLessMean def test_continuous_range(self): x = pd.Series(range(10)) longest_sequence = [0, 1, 2, 3, 4] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_subsequence_in_middle(self): x = pd.Series( [ 0.6, 0.18, 1.11, -0.19, 0.25, -1.41, 0.54, 0.29, -1.59, 1.67, 1.19, 0.44, 2.39, -1.38, 0.15, -1.16, 1.54, -0.34, -1.41, 0.58, ], ) longest_sequence = [-1.38, 0.15, -1.16] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_subsequence_at_start(self): x = pd.Series( [ -1.38, 0.15, -1.16, 0.6, 0.18, 1.11, -0.19, 0.25, -1.41, 0.54, 0.29, -1.59, 1.67, 1.19, 0.44, 2.39, 1.54, -0.34, -1.41, 0.58, ], ) longest_sequence = [-1.38, 0.15, -1.16] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_subsequence_at_end(self): x = pd.Series( [ 0.6, 0.18, 1.11, -0.19, 0.25, -1.41, 0.54, 0.29, -1.59, 1.67, 1.19, 0.44, 2.39, 1.54, -0.34, -1.41, 0.58, -1.38, 0.15, -1.16, ], ) longest_sequence = [-1.38, 0.15, -1.16] primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) def test_nan(self): x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.nan] * 20)]) longest_sequence = [0, 1, 2, 3, 4] # test ignoring NaN values primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() assert primitive_func(x) == len(longest_sequence) # test skipna=False primitive_instance = self.primitive(skipna=False) primitive_func = primitive_instance.get_function() assert np.isnan(primitive_func(x)) def test_inf(self): primitive_instance = self.primitive() primitive_func = primitive_instance.get_function() x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.inf])]) assert primitive_func(x) == 10 x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.NINF])]) assert primitive_func(x) == 0 x = pd.Series(range(10)) x = pd.concat([x, pd.Series([np.NINF, np.inf, np.inf])]) assert np.isnan(primitive_func(x)) ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_percent_true.py ================================================ import pandas as pd from woodwork.logical_types import BooleanNullable import featuretools as ft def test_percent_true_default_value_with_dfs(): es = ft.EntitySet(id="customer_data") customers_df = pd.DataFrame(data={"customer_id": [1, 2]}) transactions_df = pd.DataFrame( data={"tx_id": [1], "customer_id": [1], "is_foo": [True]}, ) es.add_dataframe( dataframe_name="customers_df", dataframe=customers_df, index="customer_id", ) es.add_dataframe( dataframe_name="transactions_df", dataframe=transactions_df, index="tx_id", logical_types={"is_foo": BooleanNullable}, ) es = es.add_relationship( "customers_df", "customer_id", "transactions_df", "customer_id", ) feature_matrix, _ = ft.dfs( entityset=es, target_dataframe_name="customers_df", agg_primitives=["percent_true"], ) assert pd.isna(feature_matrix["PERCENT_TRUE(transactions_df.is_foo)"][2]) ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_rolling_primitive.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import ( RollingCount, RollingMax, RollingMean, RollingMin, RollingOutlierCount, RollingSTD, RollingTrend, ) from featuretools.primitives.standard.transform.time_series.utils import ( apply_rolling_agg_to_series, ) from featuretools.tests.primitive_tests.utils import get_number_from_offset @pytest.mark.parametrize( "window_length, gap", [ (5, 2), (5, 0), ("5d", "7d"), ("5d", "0d"), ], ) @pytest.mark.parametrize("min_periods", [1, 0, 2, 5]) def test_rolling_max(min_periods, window_length, gap, window_series): gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) # Since we're using a uniform series we can check correctness using numeric parameters expected_vals = apply_rolling_agg_to_series( window_series, lambda x: x.max(), window_length_num, gap=gap_num, min_periods=min_periods, ) primitive_instance = RollingMax( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() actual_vals = pd.Series( primitive_func(window_series.index, pd.Series(window_series.values)), ) # Since min_periods of 0 is the same as min_periods of 1 num_nans_from_min_periods = min_periods or 1 assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1 pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals) @pytest.mark.parametrize( "window_length, gap", [ (5, 2), (5, 0), ("5d", "7d"), ("5d", "0d"), ], ) @pytest.mark.parametrize("min_periods", [1, 0, 2, 5]) def test_rolling_min(min_periods, window_length, gap, window_series): gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) # Since we're using a uniform series we can check correctness using numeric parameters expected_vals = apply_rolling_agg_to_series( window_series, lambda x: x.min(), window_length_num, gap=gap_num, min_periods=min_periods, ) primitive_instance = RollingMin( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() actual_vals = pd.Series( primitive_func(window_series.index, pd.Series(window_series.values)), ) # Since min_periods of 0 is the same as min_periods of 1 num_nans_from_min_periods = min_periods or 1 assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1 pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals) @pytest.mark.parametrize( "window_length, gap", [ (5, 2), (5, 0), ("5d", "7d"), ("5d", "0d"), ], ) @pytest.mark.parametrize("min_periods", [1, 0, 2, 5]) def test_rolling_mean(min_periods, window_length, gap, window_series): gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) # Since we're using a uniform series we can check correctness using numeric parameters expected_vals = apply_rolling_agg_to_series( window_series, np.mean, window_length_num, gap=gap_num, min_periods=min_periods, ) primitive_instance = RollingMean( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() actual_vals = pd.Series( primitive_func(window_series.index, pd.Series(window_series.values)), ) # Since min_periods of 0 is the same as min_periods of 1 num_nans_from_min_periods = min_periods or 1 assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1 pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals) @pytest.mark.parametrize( "window_length, gap", [ (5, 2), (5, 0), ("5d", "7d"), ("5d", "0d"), ], ) @pytest.mark.parametrize("min_periods", [1, 0, 2, 5]) def test_rolling_std(min_periods, window_length, gap, window_series): gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) # Since we're using a uniform series we can check correctness using numeric parameters expected_vals = apply_rolling_agg_to_series( window_series, lambda x: x.std(), window_length_num, gap=gap_num, min_periods=min_periods, ) primitive_instance = RollingSTD( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() actual_vals = pd.Series( primitive_func(window_series.index, pd.Series(window_series.values)), ) # Since min_periods of 0 is the same as min_periods of 1 num_nans_from_min_periods = min_periods or 2 if min_periods in [0, 1]: # the additional nan is because std pandas function returns NaN if there's only one value num_nans = gap_num + 1 else: num_nans = gap_num + num_nans_from_min_periods - 1 # The extra 1 at the beginning is because the std pandas function returns NaN if there's only one value assert actual_vals.isna().sum() == num_nans pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals) @pytest.mark.parametrize( "window_length, gap", [ (5, 2), ("6d", "7d"), ], ) def test_rolling_count(window_length, gap, window_series): gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) expected_vals = apply_rolling_agg_to_series( window_series, lambda x: x.count(), window_length_num, gap=gap_num, ) primitive_instance = RollingCount( window_length=window_length, gap=gap, min_periods=window_length_num, ) primitive_func = primitive_instance.get_function() actual_vals = pd.Series(primitive_func(window_series.index)) num_nans = gap_num + window_length_num - 1 assert actual_vals.isna().sum() == num_nans # RollingCount will not match the exact roll_series_with_gap call, # because it handles the min_periods difference within the primitive pd.testing.assert_series_equal( pd.Series(expected_vals).iloc[num_nans:], actual_vals.iloc[num_nans:], ) @pytest.mark.parametrize( "min_periods, expected_num_nams", [(0, 2), (1, 2), (3, 4), (5, 6)], # 0 and 1 get treated the same ) @pytest.mark.parametrize("window_length, gap", [("5d", "2d"), (5, 2)]) def test_rolling_count_primitive_min_periods_nans( window_length, gap, min_periods, expected_num_nams, window_series, ): primitive_instance = RollingCount( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() vals = pd.Series(primitive_func(window_series.index)) assert vals.isna().sum() == expected_num_nams @pytest.mark.parametrize( "min_periods, expected_num_nams", [(0, 0), (1, 0), (3, 2), (5, 4)], # 0 and 1 get treated the same ) @pytest.mark.parametrize("window_length, gap", [("5d", "0d"), (5, 0)]) def test_rolling_count_with_no_gap( window_length, gap, min_periods, expected_num_nams, window_series, ): primitive_instance = RollingCount( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() vals = pd.Series(primitive_func(window_series.index)) assert vals.isna().sum() == expected_num_nams @pytest.mark.parametrize( "window_length, gap, expected_vals", [ (3, 0, [np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), ( 4, 1, [np.nan, np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], ), ( "5d", "7d", [ np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ], ), ( "5d", "0d", [np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], ), ], ) def test_rolling_trend(window_length, gap, expected_vals, window_series): primitive_instance = RollingTrend(window_length=window_length, gap=gap) actual_vals = primitive_instance(window_series.index, window_series.values) pd.testing.assert_series_equal(pd.Series(expected_vals), pd.Series(actual_vals)) def test_rolling_trend_window_length_less_than_three(window_series): primitive_instance = RollingTrend(window_length=2) vals = primitive_instance(window_series.index, window_series.values) for v in vals: assert np.isnan(v) @pytest.mark.parametrize( "primitive", [ RollingCount, RollingMax, RollingMin, RollingMean, RollingOutlierCount, ], ) def test_rolling_primitives_non_uniform(primitive): # When the data isn't uniform, this impacts the number of values in each rolling window datetimes = ( list(pd.date_range(start="2017-01-01", freq="1d", periods=3)) + list(pd.date_range(start="2017-01-10", freq="2d", periods=4)) + list(pd.date_range(start="2017-01-22", freq="1d", periods=7)) ) no_freq_series = pd.Series(range(len(datetimes)), index=datetimes) # Should match RollingCount exactly and have same nan values as other primitives expected_series = pd.Series( [None, 1, 2] + [None, 1, 1, 1] + [None, 1, 2, 3, 3, 3, 3], ) primitive_instance = primitive(window_length="3d", gap="1d") if isinstance(primitive_instance, RollingCount): rolled_series = pd.Series(primitive_instance(no_freq_series.index)) pd.testing.assert_series_equal(rolled_series, expected_series) else: rolled_series = pd.Series( primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)), ) pd.testing.assert_series_equal(expected_series.isna(), rolled_series.isna()) def test_rolling_std_non_uniform(): # When the data isn't uniform, this impacts the number of values in each rolling window datetimes = ( list(pd.date_range(start="2017-01-01", freq="1d", periods=3)) + list(pd.date_range(start="2017-01-10", freq="2d", periods=4)) + list(pd.date_range(start="2017-01-22", freq="1d", periods=7)) ) no_freq_series = pd.Series(range(len(datetimes)), index=datetimes) # There will be at least two null values at the beginning of each range's rows, the first for the # row skipped by the gap, and the second because pandas' std returns NaN if there's only one row expected_series = pd.Series( [None, None, 0.707107] + [None, None, None, None] + [ # Because the freq was 2 days, there will never be more than 1 observation None, None, 0.707107, 1.0, 1.0, 1.0, 1.0, ], ) primitive_instance = RollingSTD(window_length="3d", gap="1d") rolled_series = pd.Series( primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)), ) pd.testing.assert_series_equal(rolled_series, expected_series) def test_rolling_trend_non_uniform(): datetimes = ( list(pd.date_range(start="2017-01-01", freq="1d", periods=3)) + list(pd.date_range(start="2017-01-10", freq="2d", periods=4)) + list(pd.date_range(start="2017-01-22", freq="1d", periods=7)) ) no_freq_series = pd.Series(range(len(datetimes)), index=datetimes) expected_series = pd.Series( [None, None, None] + [None, None, None, None] + [ None, None, None, 1.0, 1.0, 1.0, 1.0, ], ) primitive_instance = RollingTrend(window_length="3d", gap="1d") rolled_series = pd.Series( primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)), ) pd.testing.assert_series_equal(rolled_series, expected_series) @pytest.mark.parametrize( "window_length, gap", [ (5, 2), (5, 0), ("5d", "7d"), ("5d", "0d"), ], ) @pytest.mark.parametrize( "min_periods", [1, 0, 2, 5], ) def test_rolling_outlier_count( min_periods, window_length, gap, rolling_outlier_series, ): primitive_instance = RollingOutlierCount( window_length=window_length, gap=gap, min_periods=min_periods, ) primitive_func = primitive_instance.get_function() actual_vals = pd.Series( primitive_func( rolling_outlier_series.index, pd.Series(rolling_outlier_series.values), ), ) expected_vals = apply_rolling_agg_to_series( series=rolling_outlier_series, agg_func=primitive_instance.get_outliers_count, window_length=window_length, gap=gap, min_periods=min_periods, ) # Since min_periods of 0 is the same as min_periods of 1 num_nans_from_min_periods = min_periods or 1 assert ( actual_vals.isna().sum() == get_number_from_offset(gap) + num_nans_from_min_periods - 1 ) pd.testing.assert_series_equal(actual_vals, pd.Series(data=expected_vals)) ================================================ FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_time_since.py ================================================ from datetime import datetime from math import isnan import numpy as np import pandas as pd from featuretools.primitives import ( TimeSinceLastFalse, TimeSinceLastMax, TimeSinceLastMin, TimeSinceLastTrue, ) class TestTimeSinceLastFalse: primitive = TimeSinceLastFalse cutoff_time = datetime(2011, 4, 9, 11, 31, 27) times = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)], ) booleans = pd.Series([True] * 5 + [False] * 4) def test_booleans(self): primitive_func = self.primitive().get_function() answer = self.cutoff_time - datetime(2011, 4, 9, 10, 31, 27) assert ( primitive_func( self.times, self.booleans, time=self.cutoff_time, ) == answer.total_seconds() ) def test_booleans_reversed(self): primitive_func = self.primitive().get_function() answer = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 18) reversed_booleans = pd.Series(self.booleans.values[::-1]) assert ( primitive_func( self.times, reversed_booleans, time=self.cutoff_time, ) == answer.total_seconds() ) def test_no_false(self): primitive_func = self.primitive().get_function() times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]) booleans = pd.Series([True] * 5) assert isnan(primitive_func(times, booleans, time=self.cutoff_time)) def test_nans(self): primitive_func = self.primitive().get_function() times = pd.concat([self.times.copy(), pd.Series([np.nan, pd.NaT])]) booleans = pd.concat( [self.booleans.copy(), pd.Series([np.nan], dtype="boolean")], ) times = times.reset_index(drop=True) booleans = booleans.reset_index(drop=True) answer = self.cutoff_time - datetime(2011, 4, 9, 10, 31, 27) assert ( primitive_func( times, booleans, time=self.cutoff_time, ) == answer.total_seconds() ) def test_empty(self): primitive_func = self.primitive().get_function() times = pd.Series([], dtype="datetime64[ns]") booleans = pd.Series([], dtype="boolean") times = times.reset_index(drop=True) answer = primitive_func( times, booleans, time=self.cutoff_time, ) assert pd.isna(answer) class TestTimeSinceLastMax: primitive = TimeSinceLastMax cutoff_time = datetime(2011, 4, 9, 11, 31, 27) times = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)], ) numerics = pd.Series([0, 1, 2, 8, 2, 5, 1, 3, 7]) actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 18) actual_seconds = actual_time_since.total_seconds() def test_primitive_func_1(self): primitive_func = self.primitive().get_function() assert ( primitive_func( self.times, self.numerics, time=self.cutoff_time, ) == self.actual_seconds ) def test_no_max(self): primitive_func = self.primitive().get_function() times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]) numerics = pd.Series([0] * 5) actual_time_since = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 0) actual_seconds = actual_time_since.total_seconds() assert primitive_func(times, numerics, time=self.cutoff_time) == actual_seconds def test_nans(self): primitive_func = self.primitive().get_function() times = pd.concat([self.times.copy(), pd.Series([np.nan, pd.NaT])]) numerics = pd.concat( [self.numerics.copy(), pd.Series([np.nan], dtype="float64")], ) times = times.reset_index(drop=True) numerics = numerics.reset_index(drop=True) assert ( primitive_func( times, numerics, time=self.cutoff_time, ) == self.actual_seconds ) class TestTimeSinceLastMin: primitive = TimeSinceLastMin cutoff_time = datetime(2011, 4, 9, 11, 31, 27) times = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)], ) numerics = pd.Series([1, 0, 2, 8, 2, 5, 1, 3, 7]) actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 6) actual_seconds = actual_time_since.total_seconds() def test_primitive_func_1(self): primitive_func = self.primitive().get_function() assert ( primitive_func( self.times, self.numerics, time=self.cutoff_time, ) == self.actual_seconds ) def test_no_max(self): primitive_func = self.primitive().get_function() times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]) numerics = pd.Series([0] * 5) actual_time_since = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 0) actual_seconds = actual_time_since.total_seconds() assert primitive_func(times, numerics, time=self.cutoff_time) == actual_seconds def test_nans(self): primitive_func = self.primitive().get_function() times = pd.concat( [self.times.copy(), pd.Series([np.nan, pd.NaT], dtype="datetime64[ns]")], ) numerics = pd.concat( [self.numerics.copy(), pd.Series([np.nan, np.nan], dtype="float64")], ) times = times.reset_index(drop=True) numerics = numerics.reset_index(drop=True) assert ( primitive_func( times, numerics, time=self.cutoff_time, ) == self.actual_seconds ) class TestTimeSinceLastTrue: primitive = TimeSinceLastTrue cutoff_time = datetime(2011, 4, 9, 11, 31, 27) times = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)], ) booleans = pd.Series([True] * 5 + [False] * 4) actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 24) actual_seconds = actual_time_since.total_seconds() def test_primitive_func_1(self): primitive_func = self.primitive().get_function() assert ( primitive_func( self.times, self.booleans, time=self.cutoff_time, ) == self.actual_seconds ) def test_no_true(self): primitive_func = self.primitive().get_function() times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]) booleans = pd.Series([False] * 5) assert isnan(primitive_func(times, booleans, time=self.cutoff_time)) def test_nans(self): primitive_func = self.primitive().get_function() times = pd.concat( [self.times.copy(), pd.Series([np.nan, pd.NaT], dtype="datetime64[ns]")], ) booleans = pd.concat( [self.booleans.copy(), pd.Series([np.nan], dtype="boolean")], ) times = times.reset_index(drop=True) booleans = booleans.reset_index(drop=True) assert ( primitive_func( times, booleans, time=self.cutoff_time, ) == self.actual_seconds ) def test_no_cutofftime(self): primitive_func = self.primitive().get_function() times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]) booleans = pd.Series([False] * 5) assert isnan(primitive_func(times, booleans)) def test_empty(self): primitive_func = self.primitive().get_function() times = pd.Series([], dtype="datetime64[ns]") booleans = pd.Series([], dtype="boolean") times = times.reset_index(drop=True) answer = primitive_func( times, booleans, time=self.cutoff_time, ) assert pd.isna(answer) ================================================ FILE: featuretools/tests/primitive_tests/bad_primitive_files/__init__.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/bad_primitive_files/multiple_primitives.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives import AggregationPrimitive class CustomMax(AggregationPrimitive): name = "custom_max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) class CustomSum(AggregationPrimitive): name = "custom_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) ================================================ FILE: featuretools/tests/primitive_tests/bad_primitive_files/no_primitives.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_count_string.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import CountString from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestCountString(PrimitiveTestBase): primitive = CountString def compare(self, primitive_initiated, test_cases, answers): primitive_func = primitive_initiated.get_function() primitive_answers = primitive_func(test_cases) return np.testing.assert_array_equal(answers, primitive_answers) test_cases = pd.Series( [ # Ignore case "Hello other words hello hEllo HELLO", # ignore non alphanumeric "he\\{ll\t\n\t.--?o othe/r words hello hello h.el./lo", # match whole word "hellohellohello other hello word go hello here 9hello hello9", # all combined # hello/ counts as hello being it's own word # since * and / are non word characters # but 9 is a "word character" so 9hello9 # does not count as hello being its own word "helloHellohello 9Hello 9hello9 *hello/ test'hel..lo' 'hE.l.lO' \ hello", ], ) def test_non_regex_with_no_other_parameters(self): primitive = self.primitive( "hello", ignore_case=False, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) answers = [1, 2, 7, 5] self.compare(primitive, self.test_cases, answers) def test_non_regex_ignore_case(self): primitive1 = self.primitive( "hello", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) primitive2 = self.primitive( "HeLLo", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) answers = [4, 2, 7, 7] self.compare(primitive1, self.test_cases, answers) self.compare(primitive2, self.test_cases, answers) def test_non_regex_ignore_non_alphanumeric(self): primitive = self.primitive( "hello", ignore_case=False, ignore_non_alphanumeric=True, is_regex=False, match_whole_words_only=False, ) answers = [1, 4, 7, 6] self.compare(primitive, self.test_cases, answers) def test_non_regex_match_whole_words_only(self): primitive = self.primitive( "hello", ignore_case=False, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=True, ) answers = [1, 2, 2, 2] self.compare(primitive, self.test_cases, answers) def test_non_regex_with_all_others_parameters(self): primitive = self.primitive( "hello", ignore_case=True, ignore_non_alphanumeric=True, is_regex=False, match_whole_words_only=True, ) answers = [4, 4, 2, 3] self.compare(primitive, self.test_cases, answers) def test_regex_with_no_other_parameters(self): primitive = self.primitive( "h.l.o", ignore_case=False, ignore_non_alphanumeric=False, is_regex=True, match_whole_words_only=False, ) answers = [2, 2, 7, 5] self.compare(primitive, self.test_cases, answers) def test_regex_with_ignore_case(self): primitive = self.primitive( "h.l.o", ignore_case=True, ignore_non_alphanumeric=False, is_regex=True, match_whole_words_only=False, ) answers = [4, 2, 7, 7] self.compare(primitive, self.test_cases, answers) def test_regex_with_ignore_non_alphanumeric(self): primitive = self.primitive( "h.l.o", ignore_case=False, ignore_non_alphanumeric=True, is_regex=True, match_whole_words_only=False, ) answers = [2, 4, 7, 6] self.compare(primitive, self.test_cases, answers) def test_regex_with_match_whole_words_only(self): primitive = self.primitive( "h.l.o", ignore_case=False, ignore_non_alphanumeric=False, is_regex=True, match_whole_words_only=True, ) answers = [2, 2, 2, 2] self.compare(primitive, self.test_cases, answers) def test_regex_with_all_other_parameters(self): primitive = self.primitive( "h.l.o", ignore_case=True, ignore_non_alphanumeric=True, is_regex=True, match_whole_words_only=True, ) answers = [4, 4, 2, 3] self.compare(primitive, self.test_cases, answers) def test_overlapping_regex(self): primitive = self.primitive( "(?=(a.*a))", ignore_case=True, ignore_non_alphanumeric=True, is_regex=True, match_whole_words_only=False, ) test_cases = pd.Series(["aaaaaaaaaa", "atesta aa aa a"]) answers = [9, 6] self.compare(primitive, test_cases, answers) def test_the(self): primitive = self.primitive( "the", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) test_cases = pd.Series(["The fox jumped over the cat", "The there then"]) answers = [2, 3] self.compare(primitive, test_cases, answers) def test_nan(self): primitive = self.primitive( "the", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) test_cases = pd.Series( [np.nan, None, pd.NA, "The fox jumped over the cat", "The there then"], ) answers = [np.nan, np.nan, np.nan, 2, 3] self.compare(primitive, test_cases, answers) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive( "the", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) def test_with_featuretools_nan(self, es): log_df = es["log"] comments = log_df["comments"] comments[1] = pd.NA comments[2] = np.nan comments[3] = None log_df["comments"] = comments es.replace_dataframe(dataframe_name="log", df=log_df) transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive( "the", ignore_case=True, ignore_non_alphanumeric=False, is_regex=False, match_whole_words_only=False, ) transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_mean_characters_per_word.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import MeanCharactersPerWord from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestMeanCharactersPerWord(PrimitiveTestBase): primitive = MeanCharactersPerWord def test_sentences(self): x = pd.Series( [ "This is a test file", "This is second line", "third line $1,000", "and subsequent lines", "and more", ], ) primitive_func = self.primitive().get_function() answers = pd.Series([3.0, 4.0, 5.0, 6.0, 3.5]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_punctuation(self): x = pd.Series( [ "This: is a test file", "This, is second line?", "third/line $1,000;", "and--subsequen't lines...", "*and, more..", ], ) primitive_func = self.primitive().get_function() answers = pd.Series([3.0, 4.0, 8.0, 10.5, 4.0]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_multiline(self): x = pd.Series( [ "This is a test file", "This is second line\nthird line $1000;\nand subsequent lines", "and more", ], ) primitive_func = self.primitive().get_function() answers = pd.Series([3.0, 4.8, 3.5]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) @pytest.mark.parametrize( "na_value", [None, np.nan, pd.NA], ) def test_nans(self, na_value): x = pd.Series([na_value, "", "third line"]) primitive_func = self.primitive().get_function() answers = pd.Series([np.nan, 0, 4.5]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) @pytest.mark.parametrize( "na_value", [None, np.nan, pd.NA], ) def test_all_nans(self, na_value): x = pd.Series([na_value, na_value, na_value]) primitive_func = self.primitive().get_function() answers = pd.Series([np.nan, np.nan, np.nan]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_median_word_length.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import MedianWordLength from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestMedianWordLength(PrimitiveTestBase): primitive = MedianWordLength def test_delimiter_override(self): x = pd.Series( ["This is a test file.", "This,is,second,line?", "and;subsequent;lines..."], ) expected = pd.Series([4.0, 4.5, 8.0]) actual = self.primitive("[ ,;]").get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_multiline(self): x = pd.Series( [ "This is a test file.", "This is second line\nthird line $1000;\nand subsequent lines", ], ) expected = pd.Series([4.0, 4.5]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "This is a test file."]) actual = self.primitive().get_function()(x) expected = pd.Series([np.nan, np.nan, np.nan, 4.0]) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_natural_language_primitives_terminate.py ================================================ import pandas as pd import pytest from featuretools.primitives.utils import _get_natural_language_primitives TIMEOUT_THRESHOLD = 20 class TestNaturalLanguagePrimitivesTerminate: # need to sort primitives to avoid pytest collection error primitives = sorted(_get_natural_language_primitives().items()) @pytest.mark.timeout(TIMEOUT_THRESHOLD) @pytest.mark.parametrize("primitive", [prim for _, prim in primitives]) def test_natlang_primitive_does_not_timeout( self, strings_that_have_triggered_errors_before, primitive, ): for text in strings_that_have_triggered_errors_before: primitive().get_function()(pd.Series(text)) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_characters.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumCharacters from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumCharacters(PrimitiveTestBase): primitive = NumCharacters def test_general(self): x = pd.Series( [ "test test test test", "test TEST test TEST,test test test", "and subsequent lines...", ], ) expected = pd.Series([19, 34, 23]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_special_characters_and_whitespace(self): x = pd.Series(["50% 50 50% \t\t\t\n\n", "$5,3040 a test* test"]) expected = pd.Series([16, 20]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_unicode_input(self): x = pd.Series( [ "Ángel Angel Ángel ángel", ], ) expected = pd.Series([23]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "This is a test file."]) actual = self.primitive().get_function()(x) expected = pd.Series([pd.NA, pd.NA, pd.NA, 20]) pd.testing.assert_series_equal( actual, expected, check_names=False, check_dtype=False, ) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_unique_separators.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumUniqueSeparators from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumUniqueSeparators(PrimitiveTestBase): primitive = NumUniqueSeparators def test_punctuation(self): x = pd.Series( [ "This: is a test file", "This, is second line?", "third/line $1,000;", "and--subsequen't lines...", "*and, more..", ], ) primitive_func = self.primitive().get_function() answers = pd.Series([1, 3, 3, 2, 3]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_other_delimeters(self): x = pd.Series(["@#$%^&*()<>/[]\\`~-_=+"]) primitive_func = self.primitive().get_function() answers = pd.Series([0]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_multiline(self): x = pd.Series( [ "This is a test file", "This is second line\nthird line $1000;\nand subsequent lines", "and more!", ], ) primitive_func = self.primitive().get_function() answers = pd.Series([1, 3, 2]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_nans(self): x = pd.Series([np.nan, "", "third line."]) primitive_func = self.primitive().get_function() answers = pd.Series([pd.NA, 0, 2]) pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_words.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumWords from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumWords(PrimitiveTestBase): primitive = NumWords def test_general(self): x = pd.Series( [ "test test test test", "test TEST test TEST,test test test", "and subsequent lines...", ], ) expected = pd.Series([4, 6, 3]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_special_characters_and_whitespace(self): x = pd.Series(["50% 50 50% \t\t\t\n\n", "$5,3040 a test* test"]) expected = pd.Series([3, 4]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_unicode_input(self): x = pd.Series( [ "Ángel Angel Ángel ángel", ], ) expected = pd.Series([4]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_contractions(self): x = pd.Series( [ "can't won't don't can't aren't won't don't they'd there's", ], ) expected = pd.Series([9]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_multiple_spaces(self): x = pd.Series( [ " word word word word .", "This is \nthird line \nthird line", ], ) expected = pd.Series([4, 6]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "This is a test file."]) actual = self.primitive().get_function()(x) expected = pd.Series([pd.NA, pd.NA, pd.NA, 5]) pd.testing.assert_series_equal( actual, expected, check_names=False, check_dtype=False, ) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_common_words.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumberOfCommonWords from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumberOfCommonWords(PrimitiveTestBase): primitive = NumberOfCommonWords test_word_bank = {"and", "a", "is"} def test_delimiter_override(self): x = pd.Series( [ "This is a test file.", "This,is,second,line, and?", "and;subsequent;lines...", ], ) expected = pd.Series([2, 2, 1]) actual = self.primitive( word_set=self.test_word_bank, delimiters_regex="[ ,;]", ).get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_multiline(self): x = pd.Series( [ "This is a test file.", "This is second line\nthird line $1000;\nand subsequent lines", ], ) expected = pd.Series([2, 2]) actual = self.primitive(self.test_word_bank).get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "This is a test file."]) actual = self.primitive(self.test_word_bank).get_function()(x) expected = pd.Series([pd.NA, pd.NA, pd.NA, 2]) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_case_insensitive(self): x = pd.Series(["Is", "a", "AND"]) actual = self.primitive(self.test_word_bank).get_function()(x) expected = pd.Series([1, 1, 1]) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_hashtags.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumberOfHashtags from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumberOfHashtags(PrimitiveTestBase): primitive = NumberOfHashtags def test_regular_input(self): x = pd.Series( [ "#hello #hi #hello", "#regular#expression#0or1#yes", "andorandorand #32309", ], ) expected = [3.0, 0.0, 0.0] actual = self.primitive().get_function()(x) np.testing.assert_array_equal(actual, expected) def test_unicode_input(self): x = pd.Series( [ "#Ángel #Æ #ĘÁÊÚ", "#############Āndandandandand###", "andorandorand #32309", ], ) expected = [3.0, 0.0, 0.0] actual = self.primitive().get_function()(x) np.testing.assert_array_equal(actual, expected) def test_multiline(self): x = pd.Series( [ "#\n\t\n", "#hashtag\n#hashtag2\n#\n\n", ], ) expected = [0.0, 2.0] actual = self.primitive().get_function()(x) np.testing.assert_array_equal(actual, expected) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "#test"]) actual = self.primitive().get_function()(x) expected = [np.nan, np.nan, np.nan, 1.0] np.testing.assert_array_equal(actual, expected) def test_alphanumeric_and_special(self): x = pd.Series(["#1or0", "#12", "#??!>@?@#>"]) actual = self.primitive().get_function()(x) expected = [1.0, 0.0, 0.0] np.testing.assert_array_equal(actual, expected) def test_underscore(self): x = pd.Series(["#no", "#__yes", "#??!>@?@#>"]) actual = self.primitive().get_function()(x) expected = [1.0, 1.0, 0.0] np.testing.assert_array_equal(actual, expected) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_mentions.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumberOfMentions from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumberOfMentions(PrimitiveTestBase): primitive = NumberOfMentions def test_regular_input(self): x = pd.Series( [ "@hello @hi @hello", "@and@", "andorandorand", ], ) expected = [3.0, 0.0, 0.0] actual = self.primitive().get_function()(x) np.testing.assert_array_equal(actual, expected) def test_unicode_input(self): x = pd.Series( [ "@Ángel @Æ @ĘÁÊÚ", "@@@@Āndandandandand@", "andorandorand @32309", "example@gmail.com", "@example-20329", ], ) expected = [3.0, 0.0, 1.0, 0.0, 1.0] actual = self.primitive().get_function()(x) np.testing.assert_array_equal(actual, expected) def test_multiline(self): x = pd.Series( [ "@\n\t\n", "@mention\n @mention2\n@\n\n", ], ) expected = [0.0, 2.0] actual = self.primitive().get_function()(x) np.testing.assert_array_equal(actual, expected) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "@test"]) actual = self.primitive().get_function()(x) expected = [np.nan, np.nan, np.nan, 1.0] np.testing.assert_array_equal(actual, expected) def test_alphanumeric_and_special(self): x = pd.Series(["@1or0", "@12", "#??!>@?@#>"]) actual = self.primitive().get_function()(x) expected = [1.0, 1.0, 0.0] np.testing.assert_array_equal(actual, expected) def test_underscore(self): x = pd.Series(["@user1", "@__yes", "#??!>@?@#>"]) actual = self.primitive().get_function()(x) expected = [1.0, 1.0, 0.0] np.testing.assert_array_equal(actual, expected) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_unique_words.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import NumberOfUniqueWords from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumberOfUniqueWords(PrimitiveTestBase): primitive = NumberOfUniqueWords def test_general(self): x = pd.Series( [ "test test test test", "test TEST test TEST", "and subsequent lines...", ], ) expected = pd.Series([1, 2, 3]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_special_characters_and_whitespace(self): x = pd.Series(["50% 50 50% \t\t\t\n\n", "a test* test"]) expected = pd.Series([1, 2]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_unicode_input(self): x = pd.Series( [ "Ángel Angel Ángel ángel", ], ) expected = pd.Series([3]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_contractions(self): x = pd.Series( [ "can't won't don't can't aren't won't don't they'd there's", ], ) expected = pd.Series([6]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_multiline(self): x = pd.Series( [ "word word word word.", "This is \nthird line \nthird line", ], ) expected = pd.Series([1, 4]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "This is a test file."]) actual = self.primitive().get_function()(x) expected = pd.Series([pd.NA, pd.NA, pd.NA, 5]) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_case_insensitive(self): x = pd.Series(["WORD word WORd WORd WOrD word"]) actual = self.primitive(case_insensitive=True).get_function()(x) expected = pd.Series([1]) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_words_in_quotes.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import NumberOfWordsInQuotes from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestNumberOfWordsInQuotes(PrimitiveTestBase): primitive = NumberOfWordsInQuotes def test_regular_double_quotes_input(self): x = pd.Series( [ 'Yes " "', '"Hello this is a test"', '"Yes" " "', "", '"Python, java prolog"', '"Python, java prolog" three words here "binary search algorithm"', '"Diffie-Hellman key exchange"', '"user@email.com"', '"https://alteryx.com"', '"100,000"', '"This Borderlands game here"" is the perfect conclusion to the ""Borderlands 3"" line, which focuses on the fans ""favorite character and gives the players the opportunity to close for a long time some very important questions about\'s character and the memorable scenery with which the players interact.', ], ) expected = pd.Series([0, 5, 1, 0, 3, 6, 3, 1, 1, 1, 6], dtype="Int64") actual = self.primitive("double").get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_captures_regular_single_quotes(self): x = pd.Series( [ "'Hello this is a test'", "'Python, Java Prolog'", "'Python, Java Prolog' three words here 'three words here'", "'Diffie-Hellman key exchange'", "'user@email.com'", "'https://alteryx.com'", "'there's where's here's' word 'word'", "'100,000'", ], ) expected = pd.Series([5, 3, 6, 3, 1, 1, 4, 1], dtype="Int64") actual = self.primitive("single").get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_captures_both_single_and_double_quotes(self): x = pd.Series( [ "'test test test test' three words here \"test test test!\"", ], ) expected = pd.Series([7], dtype="Int64") actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_unicode_input(self): x = pd.Series( [ '"Ángel"', '"Ángel" word word', ], ) expected = pd.Series([1, 1], dtype="Int64") actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_multiline(self): x = pd.Series( [ "'Yes\n, this is me'", ], ) expected = pd.Series([4], dtype="Int64") actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_raises_error_invalid_args(self): error_msg = ( "NULL is not a valid quote_type. Specify 'both', 'single', or 'double'" ) with pytest.raises( ValueError, match=error_msg, ): self.primitive(quote_type="NULL") def test_null(self): x = pd.Series([np.nan, pd.NA, None, '"test"']) actual = self.primitive().get_function()(x) expected = pd.Series([pd.NA, pd.NA, pd.NA, 1.0], dtype="Int64") pd.testing.assert_series_equal(actual, expected, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_punctuation_count.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import PunctuationCount from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestPunctuationCount(PrimitiveTestBase): primitive = PunctuationCount def test_punctuation(self): x = pd.Series( [ "This is a test file.", "This, is second line?", "third/line $1,000;", "and--subsequen't lines...", "*and, more..", ], ) primitive_func = self.primitive().get_function() answers = [1.0, 2.0, 4.0, 6.0, 4.0] np.testing.assert_array_equal(primitive_func(x), answers) def test_multiline(self): x = pd.Series( [ "This is a test file.", "This is second line\nthird line $1000;\nand subsequent lines", ], ) primitive_func = self.primitive().get_function() answers = [1.0, 2.0] np.testing.assert_array_equal(primitive_func(x), answers) def test_nan(self): x = pd.Series([np.nan, "", "This is a test file."]) primitive_func = self.primitive().get_function() answers = [np.nan, 0.0, 1.0] np.testing.assert_array_equal(primitive_func(x), answers) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_title_word_count.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import TitleWordCount from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestTitleWordCount(PrimitiveTestBase): primitive = TitleWordCount def test_strings(self): x = pd.Series( [ "My favorite movie is Jaws.", "this is a string", "AAA", "I bought a Yo-Yo", ], ) primitive_func = self.primitive().get_function() answers = [2.0, 0.0, 1.0, 2.0] np.testing.assert_array_equal(answers, primitive_func(x)) def test_nan(self): x = pd.Series([np.nan, "", "My favorite movie is Jaws."]) primitive_func = self.primitive().get_function() answers = [np.nan, 0.0, 2.0] np.testing.assert_array_equal(answers, primitive_func(x)) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_total_word_length.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import TotalWordLength from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestTotalWordLength(PrimitiveTestBase): primitive = TotalWordLength def test_delimiter_override(self): x = pd.Series( ["This is a test file.", "This,is,second,line?", "and;subsequent;lines..."], ) expected = pd.Series([16, 17, 21]) actual = self.primitive("[ ,;]").get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_multiline(self): x = pd.Series( [ "This is a test file.", "This is second line\nthird line $1000;\nand subsequent lines", ], ) expected = pd.Series([15, 47]) actual = self.primitive().get_function()(x) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_null(self): x = pd.Series([np.nan, pd.NA, None, "This is a test file."]) expected = pd.Series([np.nan, np.nan, np.nan, 15]) actual = self.primitive().get_function()(x).astype(float) pd.testing.assert_series_equal(actual, expected, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_upper_case_count.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import UpperCaseCount from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestUpperCaseCount(PrimitiveTestBase): primitive = UpperCaseCount def test_strings(self): x = pd.Series( ["This IS a STRING.", "Testing AaA", "Testing AAA-BBB", "testing aaa"], ) primitive_func = self.primitive().get_function() answers = [9.0, 3.0, 7.0, 0.0] np.testing.assert_array_equal(primitive_func(x), answers) def test_nan(self): x = pd.Series([np.nan, "", "This IS a STRING."]) primitive_func = self.primitive().get_function() answers = [np.nan, 0.0, 9.0] np.testing.assert_array_equal(primitive_func(x), answers) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_upper_case_word_count.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import UpperCaseWordCount class TestUpperCaseWordCount: primitive = UpperCaseWordCount def test_strings(self): x = pd.Series( [ "This IS a STRING.", "Testing AAA", "Testing AAA BBB", "Testing TEsTIng AA3 AA_33 HELLO", "AAA $@()#$@@#$", ], dtype="string", ) primitive_func = self.primitive().get_function() answers = pd.Series([2, 1, 2, 3, 1], dtype="Int64") pd.testing.assert_series_equal( primitive_func(x).astype("Int64"), answers, check_names=False, ) def test_nan(self): x = pd.Series( [ np.nan, "", "This IS a STRING.", ], dtype="string", ) primitive_func = self.primitive().get_function() answers = pd.Series([pd.NA, 0, 2], dtype="Int64") pd.testing.assert_series_equal( primitive_func(x).astype("Int64"), answers, check_names=False, ) ================================================ FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_whitespace_count.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import WhitespaceCount from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestWhitespaceCount(PrimitiveTestBase): primitive = WhitespaceCount def compare(self, primitive_initiated, test_cases, answers): primitive_func = primitive_initiated.get_function() primitive_answers = primitive_func(test_cases) return np.testing.assert_array_equal(answers, primitive_answers) def test_strings(self): x = pd.Series( ["", "hi im ethan!", "consecutive. spaces.", " spaces-on-ends "], ) answers = [0, 2, 4, 2] self.compare(self.primitive(), x, answers) def test_nan(self): x = pd.Series([np.nan, None, pd.NA, "", "This IS a STRING."]) answers = [np.nan, np.nan, np.nan, 0, 3] self.compare(self.primitive(), x, answers) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/primitives_to_install/__init__.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/primitives_to_install/custom_max.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import AggregationPrimitive class CustomMax(AggregationPrimitive): name = "custom_max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) ================================================ FILE: featuretools/tests/primitive_tests/primitives_to_install/custom_mean.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import AggregationPrimitive class CustomMean(AggregationPrimitive): name = "custom_mean" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) ================================================ FILE: featuretools/tests/primitive_tests/primitives_to_install/custom_sum.py ================================================ from woodwork.column_schema import ColumnSchema from featuretools.primitives.base import AggregationPrimitive class CustomSum(AggregationPrimitive): name = "custom_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) ================================================ FILE: featuretools/tests/primitive_tests/test_absolute_diff.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import AbsoluteDiff class TestAbsoluteDiff: def test_nan(self): data = pd.Series([np.nan, 5, 10, 20, np.nan, 10, np.nan]) answer = pd.Series([np.nan, np.nan, 5, 10, 0, 10, 0]) primitive_func = AbsoluteDiff().get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_regular(self): data = pd.Series([2, 5, 15, 3, 9, 4.5]) answer = pd.Series([np.nan, 3, 10, 12, 6, 4.5]) primitive_func = AbsoluteDiff().get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_method(self): data = pd.Series([2, np.nan, 15, 3, np.nan, 4.5]) answer = pd.Series([np.nan, 13, 0, 12, 1.5, 0]) primitive_func = AbsoluteDiff(method="backfill").get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_limit(self): data = pd.Series([2, np.nan, np.nan, np.nan, 3.0, 4.5]) answer = pd.Series([np.nan, 0, 0, np.nan, np.nan, 1.5]) primitive_func = AbsoluteDiff(limit=2).get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_zero(self): data = pd.Series([2, 0, 0, 5, 0, -4]) answer = pd.Series([np.nan, 2, 0, 5, 5, 4]) primitive_func = AbsoluteDiff().get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_empty(self): data = pd.Series([], dtype="float64") answer = pd.Series([], dtype="float64") primitive_func = AbsoluteDiff().get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_inf(self): data = pd.Series([0, np.inf, 0, 5, np.NINF, np.inf, np.NINF]) answer = pd.Series([np.nan, np.inf, np.inf, 5, np.inf, np.inf, np.inf]) primitive_func = AbsoluteDiff().get_function() given_answer = primitive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_raises(self): with pytest.raises(ValueError): AbsoluteDiff(method="invalid") ================================================ FILE: featuretools/tests/primitive_tests/test_agg_feats.py ================================================ from datetime import datetime from inspect import isclass from math import isnan import numpy as np import pandas as pd import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools import ( AggregationFeature, Feature, IdentityFeature, Timedelta, calculate_feature_matrix, dfs, primitives, ) from featuretools.entityset.relationship import RelationshipPath from featuretools.feature_base.cache import feature_cache from featuretools.primitives import ( Count, Max, Mean, Median, NMostCommon, NumTrue, NumUnique, Sum, TimeSinceFirst, TimeSinceLast, get_aggregation_primitives, ) from featuretools.primitives.base import AggregationPrimitive from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis, match from featuretools.tests.testing_utils import backward_path, feature_with_name @pytest.fixture(autouse=True) def reset_dfs_cache(): feature_cache.enabled = False feature_cache.clear_all() def test_get_depth(es): log_id_feat = IdentityFeature(es["log"].ww["id"]) customer_id_feat = IdentityFeature(es["customers"].ww["id"]) count_logs = Feature(log_id_feat, parent_dataframe_name="sessions", primitive=Count) sum_count_logs = Feature( count_logs, parent_dataframe_name="customers", primitive=Sum, ) num_logs_greater_than_5 = sum_count_logs > 5 count_customers = Feature( customer_id_feat, parent_dataframe_name="régions", where=num_logs_greater_than_5, primitive=Count, ) num_customers_region = Feature(count_customers, dataframe_name="customers") depth = num_customers_region.get_depth() assert depth == 5 def test_makes_count(es): dfs = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Count], trans_primitives=[], ) features = dfs.build_features() assert feature_with_name(features, "device_type") assert feature_with_name(features, "customer_id") assert feature_with_name(features, "customers.région_id") assert feature_with_name(features, "customers.age") assert feature_with_name(features, "COUNT(log)") assert feature_with_name(features, "customers.COUNT(sessions)") assert feature_with_name(features, "customers.régions.language") assert feature_with_name(features, "customers.COUNT(log)") def test_count_null(es): class Count(AggregationPrimitive): name = "count" input_types = [[ColumnSchema(semantic_tags={"foreign_key"})], [ColumnSchema()]] return_type = ColumnSchema(semantic_tags={"numeric"}) stack_on_self = False def __init__(self, count_null=True): self.count_null = count_null def get_function(self): def count_func(values): if self.count_null: values = values.fillna(0) return values.count() return count_func def generate_name( self, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ): return "COUNT(%s%s%s)" % (relationship_path_name, where_str, use_prev_str) count_null = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Count(count_null=True), ) feature_matrix = calculate_feature_matrix([count_null], entityset=es) values = [5, 4, 1, 2, 3, 2] assert (values == feature_matrix[count_null.get_name()]).all() def test_check_input_types(es): count = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) mean = Feature(count, parent_dataframe_name="régions", primitive=Mean) assert mean._check_input_types() boolean = count > 3 mean = Feature( count, parent_dataframe_name="régions", where=boolean, primitive=Mean, ) assert mean._check_input_types() def test_mean_nan(es): array = pd.Series([5, 5, 5, 5, 5]) mean_func_nans_default = Mean().get_function() mean_func_nans_false = Mean(skipna=False).get_function() mean_func_nans_true = Mean(skipna=True).get_function() assert mean_func_nans_default(array) == 5 assert mean_func_nans_false(array) == 5 assert mean_func_nans_true(array) == 5 array = pd.Series([5, np.nan, np.nan, np.nan, np.nan, 10]) assert mean_func_nans_default(array) == 7.5 assert isnan(mean_func_nans_false(array)) assert mean_func_nans_true(array) == 7.5 array_nans = pd.Series([np.nan, np.nan, np.nan, np.nan]) assert isnan(mean_func_nans_default(array_nans)) assert isnan(mean_func_nans_false(array_nans)) assert isnan(mean_func_nans_true(array_nans)) # test naming default_feat = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Mean, ) assert default_feat.get_name() == "MEAN(log.value)" ignore_nan_feat = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Mean(skipna=True), ) assert ignore_nan_feat.get_name() == "MEAN(log.value)" include_nan_feat = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Mean(skipna=False), ) assert include_nan_feat.get_name() == "MEAN(log.value, skipna=False)" def test_init_and_name(es): log = es["log"] # Add a BooleanNullable column so primitives with that input type get tested boolean_nullable = log.ww["purchased"] boolean_nullable = boolean_nullable.ww.set_logical_type("BooleanNullable") log.ww["boolean_nullable"] = boolean_nullable features = [Feature(es["log"].ww[col]) for col in log.columns] # check all primitives have name for attribute_string in dir(primitives): attr = getattr(primitives, attribute_string) if isclass(attr): if issubclass(attr, AggregationPrimitive) and attr != AggregationPrimitive: assert getattr(attr, "name") is not None agg_primitives = get_aggregation_primitives().values() for agg_prim in agg_primitives: input_types = agg_prim.input_types if not isinstance(input_types[0], list): input_types = [input_types] # test each allowed input_types for this primitive for it in input_types: # use the input_types matching function from DFS matching_types = match(it, features) if len(matching_types) == 0: raise Exception("Agg Primitive %s not tested" % agg_prim.name) for t in matching_types: instance = Feature( t, parent_dataframe_name="sessions", primitive=agg_prim, ) # try to get name and calculate instance.get_name() calculate_feature_matrix([instance], entityset=es) def test_invalid_init_args(diamond_es): error_text = "parent_dataframe must match first relationship in path" with pytest.raises(AssertionError, match=error_text): path = backward_path(diamond_es, ["stores", "transactions"]) AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "customers", Mean, relationship_path=path, ) error_text = ( "Base feature must be defined on the dataframe at the end of relationship_path" ) with pytest.raises(AssertionError, match=error_text): path = backward_path(diamond_es, ["regions", "stores"]) AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "regions", Mean, relationship_path=path, ) error_text = "All relationships in path must be backward" with pytest.raises(AssertionError, match=error_text): backward = backward_path(diamond_es, ["customers", "transactions"]) forward = RelationshipPath([(True, r) for _, r in backward]) path = RelationshipPath(list(forward) + list(backward)) AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "transactions", Mean, relationship_path=path, ) def test_init_with_multiple_possible_paths(diamond_es): error_text = ( "There are multiple possible paths to the base dataframe. " "You must specify a relationship path." ) with pytest.raises(RuntimeError, match=error_text): AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "regions", Mean, ) # Does not raise if path specified. path = backward_path(diamond_es, ["regions", "customers", "transactions"]) AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "regions", Mean, relationship_path=path, ) def test_init_with_single_possible_path(diamond_es): # This uses diamond_es to test that there being a cycle somewhere in the # graph doesn't cause an error. feat = AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "customers", Mean, ) expected_path = backward_path(diamond_es, ["customers", "transactions"]) assert feat.relationship_path == expected_path def test_init_with_no_path(diamond_es): error_text = 'No backward path from "transactions" to "customers" found.' with pytest.raises(RuntimeError, match=error_text): AggregationFeature( IdentityFeature(diamond_es["customers"].ww["name"]), "transactions", Count, ) error_text = 'No backward path from "transactions" to "transactions" found.' with pytest.raises(RuntimeError, match=error_text): AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "transactions", Mean, ) def test_name_with_multiple_possible_paths(diamond_es): path = backward_path(diamond_es, ["regions", "customers", "transactions"]) feat = AggregationFeature( IdentityFeature(diamond_es["transactions"].ww["amount"]), "regions", Mean, relationship_path=path, ) assert feat.get_name() == "MEAN(customers.transactions.amount)" assert feat.relationship_path_name() == "customers.transactions" def test_copy(games_es): home_games = next( r for r in games_es.relationships if r._child_column_name == "home_team_id" ) path = RelationshipPath([(False, home_games)]) feat = AggregationFeature( IdentityFeature(games_es["games"].ww["home_team_score"]), "teams", relationship_path=path, primitive=Mean, ) copied = feat.copy() assert copied.dataframe_name == feat.dataframe_name assert copied.base_features == feat.base_features assert copied.relationship_path == feat.relationship_path assert copied.primitive == feat.primitive def test_serialization(es): value = IdentityFeature(es["log"].ww["value"]) primitive = Max() max1 = AggregationFeature(value, "customers", primitive) path = next(es.find_backward_paths("customers", "log")) dictionary = { "name": max1.get_name(), "base_features": [value.unique_name()], "relationship_path": [r.to_dictionary() for r in path], "primitive": primitive, "where": None, "use_previous": None, } assert dictionary == max1.get_arguments() deserialized = AggregationFeature.from_dictionary( dictionary, es, {value.unique_name(): value}, primitive, ) _assert_agg_feats_equal(max1, deserialized) is_purchased = IdentityFeature(es["log"].ww["purchased"]) use_previous = Timedelta(3, "d") max2 = AggregationFeature( value, "customers", primitive, where=is_purchased, use_previous=use_previous, ) dictionary = { "name": max2.get_name(), "base_features": [value.unique_name()], "relationship_path": [r.to_dictionary() for r in path], "primitive": primitive, "where": is_purchased.unique_name(), "use_previous": use_previous.get_arguments(), } assert dictionary == max2.get_arguments() dependencies = { value.unique_name(): value, is_purchased.unique_name(): is_purchased, } deserialized = AggregationFeature.from_dictionary( dictionary, es, dependencies, primitive, ) _assert_agg_feats_equal(max2, deserialized) def test_time_since_last(es): f = Feature( es["log"].ww["datetime"], parent_dataframe_name="customers", primitive=TimeSinceLast, ) fm = calculate_feature_matrix( [f], entityset=es, instance_ids=[0, 1, 2], cutoff_time=datetime(2015, 6, 8), ) correct = [131376000.0, 131289534.0, 131287797.0] # note: must round to nearest second assert all(fm[f.get_name()].round().values == correct) def test_time_since_first(es): f = Feature( es["log"].ww["datetime"], parent_dataframe_name="customers", primitive=TimeSinceFirst, ) fm = calculate_feature_matrix( [f], entityset=es, instance_ids=[0, 1, 2], cutoff_time=datetime(2015, 6, 8), ) correct = [131376600.0, 131289600.0, 131287800.0] # note: must round to nearest second assert all(fm[f.get_name()].round().values == correct) def test_median(es): f = Feature( es["log"].ww["value_many_nans"], parent_dataframe_name="customers", primitive=Median, ) fm = calculate_feature_matrix( [f], entityset=es, instance_ids=[0, 1, 2], cutoff_time=datetime(2015, 6, 8), ) correct = [1, 3, np.nan] np.testing.assert_equal(fm[f.get_name()].values, correct) def test_agg_same_method_name(es): """ Pandas relies on the function name when calculating aggregations. This means if a two primitives with the same function name are applied to the same column, pandas can't differentiate them. We have a work around to this based on the name property that we test here. """ # test with normally defined functions class Sum(AggregationPrimitive): name = "sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): def custom_primitive(x): return x.sum() return custom_primitive class Max(AggregationPrimitive): name = "max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): def custom_primitive(x): return x.max() return custom_primitive f_sum = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum, ) f_max = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Max, ) fm = calculate_feature_matrix([f_sum, f_max], entityset=es) assert fm.columns.tolist() == [f_sum.get_name(), f_max.get_name()] # test with lambdas class Sum(AggregationPrimitive): name = "sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): return lambda x: x.sum() class Max(AggregationPrimitive): name = "max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): return lambda x: x.max() f_sum = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Sum, ) f_max = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Max, ) fm = calculate_feature_matrix([f_sum, f_max], entityset=es) assert fm.columns.tolist() == [f_sum.get_name(), f_max.get_name()] def test_time_since_last_custom(es): class TimeSinceLast(AggregationPrimitive): name = "time_since_last" input_types = [ ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_calc_time = True def get_function(self): def time_since_last(values, time): time_since = time - values.iloc[0] return time_since.total_seconds() return time_since_last f = Feature( es["log"].ww["datetime"], parent_dataframe_name="customers", primitive=TimeSinceLast, ) fm = calculate_feature_matrix( [f], entityset=es, instance_ids=[0, 1, 2], cutoff_time=datetime(2015, 6, 8), ) correct = [131376600, 131289600, 131287800] # note: must round to nearest second assert all(fm[f.get_name()].round().values == correct) def test_custom_primitive_multiple_inputs(es): class MeanSunday(AggregationPrimitive): name = "mean_sunday" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(logical_type=Datetime), ] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): def mean_sunday(numeric, datetime): """ Finds the mean of non-null values of a feature that occurred on Sundays """ days = pd.DatetimeIndex(datetime).weekday.values df = pd.DataFrame({"numeric": numeric, "time": days}) return df[df["time"] == 6]["numeric"].mean() return mean_sunday fm, features = dfs( entityset=es, target_dataframe_name="sessions", agg_primitives=[MeanSunday], trans_primitives=[], ) mean_sunday_value = pd.Series([None, None, None, 2.5, 7, None]) iterator = zip(fm["MEAN_SUNDAY(log.value, datetime)"], mean_sunday_value) for x, y in iterator: assert (pd.isnull(x) and pd.isnull(y)) or (x == y) es.add_interesting_values() mean_sunday_value_priority_0 = pd.Series([None, None, None, 2.5, 0, None]) fm, features = dfs( entityset=es, target_dataframe_name="sessions", agg_primitives=[MeanSunday], trans_primitives=[], where_primitives=[MeanSunday], ) where_feat = "MEAN_SUNDAY(log.value, datetime WHERE priority_level = 0)" for x, y in zip(fm[where_feat], mean_sunday_value_priority_0): assert (pd.isnull(x) and pd.isnull(y)) or (x == y) def test_custom_primitive_default_kwargs(es): class SumNTimes(AggregationPrimitive): name = "sum_n_times" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def __init__(self, n=1): self.n = n sum_n_1_n = 1 sum_n_1_base_f = Feature(es["log"].ww["value"]) sum_n_1 = Feature( [sum_n_1_base_f], parent_dataframe_name="sessions", primitive=SumNTimes(n=sum_n_1_n), ) sum_n_2_n = 2 sum_n_2_base_f = Feature(es["log"].ww["value_2"]) sum_n_2 = Feature( [sum_n_2_base_f], parent_dataframe_name="sessions", primitive=SumNTimes(n=sum_n_2_n), ) assert sum_n_1_base_f == sum_n_1.base_features[0] assert sum_n_1_n == sum_n_1.primitive.n assert sum_n_2_base_f == sum_n_2.base_features[0] assert sum_n_2_n == sum_n_2.primitive.n def test_makes_numtrue(es): dfs = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[NumTrue], trans_primitives=[], ) features = dfs.build_features() assert feature_with_name(features, "customers.NUM_TRUE(log.purchased)") assert feature_with_name(features, "NUM_TRUE(log.purchased)") def test_make_three_most_common(es): class NMostCommoner(AggregationPrimitive): name = "pd_top3" input_types = ([ColumnSchema(semantic_tags={"category"})],) return_type = None number_output_features = 3 def get_function(self): def pd_top3(x): counts = x.value_counts() counts = counts[counts > 0] array = np.array(counts[:3].index) if len(array) < 3: filler = np.full(3 - len(array), np.nan) array = np.append(array, filler) return array return pd_top3 fm, features = dfs( entityset=es, target_dataframe_name="customers", instance_ids=[0, 1, 2], agg_primitives=[NMostCommoner], trans_primitives=[], ) df = fm[["PD_TOP3(log.product_id)[%s]" % i for i in range(3)]] assert set(df.iloc[0].values[:2]) == set( ["coke zero", "toothpaste"], ) # coke zero and toothpaste have same number of occurrences assert df.iloc[0].values[2] in [ "car", "brown bag", ] # so just check that the top two match assert ( df.iloc[1] .reset_index(drop=True) .equals(pd.Series(["coke zero", "Haribo sugar-free gummy bears", np.nan])) ) assert ( df.iloc[2] .reset_index(drop=True) .equals(pd.Series(["taco clock", np.nan, np.nan])) ) def test_stacking_multi(es): threecommon = NMostCommon(3) tc = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=threecommon, ) stacked = [] for i in range(3): stacked.append( Feature(tc[i], parent_dataframe_name="customers", primitive=NumUnique), ) fm = calculate_feature_matrix(stacked, entityset=es, instance_ids=[0, 1, 2]) correct_vals = [[3, 2, 1], [2, 1, 0], [0, 0, 0]] correct_vals1 = [[3, 1, 1], [2, 1, 0], [0, 0, 0]] # either of the above can be correct, and the outcome depends on the sorting of # two values in the initial n most common function, which changes arbitrarily. for i in range(3): f = "NUM_UNIQUE(sessions.N_MOST_COMMON(log.product_id)[%d])" % i cols = fm.columns assert f in cols assert ( fm[cols[i]].tolist() == correct_vals[i] or fm[cols[i]].tolist() == correct_vals1[i] ) def test_use_previous_pd_dateoffset(es): total_events_pd = Feature( es["log"].ww["id"], parent_dataframe_name="customers", use_previous=pd.DateOffset(hours=47, minutes=60), primitive=Count, ) feature_matrix = calculate_feature_matrix( [total_events_pd], es, cutoff_time=pd.Timestamp("2011-04-11 10:31:30"), instance_ids=[0, 1, 2], ) col_name = list(feature_matrix.head().keys())[0] assert (feature_matrix[col_name] == [1, 5, 2]).all() def _assert_agg_feats_equal(f1, f2): assert f1.unique_name() == f2.unique_name() assert f1.child_dataframe_name == f2.child_dataframe_name assert f1.parent_dataframe_name == f2.parent_dataframe_name assert f1.relationship_path == f2.relationship_path assert f1.use_previous == f2.use_previous def test_override_multi_feature_names(es): def gen_custom_names( primitive, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ): base_string = "Custom_%s({}.{})".format( parent_dataframe_name, base_feature_names, ) return [base_string % i for i in range(primitive.number_output_features)] class NMostCommoner(AggregationPrimitive): name = "pd_top3" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"category"}) number_output_features = 3 def generate_names( self, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ): return gen_custom_names( self, base_feature_names, relationship_path_name, parent_dataframe_name, where_str, use_prev_str, ) fm, features = dfs( entityset=es, target_dataframe_name="products", instance_ids=[0, 1, 2], agg_primitives=[NMostCommoner], trans_primitives=[], ) expected_names = [] base_names = [["value"], ["value_2"], ["value_many_nans"]] for name in base_names: expected_names += gen_custom_names( NMostCommoner, name, None, "products", None, None, ) for name in expected_names: assert name in fm.columns ================================================ FILE: featuretools/tests/primitive_tests/test_all_primitive_docstrings.py ================================================ from featuretools.primitives import get_aggregation_primitives, get_transform_primitives def docstring_is_uniform(primitive): docstring = primitive.__doc__ valid_verbs = [ "Calculates", "Determines", "Transforms", "Computes", "Counts", "Negates", "Adds", "Subtracts", "Multiplies", "Divides", "Performs", "Returns", "Shifts", "Extracts", "Applies", ] return any(docstring.startswith(s) for s in valid_verbs) def test_transform_primitive_docstrings(): for primitive in get_transform_primitives().values(): assert docstring_is_uniform(primitive) def test_aggregation_primitive_docstrings(): for primitive in get_aggregation_primitives().values(): assert docstring_is_uniform(primitive) ================================================ FILE: featuretools/tests/primitive_tests/test_direct_features.py ================================================ import numpy as np import pandas as pd import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools.computational_backends.feature_set import FeatureSet from featuretools.computational_backends.feature_set_calculator import ( FeatureSetCalculator, ) from featuretools.feature_base import DirectFeature, Feature, IdentityFeature from featuretools.primitives import ( AggregationPrimitive, Day, Hour, Minute, Month, NMostCommon, Second, TransformPrimitive, Year, ) from featuretools.primitives.utils import PrimitivesDeserializer from featuretools.synthesis import dfs def test_direct_from_identity(es): device = Feature(es["sessions"].ww["device_type"]) d = DirectFeature(base_feature=device, child_dataframe_name="log") feature_set = FeatureSet([d]) calculator = FeatureSetCalculator(es, feature_set=feature_set, time_last=None) df = calculator.run(np.array([0, 5])) v = df[d.get_name()].tolist() expected = [0, 1] assert v == expected def test_direct_from_column(es): # should be same behavior as test_direct_from_identity device = Feature(es["sessions"].ww["device_type"]) d = DirectFeature(base_feature=device, child_dataframe_name="log") feature_set = FeatureSet([d]) calculator = FeatureSetCalculator(es, feature_set=feature_set, time_last=None) df = calculator.run(np.array([0, 5])) v = df[d.get_name()].tolist() expected = [0, 1] assert v == expected def test_direct_rename_multioutput(es): n_common = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) feat = DirectFeature(n_common, "sessions") copy_feat = feat.rename("session_test") assert feat.unique_name() != copy_feat.unique_name() assert feat.get_name() != copy_feat.get_name() assert ( feat.base_features[0].generate_name() == copy_feat.base_features[0].generate_name() ) assert feat.dataframe_name == copy_feat.dataframe_name def test_direct_rename(es): # should be same behavior as test_direct_from_identity feat = DirectFeature( base_feature=IdentityFeature(es["sessions"].ww["device_type"]), child_dataframe_name="log", ) copy_feat = feat.rename("session_test") assert feat.unique_name() != copy_feat.unique_name() assert feat.get_name() != copy_feat.get_name() assert ( feat.base_features[0].generate_name() == copy_feat.base_features[0].generate_name() ) assert feat.dataframe_name == copy_feat.dataframe_name def test_direct_copy(games_es): home_team = next( r for r in games_es.relationships if r._child_column_name == "home_team_id" ) feat = DirectFeature( IdentityFeature(games_es["teams"].ww["name"]), "games", relationship=home_team, ) copied = feat.copy() assert copied.dataframe_name == feat.dataframe_name assert copied.base_features == feat.base_features assert copied.relationship_path == feat.relationship_path def test_direct_of_multi_output_transform_feat(es): class TestTime(TransformPrimitive): name = "test_time" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 6 def get_function(self): def test_f(x): times = pd.Series(x) units = ["year", "month", "day", "hour", "minute", "second"] return [times.apply(lambda x: getattr(x, unit)) for unit in units] return test_f base_feature = IdentityFeature(es["customers"].ww["signup_date"]) join_time_split = Feature(base_feature, primitive=TestTime) alt_features = [ Feature(base_feature, primitive=Year), Feature(base_feature, primitive=Month), Feature(base_feature, primitive=Day), Feature(base_feature, primitive=Hour), Feature(base_feature, primitive=Minute), Feature(base_feature, primitive=Second), ] fm, fl = dfs( entityset=es, target_dataframe_name="sessions", trans_primitives=[TestTime, Year, Month, Day, Hour, Minute, Second], ) # Get column names of for multi feature and normal features subnames = DirectFeature(join_time_split, "sessions").get_feature_names() altnames = [DirectFeature(f, "sessions").get_name() for f in alt_features] # Check values are equal between for col1, col2 in zip(subnames, altnames): assert (fm[col1] == fm[col2]).all() def test_direct_features_of_multi_output_agg_primitives(es): class ThreeMostCommonCat(AggregationPrimitive): name = "n_most_common_categorical" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(semantic_tags={"category"}) number_output_features = 3 def get_function(self): def pd_top3(x): counts = x.value_counts() counts = counts[counts > 0] array = np.array(counts.index[:3]) if len(array) < 3: filler = np.full(3 - len(array), np.nan) array = np.append(array, filler) return array return pd_top3 fm, fl = dfs( entityset=es, target_dataframe_name="log", agg_primitives=[ThreeMostCommonCat], trans_primitives=[], max_depth=3, ) has_nmost_as_base = [] for feature in fl: is_base = False if len(feature.base_features) > 0 and isinstance( feature.base_features[0].primitive, ThreeMostCommonCat, ): is_base = True has_nmost_as_base.append(is_base) assert any(has_nmost_as_base) true_result_rows = [] session_data = { 0: ["coke zero", "car", np.nan], 1: ["toothpaste", "brown bag", np.nan], 2: ["brown bag", np.nan, np.nan], 3: set(["Haribo sugar-free gummy bears", "coke zero", np.nan]), 4: ["coke zero", np.nan, np.nan], 5: ["taco clock", np.nan, np.nan], } for i, count in enumerate([5, 4, 1, 2, 3, 2]): while count > 0: true_result_rows.append(session_data[i]) count -= 1 tempname = "sessions.N_MOST_COMMON_CATEGORICAL(log.product_id)[%s]" for i, row in enumerate(true_result_rows): for j in range(3): value = fm[tempname % (j)][i] if isinstance(row, set): assert pd.isnull(value) or value in row else: assert (pd.isnull(value) and pd.isnull(row[j])) or value == row[j] def test_direct_with_invalid_init_args(diamond_es): customer_to_region = diamond_es.get_forward_relationships("customers")[0] error_text = "child_dataframe must be the relationship child dataframe" with pytest.raises(AssertionError, match=error_text): DirectFeature( IdentityFeature(diamond_es["regions"].ww["name"]), "stores", relationship=customer_to_region, ) transaction_relationships = diamond_es.get_forward_relationships("transactions") transaction_to_store = next( r for r in transaction_relationships if r.parent_dataframe.ww.name == "stores" ) error_text = "Base feature must be defined on the relationship parent dataframe" with pytest.raises(AssertionError, match=error_text): DirectFeature( IdentityFeature(diamond_es["regions"].ww["name"]), "transactions", relationship=transaction_to_store, ) def test_direct_with_multiple_possible_paths(games_es): error_text = ( "There are multiple relationships to the base dataframe. " "You must specify a relationship." ) with pytest.raises(RuntimeError, match=error_text): DirectFeature(IdentityFeature(games_es["teams"].ww["name"]), "games") # Does not raise if path specified. relationship = next( r for r in games_es.get_forward_relationships("games") if r._child_column_name == "home_team_id" ) feat = DirectFeature( IdentityFeature(games_es["teams"].ww["name"]), "games", relationship=relationship, ) assert feat.relationship_path_name() == "teams[home_team_id]" assert feat.get_name() == "teams[home_team_id].name" def test_direct_with_single_possible_path(es): feat = DirectFeature(IdentityFeature(es["customers"].ww["age"]), "sessions") assert feat.relationship_path_name() == "customers" assert feat.get_name() == "customers.age" def test_direct_with_no_path(diamond_es): error_text = 'No relationship from "regions" to "customers" found.' with pytest.raises(RuntimeError, match=error_text): DirectFeature(IdentityFeature(diamond_es["customers"].ww["name"]), "regions") error_text = 'No relationship from "customers" to "customers" found.' with pytest.raises(RuntimeError, match=error_text): DirectFeature(IdentityFeature(diamond_es["customers"].ww["name"]), "customers") def test_serialization(es): value = IdentityFeature(es["products"].ww["rating"]) direct = DirectFeature(value, "log") log_to_products = next( r for r in es.get_forward_relationships("log") if r.parent_dataframe.ww.name == "products" ) dictionary = { "name": direct.get_name(), "base_feature": value.unique_name(), "relationship": log_to_products.to_dictionary(), } assert dictionary == direct.get_arguments() assert direct == DirectFeature.from_dictionary( dictionary, es, {value.unique_name(): value}, PrimitivesDeserializer(), ) ================================================ FILE: featuretools/tests/primitive_tests/test_feature_base.py ================================================ import os.path import re import pytest from pympler.asizeof import asizeof from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime, Integer from featuretools import Feature, config, feature_base from featuretools.feature_base import IdentityFeature from featuretools.primitives import ( Count, Diff, Last, Mode, Negate, NMostCommon, NumUnique, Sum, TransformPrimitive, ) from featuretools.synthesis.deep_feature_synthesis import can_stack_primitive_on_inputs from featuretools.tests.testing_utils import check_rename def test_copy_features_does_not_copy_entityset(es): agg = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) agg_where = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", where=IdentityFeature(es["log"].ww["value"]) == 2, primitive=Sum, ) agg_use_previous = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", use_previous="4 days", primitive=Sum, ) agg_use_previous_where = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", where=IdentityFeature(es["log"].ww["value"]) == 2, use_previous="4 days", primitive=Sum, ) features = [agg, agg_where, agg_use_previous, agg_use_previous_where] in_memory_size = asizeof(locals()) copied = [f.copy() for f in features] new_in_memory_size = asizeof(locals()) assert new_in_memory_size < 2 * in_memory_size def test_get_dependencies(es): f = Feature(es["log"].ww["value"]) agg1 = Feature(f, parent_dataframe_name="sessions", primitive=Sum) agg2 = Feature(agg1, parent_dataframe_name="customers", primitive=Sum) d1 = Feature(agg2, "sessions") shallow = d1.get_dependencies(deep=False, ignored=None) deep = d1.get_dependencies(deep=True, ignored=None) ignored = set([agg1.unique_name()]) deep_ignored = d1.get_dependencies(deep=True, ignored=ignored) assert [s.unique_name() for s in shallow] == [agg2.unique_name()] assert [d.unique_name() for d in deep] == [ agg2.unique_name(), agg1.unique_name(), f.unique_name(), ] assert [d.unique_name() for d in deep_ignored] == [agg2.unique_name()] def test_get_depth(es): f = Feature(es["log"].ww["value"]) g = Feature(es["log"].ww["value"]) agg1 = Feature(f, parent_dataframe_name="sessions", primitive=Last) agg2 = Feature(agg1, parent_dataframe_name="customers", primitive=Last) d1 = Feature(agg2, "sessions") d2 = Feature(d1, "log") assert d2.get_depth() == 4 # Make sure this works if we pass in two of the same # feature. This came up when user supplied duplicates # in the seed_features of DFS. assert d2.get_depth(stop_at=[f, g]) == 4 assert d2.get_depth(stop_at=[f, g, agg1]) == 3 assert d2.get_depth(stop_at=[f, g, agg1]) == 3 assert d2.get_depth(stop_at=[f, g, agg2]) == 2 assert d2.get_depth(stop_at=[f, g, d1]) == 1 assert d2.get_depth(stop_at=[f, g, d2]) == 0 def test_squared(es): feature = Feature(es["log"].ww["value"]) squared = feature * feature assert len(squared.base_features) == 2 assert ( squared.base_features[0].unique_name() == squared.base_features[1].unique_name() ) def test_return_type_inference(es): mode = Feature( es["log"].ww["priority_level"], parent_dataframe_name="customers", primitive=Mode, ) assert ( mode.column_schema == IdentityFeature(es["log"].ww["priority_level"]).column_schema ) def test_return_type_inference_direct_feature(es): mode = Feature( es["log"].ww["priority_level"], parent_dataframe_name="customers", primitive=Mode, ) mode_session = Feature(mode, "sessions") assert ( mode_session.column_schema == IdentityFeature(es["log"].ww["priority_level"]).column_schema ) def test_return_type_inference_index(es): last = Feature( es["log"].ww["id"], parent_dataframe_name="customers", primitive=Last, ) assert "index" not in last.column_schema.semantic_tags assert isinstance(last.column_schema.logical_type, Integer) def test_return_type_inference_datetime_time_index(es): last = Feature( es["log"].ww["datetime"], parent_dataframe_name="customers", primitive=Last, ) assert isinstance(last.column_schema.logical_type, Datetime) def test_return_type_inference_numeric_time_index(int_es): last = Feature( int_es["log"].ww["datetime"], parent_dataframe_name="customers", primitive=Last, ) assert "numeric" in last.column_schema.semantic_tags def test_return_type_inference_id(es): # direct features should keep foreign key tag direct_id_feature = Feature(es["sessions"].ww["customer_id"], "log") assert "foreign_key" in direct_id_feature.column_schema.semantic_tags # aggregations of foreign key types should get converted last_feat = Feature( es["log"].ww["session_id"], parent_dataframe_name="customers", primitive=Last, ) assert "foreign_key" not in last_feat.column_schema.semantic_tags assert isinstance(last_feat.column_schema.logical_type, Integer) # also test direct feature of aggregation last_direct = Feature(last_feat, "sessions") assert "foreign_key" not in last_direct.column_schema.semantic_tags assert isinstance(last_direct.column_schema.logical_type, Integer) def test_set_data_path(es): key = "primitive_data_folder" # Don't change orig_path orig_path = config.get(key) new_path = "/example/new/directory" filename = "test.csv" # Test that default path works sum_prim = Sum() assert sum_prim.get_filepath(filename) == os.path.join(orig_path, filename) # Test that new path works config.set({key: new_path}) assert sum_prim.get_filepath(filename) == os.path.join(new_path, filename) # Test that new path with trailing / works new_path += "/" config.set({key: new_path}) assert sum_prim.get_filepath(filename) == os.path.join(new_path, filename) # Test that the path is correct on newly defined feature sum_prim2 = Sum() assert sum_prim2.get_filepath(filename) == os.path.join(new_path, filename) # Ensure path was reset config.set({key: orig_path}) assert config.get(key) == orig_path def test_to_dictionary_direct(es): actual = Feature( IdentityFeature(es["sessions"].ww["customer_id"]), "log", ).to_dictionary() expected = { "type": "DirectFeature", "dependencies": ["sessions: customer_id"], "arguments": { "name": "sessions.customer_id", "base_feature": "sessions: customer_id", "relationship": { "parent_dataframe_name": "sessions", "child_dataframe_name": "log", "parent_column_name": "id", "child_column_name": "session_id", }, }, } assert expected == actual def test_to_dictionary_identity(es): actual = Feature(es["sessions"].ww["customer_id"]).to_dictionary() expected = { "type": "IdentityFeature", "dependencies": [], "arguments": { "name": "customer_id", "column_name": "customer_id", "dataframe_name": "sessions", }, } assert expected == actual def test_to_dictionary_agg(es): primitive = Sum() actual = Feature( es["customers"].ww["age"], primitive=primitive, parent_dataframe_name="cohorts", ).to_dictionary() expected = { "type": "AggregationFeature", "dependencies": ["customers: age"], "arguments": { "name": "SUM(customers.age)", "base_features": ["customers: age"], "relationship_path": [ { "parent_dataframe_name": "cohorts", "child_dataframe_name": "customers", "parent_column_name": "cohort", "child_column_name": "cohort", }, ], "primitive": primitive, "where": None, "use_previous": None, }, } assert expected == actual def test_to_dictionary_where(es): primitive = Sum() actual = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", where=IdentityFeature(es["log"].ww["value"]) == 2, primitive=primitive, ).to_dictionary() expected = { "type": "AggregationFeature", "dependencies": ["log: value", "log: value = 2"], "arguments": { "name": "SUM(log.value WHERE value = 2)", "base_features": ["log: value"], "relationship_path": [ { "parent_dataframe_name": "sessions", "child_dataframe_name": "log", "parent_column_name": "id", "child_column_name": "session_id", }, ], "primitive": primitive, "where": "log: value = 2", "use_previous": None, }, } assert expected == actual def test_to_dictionary_trans(es): primitive = Negate() trans_feature = Feature(es["customers"].ww["age"], primitive=primitive) expected = { "type": "TransformFeature", "dependencies": ["customers: age"], "arguments": { "name": "-(age)", "base_features": ["customers: age"], "primitive": primitive, }, } assert expected == trans_feature.to_dictionary() def test_to_dictionary_groupby_trans(es): primitive = Negate() id_feat = Feature(es["log"].ww["product_id"]) groupby_feature = Feature( es["log"].ww["value"], primitive=primitive, groupby=id_feat, ) expected = { "type": "GroupByTransformFeature", "dependencies": ["log: value", "log: product_id"], "arguments": { "name": "-(value) by product_id", "base_features": ["log: value"], "primitive": primitive, "groupby": "log: product_id", }, } assert expected == groupby_feature.to_dictionary() def test_to_dictionary_multi_slice(es): slice_feature = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), )[0] expected = { "type": "FeatureOutputSlice", "dependencies": ["customers: N_MOST_COMMON(log.product_id, n=2)"], "arguments": { "name": "N_MOST_COMMON(log.product_id, n=2)[0]", "base_feature": "customers: N_MOST_COMMON(log.product_id, n=2)", "n": 0, }, } assert expected == slice_feature.to_dictionary() def test_multi_output_base_error_agg(es): three_common = NMostCommon(3) tc = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=three_common, ) error_text = "Cannot stack on whole multi-output feature." with pytest.raises(ValueError, match=error_text): Feature(tc, parent_dataframe_name="customers", primitive=NumUnique) def test_multi_output_base_error_trans(es): class TestTime(TransformPrimitive): name = "test_time" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 6 tc = Feature(es["customers"].ww["birthday"], primitive=TestTime) error_text = "Cannot stack on whole multi-output feature." with pytest.raises(ValueError, match=error_text): Feature(tc, primitive=Diff) def test_multi_output_attributes(es): tc = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=NMostCommon, ) assert tc.generate_name() == "N_MOST_COMMON(log.product_id)" assert tc.number_output_features == 3 assert tc.base_features == [""] assert tc[0].generate_name() == "N_MOST_COMMON(log.product_id)[0]" assert tc[0].number_output_features == 1 assert tc[0].base_features == [tc] assert tc.relationship_path == tc[0].relationship_path def test_multi_output_index_error(es): error_text = "can only access slice of multi-output feature" three_common = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=NMostCommon, ) with pytest.raises(AssertionError, match=error_text): single = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=NumUnique, ) single[0] error_text = "Cannot get item from slice of multi output feature" with pytest.raises(ValueError, match=error_text): three_common[0][0] error_text = "index is higher than the number of outputs" with pytest.raises(AssertionError, match=error_text): three_common[10] def test_rename(es): feat = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) new_name = "session_test" new_names = ["session_test"] check_rename(feat, new_name, new_names) def test_rename_multioutput(es): feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) new_name = "session_test" new_names = ["session_test[0]", "session_test[1]"] check_rename(feat, new_name, new_names) def test_rename_featureoutputslice(es): multi_output_feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) feat = feature_base.FeatureOutputSlice(multi_output_feat, 0) new_name = "session_test" new_names = ["session_test"] check_rename(feat, new_name, new_names) def test_set_feature_names_wrong_number_of_names(es): feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) new_names = ["col1"] error_msg = re.escape( "Number of names provided must match the number of output features: 1 name(s) provided, 2 expected.", ) with pytest.raises(ValueError, match=error_msg): feat.set_feature_names(new_names) def test_set_feature_names_not_unique(es): feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) new_names = ["col1", "col1"] error_msg = "Provided output feature names must be unique." with pytest.raises(ValueError, match=error_msg): feat.set_feature_names(new_names) def test_set_feature_names_error_on_single_output_feature(es): feat = Feature(es["sessions"].ww["device_name"], "log") new_names = ["sessions_device"] error_msg = "The set_feature_names can only be used on features that have more than one output column." with pytest.raises(ValueError, match=error_msg): feat.set_feature_names(new_names) def test_set_feature_names_transform_feature(es): class MultiCumulative(TransformPrimitive): name = "multi_cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 feat = Feature(es["log"].ww["value"], primitive=MultiCumulative) new_names = ["cumulative_sum", "cumulative_max", "cumulative_min"] feat.set_feature_names(new_names) assert feat.get_feature_names() == new_names def test_set_feature_names_aggregation_feature(es): feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) new_names = ["agg_col_1", "second_agg_col"] feat.set_feature_names(new_names) assert feat.get_feature_names() == new_names def test_renaming_resets_feature_output_names_to_default(es): feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) new_names = ["renamed1", "renamed2"] feat.set_feature_names(new_names) assert feat.get_feature_names() == new_names feat = feat.rename("new_feature_name") assert feat.get_feature_names() == ["new_feature_name[0]", "new_feature_name[1]"] def test_base_of_and_stack_on_heuristic(es, test_aggregation_primitive): child = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) test_aggregation_primitive.stack_on = [] child.primitive.base_of = [] assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = [] child.primitive.base_of = None assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = [] child.primitive.base_of = [test_aggregation_primitive] assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = None child.primitive.base_of = [] assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = None child.primitive.base_of = None assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = None child.primitive.base_of = [test_aggregation_primitive] assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = [type(child.primitive)] child.primitive.base_of = [] assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = [type(child.primitive)] child.primitive.base_of = None assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = [type(child.primitive)] child.primitive.base_of = [test_aggregation_primitive] assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on = None child.primitive.base_of = None child.primitive.base_of_exclude = [test_aggregation_primitive] assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) test_aggregation_primitive.stack_on_exclude = [Count] assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) child.primitive.number_output_features = 2 test_aggregation_primitive.stack_on_exclude = [] test_aggregation_primitive.stack_on = [] child.primitive.base_of = [] assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child]) def test_stack_on_self(es, test_transform_primitive): # test stacks on self child = Feature( es["log"].ww["value"], primitive=test_transform_primitive, ) test_transform_primitive.stack_on = [] child.primitive.base_of = [] test_transform_primitive.stack_on_self = False child.primitive.stack_on_self = False assert not can_stack_primitive_on_inputs(test_transform_primitive(), [child]) test_transform_primitive.stack_on_self = True assert can_stack_primitive_on_inputs(test_transform_primitive(), [child]) test_transform_primitive.stack_on = None test_transform_primitive.stack_on_self = False assert not can_stack_primitive_on_inputs(test_transform_primitive(), [child]) ================================================ FILE: featuretools/tests/primitive_tests/test_feature_descriptions.py ================================================ import json import os import pytest from woodwork.column_schema import ColumnSchema from featuretools import describe_feature from featuretools.feature_base import ( AggregationFeature, DirectFeature, GroupByTransformFeature, IdentityFeature, TransformFeature, ) from featuretools.primitives import ( Absolute, AggregationPrimitive, CumMean, EqualScalar, Mean, Mode, NMostCommon, NumUnique, PercentTrue, Sum, TransformPrimitive, ) def test_identity_description(es): feature = IdentityFeature(es["log"].ww["session_id"]) description = 'The "session_id".' assert describe_feature(feature) == description def test_direct_description(es): feature = DirectFeature( IdentityFeature(es["customers"].ww["loves_ice_cream"]), "sessions", ) description = ( 'The "loves_ice_cream" for the instance of "customers" associated ' 'with this instance of "sessions".' ) assert describe_feature(feature) == description deep_direct = DirectFeature(feature, "log") deep_description = ( 'The "loves_ice_cream" for the instance of "customers" ' 'associated with the instance of "sessions" associated with ' 'this instance of "log".' ) assert describe_feature(deep_direct) == deep_description agg = AggregationFeature( IdentityFeature(es["log"].ww["purchased"]), "sessions", PercentTrue, ) complicated_direct = DirectFeature(agg, "log") agg_on_direct = AggregationFeature(complicated_direct, "products", Mean) complicated_description = ( "The average of the percentage of true values in " 'the "purchased" of all instances of "log" for each "id" in "sessions" for ' 'the instance of "sessions" associated with this instance of "log" of all ' 'instances of "log" for each "id" in "products".' ) assert describe_feature(agg_on_direct) == complicated_description def test_transform_description(es): feature = TransformFeature(IdentityFeature(es["log"].ww["value"]), Absolute) description = 'The absolute value of the "value".' assert describe_feature(feature) == description def test_groupby_transform_description(es): feature = GroupByTransformFeature( IdentityFeature(es["log"].ww["value"]), CumMean, IdentityFeature(es["log"].ww["session_id"]), ) description = 'The cumulative mean of the "value" for each "session_id".' assert describe_feature(feature) == description def test_aggregation_description(es): feature = AggregationFeature( IdentityFeature(es["log"].ww["value"]), "sessions", Mean, ) description = 'The average of the "value" of all instances of "log" for each "id" in "sessions".' assert describe_feature(feature) == description stacked_agg = AggregationFeature(feature, "customers", Sum) stacked_description = ( 'The sum of t{} of all instances of "sessions" for each "id" ' 'in "customers".'.format(description[1:-1]) ) assert describe_feature(stacked_agg) == stacked_description def test_aggregation_description_where(es): where_feature = TransformFeature( IdentityFeature(es["log"].ww["countrycode"]), EqualScalar("US"), ) feature = AggregationFeature( IdentityFeature(es["log"].ww["value"]), "sessions", Mean, where=where_feature, ) description = ( 'The average of the "value" of all instances of "log" where the ' '"countrycode" is US for each "id" in "sessions".' ) assert describe_feature(feature) == description def test_aggregation_description_use_previous(es): feature = AggregationFeature( IdentityFeature(es["log"].ww["value"]), "sessions", Mean, use_previous="5d", ) description = 'The average of the "value" of the previous 5 days of "log" for each "id" in "sessions".' assert describe_feature(feature) == description def test_multioutput_description(es): n_most_common = NMostCommon(2) n_most_common_feature = AggregationFeature( IdentityFeature(es["log"].ww["zipcode"]), "sessions", n_most_common, ) first_most_common_slice = n_most_common_feature[0] second_most_common_slice = n_most_common_feature[1] n_most_common_base = 'The 2 most common values of the "zipcode" of all instances of "log" for each "id" in "sessions".' n_most_common_first = ( 'The most common value of the "zipcode" of all instances of "log" ' 'for each "id" in "sessions".' ) n_most_common_second = ( 'The 2nd most common value of the "zipcode" of all instances of ' '"log" for each "id" in "sessions".' ) assert describe_feature(n_most_common_feature) == n_most_common_base assert describe_feature(first_most_common_slice) == n_most_common_first assert describe_feature(second_most_common_slice) == n_most_common_second class CustomMultiOutput(TransformPrimitive): name = "custom_multioutput" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(semantic_tags={"category"}) number_output_features = 4 custom_feat = TransformFeature( IdentityFeature(es["log"].ww["zipcode"]), CustomMultiOutput, ) generic_base = 'The result of applying CUSTOM_MULTIOUTPUT to the "zipcode".' generic_first = 'The 1st output from applying CUSTOM_MULTIOUTPUT to the "zipcode".' generic_second = 'The 2nd output from applying CUSTOM_MULTIOUTPUT to the "zipcode".' assert describe_feature(custom_feat) == generic_base assert describe_feature(custom_feat[0]) == generic_first assert describe_feature(custom_feat[1]) == generic_second CustomMultiOutput.description_template = [ "the multioutput of {}", "the {nth_slice} multioutput part of {}", ] template_base = 'The multioutput of the "zipcode".' template_first_slice = 'The 1st multioutput part of the "zipcode".' template_second_slice = 'The 2nd multioutput part of the "zipcode".' template_third_slice = 'The 3rd multioutput part of the "zipcode".' template_fourth_slice = 'The 4th multioutput part of the "zipcode".' assert describe_feature(custom_feat) == template_base assert describe_feature(custom_feat[0]) == template_first_slice assert describe_feature(custom_feat[1]) == template_second_slice assert describe_feature(custom_feat[2]) == template_third_slice assert describe_feature(custom_feat[3]) == template_fourth_slice CustomMultiOutput.description_template = [ "the multioutput of {}", "the primary multioutput part of {}", "the secondary multioutput part of {}", ] custom_base = 'The multioutput of the "zipcode".' custom_first_slice = 'The primary multioutput part of the "zipcode".' custom_second_slice = 'The secondary multioutput part of the "zipcode".' bad_slice_error = "Slice out of range of template" assert describe_feature(custom_feat) == custom_base assert describe_feature(custom_feat[0]) == custom_first_slice assert describe_feature(custom_feat[1]) == custom_second_slice with pytest.raises(IndexError, match=bad_slice_error): describe_feature(custom_feat[2]) def test_generic_description(es): class NoName(TransformPrimitive): input_types = [ColumnSchema(semantic_tags={"category"})] output_type = ColumnSchema(semantic_tags={"category"}) def generate_name(self, base_feature_names): return "%s(%s%s)" % ( "NO_NAME", ", ".join(base_feature_names), self.get_args_string(), ) class CustomAgg(AggregationPrimitive): name = "custom_aggregation" input_types = [ColumnSchema(semantic_tags={"category"})] output_type = ColumnSchema(semantic_tags={"category"}) class CustomTrans(TransformPrimitive): name = "custom_transform" input_types = [ColumnSchema(semantic_tags={"category"})] output_type = ColumnSchema(semantic_tags={"category"}) no_name = TransformFeature(IdentityFeature(es["log"].ww["zipcode"]), NoName) no_name_description = 'The result of applying NoName to the "zipcode".' assert describe_feature(no_name) == no_name_description custom_agg = AggregationFeature( IdentityFeature(es["log"].ww["zipcode"]), "customers", CustomAgg, ) custom_agg_description = 'The result of applying CUSTOM_AGGREGATION to the "zipcode" of all instances of "log" for each "id" in "customers".' assert describe_feature(custom_agg) == custom_agg_description custom_trans = TransformFeature( IdentityFeature(es["log"].ww["zipcode"]), CustomTrans, ) custom_trans_description = ( 'The result of applying CUSTOM_TRANSFORM to the "zipcode".' ) assert describe_feature(custom_trans) == custom_trans_description def test_column_description(es): column_description = "the name of the device used for each session" es["sessions"].ww.columns["device_name"].description = column_description identity_feat = IdentityFeature(es["sessions"].ww["device_name"]) assert ( describe_feature(identity_feat) == column_description[0].upper() + column_description[1:] + "." ) def test_metadata(es, tmp_path): identity_feature_descriptions = { "sessions: device_name": "the name of the device used for each session", "customers: id": "the customer's id", } agg_feat = AggregationFeature( IdentityFeature(es["sessions"].ww["device_name"]), "customers", NumUnique, ) agg_description = ( "The number of unique elements in the name of the device used for each " 'session of all instances of "sessions" for each customer\'s id.' ) assert ( describe_feature(agg_feat, feature_descriptions=identity_feature_descriptions) == agg_description ) transform_feat = GroupByTransformFeature( IdentityFeature(es["log"].ww["value"]), CumMean, IdentityFeature(es["log"].ww["session_id"]), ) transform_description = 'The running average of the "value" for each "session_id".' primitive_templates = {"cum_mean": "the running average of {}"} assert ( describe_feature(transform_feat, primitive_templates=primitive_templates) == transform_description ) custom_agg = AggregationFeature( IdentityFeature(es["log"].ww["zipcode"]), "sessions", Mode, ) auto_description = 'The most frequently occurring value of the "zipcode" of all instances of "log" for each "id" in "sessions".' custom_agg_description = "the most frequently used zipcode" custom_feature_description = ( custom_agg_description[0].upper() + custom_agg_description[1:] + "." ) feature_description_dict = {"sessions: MODE(log.zipcode)": custom_agg_description} assert describe_feature(custom_agg) == auto_description assert ( describe_feature(custom_agg, feature_descriptions=feature_description_dict) == custom_feature_description ) metadata = { "feature_descriptions": { **identity_feature_descriptions, **feature_description_dict, }, "primitive_templates": primitive_templates, } metadata_path = os.path.join(tmp_path, "description_metadata.json") with open(metadata_path, "w") as f: json.dump(metadata, f) assert describe_feature(agg_feat, metadata_file=metadata_path) == agg_description assert ( describe_feature(transform_feat, metadata_file=metadata_path) == transform_description ) assert ( describe_feature(custom_agg, metadata_file=metadata_path) == custom_feature_description ) ================================================ FILE: featuretools/tests/primitive_tests/test_feature_serialization.py ================================================ import os import boto3 import pandas as pd import pytest from pympler.asizeof import asizeof from smart_open import open from woodwork.column_schema import ColumnSchema from featuretools import ( AggregationFeature, DirectFeature, EntitySet, Feature, GroupByTransformFeature, IdentityFeature, TransformFeature, dfs, feature_base, load_features, primitives, save_features, ) from featuretools.feature_base import FeatureOutputSlice from featuretools.feature_base.cache import feature_cache from featuretools.feature_base.features_deserializer import FeaturesDeserializer from featuretools.feature_base.features_serializer import FeaturesSerializer from featuretools.primitives import ( Count, CumSum, Day, DistanceToHoliday, Haversine, IsIn, Max, Mean, Min, Mode, Month, MultiplyNumericScalar, Negate, NMostCommon, NumberOfCommonWords, NumCharacters, NumUnique, NumWords, PercentTrue, Skew, Std, Sum, TransformPrimitive, Weekday, Year, ) from featuretools.primitives.base import AggregationPrimitive from featuretools.tests.testing_utils import check_names from featuretools.version import ENTITYSET_SCHEMA_VERSION, FEATURES_SCHEMA_VERSION BUCKET_NAME = "test-bucket" WRITE_KEY_NAME = "test-key" TEST_S3_URL = "s3://{}/{}".format(BUCKET_NAME, WRITE_KEY_NAME) TEST_FILE = "test_feature_serialization_feature_schema_{}_entityset_schema_{}_2022_12_28.json".format( FEATURES_SCHEMA_VERSION, ENTITYSET_SCHEMA_VERSION, ) S3_URL = "s3://featuretools-static/" + TEST_FILE URL = "https://featuretools-static.s3.amazonaws.com/" + TEST_FILE TEST_CONFIG = "CheckConfigPassesOn" TEST_KEY = "test_access_key_features" @pytest.fixture(autouse=True) def reset_dfs_cache(): feature_cache.enabled = False feature_cache.clear_all() def assert_features(original, deserialized): for feat_1, feat_2 in zip(original, deserialized): assert feat_1.unique_name() == feat_2.unique_name() assert feat_1.entityset == feat_2.entityset def pickle_features_test_helper(es_size, features_original, dir_path): filepath = os.path.join(dir_path, "test_feature") save_features(features_original, filepath) features_deserializedA = load_features(filepath) assert os.path.getsize(filepath) < es_size os.remove(filepath) with open(filepath, "w") as f: save_features(features_original, f) features_deserializedB = load_features(open(filepath)) assert os.path.getsize(filepath) < es_size os.remove(filepath) features = save_features(features_original) features_deserializedC = load_features(features) assert asizeof(features) < es_size features_deserialized_options = [ features_deserializedA, features_deserializedB, features_deserializedC, ] for features_deserialized in features_deserialized_options: assert_features(features_original, features_deserialized) def test_pickle_features(es, tmp_path): features_original = dfs( target_dataframe_name="sessions", entityset=es, features_only=True, ) pickle_features_test_helper(asizeof(es), features_original, str(tmp_path)) def test_pickle_features_with_custom_primitive(es, tmp_path): class NewMax(AggregationPrimitive): name = "new_max" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) features_original = dfs( target_dataframe_name="sessions", entityset=es, agg_primitives=["Last", "Mean", NewMax], features_only=True, ) assert any([isinstance(feat.primitive, NewMax) for feat in features_original]) pickle_features_test_helper(asizeof(es), features_original, str(tmp_path)) def test_serialized_renamed_features(es): def serialize_name_unchanged(original): new_name = "MyFeature" original_names = original.get_feature_names() renamed = original.rename(new_name) new_names = ( [new_name] if len(original_names) == 1 else [new_name + "[{}]".format(i) for i in range(len(original_names))] ) check_names(renamed, new_name, new_names) serializer = FeaturesSerializer([renamed]) serialized = serializer.to_dict() deserializer = FeaturesDeserializer(serialized) deserialized = deserializer.to_list()[0] check_names(deserialized, new_name, new_names) identity_original = IdentityFeature(es["log"].ww["value"]) assert identity_original.get_name() == "value" value = IdentityFeature(es["log"].ww["value"]) primitive = primitives.Max() agg_original = AggregationFeature(value, "customers", primitive) assert agg_original.get_name() == "MAX(log.value)" direct_original = DirectFeature( IdentityFeature(es["customers"].ww["age"]), "sessions", ) assert direct_original.get_name() == "customers.age" primitive = primitives.MultiplyNumericScalar(value=2) transform_original = TransformFeature(value, primitive) assert transform_original.get_name() == "value * 2" zipcode = IdentityFeature(es["log"].ww["zipcode"]) primitive = CumSum() groupby_original = feature_base.GroupByTransformFeature(value, primitive, zipcode) assert groupby_original.get_name() == "CUM_SUM(value) by zipcode" multioutput_original = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) assert multioutput_original.get_name() == "N_MOST_COMMON(log.product_id, n=2)" featureslice_original = feature_base.FeatureOutputSlice(multioutput_original, 0) assert featureslice_original.get_name() == "N_MOST_COMMON(log.product_id, n=2)[0]" feature_type_list = [ identity_original, agg_original, direct_original, transform_original, groupby_original, multioutput_original, featureslice_original, ] for feature_type in feature_type_list: serialize_name_unchanged(feature_type) @pytest.fixture def s3_client(): _environ = os.environ.copy() from moto import mock_aws with mock_aws(): s3 = boto3.resource("s3") yield s3 os.environ.clear() os.environ.update(_environ) @pytest.fixture def s3_bucket(s3_client, region="us-east-2"): location = {"LocationConstraint": region} s3_client.create_bucket( Bucket=BUCKET_NAME, ACL="public-read-write", CreateBucketConfiguration=location, ) s3_bucket = s3_client.Bucket(BUCKET_NAME) yield s3_bucket def test_serialize_features_mock_s3(es, s3_client, s3_bucket): features_original = dfs( target_dataframe_name="sessions", entityset=es, features_only=True, ) save_features(features_original, TEST_S3_URL) obj = list(s3_bucket.objects.all())[0].key s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write") features_deserialized = load_features(TEST_S3_URL) assert_features(features_original, features_deserialized) def test_serialize_features_mock_anon_s3(es, s3_client, s3_bucket): features_original = dfs( target_dataframe_name="sessions", entityset=es, features_only=True, ) save_features(features_original, TEST_S3_URL, profile_name=False) obj = list(s3_bucket.objects.all())[0].key s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write") features_deserialized = load_features(TEST_S3_URL, profile_name=False) assert_features(features_original, features_deserialized) @pytest.mark.parametrize("profile_name", ["test", False]) def test_s3_test_profile(es, s3_client, s3_bucket, setup_test_profile, profile_name): features_original = dfs( target_dataframe_name="sessions", entityset=es, features_only=True, ) save_features(features_original, TEST_S3_URL, profile_name="test") obj = list(s3_bucket.objects.all())[0].key s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write") features_deserialized = load_features(TEST_S3_URL, profile_name=profile_name) assert_features(features_original, features_deserialized) @pytest.mark.parametrize("url,profile_name", [(S3_URL, False), (URL, None)]) def test_deserialize_features_s3(es, url, profile_name): agg_primitives = [ Sum, Std, Max, Skew, Min, Mean, Count, PercentTrue, NumUnique, Mode, ] trans_primitives = [Day, Year, Month, Weekday, Haversine, NumWords, NumCharacters] features_original = dfs( target_dataframe_name="sessions", entityset=es, features_only=True, agg_primitives=agg_primitives, trans_primitives=trans_primitives, ) features_deserialized = load_features(url, profile_name=profile_name) assert_features(features_original, features_deserialized) def test_serialize_url(es): features_original = dfs( target_dataframe_name="sessions", entityset=es, features_only=True, ) error_text = "Writing to URLs is not supported" with pytest.raises(ValueError, match=error_text): save_features(features_original, URL) def test_custom_feature_names_retained_during_serialization(es, tmp_path): class MultiCumulative(TransformPrimitive): name = "multi_cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 multi_output_trans_feat = Feature( es["log"].ww["value"], primitive=MultiCumulative, ) groupby_trans_feat = GroupByTransformFeature( es["log"].ww["value"], primitive=MultiCumulative, groupby=es["log"].ww["product_id"], ) multi_output_agg_feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="customers", primitive=NMostCommon(n=2), ) slice = FeatureOutputSlice(multi_output_trans_feat, 1) stacked_feat = Feature(slice, primitive=Negate) trans_names = ["cumulative_sum", "cumulative_max", "cumulative_min"] multi_output_trans_feat.set_feature_names(trans_names) groupby_trans_names = ["grouped_sum", "grouped_max", "grouped_min"] groupby_trans_feat.set_feature_names(groupby_trans_names) agg_names = ["first_most_common", "second_most_common"] multi_output_agg_feat.set_feature_names(agg_names) features = [ multi_output_trans_feat, multi_output_agg_feat, groupby_trans_feat, stacked_feat, ] file = os.path.join(tmp_path, "features.json") save_features(features, file) deserialized_features = load_features(file) new_trans, new_agg, new_groupby, new_stacked = deserialized_features assert new_trans.get_feature_names() == trans_names assert new_agg.get_feature_names() == agg_names assert new_groupby.get_feature_names() == groupby_trans_names assert new_stacked.get_feature_names() == ["-(cumulative_max)"] def test_deserializer_uses_common_primitive_instances_no_args(es, tmp_path): features = dfs( entityset=es, target_dataframe_name="products", features_only=True, agg_primitives=["sum"], trans_primitives=["is_null"], ) is_null_features = [f for f in features if f.primitive.name == "is_null"] sum_features = [f for f in features if f.primitive.name == "sum"] # Make sure we have multiple features of each type assert len(is_null_features) > 1 assert len(sum_features) > 1 # DFS should use the same primitive instance for all features that share a primitive is_null_primitive = is_null_features[0].primitive sum_primitive = sum_features[0].primitive assert all([f.primitive is is_null_primitive for f in is_null_features]) assert all([f.primitive is sum_primitive for f in sum_features]) file = os.path.join(tmp_path, "features.json") save_features(features, file) deserialized_features = load_features(file) new_is_null_features = [ f for f in deserialized_features if f.primitive.name == "is_null" ] new_sum_features = [f for f in deserialized_features if f.primitive.name == "sum"] # After deserialization all features that share a primitive should use the same primitive instance new_is_null_primitive = new_is_null_features[0].primitive new_sum_primitive = new_sum_features[0].primitive assert all([f.primitive is new_is_null_primitive for f in new_is_null_features]) assert all([f.primitive is new_sum_primitive for f in new_sum_features]) def test_deserializer_uses_common_primitive_instances_with_args(es, tmp_path): # Single argument scalar1 = MultiplyNumericScalar(value=1) scalar5 = MultiplyNumericScalar(value=5) features = dfs( entityset=es, target_dataframe_name="products", features_only=True, agg_primitives=["sum"], trans_primitives=[scalar1, scalar5], ) scalar1_features = [ f for f in features if f.primitive.name == "multiply_numeric_scalar" and " * 1" in f.get_name() ] scalar5_features = [ f for f in features if f.primitive.name == "multiply_numeric_scalar" and " * 5" in f.get_name() ] # Make sure we have multiple features of each type assert len(scalar1_features) > 1 assert len(scalar5_features) > 1 # DFS should use the the passed in primitive instance for all features assert all([f.primitive is scalar1 for f in scalar1_features]) assert all([f.primitive is scalar5 for f in scalar5_features]) file = os.path.join(tmp_path, "features.json") save_features(features, file) deserialized_features = load_features(file) new_scalar1_features = [ f for f in deserialized_features if f.primitive.name == "multiply_numeric_scalar" and " * 1" in f.get_name() ] new_scalar5_features = [ f for f in deserialized_features if f.primitive.name == "multiply_numeric_scalar" and " * 5" in f.get_name() ] # After deserialization all features that share a primitive should use the same primitive instance new_scalar1_primitive = new_scalar1_features[0].primitive new_scalar5_primitive = new_scalar5_features[0].primitive assert all([f.primitive is new_scalar1_primitive for f in new_scalar1_features]) assert all([f.primitive is new_scalar5_primitive for f in new_scalar5_features]) assert new_scalar1_primitive.value == 1 assert new_scalar5_primitive.value == 5 # Test primitive with multiple args distance_to_holiday = DistanceToHoliday( holiday="Canada Day", country="Canada", ) features = dfs( entityset=es, target_dataframe_name="customers", features_only=True, agg_primitives=[], trans_primitives=[distance_to_holiday], ) distance_features = [ f for f in features if f.primitive.name == "distance_to_holiday" ] assert len(distance_features) > 1 # DFS should use the the passed in primitive instance for all features assert all([f.primitive is distance_to_holiday for f in distance_features]) file = os.path.join(tmp_path, "distance_features.json") save_features(distance_features, file) new_distance_features = load_features(file) # After deserialization all features that share a primitive should use the same primitive instance new_distance_primitive = new_distance_features[0].primitive assert all( [f.primitive is new_distance_primitive for f in new_distance_features], ) assert new_distance_primitive.holiday == "Canada Day" assert new_distance_primitive.country == "Canada" # Test primitive with list arg is_in = IsIn(list_of_outputs=[5, True, "coke zero"]) features = dfs( entityset=es, target_dataframe_name="customers", features_only=True, agg_primitives=[], trans_primitives=[is_in], ) is_in_features = [f for f in features if f.primitive.name == "isin"] assert len(is_in_features) > 1 # DFS should use the the passed in primitive instance for all features assert all([f.primitive is is_in for f in is_in_features]) file = os.path.join(tmp_path, "distance_features.json") save_features(is_in_features, file) new_is_in_features = load_features(file) # After deserialization all features that share a primitive should use the same primitive instance new_is_in_primitive = new_is_in_features[0].primitive assert all([f.primitive is new_is_in_primitive for f in new_is_in_features]) assert new_is_in_primitive.list_of_outputs == [5, True, "coke zero"] def test_can_serialize_word_set_for_number_of_common_words_feature(es): # The word_set argument is passed in as a set, which is not JSON-serializable. # This test checks internal logic that converts the set to a list so it can be serialized common_word_set = {"hello", "my"} df = pd.DataFrame({"text": ["hello my name is hi"]}) es = EntitySet() es.add_dataframe(dataframe_name="df", index="idx", dataframe=df, make_index=True) num_common_words = NumberOfCommonWords(word_set=common_word_set) fm, fd = dfs( entityset=es, target_dataframe_name="df", trans_primitives=[num_common_words], ) feat = fd[-1] save_features([feat]) ================================================ FILE: featuretools/tests/primitive_tests/test_feature_utils.py ================================================ from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Double, Integer from featuretools.feature_base.utils import is_valid_input def test_is_valid_input(): assert is_valid_input(candidate=ColumnSchema(), template=ColumnSchema()) assert is_valid_input( candidate=ColumnSchema(logical_type=Integer, semantic_tags={"index"}), template=ColumnSchema(logical_type=Integer, semantic_tags={"index"}), ) assert is_valid_input( candidate=ColumnSchema( logical_type=Integer, semantic_tags={"index", "numeric"}, ), template=ColumnSchema(semantic_tags={"index"}), ) assert is_valid_input( candidate=ColumnSchema(semantic_tags={"index"}), template=ColumnSchema(semantic_tags={"index"}), ) assert is_valid_input( candidate=ColumnSchema(logical_type=Integer, semantic_tags={"index"}), template=ColumnSchema(), ) assert is_valid_input( candidate=ColumnSchema(logical_type=Integer), template=ColumnSchema(logical_type=Integer), ) assert is_valid_input( candidate=ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}), template=ColumnSchema(logical_type=Integer), ) assert not is_valid_input( candidate=ColumnSchema(logical_type=Integer, semantic_tags={"index"}), template=ColumnSchema(logical_type=Double, semantic_tags={"index"}), ) assert not is_valid_input( candidate=ColumnSchema(logical_type=Integer, semantic_tags={}), template=ColumnSchema(logical_type=Integer, semantic_tags={"index"}), ) assert not is_valid_input( candidate=ColumnSchema(), template=ColumnSchema(logical_type=Integer, semantic_tags={"index"}), ) assert not is_valid_input( candidate=ColumnSchema(), template=ColumnSchema(logical_type=Integer), ) assert not is_valid_input( candidate=ColumnSchema(), template=ColumnSchema(semantic_tags={"index"}), ) ================================================ FILE: featuretools/tests/primitive_tests/test_feature_visualizer.py ================================================ import json import os import re import graphviz import pytest from featuretools.feature_base import ( AggregationFeature, DirectFeature, FeatureOutputSlice, GroupByTransformFeature, IdentityFeature, TransformFeature, graph_feature, ) from featuretools.primitives import Count, CumMax, Mode, NMostCommon, Year @pytest.fixture def simple_feat(es): return IdentityFeature(es["log"].ww["id"]) @pytest.fixture def trans_feat(es): return TransformFeature(IdentityFeature(es["customers"].ww["cancel_date"]), Year) def test_returns_digraph_object(simple_feat): graph = graph_feature(simple_feat) assert isinstance(graph, graphviz.Digraph) def test_saving_png_file(simple_feat, tmp_path): output_path = str(tmp_path.joinpath("test1.png")) graph_feature(simple_feat, to_file=output_path) assert os.path.isfile(output_path) def test_missing_file_extension(simple_feat): output_path = "test1" with pytest.raises(ValueError, match="Please use a file extension"): graph_feature(simple_feat, to_file=output_path) def test_invalid_format(simple_feat): output_path = "test1.xyz" with pytest.raises(ValueError, match="Unknown format"): graph_feature(simple_feat, to_file=output_path) def test_transform(es, trans_feat): feat = trans_feat graph = graph_feature(feat).source feat_name = feat.get_name() prim_node = "0_{}_year".format(feat_name) dataframe_table = "\u2605 customers (target)" prim_edge = 'customers:cancel_date -> "{}"'.format(prim_node) feat_edge = '"{}" -> customers:"{}"'.format(prim_node, feat_name) graph_components = [feat_name, dataframe_table, prim_node, prim_edge, feat_edge] for component in graph_components: assert component in graph matches = re.findall(r"customers \[label=<\n>", graph, re.DOTALL) assert len(matches) == 1 rows = re.findall(r"", matches[0], re.DOTALL) assert len(rows) == 3 to_match = ["customers", "cancel_date", feat_name] for match, row in zip(to_match, rows): assert match in row def test_html_symbols(es, tmp_path): output_path_template = str(tmp_path.joinpath("test{}.png")) value = IdentityFeature(es["log"].ww["value"]) gt = value > 5 lt = value < 5 ge = value >= 5 le = value <= 5 for i, feat in enumerate([gt, lt, ge, le]): output_path = output_path_template.format(i) graph = graph_feature(feat, to_file=output_path).source assert os.path.isfile(output_path) assert feat.get_name() in graph def test_groupby_transform(es): feat = GroupByTransformFeature( IdentityFeature(es["customers"].ww["age"]), CumMax, IdentityFeature(es["customers"].ww["cohort"]), ) graph = graph_feature(feat).source feat_name = feat.get_name() prim_node = "0_{}_cum_max".format(feat_name) groupby_node = "{}_groupby_customers--cohort".format(feat_name) dataframe_table = "\u2605 customers (target)" groupby_edge = 'customers:cohort -> "{}"'.format(groupby_node) groupby_input = 'customers:age -> "{}"'.format(groupby_node) prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node) feat_edge = '"{}" -> customers:"{}"'.format(prim_node, feat_name) graph_components = [ feat_name, prim_node, groupby_node, dataframe_table, groupby_edge, groupby_input, prim_input, feat_edge, ] for component in graph_components: assert component in graph matches = re.findall(r"customers \[label=<\n>", graph, re.DOTALL) assert len(matches) == 1 rows = re.findall(r"", matches[0], re.DOTALL) assert len(rows) == 4 assert dataframe_table in rows[0] assert feat_name in rows[-1] assert ("age" in rows[1] and "cohort" in rows[2]) or ( "age" in rows[2] and "cohort" in rows[1] ) def test_groupby_transform_direct_groupby(es): groupby = DirectFeature( IdentityFeature(es["cohorts"].ww["cohort_name"]), "customers", ) feat = GroupByTransformFeature( IdentityFeature(es["customers"].ww["age"]), CumMax, groupby, ) graph = graph_feature(feat).source groupby_name = groupby.get_name() feat_name = feat.get_name() join_node = "1_{}_join".format(groupby_name) prim_node = "0_{}_cum_max".format(feat_name) groupby_node = "{}_groupby_customers--{}".format(feat_name, groupby_name) customers_table = "\u2605 customers (target)" cohorts_table = "cohorts" join_groupby = '"{}" -> customers:cohort'.format(join_node) join_input = 'cohorts:cohort_name -> "{}"'.format(join_node) join_out_edge = '"{}" -> customers:"{}"'.format(join_node, groupby_name) groupby_edge = 'customers:"{}" -> "{}"'.format(groupby_name, groupby_node) groupby_input = 'customers:age -> "{}"'.format(groupby_node) prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node) feat_edge = '"{}" -> customers:"{}"'.format(prim_node, feat_name) graph_components = [ groupby_name, feat_name, join_node, prim_node, groupby_node, customers_table, cohorts_table, join_groupby, join_input, join_out_edge, groupby_edge, groupby_input, prim_input, feat_edge, ] for component in graph_components: assert component in graph dataframes = { "cohorts": [cohorts_table, "cohort_name"], "customers": [customers_table, "cohort", "age", groupby_name, feat_name], } for dataframe in dataframes: regex = r"{} \[label=<\n>".format(dataframe) matches = re.findall(regex, graph, re.DOTALL) assert len(matches) == 1 rows = re.findall(r"", matches[0], re.DOTALL) assert len(rows) == len(dataframes[dataframe]) for row in rows: matched = False for i in dataframes[dataframe]: if i in row: matched = True dataframes[dataframe].remove(i) break assert matched def test_aggregation(es): feat = AggregationFeature(IdentityFeature(es["log"].ww["id"]), "sessions", Count) graph = graph_feature(feat).source feat_name = feat.get_name() prim_node = "0_{}_count".format(feat_name) groupby_node = "{}_groupby_log--session_id".format(feat_name) sessions_table = "\u2605 sessions (target)" log_table = "log" groupby_edge = 'log:session_id -> "{}"'.format(groupby_node) groupby_input = 'log:id -> "{}"'.format(groupby_node) prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node) feat_edge = '"{}" -> sessions:"{}"'.format(prim_node, feat_name) graph_components = [ feat_name, prim_node, groupby_node, sessions_table, log_table, groupby_edge, groupby_input, prim_input, feat_edge, ] for component in graph_components: assert component in graph dataframes = { "log": [log_table, "id", "session_id"], "sessions": [sessions_table, feat_name], } for dataframe in dataframes: regex = r"{} \[label=<\n>".format(dataframe) matches = re.findall(regex, graph, re.DOTALL) assert len(matches) == 1 rows = re.findall(r"", matches[0], re.DOTALL) assert len(rows) == len(dataframes[dataframe]) for row in rows: matched = False for i in dataframes[dataframe]: if i in row: matched = True dataframes[dataframe].remove(i) break assert matched def test_multioutput(es): multioutput = AggregationFeature( IdentityFeature(es["log"].ww["zipcode"]), "sessions", NMostCommon, ) feat = FeatureOutputSlice(multioutput, 0) graph = graph_feature(feat).source feat_name = feat.get_name() prim_node = "0_{}_n_most_common".format(multioutput.get_name()) groupby_node = "{}_groupby_log--session_id".format(multioutput.get_name()) sessions_table = "\u2605 sessions (target)" log_table = "log" groupby_edge = 'log:session_id -> "{}"'.format(groupby_node) groupby_input = 'log:zipcode -> "{}"'.format(groupby_node) prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node) feat_edge = '"{}" -> sessions:"{}"'.format(prim_node, feat_name) graph_components = [ feat_name, prim_node, groupby_node, sessions_table, log_table, groupby_edge, groupby_input, prim_input, feat_edge, ] for component in graph_components: assert component in graph dataframes = { "log": [log_table, "zipcode", "session_id"], "sessions": [sessions_table, feat_name], } for dataframe in dataframes: regex = r"{} \[label=<\n>".format(dataframe) matches = re.findall(regex, graph, re.DOTALL) assert len(matches) == 1 rows = re.findall(r"", matches[0], re.DOTALL) assert len(rows) == len(dataframes[dataframe]) for row in rows: matched = False for i in dataframes[dataframe]: if i in row: matched = True dataframes[dataframe].remove(i) break assert matched def test_direct(es): d1 = DirectFeature( IdentityFeature(es["customers"].ww["engagement_level"]), "sessions", ) d2 = DirectFeature(d1, "log") graph = graph_feature(d2).source d1_name = d1.get_name() d2_name = d2.get_name() prim_node1 = "1_{}_join".format(d1_name) prim_node2 = "0_{}_join".format(d2_name) log_table = "\u2605 log (target)" sessions_table = "sessions" customers_table = "customers" groupby_edge1 = '"{}" -> sessions:customer_id'.format(prim_node1) groupby_edge2 = '"{}" -> log:session_id'.format(prim_node2) groupby_input1 = 'customers:engagement_level -> "{}"'.format(prim_node1) groupby_input2 = 'sessions:"{}" -> "{}"'.format(d1_name, prim_node2) d1_edge = '"{}" -> sessions:"{}"'.format(prim_node1, d1_name) d2_edge = '"{}" -> log:"{}"'.format(prim_node2, d2_name) graph_components = [ d1_name, d2_name, prim_node1, prim_node2, log_table, sessions_table, customers_table, groupby_edge1, groupby_edge2, groupby_input1, groupby_input2, d1_edge, d2_edge, ] for component in graph_components: assert component in graph dataframes = { "customers": [customers_table, "engagement_level"], "sessions": [sessions_table, "customer_id", d1_name], "log": [log_table, "session_id", d2_name], } for dataframe in dataframes: regex = r"{} \[label=<\n>".format(dataframe) matches = re.findall(regex, graph, re.DOTALL) assert len(matches) == 1 rows = re.findall(r"", matches[0], re.DOTALL) assert len(rows) == len(dataframes[dataframe]) for row in rows: matched = False for i in dataframes[dataframe]: if i in row: matched = True dataframes[dataframe].remove(i) break assert matched def test_stacked(es, trans_feat): stacked = AggregationFeature(trans_feat, "cohorts", Mode) graph = graph_feature(stacked).source feat_name = stacked.get_name() intermediate_name = trans_feat.get_name() agg_primitive = "0_{}_mode".format(feat_name) trans_primitive = "1_{}_year".format(intermediate_name) groupby_node = "{}_groupby_customers--cohort".format(feat_name) trans_prim_edge = 'customers:cancel_date -> "{}"'.format(trans_primitive) intermediate_edge = '"{}" -> customers:"{}"'.format( trans_primitive, intermediate_name, ) groupby_edge = 'customers:cohort -> "{}"'.format(groupby_node) groupby_input = 'customers:"{}" -> "{}"'.format(intermediate_name, groupby_node) agg_input = '"{}" -> "{}"'.format(groupby_node, agg_primitive) feat_edge = '"{}" -> cohorts:"{}"'.format(agg_primitive, feat_name) graph_components = [ feat_name, intermediate_name, agg_primitive, trans_primitive, groupby_node, trans_prim_edge, intermediate_edge, groupby_edge, groupby_input, agg_input, feat_edge, ] for component in graph_components: assert component in graph agg_primitive = agg_primitive.replace("(", "\\(").replace(")", "\\)") agg_node = re.findall('"{}" \\[label.*'.format(agg_primitive), graph) assert len(agg_node) == 1 assert "Step 2" in agg_node[0] trans_primitive = trans_primitive.replace("(", "\\(").replace(")", "\\)") trans_node = re.findall('"{}" \\[label.*'.format(trans_primitive), graph) assert len(trans_node) == 1 assert "Step 1" in trans_node[0] def test_description_auto_caption(trans_feat): default_graph = graph_feature(trans_feat, description=True).source default_label = 'label="The year of the \\"cancel_date\\"."' assert default_label in default_graph def test_description_auto_caption_metadata(trans_feat, tmp_path): feature_descriptions = {"customers: cancel_date": "the date the customer cancelled"} primitive_templates = {"year": "the year that {} occurred"} metadata_graph = graph_feature( trans_feat, description=True, feature_descriptions=feature_descriptions, primitive_templates=primitive_templates, ).source metadata_label = 'label="The year that the date the customer cancelled occurred."' assert metadata_label in metadata_graph metadata = { "feature_descriptions": feature_descriptions, "primitive_templates": primitive_templates, } metadata_path = os.path.join(tmp_path, "description_metadata.json") with open(metadata_path, "w") as f: json.dump(metadata, f) json_metadata_graph = graph_feature( trans_feat, description=True, metadata_file=metadata_path, ).source assert metadata_label in json_metadata_graph def test_description_custom_caption(trans_feat): custom_description = "A custom feature description" custom_description_graph = graph_feature( trans_feat, description=custom_description, ).source custom_description_label = 'label="A custom feature description"' assert custom_description_label in custom_description_graph ================================================ FILE: featuretools/tests/primitive_tests/test_features_deserializer.py ================================================ import logging from unittest.mock import patch import pandas as pd import pytest from featuretools import ( AggregationFeature, Feature, IdentityFeature, TransformFeature, __version__, ) from featuretools.feature_base.features_deserializer import FeaturesDeserializer from featuretools.primitives import ( Count, Max, MultiplyNumericScalar, NMostCommon, NumberOfCommonWords, NumUnique, ) from featuretools.primitives.utils import serialize_primitive from featuretools.utils.schema_utils import FEATURES_SCHEMA_VERSION def test_single_feature(es): feature = IdentityFeature(es["log"].ww["value"]) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [feature.unique_name()], "feature_definitions": {feature.unique_name(): feature.to_dictionary()}, "primitive_definitions": {}, } deserializer = FeaturesDeserializer(dictionary) expected = [feature] assert expected == deserializer.to_list() def test_multioutput_feature(es): value = IdentityFeature(es["log"].ww["product_id"]) threecommon = NMostCommon() num_unique = NumUnique() tc = Feature(value, parent_dataframe_name="sessions", primitive=threecommon) features = [tc, value] for i in range(3): features.append( Feature( tc[i], parent_dataframe_name="customers", primitive=num_unique, ), ) features.append(tc[i]) flist = [feat.unique_name() for feat in features] fd = [feat.to_dictionary() for feat in features] fdict = dict(zip(flist, fd)) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": flist, "feature_definitions": fdict, } dictionary["primitive_definitions"] = { "0": serialize_primitive(threecommon), "1": serialize_primitive(num_unique), } dictionary["feature_definitions"][flist[0]]["arguments"]["primitive"] = "0" dictionary["feature_definitions"][flist[2]]["arguments"]["primitive"] = "1" dictionary["feature_definitions"][flist[4]]["arguments"]["primitive"] = "1" dictionary["feature_definitions"][flist[6]]["arguments"]["primitive"] = "1" deserializer = FeaturesDeserializer(dictionary).to_list() for i in range(len(features)): assert features[i].unique_name() == deserializer[i].unique_name() def test_base_features_in_list(es): max_primitive = Max() value = IdentityFeature(es["log"].ww["value"]) max_feat = AggregationFeature(value, "sessions", max_primitive) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feat.unique_name(), value.unique_name()], "feature_definitions": { max_feat.unique_name(): max_feat.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } dictionary["primitive_definitions"] = {"0": serialize_primitive(max_primitive)} dictionary["feature_definitions"][max_feat.unique_name()]["arguments"][ "primitive" ] = "0" deserializer = FeaturesDeserializer(dictionary) expected = [max_feat, value] assert expected == deserializer.to_list() def test_base_features_not_in_list(es): max_primitive = Max() mult_primitive = MultiplyNumericScalar(value=2) value = IdentityFeature(es["log"].ww["value"]) value_x2 = TransformFeature(value, mult_primitive) max_feat = AggregationFeature(value_x2, "sessions", max_primitive) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feat.unique_name()], "feature_definitions": { max_feat.unique_name(): max_feat.to_dictionary(), value_x2.unique_name(): value_x2.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } dictionary["primitive_definitions"] = { "0": serialize_primitive(max_primitive), "1": serialize_primitive(mult_primitive), } dictionary["feature_definitions"][max_feat.unique_name()]["arguments"][ "primitive" ] = "0" dictionary["feature_definitions"][value_x2.unique_name()]["arguments"][ "primitive" ] = "1" deserializer = FeaturesDeserializer(dictionary) expected = [max_feat] assert expected == deserializer.to_list() @patch("featuretools.utils.schema_utils.FEATURES_SCHEMA_VERSION", "1.1.1") @pytest.mark.parametrize( "hardcoded_schema_version, warns", [("2.1.1", True), ("1.2.1", True), ("1.1.2", True), ("1.0.2", False)], ) def test_later_schema_version(es, caplog, hardcoded_schema_version, warns): def test_version(version, warns): if warns: warning_text = ( "The schema version of the saved features" "(%s) is greater than the latest supported (%s). " "You may need to upgrade featuretools. Attempting to load features ..." % (version, "1.1.1") ) else: warning_text = None _check_schema_version(version, es, warning_text, caplog, "warn") test_version(hardcoded_schema_version, warns) @patch("featuretools.utils.schema_utils.FEATURES_SCHEMA_VERSION", "1.1.1") @pytest.mark.parametrize( "hardcoded_schema_version, warns", [("0.1.1", True), ("1.0.1", False), ("1.1.0", False)], ) def test_earlier_schema_version(es, caplog, hardcoded_schema_version, warns): def test_version(version, warns): if warns: warning_text = ( "The schema version of the saved features" "(%s) is no longer supported by this version " "of featuretools. Attempting to load features ..." % version ) else: warning_text = None _check_schema_version(version, es, warning_text, caplog, "log") test_version(hardcoded_schema_version, warns) def test_unknown_feature_type(es): dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": ["feature_1"], "feature_definitions": { "feature_1": {"type": "FakeFeature", "dependencies": [], "arguments": {}}, }, "primitive_definitions": {}, } deserializer = FeaturesDeserializer(dictionary) with pytest.raises(RuntimeError, match='Unrecognized feature type "FakeFeature"'): deserializer.to_list() def test_unknown_primitive_type(es): value = IdentityFeature(es["log"].ww["value"]) max_feat = AggregationFeature(value, "sessions", Max) primitive_dict = serialize_primitive(Max()) primitive_dict["type"] = "FakePrimitive" dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feat.unique_name(), value.unique_name()], "feature_definitions": { max_feat.unique_name(): max_feat.to_dictionary(), value.unique_name(): value.to_dictionary(), }, "primitive_definitions": {"0": primitive_dict}, } with pytest.raises(RuntimeError) as excinfo: FeaturesDeserializer(dictionary) error_text = 'Primitive "FakePrimitive" in module "%s" not found' % Max.__module__ assert error_text == str(excinfo.value) def test_unknown_primitive_module(es): value = IdentityFeature(es["log"].ww["value"]) max_feat = AggregationFeature(value, "sessions", Max) primitive_dict = serialize_primitive(Max()) primitive_dict["module"] = "fake.module" dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feat.unique_name(), value.unique_name()], "feature_definitions": { max_feat.unique_name(): max_feat.to_dictionary(), value.unique_name(): value.to_dictionary(), }, "primitive_definitions": {"0": primitive_dict}, } with pytest.raises(RuntimeError) as excinfo: FeaturesDeserializer(dictionary) error_text = 'Primitive "Max" in module "fake.module" not found' assert error_text == str(excinfo.value) def test_feature_use_previous_pd_timedelta(es): value = IdentityFeature(es["log"].ww["id"]) td = pd.Timedelta(12, "W") count_primitive = Count() count_feature = AggregationFeature( value, "customers", count_primitive, use_previous=td, ) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [count_feature.unique_name(), value.unique_name()], "feature_definitions": { count_feature.unique_name(): count_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } dictionary["primitive_definitions"] = {"0": serialize_primitive(count_primitive)} dictionary["feature_definitions"][count_feature.unique_name()]["arguments"][ "primitive" ] = "0" deserializer = FeaturesDeserializer(dictionary) expected = [count_feature, value] assert expected == deserializer.to_list() def test_feature_use_previous_pd_dateoffset(es): value = IdentityFeature(es["log"].ww["id"]) do = pd.DateOffset(months=3) count_primitive = Count() count_feature = AggregationFeature( value, "customers", count_primitive, use_previous=do, ) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [count_feature.unique_name(), value.unique_name()], "feature_definitions": { count_feature.unique_name(): count_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } dictionary["primitive_definitions"] = {"0": serialize_primitive(count_primitive)} dictionary["feature_definitions"][count_feature.unique_name()]["arguments"][ "primitive" ] = "0" deserializer = FeaturesDeserializer(dictionary) expected = [count_feature, value] assert expected == deserializer.to_list() value = IdentityFeature(es["log"].ww["id"]) do = pd.DateOffset(months=3, days=2, minutes=30) count_feature = AggregationFeature( value, "customers", count_primitive, use_previous=do, ) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [count_feature.unique_name(), value.unique_name()], "feature_definitions": { count_feature.unique_name(): count_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } dictionary["primitive_definitions"] = {"0": serialize_primitive(count_primitive)} dictionary["feature_definitions"][count_feature.unique_name()]["arguments"][ "primitive" ] = "0" deserializer = FeaturesDeserializer(dictionary) expected = [count_feature, value] assert expected == deserializer.to_list() def test_word_set_in_number_of_common_words_is_deserialized_back_into_a_set(es): id_feat = IdentityFeature(es["log"].ww["comments"]) number_of_common_words = NumberOfCommonWords(word_set={"hello", "my"}) transform_feat = TransformFeature(id_feat, number_of_common_words) dictionary = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [id_feat.unique_name(), transform_feat.unique_name()], "feature_definitions": { id_feat.unique_name(): id_feat.to_dictionary(), transform_feat.unique_name(): transform_feat.to_dictionary(), }, "primitive_definitions": {"0": serialize_primitive(number_of_common_words)}, } dictionary["feature_definitions"][transform_feat.unique_name()]["arguments"][ "primitive" ] = "0" deserializer = FeaturesDeserializer(dictionary) assert isinstance( deserializer.features_dict["primitive_definitions"]["0"]["arguments"][ "word_set" ], set, ) def _check_schema_version(version, es, warning_text, caplog, warning_type=None): dictionary = { "ft_version": __version__, "schema_version": version, "entityset": es.to_dictionary(), "feature_list": [], "feature_definitions": {}, "primitive_definitions": {}, } if warning_type == "warn" and warning_text: with pytest.warns(UserWarning) as record: FeaturesDeserializer(dictionary) assert record[0].message.args[0] == warning_text elif warning_type == "log": logger = logging.getLogger("featuretools") logger.propagate = True FeaturesDeserializer(dictionary) if warning_text: assert warning_text in caplog.text else: assert not len(caplog.text) logger.propagate = False ================================================ FILE: featuretools/tests/primitive_tests/test_features_serializer.py ================================================ import pandas as pd from featuretools import ( AggregationFeature, Feature, IdentityFeature, TransformFeature, __version__, ) from featuretools.entityset.deserialize import description_to_entityset from featuretools.feature_base.features_serializer import FeaturesSerializer from featuretools.primitives import ( Count, Max, MultiplyNumericScalar, NMostCommon, NumUnique, ) from featuretools.primitives.utils import serialize_primitive from featuretools.version import FEATURES_SCHEMA_VERSION def test_single_feature(es): feature = IdentityFeature(es["log"].ww["value"]) serializer = FeaturesSerializer([feature]) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [feature.unique_name()], "feature_definitions": {feature.unique_name(): feature.to_dictionary()}, "primitive_definitions": {}, } _compare_feature_dicts(expected, serializer.to_dict()) def test_base_features_in_list(es): value = IdentityFeature(es["log"].ww["value"]) max_feature = AggregationFeature(value, "sessions", Max) features = [max_feature, value] serializer = FeaturesSerializer(features) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feature.unique_name(), value.unique_name()], "feature_definitions": { max_feature.unique_name(): max_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } expected["primitive_definitions"] = { "0": serialize_primitive(max_feature.primitive), } expected["feature_definitions"][max_feature.unique_name()]["arguments"][ "primitive" ] = "0" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) def test_multi_output_features(es): product_id = IdentityFeature(es["log"].ww["product_id"]) threecommon = NMostCommon() num_unique = NumUnique() tc = Feature(product_id, parent_dataframe_name="sessions", primitive=threecommon) features = [tc, product_id] for i in range(3): features.append( Feature( tc[i], parent_dataframe_name="customers", primitive=num_unique, ), ) features.append(tc[i]) serializer = FeaturesSerializer(features) flist = [feat.unique_name() for feat in features] fd = [feat.to_dictionary() for feat in features] fdict = dict(zip(flist, fd)) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": flist, "feature_definitions": fdict, } expected["primitive_definitions"] = { "0": serialize_primitive(tc.primitive), "1": serialize_primitive(features[2].primitive), } expected["feature_definitions"][flist[0]]["arguments"]["primitive"] = "0" expected["feature_definitions"][flist[2]]["arguments"]["primitive"] = "1" expected["feature_definitions"][flist[4]]["arguments"]["primitive"] = "1" expected["feature_definitions"][flist[6]]["arguments"]["primitive"] = "1" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) def test_base_features_not_in_list(es): max_primitive = Max() mult_primitive = MultiplyNumericScalar(value=2) value = IdentityFeature(es["log"].ww["value"]) value_x2 = TransformFeature(value, mult_primitive) max_feature = AggregationFeature(value_x2, "sessions", max_primitive) features = [max_feature] serializer = FeaturesSerializer(features) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feature.unique_name()], "feature_definitions": { max_feature.unique_name(): max_feature.to_dictionary(), value_x2.unique_name(): value_x2.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } expected["primitive_definitions"] = { "0": serialize_primitive(max_feature.primitive), "1": serialize_primitive(value_x2.primitive), } expected["feature_definitions"][max_feature.unique_name()]["arguments"][ "primitive" ] = "0" expected["feature_definitions"][value_x2.unique_name()]["arguments"][ "primitive" ] = "1" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) def test_where_feature_dependency(es): max_primitive = Max() value = IdentityFeature(es["log"].ww["value"]) is_purchased = IdentityFeature(es["log"].ww["purchased"]) max_feature = AggregationFeature( value, "sessions", max_primitive, where=is_purchased, ) features = [max_feature] serializer = FeaturesSerializer(features) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [max_feature.unique_name()], "feature_definitions": { max_feature.unique_name(): max_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), is_purchased.unique_name(): is_purchased.to_dictionary(), }, } expected["primitive_definitions"] = { "0": serialize_primitive(max_feature.primitive), } expected["feature_definitions"][max_feature.unique_name()]["arguments"][ "primitive" ] = "0" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) def test_feature_use_previous_pd_timedelta(es): value = IdentityFeature(es["log"].ww["id"]) td = pd.Timedelta(12, "W") count_primitive = Count() count_feature = AggregationFeature( value, "customers", count_primitive, use_previous=td, ) features = [count_feature, value] serializer = FeaturesSerializer(features) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [count_feature.unique_name(), value.unique_name()], "feature_definitions": { count_feature.unique_name(): count_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } expected["primitive_definitions"] = { "0": serialize_primitive(count_feature.primitive), } expected["feature_definitions"][count_feature.unique_name()]["arguments"][ "primitive" ] = "0" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) def test_feature_use_previous_pd_dateoffset(es): value = IdentityFeature(es["log"].ww["id"]) do = pd.DateOffset(months=3) count_primitive = Count() count_feature = AggregationFeature( value, "customers", count_primitive, use_previous=do, ) features = [count_feature, value] serializer = FeaturesSerializer(features) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [count_feature.unique_name(), value.unique_name()], "feature_definitions": { count_feature.unique_name(): count_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } expected["primitive_definitions"] = { "0": serialize_primitive(count_feature.primitive), } expected["feature_definitions"][count_feature.unique_name()]["arguments"][ "primitive" ] = "0" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) value = IdentityFeature(es["log"].ww["id"]) do = pd.DateOffset(months=3, days=2, minutes=30) count_feature = AggregationFeature( value, "customers", count_primitive, use_previous=do, ) features = [count_feature, value] serializer = FeaturesSerializer(features) expected = { "ft_version": __version__, "schema_version": FEATURES_SCHEMA_VERSION, "entityset": es.to_dictionary(), "feature_list": [count_feature.unique_name(), value.unique_name()], "feature_definitions": { count_feature.unique_name(): count_feature.to_dictionary(), value.unique_name(): value.to_dictionary(), }, } expected["primitive_definitions"] = { "0": serialize_primitive(count_feature.primitive), } expected["feature_definitions"][count_feature.unique_name()]["arguments"][ "primitive" ] = "0" actual = serializer.to_dict() _compare_feature_dicts(expected, actual) def _compare_feature_dicts(a_dict, b_dict): # We can't compare entityset dictionaries because column lists are not # guaranteed to be in the same order. es_a = description_to_entityset(a_dict.pop("entityset")) es_b = description_to_entityset(b_dict.pop("entityset")) assert es_a == es_b assert a_dict == b_dict ================================================ FILE: featuretools/tests/primitive_tests/test_groupby_transform_primitives.py ================================================ import numpy as np import pandas as pd from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools import ( Feature, GroupByTransformFeature, IdentityFeature, calculate_feature_matrix, feature_base, ) from featuretools.computational_backends.feature_set import FeatureSet from featuretools.computational_backends.feature_set_calculator import ( FeatureSetCalculator, ) from featuretools.primitives import CumCount, CumMax, CumMean, CumMin, CumSum, Last from featuretools.primitives.base import TransformPrimitive from featuretools.synthesis import dfs from featuretools.tests.testing_utils import feature_with_name class TestCumCount: primitive = CumCount def test_order(self): g = pd.Series(["a", "b", "a"]) answers = ([1, 2], [1]) function = self.primitive().get_function() for (_, group), answer in zip(g.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_regular(self): g = pd.Series(["a", "b", "a", "c", "d", "b"]) answers = ([1, 2], [1, 2], [1], [1]) function = self.primitive().get_function() for (_, group), answer in zip(g.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_discrete(self): g = pd.Series(["a", "b", "a", "c", "d", "b"]) answers = ([1, 2], [1, 2], [1], [1]) function = self.primitive().get_function() for (_, group), answer in zip(g.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) class TestCumSum: primitive = CumSum def test_order(self): v = pd.Series([1, 2, 2]) g = pd.Series(["a", "b", "a"]) answers = ([1, 3], [2]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_regular(self): v = pd.Series([101, 102, 103, 104, 105, 106]) g = pd.Series(["a", "b", "a", "c", "d", "b"]) answers = ([101, 204], [102, 208], [104], [105]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) class TestCumMean: primitive = CumMean def test_order(self): v = pd.Series([1, 2, 2]) g = pd.Series(["a", "b", "a"]) answers = ([1, 1.5], [2]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_regular(self): v = pd.Series([101, 102, 103, 104, 105, 106]) g = pd.Series(["a", "b", "a", "c", "d", "b"]) answers = ([101, 102], [102, 104], [104], [105]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) class TestCumMax: primitive = CumMax def test_order(self): v = pd.Series([1, 2, 2]) g = pd.Series(["a", "b", "a"]) answers = ([1, 2], [2]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_regular(self): v = pd.Series([101, 102, 103, 104, 105, 106]) g = pd.Series(["a", "b", "a", "c", "d", "b"]) answers = ([101, 103], [102, 106], [104], [105]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) class TestCumMin: primitive = CumMin def test_order(self): v = pd.Series([1, 2, 2]) g = pd.Series(["a", "b", "a"]) answers = ([1, 1], [2]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_regular(self): v = pd.Series([101, 102, 103, 104, 105, 106, 100]) g = pd.Series(["a", "b", "a", "c", "d", "b", "a"]) answers = ([101, 101, 100], [102, 102], [104], [105]) function = self.primitive().get_function() for (_, group), answer in zip(v.groupby(g), answers): np.testing.assert_array_equal(function(group), answer) def test_cum_sum(es): log_value_feat = IdentityFeature(es["log"].ww["value"]) dfeat = Feature( IdentityFeature(es["sessions"].ww["device_type"]), dataframe_name="log", ) cum_sum = Feature(log_value_feat, groupby=dfeat, primitive=CumSum) features = [cum_sum] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) cvalues = df[cum_sum.get_name()].values assert len(cvalues) == 15 cum_sum_values = [0, 5, 15, 30, 50, 0, 1, 3, 6, 6, 50, 55, 55, 62, 76] for i, v in enumerate(cum_sum_values): assert v == cvalues[i] def test_cum_min(es): log_value_feat = IdentityFeature(es["log"].ww["value"]) cum_min = Feature( log_value_feat, groupby=IdentityFeature(es["log"].ww["session_id"]), primitive=CumMin, ) features = [cum_min] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) cvalues = df[cum_min.get_name()].values assert len(cvalues) == 15 cum_min_values = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] for i, v in enumerate(cum_min_values): assert v == cvalues[i] def test_cum_max(es): log_value_feat = IdentityFeature(es["log"].ww["value"]) cum_max = Feature( log_value_feat, groupby=IdentityFeature(es["log"].ww["session_id"]), primitive=CumMax, ) features = [cum_max] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) cvalues = df[cum_max.get_name()].values assert len(cvalues) == 15 cum_max_values = [0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14] for i, v in enumerate(cum_max_values): assert v == cvalues[i] def test_cum_sum_group_on_nan(es): log_value_feat = IdentityFeature(es["log"].ww["value"]) es["log"]["product_id"] = ( ["coke zero"] * 3 + ["car"] * 2 + ["toothpaste"] * 3 + ["brown bag"] * 2 + ["shoes"] + [np.nan] * 4 + ["coke_zero"] * 2 ) es["log"]["value"][16] = 10 cum_sum = Feature( log_value_feat, groupby=IdentityFeature(es["log"].ww["product_id"]), primitive=CumSum, ) features = [cum_sum] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(17), ) cvalues = df[cum_sum.get_name()].values assert len(cvalues) == 17 cum_sum_values = [ 0, 5, 15, 15, 35, 0, 1, 3, 3, 3, 0, np.nan, np.nan, np.nan, np.nan, np.nan, 10, ] assert len(cvalues) == len(cum_sum_values) for i, v in enumerate(cum_sum_values): if np.isnan(v): assert np.isnan(cvalues[i]) else: assert v == cvalues[i] def test_cum_sum_numpy_group_on_nan(es): class CumSumNumpy(TransformPrimitive): """Returns the cumulative sum after grouping""" name = "cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True def get_function(self): def cum_sum(values): return values.cumsum().values return cum_sum log_value_feat = IdentityFeature(es["log"].ww["value"]) es["log"]["product_id"] = ( ["coke zero"] * 3 + ["car"] * 2 + ["toothpaste"] * 3 + ["brown bag"] * 2 + ["shoes"] + [np.nan] * 4 + ["coke_zero"] * 2 ) es["log"]["value"][16] = 10 cum_sum = Feature( log_value_feat, groupby=IdentityFeature(es["log"].ww["product_id"]), primitive=CumSumNumpy, ) assert cum_sum.get_name() == "CUM_SUM(value) by product_id" features = [cum_sum] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(17), ) cvalues = df[cum_sum.get_name()].values assert len(cvalues) == 17 cum_sum_values = [ 0, 5, 15, 15, 35, 0, 1, 3, 3, 3, 0, np.nan, np.nan, np.nan, np.nan, np.nan, 10, ] assert len(cvalues) == len(cum_sum_values) for i, v in enumerate(cum_sum_values): if np.isnan(v): assert np.isnan(cvalues[i]) else: assert v == cvalues[i] def test_cum_handles_uses_full_dataframe(es): def check(feature): feature_set = FeatureSet([feature]) calculator = FeatureSetCalculator( es, feature_set=feature_set, time_last=None, ) df_1 = calculator.run(np.array([0, 1, 2])) df_2 = calculator.run(np.array([2, 4])) # check that the value for instance id 2 matches assert (df_2.loc[2] == df_1.loc[2]).all() for primitive in [CumSum, CumMean, CumMax, CumMin]: check( Feature( es["log"].ww["value"], groupby=IdentityFeature(es["log"].ww["session_id"]), primitive=primitive, ), ) check( Feature( es["log"].ww["product_id"], groupby=Feature(es["log"].ww["product_id"]), primitive=CumCount, ), ) def test_cum_mean(es): log_value_feat = IdentityFeature(es["log"].ww["value"]) cum_mean = Feature( log_value_feat, groupby=IdentityFeature(es["log"].ww["session_id"]), primitive=CumMean, ) features = [cum_mean] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) cvalues = df[cum_mean.get_name()].values assert len(cvalues) == 15 cum_mean_values = [0, 2.5, 5, 7.5, 10, 0, 0.5, 1, 1.5, 0, 0, 2.5, 0, 3.5, 7] for i, v in enumerate(cum_mean_values): assert v == cvalues[i] def test_cum_count(es): cum_count = Feature( IdentityFeature(es["log"].ww["product_id"]), groupby=IdentityFeature(es["log"].ww["product_id"]), primitive=CumCount, ) features = [cum_count] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) cvalues = df[cum_count.get_name()].values assert len(cvalues) == 15 cum_count_values = [1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 4, 5, 6, 7] for i, v in enumerate(cum_count_values): assert v == cvalues[i] def test_rename(es): cum_count = Feature( IdentityFeature(es["log"].ww["product_id"]), groupby=IdentityFeature(es["log"].ww["product_id"]), primitive=CumCount, ) copy_feat = cum_count.rename("rename_test") assert cum_count.unique_name() != copy_feat.unique_name() assert cum_count.get_name() != copy_feat.get_name() assert all( [ x.generate_name() == y.generate_name() for x, y in zip(cum_count.base_features, copy_feat.base_features) ], ) assert cum_count.dataframe_name == copy_feat.dataframe_name def test_groupby_no_data(es): cum_count = Feature( IdentityFeature(es["log"].ww["product_id"]), groupby=IdentityFeature(es["log"].ww["product_id"]), primitive=CumCount, ) last_feat = Feature(cum_count, parent_dataframe_name="customers", primitive=Last) df = calculate_feature_matrix( entityset=es, features=[last_feat], cutoff_time=pd.Timestamp("2011-04-08"), ) cvalues = df[last_feat.get_name()].values assert len(cvalues) == 2 assert all([pd.isnull(value) for value in cvalues]) def test_groupby_uses_calc_time(es): def projected_amount_left(amount, timestamp, time=None): # cumulative sum of amount, with timedelta * constant subtracted delta = time - timestamp delta_seconds = delta / np.timedelta64(1, "s") return amount.cumsum() - (delta_seconds) class ProjectedAmountRemaining(TransformPrimitive): name = "projected_amount_remaining" uses_calc_time = True input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) uses_full_dataframe = True def get_function(self): return projected_amount_left time_since_product = GroupByTransformFeature( [ IdentityFeature(es["log"].ww["value"]), IdentityFeature(es["log"].ww["datetime"]), ], groupby=IdentityFeature(es["log"].ww["product_id"]), primitive=ProjectedAmountRemaining, ) df = calculate_feature_matrix( entityset=es, features=[time_since_product], cutoff_time=pd.Timestamp("2011-04-10 11:10:30"), ) answers = [ -88830, -88819, -88803, -88797, -88771, -88770, -88760, -88749, -88740, -88227, -1830, -1809, -1750, -1740, -1723, np.nan, np.nan, ] for x, y in zip(df[time_since_product.get_name()], answers): assert (pd.isnull(x) and pd.isnull(y)) or x == y def test_groupby_multi_output_stacking(es): class TestTime(TransformPrimitive): name = "test_time" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 6 fl = dfs( entityset=es, target_dataframe_name="sessions", agg_primitives=["sum"], groupby_trans_primitives=[TestTime], features_only=True, max_depth=4, ) for i in range(6): f = "SUM(log.TEST_TIME(datetime)[%d] by product_id)" % i assert feature_with_name(fl, f) assert ("customers.SUM(log.TEST_TIME(datetime)[%d] by session_id)" % i) in fl def test_serialization(es): value = IdentityFeature(es["log"].ww["value"]) zipcode = IdentityFeature(es["log"].ww["zipcode"]) primitive = CumSum() groupby = feature_base.GroupByTransformFeature(value, primitive, zipcode) dictionary = { "name": "CUM_SUM(value) by zipcode", "base_features": [value.unique_name()], "primitive": primitive, "groupby": zipcode.unique_name(), } assert dictionary == groupby.get_arguments() dependencies = { value.unique_name(): value, zipcode.unique_name(): zipcode, } assert groupby == feature_base.GroupByTransformFeature.from_dictionary( dictionary, es, dependencies, primitive, ) def test_groupby_with_multioutput_primitive(es): class MultiCumSum(TransformPrimitive): name = "multi_cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 def get_function(self): def multi_cum_sum(x): return x.cumsum(), x.cummax(), x.cummin() return multi_cum_sum fm, _ = dfs( entityset=es, target_dataframe_name="customers", trans_primitives=[], agg_primitives=[], groupby_trans_primitives=[MultiCumSum, CumSum, CumMax, CumMin], ) # Calculate output in a separate DFS call to make sure the multi-output code # does not alter any values fm2, _ = dfs( entityset=es, target_dataframe_name="customers", trans_primitives=[], agg_primitives=[], groupby_trans_primitives=[CumSum, CumMax, CumMin], ) answer_cols = [ ["CUM_SUM(age) by cohort", "CUM_SUM(age) by région_id"], ["CUM_MAX(age) by cohort", "CUM_MAX(age) by région_id"], ["CUM_MIN(age) by cohort", "CUM_MIN(age) by région_id"], ] for i in range(3): # Check that multi-output gives correct answers f = "MULTI_CUM_SUM(age)[%d] by cohort" % i assert f in fm.columns for x, y in zip(fm[f].values, fm[answer_cols[i][0]].values): assert x == y f = "MULTI_CUM_SUM(age)[%d] by région_id" % i assert f in fm.columns for x, y in zip(fm[f].values, fm[answer_cols[i][1]].values): assert x == y # Verify single output results are unchanged by inclusion of # multi-output primitive for x, y in zip(fm[answer_cols[i][0]], fm2[answer_cols[i][0]]): assert x == y for x, y in zip(fm[answer_cols[i][1]], fm2[answer_cols[i][1]]): assert x == y def test_groupby_with_multioutput_primitive_custom_names(es): class MultiCumSum(TransformPrimitive): name = "multi_cum_sum" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 def get_function(self): def multi_cum_sum(x): return x.cumsum(), x.cummax(), x.cummin() return multi_cum_sum def generate_names(primitive, base_feature_names): return ["CUSTOM_SUM", "CUSTOM_MAX", "CUSTOM_MIN"] fm, _ = dfs( entityset=es, target_dataframe_name="customers", trans_primitives=[], agg_primitives=[], groupby_trans_primitives=[MultiCumSum, CumSum, CumMax, CumMin], ) answer_cols = [ ["CUM_SUM(age) by cohort", "CUM_SUM(age) by région_id"], ["CUM_MAX(age) by cohort", "CUM_MAX(age) by région_id"], ["CUM_MIN(age) by cohort", "CUM_MIN(age) by région_id"], ] expected_names = [ ["CUSTOM_SUM by cohort", "CUSTOM_SUM by région_id"], ["CUSTOM_MAX by cohort", "CUSTOM_MAX by région_id"], ["CUSTOM_MIN by cohort", "CUSTOM_MIN by région_id"], ] for i in range(3): f = expected_names[i][0] assert f in fm.columns for x, y in zip(fm[f].values, fm[answer_cols[i][0]].values): assert x == y f = expected_names[i][1] assert f in fm.columns for x, y in zip(fm[f].values, fm[answer_cols[i][1]].values): assert x == y ================================================ FILE: featuretools/tests/primitive_tests/test_identity_features.py ================================================ from featuretools import IdentityFeature from featuretools.primitives.utils import PrimitivesDeserializer def test_relationship_path(es): value = IdentityFeature(es["log"].ww["value"]) assert len(value.relationship_path) == 0 def test_serialization(es): value = IdentityFeature(es["log"].ww["value"]) dictionary = { "name": "value", "column_name": "value", "dataframe_name": "log", } assert dictionary == value.get_arguments() assert value == IdentityFeature.from_dictionary( dictionary, es, {}, PrimitivesDeserializer, ) ================================================ FILE: featuretools/tests/primitive_tests/test_overrides.py ================================================ from featuretools import Feature, calculate_feature_matrix from featuretools.primitives import ( AddNumeric, AddNumericScalar, Count, DivideByFeature, DivideNumeric, DivideNumericScalar, Equal, EqualScalar, GreaterThan, GreaterThanEqualTo, GreaterThanEqualToScalar, GreaterThanScalar, LessThan, LessThanEqualTo, LessThanEqualToScalar, LessThanScalar, ModuloByFeature, ModuloNumeric, ModuloNumericScalar, MultiplyNumeric, MultiplyNumericScalar, Negate, NotEqual, NotEqualScalar, ScalarSubtractNumericFeature, SubtractNumeric, SubtractNumericScalar, Sum, ) def test_overrides(es): value = Feature(es["log"].ww["value"]) value2 = Feature(es["log"].ww["value_2"]) feats = [ AddNumeric, SubtractNumeric, MultiplyNumeric, DivideNumeric, ModuloNumeric, GreaterThan, LessThan, Equal, NotEqual, GreaterThanEqualTo, LessThanEqualTo, ] assert Feature(value, primitive=Negate).unique_name() == (-value).unique_name() compares = [(value, value), (value, value2)] overrides = [ value + value, value - value, value * value, value / value, value % value, value > value, value < value, value == value, value != value, value >= value, value <= value, value + value2, value - value2, value * value2, value / value2, value % value2, value > value2, value < value2, value == value2, value != value2, value >= value2, value <= value2, ] for left, right in compares: for feat in feats: f = Feature([left, right], primitive=feat) o = overrides.pop(0) assert o.unique_name() == f.unique_name() def test_override_boolean(es): count = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) count_lo = Feature(count, primitive=GreaterThanScalar(1)) count_hi = Feature(count, primitive=LessThanScalar(10)) to_test = [[True, True, True], [True, True, False], [False, False, True]] features = [] features.append(count_lo.OR(count_hi)) features.append(count_lo.AND(count_hi)) features.append(~(count_lo.AND(count_hi))) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2], ) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test def test_scalar_overrides(es): value = Feature(es["log"].ww["value"]) feats = [ AddNumericScalar, SubtractNumericScalar, MultiplyNumericScalar, DivideNumericScalar, ModuloNumericScalar, GreaterThanScalar, LessThanScalar, EqualScalar, NotEqualScalar, GreaterThanEqualToScalar, LessThanEqualToScalar, ] overrides = [ value + 2, value - 2, value * 2, value / 2, value % 2, value > 2, value < 2, value == 2, value != 2, value >= 2, value <= 2, ] for feat in feats: f = Feature(value, primitive=feat(2)) o = overrides.pop(0) assert o.unique_name() == f.unique_name() value2 = Feature(es["log"].ww["value_2"]) reverse_feats = [ AddNumericScalar, ScalarSubtractNumericFeature, MultiplyNumericScalar, DivideByFeature, ModuloByFeature, GreaterThanScalar, LessThanScalar, EqualScalar, NotEqualScalar, GreaterThanEqualToScalar, LessThanEqualToScalar, ] reverse_overrides = [ 2 + value2, 2 - value2, 2 * value2, 2 / value2, 2 % value2, 2 < value2, 2 > value2, 2 == value2, 2 != value2, 2 <= value2, 2 >= value2, ] for feat in reverse_feats: f = Feature(value2, primitive=feat(2)) o = reverse_overrides.pop(0) assert o.unique_name() == f.unique_name() def test_override_cmp_from_column(es): count_lo = Feature(es["log"].ww["value"]) > 1 to_test = [False, True, True] features = [count_lo] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2], ) v = df[count_lo.get_name()].tolist() for i, test in enumerate(to_test): assert v[i] == test def test_override_cmp(es): count = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) _sum = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) gt_lo = count > 1 gt_other = count > _sum ge_lo = count >= 1 ge_other = count >= _sum lt_hi = count < 10 lt_other = count < _sum le_hi = count <= 10 le_other = count <= _sum ne_lo = count != 1 ne_other = count != _sum to_test = [ [True, True, False], [False, False, True], [True, True, True], [False, False, True], [True, True, True], [True, True, False], [True, True, True], [True, True, False], ] features = [ gt_lo, gt_other, ge_lo, ge_other, lt_hi, lt_other, le_hi, le_other, ne_lo, ne_other, ] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2], ) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test ================================================ FILE: featuretools/tests/primitive_tests/test_primitive_base.py ================================================ from datetime import datetime import numpy as np import pandas as pd from pytest import raises from featuretools.primitives import Haversine, IsIn, IsNull, Max, TimeSinceLast from featuretools.primitives.base import TransformPrimitive def test_call_agg(): primitive = Max() # the assert is run twice on purpose for _ in range(2): assert 5 == primitive(range(6)) def test_call_trans(): primitive = IsNull() for _ in range(2): assert pd.Series([False] * 6).equals(primitive(range(6))) def test_uses_calc_time(): primitive = TimeSinceLast() primitive_h = TimeSinceLast(unit="hours") datetimes = pd.Series([datetime(2015, 6, 6), datetime(2015, 6, 7)]) answer = 86400.0 answer_h = 24.0 assert answer == primitive(datetimes, time=datetime(2015, 6, 8)) assert answer_h == primitive_h(datetimes, time=datetime(2015, 6, 8)) def test_call_multiple_args(): primitive = Haversine() data1 = [(42.4, -71.1), (40.0, -122.4)] data2 = [(40.0, -122.4), (41.2, -96.75)] answer = [2631.231, 1343.289] for _ in range(2): assert np.round(primitive(data1, data2), 3).tolist() == answer def test_get_function_called_once(): class TestPrimitive(TransformPrimitive): def __init__(self): self.get_function_call_count = 0 def get_function(self): self.get_function_call_count += 1 def test(x): return x return test primitive = TestPrimitive() for _ in range(2): primitive(range(6)) assert primitive.get_function_call_count == 1 def test_multiple_arg_string(): class Primitive(TransformPrimitive): def __init__(self, bool=True, int=0, float=None): self.bool = bool self.int = int self.float = float primitive = Primitive(bool=True, int=4, float=0.1) string = primitive.get_args_string() assert string == ", int=4, float=0.1" def test_single_args_string(): assert IsIn([1, 2, 3]).get_args_string() == ", list_of_outputs=[1, 2, 3]" def test_args_string_default(): assert IsIn().get_args_string() == "" def test_args_string_mixed(): class Primitive(TransformPrimitive): def __init__(self, bool=True, int=0, float=None): self.bool = bool self.int = int self.float = float primitive = Primitive(bool=False, int=0) string = primitive.get_args_string() assert string == ", bool=False" def test_args_string_undefined(): string = Max().get_args_string() assert string == "" def test_args_string_error(): class Primitive(TransformPrimitive): def __init__(self, bool=True, int=0, float=None): pass with raises(AssertionError, match="must be attribute"): Primitive(bool=True, int=4, float=0.1).get_args_string() ================================================ FILE: featuretools/tests/primitive_tests/test_primitive_utils.py ================================================ import os import pytest from featuretools import list_primitives, summarize_primitives from featuretools.primitives import ( AddNumericScalar, Age, Count, Day, Diff, GreaterThan, Haversine, IsFreeEmailDomain, IsNull, Last, Max, Mean, Min, Mode, Month, MultiplyBoolean, NMostCommon, NumCharacters, NumericLag, NumUnique, NumWords, PercentTrue, Skew, Std, Sum, Weekday, Year, get_aggregation_primitives, get_default_aggregation_primitives, get_default_transform_primitives, get_transform_primitives, ) from featuretools.primitives.base import PrimitiveBase from featuretools.primitives.base.transform_primitive_base import TransformPrimitive from featuretools.primitives.utils import ( _check_input_types, _get_descriptions, _get_summary_primitives, _get_unique_input_types, list_primitive_files, load_primitive_from_file, ) def test_list_primitives_order(): df = list_primitives() all_primitives = get_transform_primitives() all_primitives.update(get_aggregation_primitives()) for name, primitive in all_primitives.items(): assert name in df["name"].values row = df.loc[df["name"] == name].iloc[0] actual_desc = _get_descriptions([primitive])[0] if actual_desc: assert actual_desc == row["description"] assert row["valid_inputs"] == ", ".join( _get_unique_input_types(primitive.input_types), ) expected_return_type = ( str(primitive.return_type) if primitive.return_type is not None else None ) assert row["return_type"] == expected_return_type types = df["type"].values assert "aggregation" in types assert "transform" in types def test_valid_input_types(): actual = _get_unique_input_types(Haversine.input_types) assert actual == {""} actual = _get_unique_input_types(MultiplyBoolean.input_types) assert actual == { "", "", } actual = _get_unique_input_types(Sum.input_types) assert actual == {""} def test_descriptions(): primitives = { NumCharacters: "Calculates the number of characters in a given string, including whitespace and punctuation.", Day: "Determines the day of the month from a datetime.", Last: "Determines the last value in a list.", GreaterThan: "Determines if values in one list are greater than another list.", } assert _get_descriptions(list(primitives.keys())) == list(primitives.values()) def test_get_descriptions_doesnt_truncate_primitive_description(): # single line descr = _get_descriptions([IsNull]) assert descr[0] == "Determines if a value is null." # multiple line; one sentence descr = _get_descriptions([Diff]) assert ( descr[0] == "Computes the difference between the value in a list and the previous value in that list." ) # multiple lines; multiple sentences class TestPrimitive(TransformPrimitive): """This is text that continues on after the line break and ends in a period. This is text on one line without a period Examples: >>> absolute = Absolute() >>> absolute([3.0, -5.0, -2.4]).tolist() [3.0, 5.0, 2.4] """ name = "test_primitive" descr = _get_descriptions([TestPrimitive]) assert ( descr[0] == "This is text that continues on after the line break and ends in a period. This is text on one line without a period" ) # docstring ends after description class TestPrimitive2(TransformPrimitive): """This is text that continues on after the line break and ends in a period. This is text on one line without a period """ name = "test_primitive" descr = _get_descriptions([TestPrimitive2]) assert ( descr[0] == "This is text that continues on after the line break and ends in a period. This is text on one line without a period" ) def test_get_default_aggregation_primitives(): primitives = get_default_aggregation_primitives() expected_primitives = [ Sum, Std, Max, Skew, Min, Mean, Count, PercentTrue, NumUnique, Mode, ] assert set(primitives) == set(expected_primitives) def test_get_default_transform_primitives(): primitives = get_default_transform_primitives() expected_primitives = [ Age, Day, Year, Month, Weekday, Haversine, NumWords, NumCharacters, ] assert set(primitives) == set(expected_primitives) @pytest.fixture def this_dir(): return os.path.dirname(os.path.abspath(__file__)) @pytest.fixture def primitives_to_install_dir(this_dir): return os.path.join(this_dir, "primitives_to_install") @pytest.fixture def bad_primitives_files_dir(this_dir): return os.path.join(this_dir, "bad_primitive_files") def test_list_primitive_files(primitives_to_install_dir): files = list_primitive_files(primitives_to_install_dir) custom_max_file = os.path.join(primitives_to_install_dir, "custom_max.py") custom_mean_file = os.path.join(primitives_to_install_dir, "custom_mean.py") custom_sum_file = os.path.join(primitives_to_install_dir, "custom_sum.py") assert {custom_max_file, custom_mean_file, custom_sum_file}.issubset(set(files)) def test_load_primitive_from_file(primitives_to_install_dir): primitve_file = os.path.join(primitives_to_install_dir, "custom_max.py") primitive_name, primitive_obj = load_primitive_from_file(primitve_file) assert issubclass(primitive_obj, PrimitiveBase) def test_errors_more_than_one_primitive_in_file(bad_primitives_files_dir): primitive_file = os.path.join(bad_primitives_files_dir, "multiple_primitives.py") error_text = "More than one primitive defined in file {}".format(primitive_file) with pytest.raises(RuntimeError) as excinfo: load_primitive_from_file(primitive_file) assert str(excinfo.value) == error_text def test_errors_no_primitive_in_file(bad_primitives_files_dir): primitive_file = os.path.join(bad_primitives_files_dir, "no_primitives.py") error_text = "No primitive defined in file {}".format(primitive_file) with pytest.raises(RuntimeError) as excinfo: load_primitive_from_file(primitive_file) assert str(excinfo.value) == error_text def test_check_input_types(): primitives = [Sum, Weekday, PercentTrue, Day, Std, NumericLag] log_in_type_checks = set() sem_tag_type_checks = set() unique_input_types = set() expected_log_in_check = { "boolean_nullable", "boolean", "datetime", } expected_sem_tag_type_check = {"numeric", "time_index"} expected_unique_input_types = { "", "", "", "", "", } for prim in primitives: input_types_flattened = prim.flatten_nested_input_types(prim.input_types) _check_input_types( input_types_flattened, log_in_type_checks, sem_tag_type_checks, unique_input_types, ) assert log_in_type_checks == expected_log_in_check assert sem_tag_type_checks == expected_sem_tag_type_check assert unique_input_types == expected_unique_input_types def test_get_summary_primitives(): primitives = [ Sum, Weekday, PercentTrue, Day, Std, NumericLag, AddNumericScalar, IsFreeEmailDomain, NMostCommon, ] primitives_summary = _get_summary_primitives(primitives) expected_unique_input_types = 7 expected_unique_output_types = 6 expected_uses_multi_input = 2 expected_uses_multi_output = 1 expected_uses_external_data = 1 expected_controllable = 3 expected_datetime_inputs = 2 expected_bool = 1 expected_bool_nullable = 1 expected_time_index_tag = 1 assert ( primitives_summary["general_metrics"]["unique_input_types"] == expected_unique_input_types ) assert ( primitives_summary["general_metrics"]["unique_output_types"] == expected_unique_output_types ) assert ( primitives_summary["general_metrics"]["uses_multi_input"] == expected_uses_multi_input ) assert ( primitives_summary["general_metrics"]["uses_multi_output"] == expected_uses_multi_output ) assert ( primitives_summary["general_metrics"]["uses_external_data"] == expected_uses_external_data ) assert ( primitives_summary["general_metrics"]["are_controllable"] == expected_controllable ) assert ( primitives_summary["semantic_tag_metrics"]["time_index"] == expected_time_index_tag ) assert ( primitives_summary["logical_type_input_metrics"]["datetime"] == expected_datetime_inputs ) assert primitives_summary["logical_type_input_metrics"]["boolean"] == expected_bool assert ( primitives_summary["logical_type_input_metrics"]["boolean_nullable"] == expected_bool_nullable ) def test_summarize_primitives(): df = summarize_primitives() trans_prims = get_transform_primitives() agg_prims = get_aggregation_primitives() tot_trans = len(trans_prims) tot_agg = len(agg_prims) tot_prims = tot_trans + tot_agg assert df["Count"].iloc[0] == tot_prims assert df["Count"].iloc[1] == tot_agg assert df["Count"].iloc[2] == tot_trans ================================================ FILE: featuretools/tests/primitive_tests/test_rolling_primitive_utils.py ================================================ from unittest.mock import patch import numpy as np import pandas as pd import pytest from featuretools.primitives import ( RollingCount, RollingMax, RollingMean, RollingMin, RollingSTD, RollingTrend, ) from featuretools.primitives.standard.transform.time_series.utils import ( _get_rolled_series_without_gap, apply_roll_with_offset_gap, roll_series_with_gap, ) from featuretools.tests.primitive_tests.utils import get_number_from_offset def test_get_rolled_series_without_gap(window_series): # Data is daily, so number of rows should be number of days not included in the gap assert len(_get_rolled_series_without_gap(window_series, "11D")) == 9 assert len(_get_rolled_series_without_gap(window_series, "0D")) == 20 assert len(_get_rolled_series_without_gap(window_series, "48H")) == 18 assert len(_get_rolled_series_without_gap(window_series, "4H")) == 19 def test_get_rolled_series_without_gap_not_uniform(window_series): non_uniform_series = window_series.iloc[[0, 2, 5, 6, 8, 9]] assert len(_get_rolled_series_without_gap(non_uniform_series, "10D")) == 0 assert len(_get_rolled_series_without_gap(non_uniform_series, "0D")) == 6 assert len(_get_rolled_series_without_gap(non_uniform_series, "48H")) == 4 assert len(_get_rolled_series_without_gap(non_uniform_series, "4H")) == 5 assert len(_get_rolled_series_without_gap(non_uniform_series, "4D")) == 3 assert len(_get_rolled_series_without_gap(non_uniform_series, "4D2H")) == 2 def test_get_rolled_series_without_gap_empty_series(window_series): empty_series = pd.Series([], dtype="object") assert len(_get_rolled_series_without_gap(empty_series, "1D")) == 0 assert len(_get_rolled_series_without_gap(empty_series, "0D")) == 0 def test_get_rolled_series_without_gap_large_bound(window_series): assert len(_get_rolled_series_without_gap(window_series, "100D")) == 0 assert ( len( _get_rolled_series_without_gap( window_series.iloc[[0, 2, 5, 6, 8, 9]], "20D", ), ) == 0 ) @pytest.mark.parametrize( "window_length, gap", [ (3, 2), (3, 4), # gap larger than window (2, 0), # gap explicitly set to 0 ("3d", "2d"), # using offset aliases ("3d", "4d"), ("4d", "0d"), ], ) def test_roll_series_with_gap(window_length, gap, window_series): rolling_max = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=1, ).max() rolling_min = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=1, ).min() assert len(rolling_max) == len(window_series) assert len(rolling_min) == len(window_series) gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) for i in range(len(window_series)): start_idx = i - gap_num - window_length_num + 1 if isinstance(gap, str): # No gap functionality is happening, so gap isn't taken account in the end index # it's like the gap is 0; it includes the row itself end_idx = i else: end_idx = i - gap_num # If start and end are negative, they're entirely before if start_idx < 0 and end_idx < 0: assert pd.isnull(rolling_max.iloc[i]) assert pd.isnull(rolling_min.iloc[i]) continue if start_idx < 0: start_idx = 0 # Because the row values are a range from 0 to 20, the rolling min will be the start index # and the rolling max will be the end idx assert rolling_min.iloc[i] == start_idx assert rolling_max.iloc[i] == end_idx @pytest.mark.parametrize("window_length", [3, "3d"]) def test_roll_series_with_no_gap(window_length, window_series): actual_rolling = roll_series_with_gap( window_series, window_length, gap=0, min_periods=1, ).mean() expected_rolling = window_series.rolling(window_length, min_periods=1).mean() pd.testing.assert_series_equal(actual_rolling, expected_rolling) @pytest.mark.parametrize( "window_length, gap", [ (6, 2), (6, 0), # No gap - changes early values ("6d", "0d"), # Uses offset aliases ("6d", "2d"), ], ) def test_roll_series_with_gap_early_values(window_length, gap, window_series): gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) # Default min periods is 1 - will include all default_partial_values = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=1, ).count() num_empty_aggregates = len(default_partial_values.loc[default_partial_values == 0]) num_partial_aggregates = len( (default_partial_values.loc[default_partial_values != 0]).loc[ default_partial_values < window_length_num ], ) assert num_partial_aggregates == window_length_num - 1 if isinstance(gap, str): # gap isn't handled, so we'll always at least include the row itself assert num_empty_aggregates == 0 else: assert num_empty_aggregates == gap_num # Make min periods the size of the window no_partial_values = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=window_length_num, ).count() num_null_aggregates = len(no_partial_values.loc[pd.isna(no_partial_values)]) num_partial_aggregates = len( no_partial_values.loc[no_partial_values < window_length_num], ) # because we shift, gap is included as nan values in the series. # Count treats nans in a window as values that don't get counted, # so the gap rows get included in the count for whether a window has "min periods". # This is different than max, for example, which does not count nans in a window as values towards "min periods" assert num_null_aggregates == window_length_num - 1 if isinstance(gap, str): # gap isn't handled, so we'll never have any partial aggregates assert num_partial_aggregates == 0 else: assert num_partial_aggregates == gap_num def test_roll_series_with_gap_nullable_types(window_series): window_length = 3 gap = 2 min_periods = 1 # Because we're inserting nans, confirm that nullability of the dtype doesn't have an impact on the results nullable_series = window_series.astype("Int64") non_nullable_series = window_series.astype("int64") nullable_rolling_max = roll_series_with_gap( nullable_series, window_length, gap=gap, min_periods=min_periods, ).max() non_nullable_rolling_max = roll_series_with_gap( non_nullable_series, window_length, gap=gap, min_periods=min_periods, ).max() pd.testing.assert_series_equal(nullable_rolling_max, non_nullable_rolling_max) def test_roll_series_with_gap_nullable_types_with_nans(window_series): window_length = 3 gap = 2 min_periods = 1 nullable_floats = window_series.astype("float64").replace( {1: np.nan, 3: np.nan}, ) nullable_ints = nullable_floats.astype("Int64") nullable_ints_rolling_max = roll_series_with_gap( nullable_ints, window_length, gap=gap, min_periods=min_periods, ).max() nullable_floats_rolling_max = roll_series_with_gap( nullable_floats, window_length, gap=gap, min_periods=min_periods, ).max() pd.testing.assert_series_equal( nullable_ints_rolling_max, nullable_floats_rolling_max, ) expected_early_values = [np.nan, np.nan, 0, 0, 2, 2, 4] + list( range(7 - gap, len(window_series) - gap), ) for i in range(len(window_series)): actual = nullable_floats_rolling_max.iloc[i] expected = expected_early_values[i] if pd.isnull(actual): assert pd.isnull(expected) else: assert actual == expected @pytest.mark.parametrize( "window_length, gap", [ ("3d", "2d"), ("3d", "4d"), ("4d", "0d"), ], ) def test_apply_roll_with_offset_gap(window_length, gap, window_series): def max_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=1) rolling_max_obj = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=1, ) rolling_max_series = rolling_max_obj.apply(max_wrapper) def min_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, min, min_periods=1) rolling_min_obj = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=1, ) rolling_min_series = rolling_min_obj.apply(min_wrapper) assert len(rolling_max_series) == len(window_series) assert len(rolling_min_series) == len(window_series) gap_num = get_number_from_offset(gap) window_length_num = get_number_from_offset(window_length) for i in range(len(window_series)): start_idx = i - gap_num - window_length_num + 1 # Now that we have the _apply call, this acts as expected end_idx = i - gap_num # If start and end are negative, they're entirely before if start_idx < 0 and end_idx < 0: assert pd.isnull(rolling_max_series.iloc[i]) assert pd.isnull(rolling_min_series.iloc[i]) continue if start_idx < 0: start_idx = 0 # Because the row values are a range from 0 to 20, the rolling min will be the start index # and the rolling max will be the end idx assert rolling_min_series.iloc[i] == start_idx assert rolling_max_series.iloc[i] == end_idx @pytest.mark.parametrize( "min_periods", [1, 0, None], ) def test_apply_roll_with_offset_gap_default_min_periods(min_periods, window_series): window_length = "5d" window_length_num = 5 gap = "3d" gap_num = 3 def count_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods) rolling_count_obj = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=min_periods, ) rolling_count_series = rolling_count_obj.apply(count_wrapper) # gap essentially creates a rolling series that has no elements; which should be nan # to differentiate from when a window only has null values num_empty_aggregates = rolling_count_series.isna().sum() num_partial_aggregates = len( (rolling_count_series.loc[rolling_count_series != 0]).loc[ rolling_count_series < window_length_num ], ) assert num_empty_aggregates == gap_num assert num_partial_aggregates == window_length_num - 1 @pytest.mark.parametrize( "min_periods", [2, 3, 4, 5], ) def test_apply_roll_with_offset_gap_min_periods(min_periods, window_series): window_length = "5d" window_length_num = 5 gap = "3d" gap_num = 3 def count_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods) rolling_count_obj = roll_series_with_gap( window_series, window_length, gap=gap, min_periods=min_periods, ) rolling_count_series = rolling_count_obj.apply(count_wrapper) # gap essentially creates rolling series that have no elements; which should be nan # to differentiate from when a window only has null values num_empty_aggregates = rolling_count_series.isna().sum() num_partial_aggregates = len( (rolling_count_series.loc[rolling_count_series != 0]).loc[ rolling_count_series < window_length_num ], ) assert num_empty_aggregates == min_periods - 1 + gap_num assert num_partial_aggregates == window_length_num - min_periods def test_apply_roll_with_offset_gap_non_uniform(): window_length = "3d" gap = "3d" min_periods = 1 # When the data isn't uniform, this impacts the number of values in each rolling window datetimes = ( list(pd.date_range(start="2017-01-01", freq="1d", periods=7)) + list(pd.date_range(start="2017-02-01", freq="2d", periods=7)) + list(pd.date_range(start="2017-03-01", freq="1d", periods=7)) ) no_freq_series = pd.Series(range(len(datetimes)), index=datetimes) assert pd.infer_freq(no_freq_series.index) is None expected_series = pd.Series( [None, None, None, 1, 2, 3, 3] + [None, None, 1, 1, 1, 1, 1] + [None, None, None, 1, 2, 3, 3], index=datetimes, ) def count_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods) rolling_count_obj = roll_series_with_gap( no_freq_series, window_length, gap=gap, min_periods=min_periods, ) rolling_count_series = rolling_count_obj.apply(count_wrapper) pd.testing.assert_series_equal(rolling_count_series, expected_series) def test_apply_roll_with_offset_data_frequency_higher_than_parameters_frequency(): window_length = "5D" # 120 hours window_length_num = 5 # In order for min periods to be the length of the window, we multiply 24hours*5 min_periods = window_length_num * 24 datetimes = list(pd.date_range(start="2017-01-01", freq="1H", periods=200)) high_frequency_series = pd.Series(range(200), index=datetimes) # Check without gap gap = "0d" gap_num = 0 def max_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods) rolling_max_obj = roll_series_with_gap( high_frequency_series, window_length, min_periods=min_periods, gap=gap, ) rolling_max_series = rolling_max_obj.apply(max_wrapper) assert rolling_max_series.isna().sum() == (min_periods - 1) + gap_num # Check with small gap gap = "3H" gap_num = 3 def max_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods) rolling_max_obj = roll_series_with_gap( high_frequency_series, window_length, min_periods=min_periods, gap=gap, ) rolling_max_series = rolling_max_obj.apply(max_wrapper) assert rolling_max_series.isna().sum() == (min_periods - 1) + gap_num # Check with large gap - in terms of days, so we'll multiply by 24hours for number of nans gap = "2D" gap_num = 2 def max_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods) rolling_max_obj = roll_series_with_gap( high_frequency_series, window_length, min_periods=min_periods, gap=gap, ) rolling_max_series = rolling_max_obj.apply(max_wrapper) assert rolling_max_series.isna().sum() == (min_periods - 1) + (gap_num * 24) def test_apply_roll_with_offset_data_min_periods_too_big(window_series): window_length = "5D" gap = "2d" # Since the data has a daily frequency, there will only be, at most, 5 rows in the window min_periods = 6 def max_wrapper(sub_s): return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods) rolling_max_obj = roll_series_with_gap( window_series, window_length, min_periods=min_periods, gap=gap, ) rolling_max_series = rolling_max_obj.apply(max_wrapper) # The resulting series is comprised entirely of nans assert rolling_max_series.isna().sum() == len(window_series) def test_roll_series_with_gap_different_input_types_same_result_uniform( window_series, ): # Offset inputs will only produce the same results as numeric inputs # when the data has a uniform frequency offset_gap = "2d" offset_window_length = "5d" int_gap = 2 int_window_length = 5 min_periods = 1 # Rolling series' with matching input types expected_rolling_numeric = roll_series_with_gap( window_series, window_length=int_window_length, gap=int_gap, min_periods=min_periods, ).max() def count_wrapper(sub_s): return apply_roll_with_offset_gap( sub_s, offset_gap, max, min_periods=min_periods, ) rolling_count_obj = roll_series_with_gap( window_series, window_length=offset_window_length, gap=offset_gap, min_periods=min_periods, ) expected_rolling_offset = rolling_count_obj.apply(count_wrapper) # confirm that the offset and gap results are equal to one another pd.testing.assert_series_equal(expected_rolling_numeric, expected_rolling_offset) # Rolling series' with mismatched input types mismatched_numeric_gap = roll_series_with_gap( window_series, window_length=offset_window_length, gap=int_gap, min_periods=min_periods, ).max() # Confirm the mismatched results also produce the same results pd.testing.assert_series_equal(expected_rolling_numeric, mismatched_numeric_gap) def test_roll_series_with_gap_incorrect_types(window_series): error = "Window length must be either an offset string or an integer." with pytest.raises(TypeError, match=error): ( roll_series_with_gap( window_series, window_length=4.2, gap=4, min_periods=1, ), ) error = "Gap must be either an offset string or an integer." with pytest.raises(TypeError, match=error): roll_series_with_gap(window_series, window_length=4, gap=4.2, min_periods=1) def test_roll_series_with_gap_negative_inputs(window_series): error = "Window length must be greater than zero." with pytest.raises(ValueError, match=error): roll_series_with_gap(window_series, window_length=-4, gap=4, min_periods=1) error = "Gap must be greater than or equal to zero." with pytest.raises(ValueError, match=error): roll_series_with_gap(window_series, window_length=4, gap=-4, min_periods=1) def test_roll_series_with_non_offset_string_inputs(window_series): error = "Cannot roll series. The specified gap, test, is not a valid offset alias." with pytest.raises(ValueError, match=error): roll_series_with_gap( window_series, window_length="4D", gap="test", min_periods=1, ) error = "Cannot roll series. The specified window length, test, is not a valid offset alias." with pytest.raises(ValueError, match=error): roll_series_with_gap( window_series, window_length="test", gap="7D", min_periods=1, ) # Test mismatched types error error = ( "Cannot roll series with offset gap, 2d, and numeric window length, 7. " "If an offset alias is used for gap, the window length must also be defined as an offset alias. " "Please either change gap to be numeric or change window length to be an offset alias." ) with pytest.raises(TypeError, match=error): roll_series_with_gap( window_series, window_length=7, gap="2d", min_periods=1, ).max() @pytest.mark.parametrize( "primitive", [RollingCount, RollingMax, RollingMin, RollingMean, RollingSTD, RollingTrend], ) @patch( "featuretools.primitives.standard.transform.time_series.utils.apply_roll_with_offset_gap", ) def test_no_call_to_apply_roll_with_offset_gap_with_numeric( mock_apply_roll, primitive, window_series, ): assert not mock_apply_roll.called fully_numeric_primitive = primitive(window_length=3, gap=1) primitive_func = fully_numeric_primitive.get_function() if isinstance(fully_numeric_primitive, RollingCount): pd.Series(primitive_func(window_series.index)) else: pd.Series( primitive_func( window_series.index, pd.Series(window_series.values), ), ) assert not mock_apply_roll.called offset_window_primitive = primitive(window_length="3d", gap=1) primitive_func = offset_window_primitive.get_function() if isinstance(offset_window_primitive, RollingCount): pd.Series(primitive_func(window_series.index)) else: pd.Series( primitive_func( window_series.index, pd.Series(window_series.values), ), ) assert not mock_apply_roll.called no_gap_specified_primitive = primitive(window_length="3d") primitive_func = no_gap_specified_primitive.get_function() if isinstance(no_gap_specified_primitive, RollingCount): pd.Series(primitive_func(window_series.index)) else: pd.Series( primitive_func( window_series.index, pd.Series(window_series.values), ), ) assert not mock_apply_roll.called no_gap_specified_primitive = primitive(window_length="3d", gap="1d") primitive_func = no_gap_specified_primitive.get_function() if isinstance(no_gap_specified_primitive, RollingCount): pd.Series(primitive_func(window_series.index)) else: pd.Series( primitive_func( window_series.index, pd.Series(window_series.values), ), ) assert mock_apply_roll.called ================================================ FILE: featuretools/tests/primitive_tests/test_transform_features.py ================================================ from inspect import isclass import numpy as np import pandas as pd import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import ( Boolean, BooleanNullable, Categorical, Datetime, Double, Integer, IntegerNullable, ) from featuretools import ( AggregationFeature, EntitySet, Feature, IdentityFeature, TransformFeature, calculate_feature_matrix, dfs, primitives, ) from featuretools.computational_backends.feature_set import FeatureSet from featuretools.computational_backends.feature_set_calculator import ( FeatureSetCalculator, ) from featuretools.primitives import ( Absolute, AddNumeric, AddNumericScalar, Age, Count, Day, Diff, DiffDatetime, DivideByFeature, DivideNumeric, DivideNumericScalar, Equal, EqualScalar, FileExtension, First, FullNameToFirstName, FullNameToLastName, FullNameToTitle, GreaterThan, GreaterThanEqualTo, GreaterThanEqualToScalar, GreaterThanScalar, Haversine, Hour, IsIn, IsNull, Lag, Latitude, LessThan, LessThanEqualTo, LessThanEqualToScalar, LessThanScalar, Longitude, Mode, MultiplyBoolean, MultiplyNumeric, MultiplyNumericBoolean, MultiplyNumericScalar, Not, NotEqual, NotEqualScalar, NumCharacters, NumericLag, NumWords, Percentile, ScalarSubtractNumericFeature, SubtractNumeric, SubtractNumericScalar, Sum, TimeSince, TransformPrimitive, get_transform_primitives, ) from featuretools.synthesis.deep_feature_synthesis import match def test_init_and_name(es): log = es["log"] rating = Feature(IdentityFeature(es["products"].ww["rating"]), "log") log_features = [Feature(es["log"].ww[col]) for col in log.columns] + [ Feature(rating, primitive=GreaterThanScalar(2.5)), Feature(rating, primitive=GreaterThanScalar(3.5)), ] # Add Timedelta feature # features.append(pd.Timestamp.now() - Feature(log['datetime'])) customers_features = [ Feature(es["customers"].ww[col]) for col in es["customers"].columns ] # check all transform primitives have a name for attribute_string in dir(primitives): attr = getattr(primitives, attribute_string) if isclass(attr): if issubclass(attr, TransformPrimitive) and attr != TransformPrimitive: assert getattr(attr, "name") is not None trans_primitives = get_transform_primitives().values() for transform_prim in trans_primitives: # skip automated testing if a few special cases features_to_use = log_features if transform_prim in [NotEqual, Equal, FileExtension]: continue if transform_prim in [ Age, FullNameToFirstName, FullNameToLastName, FullNameToTitle, ]: features_to_use = customers_features # use the input_types matching function from DFS input_types = transform_prim.input_types if isinstance(input_types[0], list): matching_inputs = match(input_types[0], features_to_use) else: matching_inputs = match(input_types, features_to_use) if len(matching_inputs) == 0: raise Exception("Transform Primitive %s not tested" % transform_prim.name) for prim in matching_inputs: instance = Feature(prim, primitive=transform_prim) # try to get name and calculate instance.get_name() calculate_feature_matrix([instance], entityset=es) def test_relationship_path(es): f = TransformFeature(Feature(es["log"].ww["datetime"]), Hour) assert len(f.relationship_path) == 0 def test_serialization(es): value = IdentityFeature(es["log"].ww["value"]) primitive = MultiplyNumericScalar(value=2) value_x2 = TransformFeature(value, primitive) dictionary = { "name": value_x2.get_name(), "base_features": [value.unique_name()], "primitive": primitive, } assert dictionary == value_x2.get_arguments() assert value_x2 == TransformFeature.from_dictionary( dictionary, es, {value.unique_name(): value}, primitive, ) def test_make_trans_feat(es): f = Feature(es["log"].ww["datetime"], primitive=Hour) feature_set = FeatureSet([f]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array([0])) v = df[f.get_name()][0] assert v == 10 @pytest.fixture def simple_es(): df = pd.DataFrame( { "id": range(4), "value": pd.Categorical(["a", "c", "b", "d"]), "value2": pd.Categorical(["a", "b", "a", "d"]), "object": ["time1", "time2", "time3", "time4"], "datetime": pd.Series( [ pd.Timestamp("2001-01-01"), pd.Timestamp("2001-01-02"), pd.Timestamp("2001-01-03"), pd.Timestamp("2001-01-04"), ], ), }, ) es = EntitySet("equal_test") es.add_dataframe(dataframe_name="values", dataframe=df, index="id") return es def test_equal_categorical(simple_es): f1 = Feature( [ IdentityFeature(simple_es["values"].ww["value"]), IdentityFeature(simple_es["values"].ww["value2"]), ], primitive=Equal, ) df = calculate_feature_matrix(entityset=simple_es, features=[f1]) assert set(simple_es["values"]["value"].cat.categories) != set( simple_es["values"]["value2"].cat.categories, ) assert df["value = value2"].to_list() == [ True, False, False, True, ] def test_equal_different_dtypes(simple_es): f1 = Feature( [ IdentityFeature(simple_es["values"].ww["object"]), IdentityFeature(simple_es["values"].ww["datetime"]), ], primitive=Equal, ) f2 = Feature( [ IdentityFeature(simple_es["values"].ww["datetime"]), IdentityFeature(simple_es["values"].ww["object"]), ], primitive=Equal, ) # verify that equals works for different dtypes regardless of order df = calculate_feature_matrix(entityset=simple_es, features=[f1, f2]) assert df["object = datetime"].to_list() == [False, False, False, False] assert df["datetime = object"].to_list() == [False, False, False, False] def test_not_equal_categorical(simple_es): f1 = Feature( [ IdentityFeature(simple_es["values"].ww["value"]), IdentityFeature(simple_es["values"].ww["value2"]), ], primitive=NotEqual, ) df = calculate_feature_matrix(entityset=simple_es, features=[f1]) assert set(simple_es["values"]["value"].cat.categories) != set( simple_es["values"]["value2"].cat.categories, ) assert df["value != value2"].to_list() == [ False, True, True, False, ] def test_not_equal_different_dtypes(simple_es): f1 = Feature( [ IdentityFeature(simple_es["values"].ww["object"]), IdentityFeature(simple_es["values"].ww["datetime"]), ], primitive=NotEqual, ) f2 = Feature( [ IdentityFeature(simple_es["values"].ww["datetime"]), IdentityFeature(simple_es["values"].ww["object"]), ], primitive=NotEqual, ) # verify that equals works for different dtypes regardless of order df = calculate_feature_matrix(entityset=simple_es, features=[f1, f2]) assert df["object != datetime"].to_list() == [True, True, True, True] assert df["datetime != object"].to_list() == [True, True, True, True] def test_diff(es): value = Feature(es["log"].ww["value"]) customer_id_feat = Feature(es["sessions"].ww["customer_id"], "log") diff1 = Feature( value, groupby=Feature(es["log"].ww["session_id"]), primitive=Diff, ) diff2 = Feature(value, groupby=customer_id_feat, primitive=Diff) feature_set = FeatureSet([diff1, diff2]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array(range(15))) val1 = df[diff1.get_name()].tolist() val2 = df[diff2.get_name()].tolist() correct_vals1 = [ np.nan, 5, 5, 5, 5, np.nan, 1, 1, 1, np.nan, np.nan, 5, np.nan, 7, 7, ] correct_vals2 = [np.nan, 5, 5, 5, 5, -20, 1, 1, 1, -3, np.nan, 5, -5, 7, 7] np.testing.assert_equal(val1, correct_vals1) np.testing.assert_equal(val2, correct_vals2) def test_diff_shift(es): value = Feature(es["log"].ww["value"]) customer_id_feat = Feature(es["sessions"].ww["customer_id"], "log") diff_periods = Feature(value, groupby=customer_id_feat, primitive=Diff(periods=1)) feature_set = FeatureSet([diff_periods]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array(range(15))) val3 = df[diff_periods.get_name()].tolist() correct_vals3 = [np.nan, np.nan, 5, 5, 5, 5, -20, 1, 1, 1, np.nan, np.nan, 5, -5, 7] np.testing.assert_equal(val3, correct_vals3) def test_diff_single_value(es): diff = Feature( es["stores"].ww["num_square_feet"], groupby=Feature(es["stores"].ww["région_id"]), primitive=Diff, ) feature_set = FeatureSet([diff]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array([4])) assert df[diff.get_name()][4] == 6000.0 def test_diff_reordered(es): sum_feat = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) diff = Feature(sum_feat, primitive=Diff) feature_set = FeatureSet([diff]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array([4, 2])) assert df[diff.get_name()][4] == 16 assert df[diff.get_name()][2] == -6 def test_diff_single_value_is_nan(es): diff = Feature( es["stores"].ww["num_square_feet"], groupby=Feature(es["stores"].ww["région_id"]), primitive=Diff, ) feature_set = FeatureSet([diff]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array([5])) assert df.shape[0] == 1 assert df[diff.get_name()].dropna().shape[0] == 0 def test_diff_datetime(es): diff = Feature( es["log"].ww["datetime"], primitive=DiffDatetime, ) feature_set = FeatureSet([diff]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array(range(15))) vals = pd.Series(df[diff.get_name()].tolist()) expected_vals = pd.Series( [ pd.NaT, pd.Timedelta(seconds=6), pd.Timedelta(seconds=6), pd.Timedelta(seconds=6), pd.Timedelta(seconds=6), pd.Timedelta(seconds=36), pd.Timedelta(seconds=9), pd.Timedelta(seconds=9), pd.Timedelta(seconds=9), pd.Timedelta(minutes=8, seconds=33), pd.Timedelta(days=1), pd.Timedelta(seconds=1), pd.Timedelta(seconds=59), pd.Timedelta(seconds=3), pd.Timedelta(seconds=3), ], ) pd.testing.assert_series_equal(vals, expected_vals) def test_diff_datetime_shift(es): diff = Feature( es["log"].ww["datetime"], primitive=DiffDatetime(periods=1), ) feature_set = FeatureSet([diff]) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array(range(6))) vals = pd.Series(df[diff.get_name()].tolist()) expected_vals = pd.Series( [ pd.NaT, pd.NaT, pd.Timedelta(seconds=6), pd.Timedelta(seconds=6), pd.Timedelta(seconds=6), pd.Timedelta(seconds=6), ], ) pd.testing.assert_series_equal(vals, expected_vals) def test_compare_of_identity(es): to_test = [ (EqualScalar, [False, False, True, False]), (NotEqualScalar, [True, True, False, True]), (LessThanScalar, [True, True, False, False]), (LessThanEqualToScalar, [True, True, True, False]), (GreaterThanScalar, [False, False, False, True]), (GreaterThanEqualToScalar, [False, False, True, True]), ] features = [] for test in to_test: features.append(Feature(es["log"].ww["value"], primitive=test[0](10))) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2, 3], ) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test[1] def test_compare_of_direct(es): log_rating = Feature(es["products"].ww["rating"], "log") to_test = [ (EqualScalar, [False, False, False, False]), (NotEqualScalar, [True, True, True, True]), (LessThanScalar, [False, False, False, True]), (LessThanEqualToScalar, [False, False, False, True]), (GreaterThanScalar, [True, True, True, False]), (GreaterThanEqualToScalar, [True, True, True, False]), ] features = [] for test in to_test: features.append(Feature(log_rating, primitive=test[0](4.5))) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2, 3], ) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test[1] def test_compare_of_transform(es): day = Feature(es["log"].ww["datetime"], primitive=Day) to_test = [ (EqualScalar, [False, True]), (NotEqualScalar, [True, False]), ] features = [] for test in to_test: features.append(Feature(day, primitive=test[0](10))) df = calculate_feature_matrix(entityset=es, features=features, instance_ids=[0, 14]) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test[1] def test_compare_of_agg(es): count_logs = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) to_test = [ (EqualScalar, [False, False, False, True]), (NotEqualScalar, [True, True, True, False]), (LessThanScalar, [False, False, True, False]), (LessThanEqualToScalar, [False, False, True, True]), (GreaterThanScalar, [True, True, False, False]), (GreaterThanEqualToScalar, [True, True, False, True]), ] features = [] for test in to_test: features.append(Feature(count_logs, primitive=test[0](2))) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2, 3], ) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test[1] def test_compare_all_nans(es): nan_feat = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=Mode, ) compare = nan_feat == "brown bag" # before all data time_last = pd.Timestamp("1/1/1993") df = calculate_feature_matrix( entityset=es, features=[nan_feat, compare], instance_ids=[0, 1, 2], cutoff_time=time_last, ) assert df[nan_feat.get_name()].dropna().shape[0] == 0 assert not df[compare.get_name()].any() def test_arithmetic_of_val(es): to_test = [ (AddNumericScalar, [2.0, 7.0, 12.0, 17.0]), (SubtractNumericScalar, [-2.0, 3.0, 8.0, 13.0]), (ScalarSubtractNumericFeature, [2.0, -3.0, -8.0, -13.0]), (MultiplyNumericScalar, [0, 10, 20, 30]), (DivideNumericScalar, [0, 2.5, 5, 7.5]), (DivideByFeature, [np.inf, 0.4, 0.2, 2 / 15.0]), ] features = [] for test in to_test: features.append(Feature(es["log"].ww["value"], primitive=test[0](2))) features.append(Feature(es["log"].ww["value"]) / 0) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2, 3], ) for f, test in zip(features, to_test): v = df[f.get_name()].tolist() assert v == test[1] test = [np.nan, np.inf, np.inf, np.inf] v = df[features[-1].get_name()].tolist() assert np.isnan(v[0]) assert v[1:] == test[1:] def test_arithmetic_two_vals_fails(es): error_text = "Not a feature" with pytest.raises(Exception, match=error_text): Feature([2, 2], primitive=AddNumeric) def test_arithmetic_of_identity(es): to_test = [ (AddNumeric, [0.0, 7.0, 14.0, 21.0]), (SubtractNumeric, [0, 3, 6, 9]), (MultiplyNumeric, [0, 10, 40, 90]), (DivideNumeric, [np.nan, 2.5, 2.5, 2.5]), ] features = [] for test in to_test: features.append( Feature( [ Feature(es["log"].ww["value"]), Feature(es["log"].ww["value_2"]), ], primitive=test[0], ), ) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 1, 2, 3], ) for i, test in enumerate(to_test[:-1]): v = df[features[i].get_name()].tolist() assert v == test[1] i, test = -1, to_test[-1] v = df[features[i].get_name()].tolist() assert np.isnan(v[0]) assert v[1:] == test[1][1:] def test_arithmetic_of_direct(es): rating = Feature(es["products"].ww["rating"]) log_rating = Feature(rating, "log") customer_age = Feature(es["customers"].ww["age"]) session_age = Feature(customer_age, "sessions") log_age = Feature(session_age, "log") to_test = [ (AddNumeric, [38, 37, 37.5, 37.5]), (SubtractNumeric, [28, 29, 28.5, 28.5]), (MultiplyNumeric, [165, 132, 148.5, 148.5]), (DivideNumeric, [6.6, 8.25, 22.0 / 3, 22.0 / 3]), ] features = [] for test in to_test: features.append(Feature([log_age, log_rating], primitive=test[0])) df = calculate_feature_matrix( entityset=es, features=features, instance_ids=[0, 3, 5, 7], ) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test[1] @pytest.fixture def boolean_mult_es(): es = EntitySet() df = pd.DataFrame( { "index": [0, 1, 2], "bool": pd.Series([True, False, True]), "numeric": [2, 3, np.nan], }, ) es.add_dataframe( dataframe_name="test", dataframe=df, index="index", logical_types={"numeric": Double}, ) return es def test_boolean_multiply(boolean_mult_es): es = boolean_mult_es to_test = [ ("numeric", "numeric"), ("numeric", "bool"), ("bool", "numeric"), ("bool", "bool"), ] features = [] for row in to_test: features.append(Feature(es["test"].ww[row[0]]) * Feature(es["test"].ww[row[1]])) fm = calculate_feature_matrix(entityset=es, features=features) df = es["test"] for row in to_test: col_name = "{} * {}".format(row[0], row[1]) if row[0] == "bool" and row[1] == "bool": assert fm[col_name].equals((df[row[0]] & df[row[1]]).astype("boolean")) else: assert fm[col_name].equals(df[row[0]] * df[row[1]]) def test_arithmetic_of_transform(es): diff1 = Feature([Feature(es["log"].ww["value"])], primitive=Diff) diff2 = Feature([Feature(es["log"].ww["value_2"])], primitive=Diff) to_test = [ (AddNumeric, [np.nan, 7.0, -7.0, 10.0]), (SubtractNumeric, [np.nan, 3.0, -3.0, 4.0]), (MultiplyNumeric, [np.nan, 10.0, 10.0, 21.0]), (DivideNumeric, [np.nan, 2.5, 2.5, 2.3333333333333335]), ] features = [] for test in to_test: features.append(Feature([diff1, diff2], primitive=test[0]())) feature_set = FeatureSet(features) calculator = FeatureSetCalculator(es, feature_set=feature_set) df = calculator.run(np.array([0, 2, 12, 13])) for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert np.isnan(v.pop(0)) assert np.isnan(test[1].pop(0)) assert v == test[1] def test_not_feature(es): not_feat = Feature(es["customers"].ww["loves_ice_cream"], primitive=Not) features = [not_feat] df = calculate_feature_matrix(entityset=es, features=features, instance_ids=[0, 1]) v = df[not_feat.get_name()].values assert not v[0] assert v[1] def test_arithmetic_of_agg(es): customer_id_feat = Feature(es["customers"].ww["id"]) store_id_feat = Feature(es["stores"].ww["id"]) count_customer = Feature( customer_id_feat, parent_dataframe_name="régions", primitive=Count, ) count_stores = Feature( store_id_feat, parent_dataframe_name="régions", primitive=Count, ) to_test = [ (AddNumeric, [6, 2]), (SubtractNumeric, [0, -2]), (MultiplyNumeric, [9, 0]), (DivideNumeric, [1, 0]), ] features = [] for test in to_test: features.append(Feature([count_customer, count_stores], primitive=test[0]())) ids = ["United States", "Mexico"] df = calculate_feature_matrix(entityset=es, features=features, instance_ids=ids) df = df.loc[ids] for i, test in enumerate(to_test): v = df[features[i].get_name()].tolist() assert v == test[1] def test_latlong(es): log_latlong_feat = Feature(es["log"].ww["latlong"]) latitude = Feature(log_latlong_feat, primitive=Latitude) longitude = Feature(log_latlong_feat, primitive=Longitude) features = [latitude, longitude] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) latvalues = df[latitude.get_name()].values lonvalues = df[longitude.get_name()].values assert len(latvalues) == 15 assert len(lonvalues) == 15 real_lats = [0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14] real_lons = [0, 2, 4, 6, 8, 0, 1, 2, 3, 0, 0, 2, 0, 3, 6] for ( i, v, ) in enumerate(real_lats): assert v == latvalues[i] for ( i, v, ) in enumerate(real_lons): assert v == lonvalues[i] def test_latlong_with_nan(es): df = es["log"] df["latlong"][0] = np.nan df["latlong"][1] = (10, np.nan) df["latlong"][2] = (np.nan, 4) df["latlong"][3] = (np.nan, np.nan) es.replace_dataframe(dataframe_name="log", df=df) log_latlong_feat = Feature(es["log"].ww["latlong"]) latitude = Feature(log_latlong_feat, primitive=Latitude) longitude = Feature(log_latlong_feat, primitive=Longitude) features = [latitude, longitude] fm = calculate_feature_matrix(entityset=es, features=features) latvalues = fm[latitude.get_name()].values lonvalues = fm[longitude.get_name()].values assert len(latvalues) == 17 assert len(lonvalues) == 17 real_lats = [ np.nan, 10, np.nan, np.nan, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14, np.nan, np.nan, ] real_lons = [ np.nan, np.nan, 4, np.nan, 8, 0, 1, 2, 3, 0, 0, 2, 0, 3, 6, np.nan, np.nan, ] assert np.allclose(latvalues, real_lats, atol=0.0001, equal_nan=True) assert np.allclose(lonvalues, real_lons, atol=0.0001, equal_nan=True) def test_haversine(es): log_latlong_feat = Feature(es["log"].ww["latlong"]) log_latlong_feat2 = Feature(es["log"].ww["latlong2"]) haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine) features = [haversine] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) values = df[haversine.get_name()].values real = [ 0, 525.318462, 1045.32190304, 1554.56176802, 2047.3294327, 0, 138.16578931, 276.20524822, 413.99185444, 0, 0, 525.318462, 0, 741.57941183, 1467.52760175, ] assert len(values) == 15 assert np.allclose(values, real, atol=0.0001) haversine = Feature( [log_latlong_feat, log_latlong_feat2], primitive=Haversine(unit="kilometers"), ) features = [haversine] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) values = df[haversine.get_name()].values real_km = [ 0, 845.41812212, 1682.2825471, 2501.82467535, 3294.85736668, 0, 222.35628593, 444.50926278, 666.25531268, 0, 0, 845.41812212, 0, 1193.45638714, 2361.75676089, ] assert len(values) == 15 assert np.allclose(values, real_km, atol=0.0001) error_text = "Invalid unit inches provided. Must be one of" with pytest.raises(ValueError, match=error_text): Haversine(unit="inches") def test_haversine_with_nan(es): # Check some `nan` values df = es["log"] df["latlong"][0] = np.nan df["latlong"][1] = (10, np.nan) es.replace_dataframe(dataframe_name="log", df=df) log_latlong_feat = Feature(es["log"].ww["latlong"]) log_latlong_feat2 = Feature(es["log"].ww["latlong2"]) haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine) features = [haversine] df = calculate_feature_matrix(entityset=es, features=features) values = df[haversine.get_name()].values real = [ np.nan, np.nan, 1045.32190304, 1554.56176802, 2047.3294327, 0, 138.16578931, 276.20524822, 413.99185444, 0, 0, 525.318462, 0, 741.57941183, 1467.52760175, np.nan, np.nan, ] assert np.allclose(values, real, atol=0.0001, equal_nan=True) # Check all `nan` values df = es["log"] df["latlong2"] = np.nan es.replace_dataframe(dataframe_name="log", df=df) log_latlong_feat = Feature(es["log"].ww["latlong"]) log_latlong_feat2 = Feature(es["log"].ww["latlong2"]) haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine) features = [haversine] df = calculate_feature_matrix(entityset=es, features=features) values = df[haversine.get_name()].values real = [np.nan] * es["log"].shape[0] assert np.allclose(values, real, atol=0.0001, equal_nan=True) def test_text_primitives(es): words = Feature(es["log"].ww["comments"], primitive=NumWords) chars = Feature(es["log"].ww["comments"], primitive=NumCharacters) features = [words, chars] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) word_counts = [532, 3, 3, 653, 1306, 1305, 174, 173, 79, 246, 1253, 3, 3, 3, 3] char_counts = [ 3392, 10, 10, 4116, 7961, 7580, 992, 957, 437, 1325, 6322, 10, 10, 10, 10, ] word_values = df[words.get_name()].values char_values = df[chars.get_name()].values assert len(word_values) == 15 for i, v in enumerate(word_values): assert v == word_counts[i] for i, v in enumerate(char_values): assert v == char_counts[i] def test_isin_feat(es): isin = Feature( es["log"].ww["product_id"], primitive=IsIn(list_of_outputs=["toothpaste", "coke zero"]), ) features = [isin] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(8), ) true = [True, True, True, False, False, True, True, True] v = df[isin.get_name()].tolist() assert true == v def test_isin_feat_other_syntax(es): isin = Feature(es["log"].ww["product_id"]).isin(["toothpaste", "coke zero"]) features = [isin] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(8), ) true = [True, True, True, False, False, True, True, True] v = df[isin.get_name()].tolist() assert true == v def test_isin_feat_other_syntax_int(es): isin = Feature(es["log"].ww["value"]).isin([5, 10]) features = [isin] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(8), ) true = [False, True, True, False, False, False, False, False] v = df[isin.get_name()].tolist() assert true == v def test_isin_feat_custom(es): class CustomIsIn(TransformPrimitive): name = "is_in" input_types = [ColumnSchema()] return_type = ColumnSchema(logical_type=Boolean) def __init__(self, list_of_outputs=None): self.list_of_outputs = list_of_outputs def get_function(self): def pd_is_in(array): return array.isin(self.list_of_outputs) return pd_is_in isin = Feature( es["log"].ww["product_id"], primitive=CustomIsIn(list_of_outputs=["toothpaste", "coke zero"]), ) features = [isin] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(8), ) true = [True, True, True, False, False, True, True, True] v = df[isin.get_name()].tolist() assert true == v isin = Feature(es["log"].ww["product_id"]).isin(["toothpaste", "coke zero"]) features = [isin] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(8), ) true = [True, True, True, False, False, True, True, True] v = df[isin.get_name()].tolist() assert true == v isin = Feature(es["log"].ww["value"]).isin([5, 10]) features = [isin] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(8), ) true = [False, True, True, False, False, False, False, False] v = df[isin.get_name()].tolist() assert true == v def test_isnull_feat(es): value = Feature(es["log"].ww["value"]) diff = Feature( value, groupby=Feature(es["log"].ww["session_id"]), primitive=Diff, ) isnull = Feature(diff, primitive=IsNull) features = [isnull] df = calculate_feature_matrix( entityset=es, features=features, instance_ids=range(15), ) correct_vals = [ True, False, False, False, False, True, False, False, False, True, True, False, True, False, False, ] values = df[isnull.get_name()].tolist() assert correct_vals == values def test_percentile(es): v = Feature(es["log"].ww["value"]) p = Feature(v, primitive=Percentile) feature_set = FeatureSet([p]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array(range(10, 17))) true = es["log"][v.get_name()].rank(pct=True) true = true.loc[range(10, 17)] for t, a in zip(true.values, df[p.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or t == a def test_dependent_percentile(es): v = Feature(es["log"].ww["value"]) p = Feature(v, primitive=Percentile) p2 = Feature(p - 1, primitive=Percentile) feature_set = FeatureSet([p, p2]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array(range(10, 17))) true = es["log"][v.get_name()].rank(pct=True) true = true.loc[range(10, 17)] for t, a in zip(true.values, df[p.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or t == a def test_agg_percentile(es): v = Feature(es["log"].ww["value"]) p = Feature(v, primitive=Percentile) agg = Feature(p, parent_dataframe_name="sessions", primitive=Sum) feature_set = FeatureSet([agg]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array([0, 1])) log_vals = es["log"][[v.get_name(), "session_id"]] log_vals["percentile"] = log_vals[v.get_name()].rank(pct=True) true_p = log_vals.groupby("session_id")["percentile"].sum()[[0, 1]] for t, a in zip(true_p.values, df[agg.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or t == a def test_percentile_agg_percentile(es): v = Feature(es["log"].ww["value"]) p = Feature(v, primitive=Percentile) agg = Feature(p, parent_dataframe_name="sessions", primitive=Sum) pagg = Feature(agg, primitive=Percentile) feature_set = FeatureSet([pagg]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array([0, 1])) log_vals = es["log"][[v.get_name(), "session_id"]] log_vals["percentile"] = log_vals[v.get_name()].rank(pct=True) true_p = log_vals.groupby("session_id")["percentile"].sum().fillna(0) true_p = true_p.rank(pct=True)[[0, 1]] for t, a in zip(true_p.values, df[pagg.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or t == a def test_percentile_agg(es): v = Feature(es["log"].ww["value"]) agg = Feature(v, parent_dataframe_name="sessions", primitive=Sum) pagg = Feature(agg, primitive=Percentile) feature_set = FeatureSet([pagg]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array([0, 1])) log_vals = es["log"][[v.get_name(), "session_id"]] true_p = log_vals.groupby("session_id")[v.get_name()].sum().fillna(0) true_p = true_p.rank(pct=True)[[0, 1]] for t, a in zip(true_p.values, df[pagg.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or t == a def test_direct_percentile(es): v = Feature(es["customers"].ww["age"]) p = Feature(v, primitive=Percentile) d = Feature(p, "sessions") feature_set = FeatureSet([d]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array([0, 1])) cust_vals = es["customers"][[v.get_name()]] cust_vals["percentile"] = cust_vals[v.get_name()].rank(pct=True) true_p = cust_vals["percentile"].loc[[0, 0]] for t, a in zip(true_p.values, df[d.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or t == a def test_direct_agg_percentile(es): v = Feature(es["log"].ww["value"]) p = Feature(v, primitive=Percentile) agg = Feature(p, parent_dataframe_name="customers", primitive=Sum) d = Feature(agg, "sessions") feature_set = FeatureSet([d]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array([0, 1])) log_vals = es["log"][[v.get_name(), "session_id"]] log_vals["percentile"] = log_vals[v.get_name()].rank(pct=True) log_vals["customer_id"] = [0] * 10 + [1] * 5 + [2] * 2 true_p = log_vals.groupby("customer_id")["percentile"].sum().fillna(0) true_p = true_p[[0, 0]] for t, a in zip(true_p.values, df[d.get_name()].values): assert (pd.isnull(t) and pd.isnull(a)) or round(t, 3) == round(a, 3) def test_percentile_with_cutoff(es): v = Feature(es["log"].ww["value"]) p = Feature(v, primitive=Percentile) feature_set = FeatureSet([p]) calculator = FeatureSetCalculator( es, feature_set, pd.Timestamp("2011/04/09 10:30:13"), ) df = calculator.run(np.array([2])) assert df[p.get_name()].tolist()[0] == 1.0 def test_two_kinds_of_dependents(es): v = Feature(es["log"].ww["value"]) product = Feature(es["log"].ww["product_id"]) agg = Feature( v, parent_dataframe_name="customers", where=product == "coke zero", primitive=Sum, ) p = Feature(agg, primitive=Percentile) g = Feature(agg, primitive=Absolute) agg2 = Feature( v, parent_dataframe_name="sessions", where=product == "coke zero", primitive=Sum, ) agg3 = Feature(agg2, parent_dataframe_name="customers", primitive=Sum) feature_set = FeatureSet([p, g, agg3]) calculator = FeatureSetCalculator(es, feature_set) df = calculator.run(np.array([0, 1])) assert df[p.get_name()].tolist() == [2.0 / 3, 1.0] assert df[g.get_name()].tolist() == [15, 26] def test_get_filepath(es): class Mod4(TransformPrimitive): """Return base feature modulo 4""" name = "mod4" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): filepath = self.get_filepath("featuretools_unit_test_example.csv") reference = pd.read_csv(filepath, header=None).squeeze("columns") def map_to_word(x): def _map(x): if pd.isnull(x): return x return reference[int(x) % 4] return x.apply(_map) return map_to_word feat = Feature(es["log"].ww["value"], primitive=Mod4) df = calculate_feature_matrix(features=[feat], entityset=es, instance_ids=range(17)) assert pd.isnull(df["MOD4(value)"][15]) assert df["MOD4(value)"][0] == 0 assert df["MOD4(value)"][14] == 2 fm, fl = dfs( entityset=es, target_dataframe_name="log", agg_primitives=[], trans_primitives=[Mod4], ) assert fm["MOD4(value)"][0] == 0 assert fm["MOD4(value)"][14] == 2 assert pd.isnull(fm["MOD4(value)"][15]) def test_override_multi_feature_names(es): def gen_custom_names(primitive, base_feature_names): return [ "Above18(%s)" % base_feature_names, "Above21(%s)" % base_feature_names, "Above65(%s)" % base_feature_names, ] class IsGreater(TransformPrimitive): name = "is_greater" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 3 def get_function(self): def is_greater(x): return x > 18, x > 21, x > 65 return is_greater def generate_names(primitive, base_feature_names): return gen_custom_names(primitive, base_feature_names) fm, features = dfs( entityset=es, target_dataframe_name="customers", instance_ids=[0, 1, 2], agg_primitives=[], trans_primitives=[IsGreater], ) expected_names = gen_custom_names(IsGreater, ["age"]) for name in expected_names: assert name in fm.columns def test_time_since_primitive_matches_all_datetime_types(es): fm, fl = dfs( target_dataframe_name="customers", entityset=es, trans_primitives=[TimeSince], agg_primitives=[], max_depth=1, ) customers_datetime_cols = [ id for id, t in es["customers"].ww.logical_types.items() if isinstance(t, Datetime) ] expected_names = [f"TIME_SINCE({v})" for v in customers_datetime_cols] for name in expected_names: assert name in fm.columns def test_cfm_with_numeric_lag_and_non_nullable_column(es): # fill nans so we can use non nullable numeric logical type in the EntitySet new_log = es["log"].copy() new_log["value"] = new_log["value"].fillna(0) new_log.ww.init( logical_types={"value": "Integer", "product_id": "Categorical"}, index="id", time_index="datetime", name="new_log", ) es.add_dataframe(new_log) rels = [ ("sessions", "id", "new_log", "session_id"), ("products", "id", "new_log", "product_id"), ] es = es.add_relationships(rels) assert isinstance(es["new_log"].ww.logical_types["value"], Integer) periods = 5 lag_primitive = NumericLag(periods=periods) cutoff_times = es["new_log"][["id", "datetime"]] fm, _ = dfs( target_dataframe_name="new_log", entityset=es, agg_primitives=[], trans_primitives=[lag_primitive], cutoff_time=cutoff_times, ) assert fm["NUMERIC_LAG(datetime, value, periods=5)"].head(periods).isnull().all() assert fm["NUMERIC_LAG(datetime, value, periods=5)"].isnull().sum() == periods assert "NUMERIC_LAG(datetime, value_2, periods=5)" in fm.columns assert "NUMERIC_LAG(datetime, products.rating, periods=5)" in fm.columns assert ( fm["NUMERIC_LAG(datetime, products.rating, periods=5)"] .head(periods) .isnull() .all() ) def test_cfm_with_lag_and_non_nullable_columns(es): # fill nans so we can use non nullable numeric logical type in the EntitySet new_log = es["log"].copy() new_log["value"] = new_log["value"].fillna(0) new_log["value_double"] = new_log["value"] new_log["purchased_with_nulls"] = new_log["purchased"] new_log["purchased_with_nulls"][0:4] = None new_log.ww.init( logical_types={ "value": "Integer", "value_2": "IntegerNullable", "product_id": "Categorical", "value_double": "Double", "purchased_with_nulls": "BooleanNullable", }, index="id", time_index="datetime", name="new_log", ) es.add_dataframe(new_log) rels = [ ("sessions", "id", "new_log", "session_id"), ("products", "id", "new_log", "product_id"), ] es = es.add_relationships(rels) assert isinstance(es["new_log"].ww.logical_types["value"], Integer) periods = 5 lag_primitive = Lag(periods=periods) cutoff_times = es["new_log"][["id", "datetime"]] fm, _ = dfs( target_dataframe_name="new_log", entityset=es, agg_primitives=[], trans_primitives=[lag_primitive], cutoff_time=cutoff_times, ) # Integer assert fm["LAG(value, datetime, periods=5)"].head(periods).isnull().all() assert fm["LAG(value, datetime, periods=5)"].isnull().sum() == periods assert isinstance( fm.ww.schema.logical_types["LAG(value, datetime, periods=5)"], IntegerNullable, ) # IntegerNullable assert "LAG(value_2, datetime, periods=5)" in fm.columns assert fm["LAG(value_2, datetime, periods=5)"].head(periods).isnull().all() assert isinstance( fm.ww.schema.logical_types["LAG(value_2, datetime, periods=5)"], IntegerNullable, ) # Categorical assert "LAG(product_id, datetime, periods=5)" in fm.columns assert fm["LAG(product_id, datetime, periods=5)"].head(periods).isnull().all() assert isinstance( fm.ww.schema.logical_types["LAG(product_id, datetime, periods=5)"], Categorical, ) # Double assert "LAG(value_double, datetime, periods=5)" in fm.columns assert fm["LAG(value_double, datetime, periods=5)"].head(periods).isnull().all() assert isinstance( fm.ww.schema.logical_types["LAG(value_double, datetime, periods=5)"], Double, ) # Boolean assert "LAG(purchased, datetime, periods=5)" in fm.columns assert fm["LAG(purchased, datetime, periods=5)"].head(periods).isnull().all() assert isinstance( fm.ww.schema.logical_types["LAG(purchased, datetime, periods=5)"], BooleanNullable, ) # BooleanNullable assert "LAG(purchased_with_nulls, datetime, periods=5)" in fm.columns assert ( fm["LAG(purchased_with_nulls, datetime, periods=5)"] .head(periods) .isnull() .all() ) assert isinstance( fm.ww.schema.logical_types["LAG(purchased_with_nulls, datetime, periods=5)"], BooleanNullable, ) def test_comparisons_with_ordinal_valid_inputs_that_dont_work_but_should(es): # TODO: Remvoe this test once the correct behavior is implemented in CFM # The following test covers a scenario where an intermediate feature doesn't have the correct type # because Woodwork has not yet been initialized. This calculation should work and return valid True/False # values. This should be fixed in a future PR, but until a fix is implemented null values are returned to # prevent calculate_feature_matrix from raising an Error when calculating features generated by DFS. priority_level = Feature(es["log"].ww["priority_level"]) first_priority = AggregationFeature( priority_level, parent_dataframe_name="customers", primitive=First, ) engagement = Feature(es["customers"].ww["engagement_level"]) invalid_but_should_be_valid = [ TransformFeature([engagement, first_priority], primitive=LessThan), TransformFeature([engagement, first_priority], primitive=LessThanEqualTo), TransformFeature([engagement, first_priority], primitive=GreaterThan), TransformFeature([engagement, first_priority], primitive=GreaterThanEqualTo), ] fm = calculate_feature_matrix( entityset=es, features=invalid_but_should_be_valid, ) feature_cols = [f.get_name() for f in invalid_but_should_be_valid] for col in feature_cols: assert fm[col].isnull().all() def test_multiply_numeric_boolean(): test_cases = [ {"val": 100, "mask": True, "expected": 100}, {"val": 100, "mask": False, "expected": 0}, {"val": 0, "mask": False, "expected": 0}, {"val": 100, "mask": pd.NA, "expected": pd.NA}, {"val": pd.NA, "mask": pd.NA, "expected": pd.NA}, {"val": pd.NA, "mask": True, "expected": pd.NA}, {"val": pd.NA, "mask": False, "expected": pd.NA}, ] multiply_numeric_boolean = MultiplyNumericBoolean() for input in test_cases: vals = pd.Series(input["val"]).astype("Int64") mask = pd.Series(input["mask"]) actual = multiply_numeric_boolean(vals, mask).tolist()[0] expected = input["expected"] if pd.isnull(expected): assert pd.isnull(actual) else: assert actual == input["expected"] def test_multiply_numeric_boolean_multiple_dtypes_no_nulls(): # Test without null values vals = pd.Series([1, 2, 3]) bools = pd.Series([True, False, True]) multiply_numeric_boolean = MultiplyNumericBoolean() numeric_dtypes = ["float64", "int64", "Int64"] boolean_dtypes = ["bool", "boolean"] for numeric_dtype in numeric_dtypes: for boolean_dtype in boolean_dtypes: actual = multiply_numeric_boolean( vals.astype(numeric_dtype), bools.astype(boolean_dtype), ) expected = pd.Series([1, 0, 3]) pd.testing.assert_series_equal(actual, expected, check_dtype=False) def test_multiply_numeric_boolean_multiple_dtypes_with_nulls(): # Test with null values vals = pd.Series([np.nan, 2, 3]) bools = pd.Series([True, False, pd.NA], dtype="boolean") multiply_numeric_boolean = MultiplyNumericBoolean() numeric_dtypes = ["float64", "Int64"] for numeric_dtype in numeric_dtypes: actual = multiply_numeric_boolean(vals.astype(numeric_dtype), bools) expected = pd.Series([np.nan, 0, np.nan]) pd.testing.assert_series_equal(actual, expected, check_dtype=False) def test_feature_multiplication(es): numeric_ft = Feature(es["customers"].ww["age"]) boolean_ft = Feature(es["customers"].ww["loves_ice_cream"]) mult_numeric = numeric_ft * numeric_ft mult_boolean = boolean_ft * boolean_ft mult_numeric_boolean = numeric_ft * boolean_ft mult_numeric_boolean2 = boolean_ft * numeric_ft assert issubclass(type(mult_numeric.primitive), MultiplyNumeric) assert issubclass(type(mult_boolean.primitive), MultiplyBoolean) assert issubclass(type(mult_numeric_boolean.primitive), MultiplyNumericBoolean) assert issubclass(type(mult_numeric_boolean2.primitive), MultiplyNumericBoolean) # Test with nullable types es["customers"].ww.set_types( logical_types={"age": "IntegerNullable", "loves_ice_cream": "BooleanNullable"}, ) numeric_ft = Feature(es["customers"].ww["age"]) boolean_ft = Feature(es["customers"].ww["loves_ice_cream"]) mult_numeric = numeric_ft * numeric_ft mult_boolean = boolean_ft * boolean_ft mult_numeric_boolean = numeric_ft * boolean_ft mult_numeric_boolean2 = boolean_ft * numeric_ft assert issubclass(type(mult_numeric.primitive), MultiplyNumeric) assert issubclass(type(mult_boolean.primitive), MultiplyBoolean) assert issubclass(type(mult_numeric_boolean.primitive), MultiplyNumericBoolean) assert issubclass(type(mult_numeric_boolean2.primitive), MultiplyNumericBoolean) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_cumulative_time_since.py ================================================ from datetime import datetime import numpy as np import pandas as pd from featuretools.primitives import ( CumulativeTimeSinceLastFalse, CumulativeTimeSinceLastTrue, ) from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestCumulativeTimeSinceLastTrue(PrimitiveTestBase): primitive = CumulativeTimeSinceLastTrue booleans = pd.Series([False, True, False, True, False, False]) datetimes = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))], ) answer = pd.Series([np.nan, 0, 6, 0, 6, 12]) def test_regular(self): primitive_func = self.primitive().get_function() given_answer = primitive_func(self.datetimes, self.booleans) assert given_answer.equals(self.answer) def test_all_false(self): primitive_func = self.primitive().get_function() booleans = pd.Series([False, False, False]) datetimes = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))], ) given_answer = primitive_func(datetimes, booleans) answer = pd.Series([np.nan] * 3) assert given_answer.equals(answer) def test_all_nan(self): primitive_func = self.primitive().get_function() datetimes = pd.Series([np.nan] * 4) booleans = pd.Series([np.nan] * 4) given_answer = primitive_func(datetimes, booleans) answer = pd.Series([np.nan] * 4) assert given_answer.equals(answer) def test_some_nans(self): primitive_func = self.primitive().get_function() booleans = pd.Series( [ False, True, False, True, False, False, True, True, False, False, ], ) datetimes = pd.Series([np.nan] * 2) datetimes = pd.concat([datetimes, self.datetimes]) datetimes = pd.concat([datetimes, pd.Series([np.nan] * 2)]) datetimes = datetimes.reset_index(drop=True) answer = pd.Series( [ np.nan, np.nan, np.nan, 0, 6, 12, 0, 0, np.nan, np.nan, ], ) given_answer = primitive_func(datetimes, booleans) assert given_answer.equals(answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestCumulativeTimeSinceLastFalse(PrimitiveTestBase): primitive = CumulativeTimeSinceLastFalse booleans = pd.Series([True, False, True, False, True, True]) datetimes = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))], ) answer = pd.Series([np.nan, 0, 6, 0, 6, 12]) def test_regular(self): primitive_func = self.primitive().get_function() given_answer = primitive_func(self.datetimes, self.booleans) assert given_answer.equals(self.answer) def test_all_true(self): primitive_func = self.primitive().get_function() booleans = pd.Series([True, True, True]) datetimes = pd.Series( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))], ) given_answer = primitive_func(datetimes, booleans) answer = pd.Series([np.nan] * 3) assert given_answer.equals(answer) def test_all_nan(self): primitive_func = self.primitive().get_function() datetimes = pd.Series([np.nan] * 4) booleans = pd.Series([np.nan] * 4) given_answer = primitive_func(datetimes, booleans) answer = pd.Series([np.nan] * 4) assert given_answer.equals(answer) def test_some_nans(self): primitive_func = self.primitive().get_function() booleans = pd.Series( [ True, False, True, False, True, True, False, False, True, True, ], ) datetimes = pd.Series([np.nan] * 2) datetimes = pd.concat([datetimes, self.datetimes]) datetimes = pd.concat([datetimes, pd.Series([np.nan] * 2)]) datetimes = datetimes.reset_index(drop=True) answer = pd.Series( [ np.nan, np.nan, np.nan, 0, 6, 12, 0, 0, np.nan, np.nan, ], ) given_answer = primitive_func(datetimes, booleans) assert given_answer.equals(answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_datetoholiday_primitive.py ================================================ from datetime import datetime import numpy as np import pandas as pd import pytest from featuretools.primitives import DateToHoliday def test_datetoholiday(): date_to_holiday = DateToHoliday() dates = pd.Series( [ datetime(2016, 1, 1), datetime(2016, 2, 27), datetime(2017, 5, 29, 10, 30, 5), datetime(2018, 7, 4), ], ) holiday_series = date_to_holiday(dates).tolist() assert holiday_series[0] == "New Year's Day" assert np.isnan(holiday_series[1]) assert holiday_series[2] == "Memorial Day" assert holiday_series[3] == "Independence Day" def test_datetoholiday_error(): error_text = r"must be one of the available countries.*" with pytest.raises(ValueError, match=error_text): DateToHoliday(country="UNK") def test_nat(): date_to_holiday = DateToHoliday() case = pd.Series( [ "2019-10-14", "NaT", "2016-02-15", "NaT", ], ).astype("datetime64[ns]") answer = ["Columbus Day", np.nan, "Washington's Birthday", np.nan] given_answer = date_to_holiday(case).astype("str") np.testing.assert_array_equal(given_answer, answer) def test_valid_country(): date_to_holiday = DateToHoliday(country="Canada") case = pd.Series( [ "2016-07-01", "2016-11-11", "2018-12-25", ], ).astype("datetime64[ns]") answer = ["Canada Day", np.nan, "Christmas Day"] given_answer = date_to_holiday(case).astype("str") np.testing.assert_array_equal(given_answer, answer) def test_multiple_countries(): dth_mexico = DateToHoliday(country="Mexico") case = pd.Series([datetime(2000, 9, 16), datetime(2005, 1, 1)]) assert len(dth_mexico(case)) > 1 dth_india = DateToHoliday(country="IND") case = pd.Series([datetime(2048, 1, 1), datetime(2048, 10, 2)]) assert len(dth_india(case)) > 1 dth_uk = DateToHoliday(country="UK") case = pd.Series([datetime(2048, 3, 17), datetime(2048, 4, 6)]) assert len(dth_uk(case)) > 1 countries = [ "Argentina", "AU", "Austria", "BY", "Belgium", "Brazil", "Canada", "Colombia", "Croatia", "England", "Finland", "FRA", "Germany", "Germany", "Italy", "NewZealand", "PortugalExt", "PTE", "Spain", "ES", "Switzerland", "UnitedStates", "US", "UK", "UA", "CH", "SE", "ZA", ] for x in countries: DateToHoliday(country=x) def test_with_timezone_aware_datetimes(): df = pd.DataFrame( { "non_timezone_aware_with_time": pd.date_range( "2018-07-03 09:00", periods=3, ), "non_timezone_aware_no_time": pd.date_range("2018-07-03", periods=3), "timezone_aware_with_time": pd.date_range( "2018-07-03 09:00", periods=3, ).tz_localize(tz="US/Eastern"), "timezone_aware_no_time": pd.date_range( "2018-07-03", periods=3, ).tz_localize(tz="US/Eastern"), }, ) date_to_holiday = DateToHoliday(country="US") expected = [np.nan, "Independence Day", np.nan] for col in df.columns: actual = date_to_holiday(df[col]).astype("str") np.testing.assert_array_equal(actual, expected) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_distancetoholiday_primitive.py ================================================ from datetime import datetime import numpy as np import pandas as pd import pytest from featuretools.primitives import DistanceToHoliday def test_distanceholiday(): distance_to_holiday = DistanceToHoliday("New Year's Day") dates = pd.Series( [ datetime(2010, 1, 1), datetime(2012, 5, 31), datetime(2017, 7, 31), datetime(2020, 12, 31), ], ) expected = [0, -151, 154, 1] output = distance_to_holiday(dates).tolist() np.testing.assert_array_equal(output, expected) def test_unknown_country_error(): error_text = r"must be one of the available countries.*" with pytest.raises(ValueError, match=error_text): DistanceToHoliday("Victoria Day", country="UNK") def test_unknown_holiday_error(): error_text = r"must be one of the available holidays.*" with pytest.raises(ValueError, match=error_text): DistanceToHoliday("Alteryx Day") def test_nat(): date_to_holiday = DistanceToHoliday("New Year's Day") case = pd.Series( [ "2010-01-01", "NaT", "2012-05-31", "NaT", ], ).astype("datetime64[ns]") answer = [0, np.nan, -151, np.nan] given_answer = date_to_holiday(case).astype("float") np.testing.assert_array_equal(given_answer, answer) def test_valid_country(): distance_to_holiday = DistanceToHoliday("Canada Day", country="Canada") case = pd.Series( [ "2010-01-01", "2012-05-31", "2017-07-31", "2020-12-31", ], ).astype("datetime64[ns]") answer = [181, 31, -30, 182] given_answer = distance_to_holiday(case).astype("float") np.testing.assert_array_equal(given_answer, answer) def test_with_timezone_aware_datetimes(): df = pd.DataFrame( { "non_timezone_aware_with_time": pd.date_range( "2018-07-03 09:00", periods=3, ), "non_timezone_aware_no_time": pd.date_range("2018-07-03", periods=3), "timezone_aware_with_time": pd.date_range( "2018-07-03 09:00", periods=3, ).tz_localize(tz="US/Eastern"), "timezone_aware_no_time": pd.date_range( "2018-07-03", periods=3, ).tz_localize(tz="US/Eastern"), }, ) distance_to_holiday = DistanceToHoliday("Independence Day", country="US") expected = [1, 0, -1] for col in df.columns: actual = distance_to_holiday(df[col]) np.testing.assert_array_equal(actual, expected) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_expanding_primitives.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives.standard.transform.time_series.expanding import ( ExpandingCount, ExpandingMax, ExpandingMean, ExpandingMin, ExpandingSTD, ExpandingTrend, ) from featuretools.primitives.standard.transform.time_series.utils import ( _apply_gap_for_expanding_primitives, ) from featuretools.utils import calculate_trend @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), ], ) def test_expanding_count_series(window_series, min_periods, gap): test = window_series.shift(gap) expected = test.expanding(min_periods=min_periods).count() num_nans = gap + min_periods - 1 expected[range(num_nans)] = np.nan primitive_instance = ExpandingCount(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance(window_series.index) pd.testing.assert_series_equal(pd.Series(actual), expected) @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), (0, 1), ], ) def test_expanding_count_date_range(window_date_range, min_periods, gap): test = _apply_gap_for_expanding_primitives(gap=gap, x=window_date_range) expected = test.expanding(min_periods=min_periods).count() num_nans = gap + min_periods - 1 expected[range(num_nans)] = np.nan primitive_instance = ExpandingCount(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance(window_date_range) pd.testing.assert_series_equal(pd.Series(actual), expected) @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), (0, 1), ], ) def test_expanding_min(window_series, min_periods, gap): test = window_series.shift(gap) expected = test.expanding(min_periods=min_periods).min().values primitive_instance = ExpandingMin(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance( numeric=window_series, datetime=window_series.index, ) pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected)) @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), (0, 1), ], ) def test_expanding_max(window_series, min_periods, gap): test = window_series.shift(gap) expected = test.expanding(min_periods=min_periods).max().values primitive_instance = ExpandingMax(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance( numeric=window_series, datetime=window_series.index, ) pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected)) @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), (0, 1), ], ) def test_expanding_std(window_series, min_periods, gap): test = window_series.shift(gap) expected = test.expanding(min_periods=min_periods).std().values primitive_instance = ExpandingSTD(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance( numeric=window_series, datetime=window_series.index, ) pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected)) @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), (0, 1), ], ) def test_expanding_mean(window_series, min_periods, gap): test = window_series.shift(gap) expected = test.expanding(min_periods=min_periods).mean().values primitive_instance = ExpandingMean(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance( numeric=window_series, datetime=window_series.index, ) pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected)) @pytest.mark.parametrize( "min_periods, gap", [ (5, 2), (5, 0), (0, 0), (0, 1), ], ) def test_expanding_trend(window_series, min_periods, gap): test = window_series.shift(gap) expected = test.expanding(min_periods=min_periods).aggregate(calculate_trend).values primitive_instance = ExpandingTrend(min_periods=min_periods, gap=gap).get_function() actual = primitive_instance( numeric=window_series, datetime=window_series.index, ) pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected)) @pytest.mark.parametrize( "primitive", [ ExpandingMax, ExpandingMean, ExpandingMin, ExpandingSTD, ExpandingTrend, ], ) def test_expanding_primitives_throw_error_when_given_string_offset( window_series, primitive, ): error_msg = ( "String offsets are not supported for the gap parameter in Expanding primitives" ) with pytest.raises(TypeError, match=error_msg): primitive(gap="2H").get_function()( numeric=window_series, datetime=window_series.index, ) def test_apply_gap_for_expanding_primitives_throws_error_when_given_string_offset( window_series, ): error_msg = ( "String offsets are not supported for the gap parameter in Expanding primitives" ) with pytest.raises(TypeError, match=error_msg): _apply_gap_for_expanding_primitives(window_series, gap="2H") @pytest.mark.parametrize( "gap", [ 2, 5, 3, 0, ], ) def test_apply_gap_for_expanding_primitives(window_series, gap): actual = _apply_gap_for_expanding_primitives(window_series, gap).values expected = window_series.shift(gap).values pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected)) @pytest.mark.parametrize( "gap", [ 2, 5, 3, 0, ], ) def test_apply_gap_for_expanding_primitives_handles_date_range( window_date_range, gap, ): actual = pd.Series( _apply_gap_for_expanding_primitives(window_date_range, gap).values, ) expected = pd.Series(window_date_range.to_series().shift(gap).values) pd.testing.assert_series_equal(actual, expected) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_exponential_primitives.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import ( ExponentialWeightedAverage, ExponentialWeightedSTD, ExponentialWeightedVariance, ) def test_regular_com_avg(): primitive_instance = ExponentialWeightedAverage(com=0.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series([1.0, 1.75, 5.384615384615384, 5.125]) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_span_avg(): primitive_instance = ExponentialWeightedAverage(span=1.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series([1.0, 1.8333333333333335, 6.0, 5.198717948717948]) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_halflife_avg(): primitive_instance = ExponentialWeightedAverage(halflife=2.7) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [1.0, 1.563830114594977, 3.8556233149044865, 4.2592901785684205], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_alpha_avg(): primitive_instance = ExponentialWeightedAverage(alpha=0.8) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series([1.0, 1.8333333333333335, 6.0, 5.198717948717948]) pd.testing.assert_series_equal(answer, correct_answer) def test_na_avg(): primitive_instance = ExponentialWeightedAverage(com=0.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, np.nan, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [1.0, 1.75, 5.384615384615384, 5.384615384615384, 5.053191489361702], ) pd.testing.assert_series_equal(answer, correct_answer) def test_ignorena_true_avg(): primitive_instance = ExponentialWeightedAverage(com=0.5, ignore_na=True) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, np.nan, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [1.0, 1.75, 5.384615384615384, 5.384615384615384, 5.125], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_com_std(): primitive_instance = ExponentialWeightedSTD(com=0.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [np.nan, 0.7071067811865475, 3.584153156068229, 2.0048019276803304], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_span_std(): primitive_instance = ExponentialWeightedSTD(span=1.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [np.nan, 0.7071067811865476, 3.6055512754639887, 1.7311551816712718], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_halflife_std(): primitive_instance = ExponentialWeightedSTD(halflife=2.7) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [np.nan, 0.7071067811865475, 3.3565236098585416, 2.631776826295855], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_alpha_std(): primitive_instance = ExponentialWeightedSTD(alpha=0.8) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [np.nan, 0.7071067811865476, 3.6055512754639887, 1.7311551816712718], ) pd.testing.assert_series_equal(answer, correct_answer) def test_na_std(): primitive_instance = ExponentialWeightedSTD(com=0.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, np.nan, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [ np.nan, 0.7071067811865475, 3.584153156068229, 3.5841531560682287, 1.8408520483016189, ], ) pd.testing.assert_series_equal(answer, correct_answer) def test_ignorena_true_std(): primitive_instance = ExponentialWeightedSTD(com=0.5, ignore_na=True) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, np.nan, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [ np.nan, 0.7071067811865475, 3.584153156068229, 3.584153156068229, 2.0048019276803304, ], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_com_var(): primitive_instance = ExponentialWeightedVariance(com=0.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [np.nan, 0.49999999999999983, 12.846153846153847, 4.019230769230769], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_span_var(): primitive_instance = ExponentialWeightedVariance(span=1.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series([np.nan, 0.5, 12.999999999999996, 2.996898263027294]) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_halflife_var(): primitive_instance = ExponentialWeightedVariance(halflife=2.7) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [np.nan, 0.49999999999999994, 11.266250743537816, 6.926249263427883], ) pd.testing.assert_series_equal(answer, correct_answer) def test_regular_alpha_var(): primitive_instance = ExponentialWeightedVariance(alpha=0.8) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series([np.nan, 0.5, 12.999999999999996, 2.996898263027294]) pd.testing.assert_series_equal(answer, correct_answer) def test_na_var(): primitive_instance = ExponentialWeightedVariance(com=0.5) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, np.nan, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [ np.nan, 0.49999999999999983, 12.846153846153847, 12.846153846153843, 3.3887362637362655, ], ) pd.testing.assert_series_equal(answer, correct_answer) def test_ignorena_true_var(): primitive_instance = ExponentialWeightedVariance(com=0.5, ignore_na=True) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 7, np.nan, 5]) answer = pd.Series(primitive_func(array)) correct_answer = pd.Series( [ np.nan, 0.49999999999999983, 12.846153846153847, 12.846153846153847, 4.019230769230769, ], ) pd.testing.assert_series_equal(answer, correct_answer) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_full_name_primitives.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import ( FullNameToFirstName, FullNameToLastName, FullNameToTitle, ) from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestFullNameToFirstName(PrimitiveTestBase): primitive = FullNameToFirstName def test_urls(self): # note this implementation incorrectly identifies the first # name for 'Oliva y Ocana, Dona. Fermina' primitive_func = self.primitive().get_function() names = pd.Series( [ "Spector, Mr. Woolf", "Oliva y Ocana, Dona. Fermina", "Saether, Mr. Simon Sivertsen", "Ware, Mr. Frederick", "Peter, Master. Michael J", ], ) answer = pd.Series(["Woolf", "Oliva", "Simon", "Frederick", "Michael"]) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_no_title(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "James Masters", "Kate Elizabeth Brown-Jones", ], ) answer = pd.Series(["Michael", "James", "Kate"], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_empty_string(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "", "Kate Elizabeth Brown-Jones", ], ) answer = pd.Series(["Michael", np.nan, "Kate"], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_single_name(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "James", "Kate Elizabeth Brown-Jones", ], ) answer = pd.Series(["Michael", "James", "Kate"], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_nan(self): primitive_func = self.primitive().get_function() names = pd.Series(["Mr. James Brown", np.nan, None]) answer = pd.Series(["James", np.nan, np.nan]) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestFullNameToLastName(PrimitiveTestBase): primitive = FullNameToLastName def test_urls(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Spector, Mr. Woolf", "Oliva y Ocana, Dona. Fermina", "Saether, Mr. Simon Sivertsen", "Ware, Mr. Frederick", "Peter, Master. Michael J", ], ) answer = pd.Series(["Spector", "Oliva y Ocana", "Saether", "Ware", "Peter"]) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_no_title(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "James Masters", "Kate Elizabeth Brown-Jones", ], ) answer = pd.Series(["Peter", "Masters", "Brown-Jones"], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_empty_string(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "", "Kate Elizabeth Brown-Jones", ], ) answer = pd.Series(["Peter", np.nan, "Brown-Jones"], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_single_name(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "James", "Kate Elizabeth Brown-Jones", ], ) answer = pd.Series(["Peter", np.nan, "Brown-Jones"], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_nan(self): primitive_func = self.primitive().get_function() names = pd.Series(["Mr. James Brown", np.nan, None]) answer = pd.Series(["Brown", np.nan, np.nan]) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestFullNameToTitle(PrimitiveTestBase): primitive = FullNameToTitle def test_urls(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Spector, Mr. Woolf", "Oliva y Ocana, Dona. Fermina", "Saether, Mr. Simon Sivertsen", "Ware, Mr. Frederick", "Peter, Master. Michael J", "Mr. Brown", ], ) answer = pd.Series(["Mr", "Dona", "Mr", "Mr", "Master", "Mr"]) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_no_title(self): primitive_func = self.primitive().get_function() names = pd.Series( [ "Peter, Michael J", "James Master.", "Mrs Brown", "", ], ) answer = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=object) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_nan(self): primitive_func = self.primitive().get_function() names = pd.Series(["Mr. Brown", np.nan, None]) answer = pd.Series(["Mr", np.nan, np.nan]) pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_is_federal_holiday.py ================================================ from datetime import datetime import numpy as np import pandas as pd from pytest import raises from featuretools.primitives import IsFederalHoliday def test_regular(): primitive_instance = IsFederalHoliday() primitive_func = primitive_instance.get_function() case = pd.Series( [ "2016-01-01", "2016-02-29", "2017-05-29", datetime(2019, 7, 4, 10, 0, 30), ], ).astype("datetime64[ns]") answer = pd.Series([True, False, True, True]) given_answer = pd.Series(primitive_func(case)) assert given_answer.equals(answer) def test_nat(): primitive_instance = IsFederalHoliday() primitive_func = primitive_instance.get_function() case = pd.Series( [ "2019-10-14", "NaT", "2016-02-29", "NaT", ], ).astype("datetime64[ns]") answer = pd.Series([True, np.nan, False, np.nan]) given_answer = pd.Series(primitive_func(case)) assert given_answer.equals(answer) def test_valid_country(): primitive_instance = IsFederalHoliday(country="Canada") primitive_func = primitive_instance.get_function() case = pd.Series( [ "2016-07-01", "2016-11-11", "2018-09-03", ], ).astype("datetime64[ns]") answer = pd.Series([True, False, True]) given_answer = pd.Series(primitive_func(case)) assert given_answer.equals(answer) def test_invalid_country(): error_text = "must be one of the available countries" with raises(ValueError, match=error_text): IsFederalHoliday(country="") def test_multiple_countries(): primitive_mexico = IsFederalHoliday(country="Mexico") primitive_func = primitive_mexico.get_function() case = pd.Series([datetime(2000, 9, 16), datetime(2005, 1, 1)]) assert len(primitive_func(case)) > 1 primitive_india = IsFederalHoliday(country="IND") primitive_func = primitive_mexico.get_function() case = pd.Series([datetime(2048, 1, 1), datetime(2048, 10, 2)]) primitive_func = primitive_india.get_function() assert len(primitive_func(case)) > 1 primitive_uk = IsFederalHoliday(country="UK") primitive_func = primitive_uk.get_function() case = pd.Series([datetime(2048, 3, 17), datetime(2048, 4, 6)]) assert len(primitive_func(case)) > 1 countries = [ "Argentina", "AU", "Austria", "BY", "Belgium", "Brazil", "Canada", "Colombia", "Croatia", "England", "Finland", "FRA", "Germany", "Germany", "Italy", "NewZealand", "PortugalExt", "PTE", "Spain", "ES", "Switzerland", "UnitedStates", "US", "UK", "UA", "CH", "SE", "ZA", ] for x in countries: IsFederalHoliday(country=x) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_latlong_primitives.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import CityblockDistance, GeoMidpoint, IsInGeoBox def test_cityblock(): primitive_instance = CityblockDistance() latlong_1 = pd.Series([(i, i) for i in range(3)]) latlong_2 = pd.Series([(i, i) for i in range(3, 6)]) answer = pd.Series([414.56051391, 414.52893691, 414.43421555]) given_answer = primitive_instance(latlong_1, latlong_2) np.testing.assert_allclose(given_answer, answer, rtol=1e-09) primitive_instance = CityblockDistance(unit="kilometers") answer = primitive_instance(latlong_1, latlong_2) given_answer = pd.Series([667.1704814, 667.11966315, 666.96722389]) np.testing.assert_allclose(given_answer, answer, rtol=1e-09) def test_cityblock_nans(): primitive_instance = CityblockDistance() lats_longs_1 = [(i, i) for i in range(2)] lats_longs_2 = [(i, i) for i in range(2, 4)] lats_longs_1 += [(1, 1), (np.nan, 3), (4, np.nan), (np.nan, np.nan)] lats_longs_2 += [(np.nan, np.nan), (np.nan, 5), (6, np.nan), (np.nan, np.nan)] given_answer = pd.Series(list([276.37367594, 276.35262728] + [np.nan] * 4)) answer = primitive_instance(lats_longs_1, lats_longs_2) np.testing.assert_allclose(given_answer, answer, rtol=1e-09) def test_cityblock_error(): error_text = "Invalid unit given" with pytest.raises(ValueError, match=error_text): CityblockDistance(unit="invalid") def test_midpoint(): latlong1 = pd.Series([(-90, -180), (90, 180)]) latlong2 = pd.Series([(+90, +180), (-90, -180)]) function = GeoMidpoint().get_function() answer = function(latlong1, latlong2) for lat, longi in answer: assert lat == 0.0 assert longi == 0.0 def test_midpoint_floating(): latlong1 = pd.Series([(-45.5, -100.5), (45.5, 100.5)]) latlong2 = pd.Series([(+45.5, +100.5), (-45.5, -100.5)]) function = GeoMidpoint().get_function() answer = function(latlong1, latlong2) for lat, longi in answer: assert lat == 0.0 assert longi == 0.0 def test_midpoint_zeros(): latlong1 = pd.Series([(0, 0), (0, 0)]) latlong2 = pd.Series([(0, 0), (0, 0)]) function = GeoMidpoint().get_function() answer = function(latlong1, latlong2) for lat, longi in answer: assert lat == 0.0 assert longi == 0.0 def test_midpoint_nan(): all_nan = pd.Series([(np.nan, np.nan), (np.nan, np.nan)]) latlong1 = pd.Series([(0, 0), (0, 0)]) function = GeoMidpoint().get_function() answer = function(all_nan, latlong1) for lat, longi in answer: assert np.isnan(lat) assert np.isnan(longi) def test_isingeobox(): latlong = pd.Series( [ (1, 2), (5, 7), (-5, 4), (2, 3), (0, 0), (np.nan, np.nan), (-2, np.nan), (np.nan, 1), ], ) bottomleft = (-5, -5) topright = (5, 5) primitive = IsInGeoBox(bottomleft, topright) function = primitive.get_function() primitive_answer = function(latlong) answer = pd.Series([True, False, True, True, True, False, False, False]) assert np.array_equal(primitive_answer, answer) def test_boston(): NYC = (40.7128, -74.0060) SF = (37.7749, -122.4194) Somerville = (42.3876, -71.0995) Bejing = (39.9042, 116.4074) CapeTown = (-33.9249, 18.4241) latlong = pd.Series([NYC, SF, Somerville, Bejing, CapeTown]) LynnMA = (42.4668, -70.9495) DedhamMA = (42.2436, -71.1677) primitive = IsInGeoBox(LynnMA, DedhamMA) function = primitive.get_function() primitive_answer = function(latlong) answer = pd.Series([False, False, True, False, False]) assert np.array_equal(primitive_answer, answer) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_percent_change.py ================================================ import numpy as np import pandas as pd from pytest import raises from featuretools.primitives import PercentChange from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestPercentChange(PrimitiveTestBase): primitive = PercentChange def test_regular(self): data = pd.Series([2, 5, 15, 3, 3, 9, 4.5]) answer = pd.Series([np.nan, 1.5, 2.0, -0.8, 0, 2.0, -0.5]) primtive_func = self.primitive().get_function() given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_raises(self): with raises(ValueError): self.primitive(fill_method="invalid") def test_period(self): data = pd.Series([2, 4, 8]) answer = pd.Series([np.nan, np.nan, 3]) primtive_func = self.primitive(periods=2).get_function() given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) primtive_func = self.primitive(periods=2).get_function() data = pd.Series([2, 4, 8] + [np.nan] * 4) primtive_func = self.primitive(limit=2).get_function() answer = pd.Series([np.nan, 1, 1, 0, 0, np.nan, np.nan]) given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_nan(self): data = pd.Series([np.nan, 5, 10, 20, np.nan, 10, np.nan]) answer = pd.Series([np.nan, np.nan, 1, 1, 0, -0.5, 0]) primtive_func = self.primitive().get_function() given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_zero(self): data = pd.Series([2, 0, 0, 5, 0, -4]) answer = pd.Series([np.nan, -1, np.nan, np.inf, -1, np.NINF]) primtive_func = self.primitive().get_function() given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_inf(self): data = pd.Series([0, np.inf, 0, 5, np.NINF, np.inf, np.NINF]) answer = pd.Series([np.nan, np.inf, -1, np.inf, np.NINF, np.nan, np.nan]) primtive_func = self.primitive().get_function() given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_freq(self): dates = pd.DatetimeIndex( ["2018-01-01", "2018-01-02", "2018-01-03", "2018-01-05"], ) data = pd.Series([1, 2, 3, 4], index=dates) answer = pd.Series([np.nan, 1.0, 0.5, np.nan]) date_offset = pd.tseries.offsets.DateOffset(days=1) primtive_func = self.primitive(freq=date_offset).get_function() given_answer = primtive_func(data) np.testing.assert_array_equal(given_answer, answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instantiate = self.primitive transform.append(primitive_instantiate) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_percent_unique.py ================================================ import numpy as np import pandas as pd from featuretools.primitives import PercentUnique from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, ) class TestPercentUnique(PrimitiveTestBase): array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8]) primitive = PercentUnique def test_percent_unique(self): primitive_func = self.primitive().get_function() assert primitive_func(self.array) == (8 / 10.0) def test_nans(self): primitive_func = self.primitive().get_function() array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])]) assert primitive_func(array_nans) == (8 / 11.0) primitive_func = self.primitive(skipna=False).get_function() assert primitive_func(array_nans) == (9 / 11.0) def test_multiple_nans(self): primitive_func = self.primitive().get_function() array_nans = pd.concat([self.array.copy(), pd.Series([np.nan] * 3)]) assert primitive_func(array_nans) == (8 / 13.0) primitive_func = self.primitive(skipna=False).get_function() assert primitive_func(array_nans) == (9 / 13.0) def test_empty_string(self): primitive_func = self.primitive().get_function() array_empty_string = pd.concat([self.array.copy(), pd.Series([np.nan, "", ""])]) assert primitive_func(array_empty_string) == (9 / 13.0) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_postal_primitives.py ================================================ import pandas as pd from featuretools.primitives.standard.transform.postal import ( OneDigitPostalCode, TwoDigitPostalCode, ) def test_one_digit_postal_code(postal_code_dataframe): primitive = OneDigitPostalCode().get_function() for x in postal_code_dataframe: series = postal_code_dataframe[x] actual = primitive(series) expected = series.apply(lambda t: str(t)[0] if pd.notna(t) else pd.NA) pd.testing.assert_series_equal(actual, expected) def test_two_digit_postal_code(postal_code_dataframe): primitive = TwoDigitPostalCode().get_function() for x in postal_code_dataframe: series = postal_code_dataframe[x] actual = primitive(series) expected = series.apply(lambda t: str(t)[:2] if pd.notna(t) else pd.NA) pd.testing.assert_series_equal(actual, expected) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_same_as_previous.py ================================================ import numpy as np import pandas as pd import pytest from featuretools.primitives import SameAsPrevious class TestSameAsPrevious: def test_ints(self): primitive_func = SameAsPrevious().get_function() array = pd.Series([1, 2, 2, 3, 2], dtype="int64") answer = primitive_func(array) correct_answer = pd.Series([False, False, True, False, False]) pd.testing.assert_series_equal(answer, correct_answer) def test_int64(self): primitive_func = SameAsPrevious().get_function() array = pd.Series([1, 2, 2, 3, 2], dtype="Int64") answer = primitive_func(array) correct_answer = pd.Series([False, False, True, False, False], dtype="boolean") pd.testing.assert_series_equal(answer, correct_answer) def test_floats(self): primitive_func = SameAsPrevious().get_function() array = pd.Series([1.0, 2.5, 2.5, 3.0, 2.0], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, False, True, False, False]) pd.testing.assert_series_equal(answer, correct_answer) def test_mixed(self): primitive_func = SameAsPrevious().get_function() array = pd.Series([1, 2, 2.0, 3, 2.0], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, False, True, False, False]) np.testing.assert_array_equal(answer, correct_answer) def test_nan(self): primitive_instance = SameAsPrevious() primitive_func = primitive_instance.get_function() array = pd.Series([1, np.nan, 3, np.nan, 2], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, True, False, True, False]) np.testing.assert_array_equal(answer, correct_answer) def test_all_nan(self): primitive_instance = SameAsPrevious() primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, False, False, False]) np.testing.assert_array_equal(answer, correct_answer) def test_inf(self): primitive_instance = SameAsPrevious() primitive_func = primitive_instance.get_function() array = pd.Series([1, np.inf, 3, np.inf, 2], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, False, False, False, False]) np.testing.assert_array_equal(answer, correct_answer) def test_all_inf(self): primitive_instance = SameAsPrevious() primitive_func = primitive_instance.get_function() array = pd.Series([np.inf, np.inf, np.inf, np.inf], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, True, True, True]) np.testing.assert_array_equal(answer, correct_answer) def test_fill_method_bfill(self): primitive_instance = SameAsPrevious(fill_method="bfill") primitive_func = primitive_instance.get_function() array = pd.Series([1, np.nan, 3, 2, 2], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, False, True, False, True]) np.testing.assert_array_equal(answer, correct_answer) def test_fill_method_bfill_with_limit(self): primitive_instance = SameAsPrevious(fill_method="bfill", limit=2) primitive_func = primitive_instance.get_function() array = pd.Series([1, np.nan, np.nan, np.nan, 2, 3], dtype="float64") answer = primitive_func(array) correct_answer = pd.Series([False, False, False, True, True, False]) np.testing.assert_array_equal(answer, correct_answer) def test_raises(self): with pytest.raises(ValueError): SameAsPrevious(fill_method="invalid") ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_savgol_filter.py ================================================ from math import floor import numpy as np import pandas as pd from pytest import raises from featuretools.primitives import SavgolFilter from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) class TestSavgolFilter(PrimitiveTestBase): primitive = SavgolFilter data = pd.Series( [ 0, 1, 1, 2, 3, 4, 5, 7, 8, 7, 9, 9, 12, 11, 12, 14, 15, 17, 17, 17, 20, 21, 20, 20, 22, 21, 25, 25, 26, 29, 30, 30, 28, 26, 34, 35, 33, 31, 38, 34, 39, 37, 42, 35, 36, 44, 46, 43, 39, 39, 44, 49, 45, 44, 44, 52, 50, 47, 58, 59, 60, 55, 57, 63, 61, 65, 66, 57, 65, 61, 60, 71, 64, 62, 70, 65, 67, 77, 68, 75, 72, 69, 82, 66, 84, 80, 76, 87, 77, 73, 90, 91, 92, 93, 78, 76, 82, 96, 91, 94, ], ) expected_output = pd.Series( [ -0.24600037643516087, 0.6354225484660259, 1.518717742974036, 2.405318302343475, 3.296657321828948, 4.1941678966850615, 5.099283122166421, 6.0134360935276305, 6.938059906023296, 7.874587654908025, 8.824452435436303, 9.786858450473883, 10.923177508989724, 12.025171624713803, 13.009153318077633, 14.08041843739766, 14.900621118012227, 15.796338672768673, 16.77084014383764, 17.662961752206375, 18.472703497874882, 19.451454723765682, 20.530565544295253, 21.849950964367157, 22.478260869564927, 23.15233736515171, 24.12356979405003, 25.23962079110788, 26.000980712650854, 27.082379862699877, 27.787839163124843, 28.879045439685797, 29.762994442627924, 31.067342268714864, 32.11147433801854, 32.666557698593884, 33.06864988558309, 34.00098071265075, 35.134030728995945, 36.15135665250035, 36.945733899966825, 37.56227525335028, 38.55769859431137, 39.3975155279498, 39.87054593004198, 40.304347826086435, 41.11670480549146, 42.00948022229432, 41.982674076495044, 42.62798300098016, 43.15887544949274, 44.53481529911678, 45.680614579927486, 46.93886891140834, 47.98300098071202, 48.80549199084604, 50.28244524354299, 52.66851912389601, 54.28604118993064, 55.81529911735788, 57.10297482837455, 57.82641386073805, 59.45276234063342, 60.77280156913945, 61.23667865315383, 61.81660673422607, 62.60281137626594, 62.54004576658957, 62.78653154625613, 63.23046747302958, 64.09087937234307, 65.25661981039471, 65.19385420071833, 66.34161490683144, 66.65021248774022, 67.38280483818154, 68.8126838836212, 69.79470415168265, 70.943772474664, 72.74076495586698, 73.04020921869797, 73.3586139261187, 74.67734553775647, 75.71559333115299, 77.51814318404607, 79.62471395880902, 80.60150375939745, 80.61163779012645, 81.89342922523593, 82.41124550506593, 83.19293292519846, 83.97174920172642, 84.7620599588564, 85.57823082079385, 86.4346274117442, 87.34561535591293, 88.32556027750543, 89.38882780072717, 90.54978354978357, 91.82279314888011, ], ) def test_error(self): window_length = 1 polyorder = 3 mode = "incorrect" error_text = "polyorder must be less than window_length." with raises(ValueError, match=error_text): self.primitive(window_length, polyorder) error_text = ( "Both window_length and polyorder must be defined if you define one." ) with raises(ValueError, match=error_text): self.primitive(window_length=window_length) with raises(ValueError, match=error_text): self.primitive(polyorder=polyorder) error_text = "mode must be 'mirror', 'constant', 'nearest', 'wrap' or 'interp'." with raises(ValueError, match=error_text): self.primitive( window_length=window_length, polyorder=polyorder, mode=mode, ) def test_less_window_size(self): primitive_func = self.primitive().get_function() for i in range(20): data = pd.Series(list(range(i)), dtype="float64") assert data.equals(primitive_func(data)) def test_regular(self): window_length = floor(len(self.data) / 10) * 2 + 1 polyorder = 3 primitive_func = self.primitive(window_length, polyorder).get_function() output = list(primitive_func(self.data)) for a, b in zip(self.expected_output, output): assert np.isclose(a, b) def test_nans(self): primitive_func = self.primitive().get_function() data_nans = self.data.copy() data_nans = pd.concat([data_nans, pd.Series([np.nan] * 5, dtype="float64")]) # more than 5 nans due to window assert sum(np.isnan(primitive_func(data_nans))) == 15 def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instantiate = self.primitive() transform.append(primitive_instantiate) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_season.py ================================================ from datetime import datetime import pandas as pd from featuretools.primitives import Season class TestSeason: def test_regular(self): primitive_instance = Season() primitive_func = primitive_instance.get_function() case = pd.date_range(start="2019-01", periods=12, freq="m").to_series() answer = pd.Series( [ "winter", "winter", "spring", "spring", "spring", "summer", "summer", "summer", "fall", "fall", "fall", "winter", ], dtype="string", ) given_answer = primitive_func(case) pd.testing.assert_series_equal( given_answer.reset_index(drop=True), answer.reset_index(drop=True), ) def test_nat(self): primitive_instance = Season() primitive_func = primitive_instance.get_function() case = pd.Series( [ "NaT", "2019-02", "2019-03", "NaT", ], ).astype("datetime64[ns]") answer = pd.Series([pd.NA, "winter", "winter", pd.NA], dtype="string") given_answer = pd.Series(primitive_func(case)) pd.testing.assert_series_equal(given_answer, answer) def test_datetime(self): primitive_instance = Season() primitive_func = primitive_instance.get_function() case = pd.Series( [ datetime(2011, 3, 1), datetime(2011, 6, 1), datetime(2011, 9, 1), datetime(2011, 12, 1), # leap year datetime(2020, 2, 29), ], ) answer = pd.Series( ["winter", "spring", "summer", "fall", "winter"], dtype="string", ) given_answer = primitive_func(case) pd.testing.assert_series_equal(given_answer, answer) ================================================ FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_transform_primitive.py ================================================ import warnings from datetime import datetime import numpy as np import pandas as pd import pytest from pytz import timezone from featuretools.primitives import ( Age, DateToTimeZone, DayOfYear, DaysInMonth, EmailAddressToDomain, FileExtension, IsFirstWeekOfMonth, IsFreeEmailDomain, IsLeapYear, IsLunchTime, IsMonthEnd, IsMonthStart, IsQuarterEnd, IsQuarterStart, IsWorkingHours, IsYearEnd, IsYearStart, Lag, NthWeekOfMonth, NumericLag, PartOfDay, Quarter, RateOfChange, TimeSince, URLToDomain, URLToProtocol, URLToTLD, Week, get_transform_primitives, ) from featuretools.tests.primitive_tests.utils import ( PrimitiveTestBase, find_applicable_primitives, valid_dfs, ) def test_time_since(): time_since = TimeSince() # class datetime.datetime(year, month, day[, hour[, minute[, second[, microsecond[, times = pd.Series( [ datetime(2019, 3, 1, 0, 0, 0, 1), datetime(2019, 3, 1, 0, 0, 1, 0), datetime(2019, 3, 1, 0, 2, 0, 0), ], ) cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0) values = time_since(array=times, time=cutoff_time) assert list(map(int, values)) == [0, -1, -120] time_since = TimeSince(unit="nanoseconds") values = time_since(array=times, time=cutoff_time) assert list(map(round, values)) == [-1000, -1000000000, -120000000000] time_since = TimeSince(unit="milliseconds") values = time_since(array=times, time=cutoff_time) assert list(map(int, values)) == [0, -1000, -120000] time_since = TimeSince(unit="Milliseconds") values = time_since(array=times, time=cutoff_time) assert list(map(int, values)) == [0, -1000, -120000] time_since = TimeSince(unit="Years") values = time_since(array=times, time=cutoff_time) assert list(map(int, values)) == [0, 0, 0] times_y = pd.Series( [ datetime(2019, 3, 1, 0, 0, 0, 1), datetime(2020, 3, 1, 0, 0, 1, 0), datetime(2017, 3, 1, 0, 0, 0, 0), ], ) time_since = TimeSince(unit="Years") values = time_since(array=times_y, time=cutoff_time) assert list(map(int, values)) == [0, -1, 1] error_text = "Invalid unit given, make sure it is plural" with pytest.raises(ValueError, match=error_text): time_since = TimeSince(unit="na") time_since(array=times, time=cutoff_time) def test_age(): age = Age() dates = pd.Series(datetime(2010, 2, 26)) ages = age(dates, time=datetime(2020, 2, 26)) correct_ages = [10.005] # .005 added due to leap years np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3) def test_age_two_years_quarterly(): age = Age() dates = pd.Series(pd.date_range("2010-01-01", "2011-12-31", freq="Q")) ages = age(dates, time=datetime(2020, 2, 26)) correct_ages = [9.915, 9.666, 9.414, 9.162, 8.915, 8.666, 8.414, 8.162] np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3) def test_age_leap_year(): age = Age() dates = pd.Series([datetime(2016, 1, 1)]) ages = age(dates, time=datetime(2016, 3, 1)) correct_ages = [(31 + 29) / 365.0] np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3) # born leap year date dates = pd.Series([datetime(2016, 2, 29)]) ages = age(dates, time=datetime(2020, 2, 29)) correct_ages = [4.0027] # .0027 added due to leap year np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3) def test_age_nan(): age = Age() dates = pd.Series([datetime(2010, 1, 1), np.nan, datetime(2012, 1, 1)]) ages = age(dates, time=datetime(2020, 2, 26)) correct_ages = [10.159, np.nan, 8.159] np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3) def test_day_of_year(): doy = DayOfYear() dates = pd.Series([datetime(2019, 12, 31), np.nan, datetime(2020, 12, 31)]) days_of_year = doy(dates) correct_days = [365, np.nan, 366] np.testing.assert_array_equal(days_of_year, correct_days) def test_days_in_month(): dim = DaysInMonth() dates = pd.Series( [datetime(2010, 1, 1), datetime(2019, 2, 1), np.nan, datetime(2020, 2, 1)], ) days_in_month = dim(dates) correct_days = [31, 28, np.nan, 29] np.testing.assert_array_equal(days_in_month, correct_days) def test_is_leap_year(): ily = IsLeapYear() dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 1, 1)]) leap_year_bools = ily(dates) correct_bools = [True, False] np.testing.assert_array_equal(leap_year_bools, correct_bools) def test_is_month_end(): ime = IsMonthEnd() dates = pd.Series( [datetime(2019, 3, 1), datetime(2021, 2, 28), datetime(2020, 2, 29)], ) ime_bools = ime(dates) correct_bools = [False, True, True] np.testing.assert_array_equal(ime_bools, correct_bools) def test_is_month_start(): ims = IsMonthStart() dates = pd.Series( [datetime(2019, 3, 1), datetime(2020, 2, 28), datetime(2020, 2, 29)], ) ims_bools = ims(dates) correct_bools = [True, False, False] np.testing.assert_array_equal(ims_bools, correct_bools) def test_is_quarter_end(): iqe = IsQuarterEnd() dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 3, 31)]) iqe_bools = iqe(dates) correct_bools = [False, True] np.testing.assert_array_equal(iqe_bools, correct_bools) def test_is_quarter_start(): iqs = IsQuarterStart() dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 3, 31)]) iqs_bools = iqs(dates) correct_bools = [True, False] np.testing.assert_array_equal(iqs_bools, correct_bools) def test_is_lunch_time_default(): is_lunch_time = IsLunchTime() dates = pd.Series( [ datetime(2022, 6, 26, 12, 12, 12), datetime(2022, 6, 28, 12, 3, 4), datetime(2022, 6, 28, 11, 3, 4), np.nan, ], ) actual = is_lunch_time(dates) expected = [True, True, False, False] np.testing.assert_array_equal(actual, expected) def test_is_lunch_time_configurable(): is_lunch_time = IsLunchTime(14) dates = pd.Series( [ datetime(2022, 6, 26, 12, 12, 12), datetime(2022, 6, 28, 14, 3, 4), datetime(2022, 6, 28, 11, 3, 4), np.nan, ], ) actual = is_lunch_time(dates) expected = [False, True, False, False] np.testing.assert_array_equal(actual, expected) def test_is_working_hours_standard_hours(): is_working_hours = IsWorkingHours() dates = pd.Series( [ datetime(2022, 6, 21, 16, 3, 3), datetime(2019, 1, 3, 4, 4, 4), datetime(2022, 1, 1, 12, 1, 2), ], ) actual = is_working_hours(dates).tolist() expected = [True, False, True] np.testing.assert_array_equal(actual, expected) def test_is_working_hours_configured_hours(): is_working_hours = IsWorkingHours(15, 18) dates = pd.Series( [ datetime(2022, 6, 21, 16, 3, 3), datetime(2022, 6, 26, 14, 4, 4), datetime(2022, 1, 1, 12, 1, 2), ], ) answer = is_working_hours(dates).tolist() expected = [True, False, False] np.testing.assert_array_equal(answer, expected) def test_part_of_day(): pod = PartOfDay() dates = pd.Series( [ datetime(2020, 1, 11, 0, 2, 1), datetime(2020, 1, 11, 1, 2, 1), datetime(2021, 3, 31, 4, 2, 1), datetime(2020, 3, 4, 6, 2, 1), datetime(2020, 3, 4, 8, 2, 1), datetime(2020, 3, 4, 11, 2, 1), datetime(2020, 3, 4, 14, 2, 3), datetime(2020, 3, 4, 17, 2, 3), datetime(2020, 2, 2, 20, 2, 2), np.nan, ], ) actual = pod(dates) expected = pd.Series( [ "midnight", "midnight", "dawn", "early morning", "late morning", "noon", "afternoon", "evening", "night", np.nan, ], ) pd.testing.assert_series_equal(expected, actual) def test_is_year_end(): is_year_end = IsYearEnd() dates = pd.Series([datetime(2020, 12, 31), np.nan, datetime(2020, 1, 1)]) answer = is_year_end(dates) correct_answer = [True, False, False] np.testing.assert_array_equal(answer, correct_answer) def test_is_year_start(): is_year_start = IsYearStart() dates = pd.Series([datetime(2020, 12, 31), np.nan, datetime(2020, 1, 1)]) answer = is_year_start(dates) correct_answer = [False, False, True] np.testing.assert_array_equal(answer, correct_answer) def test_quarter_regular(): q = Quarter() array = pd.Series( [ pd.to_datetime("2018-01-01"), pd.to_datetime("2018-04-01"), pd.to_datetime("2018-07-01"), pd.to_datetime("2018-10-01"), ], ) answer = q(array) correct_answer = pd.Series([1, 2, 3, 4]) np.testing.assert_array_equal(answer, correct_answer) def test_quarter_leap_year(): q = Quarter() array = pd.Series( [ pd.to_datetime("2016-02-29"), pd.to_datetime("2018-04-01"), pd.to_datetime("2018-07-01"), pd.to_datetime("2018-10-01"), ], ) answer = q(array) correct_answer = pd.Series([1, 2, 3, 4]) np.testing.assert_array_equal(answer, correct_answer) def test_quarter_nan_and_nat_input(): q = Quarter() array = pd.Series( [ pd.to_datetime("2016-02-29"), np.nan, np.datetime64("NaT"), pd.to_datetime("2018-10-01"), ], ) answer = q(array) correct_answer = pd.Series([1, np.nan, np.nan, 4]) np.testing.assert_array_equal(answer, correct_answer) def test_quarter_year_before_1970(): q = Quarter() array = pd.Series( [ pd.to_datetime("2018-01-01"), pd.to_datetime("1950-04-01"), pd.to_datetime("1874-07-01"), pd.to_datetime("2018-10-01"), ], ) answer = q(array) correct_answer = pd.Series([1, 2, 3, 4]) np.testing.assert_array_equal(answer, correct_answer) def test_quarter_year_after_2038(): q = Quarter() array = pd.Series( [ pd.to_datetime("2018-01-01"), pd.to_datetime("2050-04-01"), pd.to_datetime("2174-07-01"), pd.to_datetime("2018-10-01"), ], ) answer = q(array) correct_answer = pd.Series([1, 2, 3, 4]) np.testing.assert_array_equal(answer, correct_answer) def test_quarter(): q = Quarter() dates = [datetime(2019, 12, 1), datetime(2019, 1, 3), datetime(2020, 2, 1)] quarter = q(dates) correct_quarters = [4, 1, 1] np.testing.assert_array_equal(quarter, correct_quarters) def test_week_no_deprecation_message(): dates = [ datetime(2019, 1, 3), datetime(2019, 6, 17, 11, 10, 50), datetime(2019, 11, 30, 19, 45, 15), ] with warnings.catch_warnings(): warnings.simplefilter("error") week = Week() week(dates).tolist() def test_url_to_domain_urls(): url_to_domain = URLToDomain() urls = pd.Series( [ "https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22", "http://mplay.google.co.in/sadfask/asdkfals?dk=10", "http://lplay.google.co.in/sadfask/asdkfals?dk=10", "http://play.google.co.in/sadfask/asdkfals?dk=10", "http://tplay.google.co.in/sadfask/asdkfals?dk=10", "http://www.google.co.in/sadfask/asdkfals?dk=10", "www.google.co.in/sadfask/asdkfals?dk=10", "http://user:pass@google.com/?a=b#asdd", "https://www.compzets.com?asd=10", "www.compzets.com?asd=10", "facebook.com", "https://www.compzets.net?asd=10", "http://www.featuretools.org", ], ) correct_urls = [ "play.google.com", "mplay.google.co.in", "lplay.google.co.in", "play.google.co.in", "tplay.google.co.in", "google.co.in", "google.co.in", "google.com", "compzets.com", "compzets.com", "facebook.com", "compzets.net", "featuretools.org", ] np.testing.assert_array_equal(url_to_domain(urls), correct_urls) def test_url_to_domain_long_url(): url_to_domain = URLToDomain() urls = pd.Series( [ "http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \ 100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \ 000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \ 2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \ 6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \ adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \ Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \ %7Cadb%7Csollections%7Cactivity|Chart", ], ) correct_urls = ["chart.apis.google.com"] results = url_to_domain(urls) np.testing.assert_array_equal(results, correct_urls) def test_url_to_domain_nan(): url_to_domain = URLToDomain() urls = pd.Series(["www.featuretools.com", np.nan], dtype="object") correct_urls = pd.Series(["featuretools.com", np.nan], dtype="object") results = url_to_domain(urls) pd.testing.assert_series_equal(results, correct_urls) def test_url_to_protocol_urls(): url_to_protocol = URLToProtocol() urls = pd.Series( [ "https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22", "http://mplay.google.co.in/sadfask/asdkfals?dk=10", "http://lplay.google.co.in/sadfask/asdkfals?dk=10", "www.google.co.in/sadfask/asdkfals?dk=10", "http://user:pass@google.com/?a=b#asdd", "https://www.compzets.com?asd=10", "www.compzets.com?asd=10", "facebook.com", "https://www.compzets.net?asd=10", "http://www.featuretools.org", "https://featuretools.com", ], ) correct_urls = pd.Series( [ "https", "http", "http", np.nan, "http", "https", np.nan, np.nan, "https", "http", "https", ], ) results = url_to_protocol(urls) pd.testing.assert_series_equal(results, correct_urls) def test_url_to_protocol_long_url(): url_to_protocol = URLToProtocol() urls = pd.Series( [ "http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \ 100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \ 000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \ 2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \ 6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \ adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \ Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \ %7Cadb%7Csollections%7Cactivity|Chart", ], ) correct_urls = ["http"] results = url_to_protocol(urls) np.testing.assert_array_equal(results, correct_urls) def test_url_to_protocol_nan(): url_to_protocol = URLToProtocol() urls = pd.Series(["www.featuretools.com", np.nan, ""], dtype="object") correct_urls = pd.Series([np.nan, np.nan, np.nan], dtype="object") results = url_to_protocol(urls) pd.testing.assert_series_equal(results, correct_urls) def test_url_to_tld_urls(): url_to_tld = URLToTLD() urls = pd.Series( [ "https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22", "http://mplay.google.co.in/sadfask/asdkfals?dk=10", "http://lplay.google.co.in/sadfask/asdkfals?dk=10", "http://play.google.co.in/sadfask/asdkfals?dk=10", "http://tplay.google.co.in/sadfask/asdkfals?dk=10", "http://www.google.co.in/sadfask/asdkfals?dk=10", "www.google.co.in/sadfask/asdkfals?dk=10", "http://user:pass@google.com/?a=b#asdd", "https://www.compzets.dev?asd=10", "www.compzets.com?asd=10", "https://www.compzets.net?asd=10", "http://www.featuretools.org", "featuretools.org", ], ) correct_urls = [ "com", "in", "in", "in", "in", "in", "in", "com", "dev", "com", "net", "org", "org", ] np.testing.assert_array_equal(url_to_tld(urls), correct_urls) def test_url_to_tld_long_url(): url_to_tld = URLToTLD() urls = pd.Series( [ "http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \ 100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \ 000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \ 2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \ 6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \ adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \ Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \ %7Cadb%7Csollections%7Cactivity|Chart", ], ) correct_urls = ["com"] np.testing.assert_array_equal(url_to_tld(urls), correct_urls) def test_url_to_tld_nan(): url_to_tld = URLToTLD() urls = pd.Series( ["www.featuretools.com", np.nan, "featuretools", ""], dtype="object", ) correct_urls = pd.Series(["com", np.nan, np.nan, np.nan], dtype="object") results = url_to_tld(urls) pd.testing.assert_series_equal(results, correct_urls, check_names=False) def test_is_free_email_domain_valid_addresses(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series( [ "test@hotmail.com", "name@featuretools.com", "nobody@yahoo.com", "free@gmail.com", ], ) answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([True, False, True, True]) pd.testing.assert_series_equal(answers, correct_answers) def test_is_free_email_domain_valid_addresses_whitespace(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series( [ " test@hotmail.com", " name@featuretools.com", "nobody@yahoo.com ", " free@gmail.com ", ], ) answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([True, False, True, True]) pd.testing.assert_series_equal(answers, correct_answers) def test_is_free_email_domain_nan(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series([np.nan, "name@featuretools.com", "nobody@yahoo.com"]) answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([np.nan, False, True]) pd.testing.assert_series_equal(answers, correct_answers) def test_is_free_email_domain_empty_string(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series(["", "name@featuretools.com", "nobody@yahoo.com"]) answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([np.nan, False, True]) pd.testing.assert_series_equal(answers, correct_answers) def test_is_free_email_domain_empty_series(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series([], dtype="category") answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([], dtype="category") pd.testing.assert_series_equal(answers, correct_answers) def test_is_free_email_domain_invalid_email(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series( [ np.nan, "this is not an email address", "name@featuretools.com", "nobody@yahoo.com", 1234, 1.23, True, ], ) answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([np.nan, np.nan, False, True, np.nan, np.nan, np.nan]) pd.testing.assert_series_equal(answers, correct_answers) def test_is_free_email_domain_all_nan(): is_free_email_domain = IsFreeEmailDomain() array = pd.Series([np.nan, np.nan]) answers = pd.Series(is_free_email_domain(array)) correct_answers = pd.Series([np.nan, np.nan], dtype=object) pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_valid_addresses(): email_address_to_domain = EmailAddressToDomain() array = pd.Series( [ "test@hotmail.com", "name@featuretools.com", "nobody@yahoo.com", "free@gmail.com", ], ) answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series( ["hotmail.com", "featuretools.com", "yahoo.com", "gmail.com"], ) pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_valid_addresses_whitespace(): email_address_to_domain = EmailAddressToDomain() array = pd.Series( [ " test@hotmail.com", " name@featuretools.com", "nobody@yahoo.com ", " free@gmail.com ", ], ) answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series( ["hotmail.com", "featuretools.com", "yahoo.com", "gmail.com"], ) pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_nan(): email_address_to_domain = EmailAddressToDomain() array = pd.Series([np.nan, "name@featuretools.com", "nobody@yahoo.com"]) answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series([np.nan, "featuretools.com", "yahoo.com"]) pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_empty_string(): email_address_to_domain = EmailAddressToDomain() array = pd.Series(["", "name@featuretools.com", "nobody@yahoo.com"]) answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series([np.nan, "featuretools.com", "yahoo.com"]) pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_empty_series(): email_address_to_domain = EmailAddressToDomain() array = pd.Series([], dtype="category") answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series([], dtype="category") pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_invalid_email(): email_address_to_domain = EmailAddressToDomain() array = pd.Series( [ np.nan, "this is not an email address", "name@featuretools.com", "nobody@yahoo.com", 1234, 1.23, True, ], ) answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series( [np.nan, np.nan, "featuretools.com", "yahoo.com", np.nan, np.nan, np.nan], ) pd.testing.assert_series_equal(answers, correct_answers) def test_email_address_to_domain_all_nan(): email_address_to_domain = EmailAddressToDomain() array = pd.Series([np.nan, np.nan]) answers = pd.Series(email_address_to_domain(array)) correct_answers = pd.Series([np.nan, np.nan], dtype=object) pd.testing.assert_series_equal(answers, correct_answers) def test_trans_primitives_can_init_without_params(): trans_primitives = get_transform_primitives().values() for trans_primitive in trans_primitives: trans_primitive() def test_numeric_lag_future_warning(): warning_text = "NumericLag is deprecated and will be removed in a future version. Please use the 'Lag' primitive instead." with pytest.warns(FutureWarning, match=warning_text): NumericLag() def test_lag_regular(): primitive_instance = Lag() primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 3, 4]) time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D")) answer = pd.Series(primitive_func(array, time_array)) correct_answer = pd.Series([np.nan, 1, 2, 3]) pd.testing.assert_series_equal(answer, correct_answer) def test_lag_period(): primitive_instance = Lag(periods=3) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 3, 4]) time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D")) answer = pd.Series(primitive_func(array, time_array)) correct_answer = pd.Series([np.nan, np.nan, np.nan, 1]) pd.testing.assert_series_equal(answer, correct_answer) def test_lag_negative_period(): primitive_instance = Lag(periods=-2) primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 3, 4]) time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D")) answer = pd.Series(primitive_func(array, time_array)) correct_answer = pd.Series([3, 4, np.nan, np.nan]) pd.testing.assert_series_equal(answer, correct_answer) def test_lag_starts_with_nan(): primitive_instance = Lag() primitive_func = primitive_instance.get_function() array = pd.Series([np.nan, 2, 3, 4]) time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D")) answer = pd.Series(primitive_func(array, time_array)) correct_answer = pd.Series([np.nan, np.nan, 2, 3]) pd.testing.assert_series_equal(answer, correct_answer) def test_lag_ends_with_nan(): primitive_instance = Lag() primitive_func = primitive_instance.get_function() array = pd.Series([1, 2, 3, np.nan]) time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D")) answer = pd.Series(primitive_func(array, time_array)) correct_answer = pd.Series([np.nan, 1, 2, 3]) pd.testing.assert_series_equal(answer, correct_answer) @pytest.mark.parametrize( "input_array,expected_output", [ ( pd.Series(["hello", "world", "foo", "bar"], dtype="string"), pd.Series([np.nan, "hello", "world", "foo"], dtype="string"), ), ( pd.Series(["cow", "cow", "pig", "pig"], dtype="category"), pd.Series([np.nan, "cow", "cow", "pig"], dtype="category"), ), ( pd.Series([True, False, True, False], dtype="bool"), pd.Series([np.nan, True, False, True], dtype="object"), ), ( pd.Series([True, False, True, False], dtype="boolean"), pd.Series([np.nan, True, False, True], dtype="boolean"), ), ( pd.Series([1.23, 2.45, 3.56, 4.98], dtype="float"), pd.Series([np.nan, 1.23, 2.45, 3.56], dtype="float"), ), ( pd.Series([1, 2, 3, 4], dtype="Int64"), pd.Series([np.nan, 1, 2, 3], dtype="Int64"), ), ( pd.Series([1, 2, 3, 4], dtype="int64"), pd.Series([np.nan, 1, 2, 3], dtype="float64"), ), ], ) def test_lag_with_different_dtypes(input_array, expected_output): primitive_instance = Lag() primitive_func = primitive_instance.get_function() time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D")) answer = pd.Series(primitive_func(input_array, time_array)) pd.testing.assert_series_equal(answer, expected_output) def test_date_to_time_zone_primitive(): primitive_func = DateToTimeZone().get_function() x = pd.Series( [ datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")), datetime(2010, 1, 10, tzinfo=timezone("Singapore")), datetime(2020, 1, 1, tzinfo=timezone("UTC")), datetime(2010, 1, 1, tzinfo=timezone("Europe/London")), ], ) answer = pd.Series(["America/Los_Angeles", "Singapore", "UTC", "Europe/London"]) pd.testing.assert_series_equal(primitive_func(x), answer) def test_date_to_time_zone_datetime64(): primitive_func = DateToTimeZone().get_function() x = pd.Series( [ datetime(2010, 1, 1), datetime(2010, 1, 10), datetime(2020, 1, 1), ], ).astype("datetime64[ns]") x = x.dt.tz_localize("America/Los_Angeles") answer = pd.Series(["America/Los_Angeles"] * 3) pd.testing.assert_series_equal(primitive_func(x), answer) def test_date_to_time_zone_naive_dates(): primitive_func = DateToTimeZone().get_function() x = pd.Series( [ datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")), datetime(2010, 1, 1), datetime(2010, 1, 2), ], ) answer = pd.Series(["America/Los_Angeles", np.nan, np.nan]) pd.testing.assert_series_equal(primitive_func(x), answer) def test_date_to_time_zone_nan(): primitive_func = DateToTimeZone().get_function() x = pd.Series( [ datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")), pd.NaT, np.nan, ], ) answer = pd.Series(["America/Los_Angeles", np.nan, np.nan]) pd.testing.assert_series_equal(primitive_func(x), answer) def test_rate_of_change_primitive_regular_interval(): rate_of_change = RateOfChange() times = pd.date_range(start="2019-01-01", freq="2s", periods=5) values = [0, 30, 180, -90, 0] expected = pd.Series([np.nan, 15, 75, -135, 45]) actual = rate_of_change(values, times) pd.testing.assert_series_equal(actual, expected) def test_rate_of_change_primitive_uneven_interval(): rate_of_change = RateOfChange() times = pd.to_datetime( [ "2019-01-01 00:00:00", "2019-01-01 00:00:01", "2019-01-01 00:00:03", "2019-01-01 00:00:07", "2019-01-01 00:00:08", ], ) values = [0, 30, 180, -90, 0] expected = pd.Series([np.nan, 30, 75, -67.5, 90]) actual = rate_of_change(values, times) pd.testing.assert_series_equal(actual, expected) def test_rate_of_change_primitive_with_nan(): rate_of_change = RateOfChange() times = pd.date_range(start="2019-01-01", freq="2s", periods=5) values = [0, 30, np.nan, -90, 0] expected = pd.Series([np.nan, 15, np.nan, np.nan, 45]) actual = rate_of_change(values, times) pd.testing.assert_series_equal(actual, expected) class TestFileExtension(PrimitiveTestBase): primitive = FileExtension def test_filepaths(self): primitive_func = FileExtension().get_function() array = pd.Series( [ "doc.txt", "~/documents/data.json", "data.JSON", "C:\\Projects\\apilibrary\\apilibrary.sln", ], dtype="string", ) answer = pd.Series([".txt", ".json", ".json", ".sln"], dtype="string") pd.testing.assert_series_equal(primitive_func(array), answer) def test_invalid(self): primitive_func = FileExtension().get_function() array = pd.Series(["doc.txt", "~/documents/data", np.nan], dtype="string") answer = pd.Series([".txt", np.nan, np.nan], dtype="string") pd.testing.assert_series_equal(primitive_func(array), answer) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs( es, aggregation, transform, self.primitive, target_dataframe_name="sessions", ) class TestIsFirstWeekOfMonth(PrimitiveTestBase): primitive = IsFirstWeekOfMonth def test_valid_dates(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("03/01/2019"), pd.to_datetime("03/03/2019"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array).tolist() correct_answers = [True, False, False, False] np.testing.assert_array_equal(answers, correct_answers) def test_leap_year(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("03/01/2019"), pd.to_datetime("02/29/2016"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array).tolist() correct_answers = [True, False, False, False] np.testing.assert_array_equal(answers, correct_answers) def test_year_before_1970(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("06/01/1965"), pd.to_datetime("03/02/2019"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array).tolist() correct_answers = [True, True, False, False] np.testing.assert_array_equal(answers, correct_answers) def test_year_after_2038(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("12/31/2040"), pd.to_datetime("01/01/2040"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array).tolist() correct_answers = [False, True, False, False] np.testing.assert_array_equal(answers, correct_answers) def test_nan_input(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("03/01/2019"), np.nan, np.datetime64("NaT"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array).tolist() correct_answers = [True, np.nan, np.nan, False] np.testing.assert_array_equal(answers, correct_answers) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) class TestNthWeekOfMonth(PrimitiveTestBase): primitive = NthWeekOfMonth def test_valid_dates(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("03/01/2019"), pd.to_datetime("03/03/2019"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), pd.to_datetime("09/01/2019"), ], ) answers = primitive_func(array) correct_answers = [1, 2, 6, 5, 1] np.testing.assert_array_equal(answers, correct_answers) def test_leap_year(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("03/01/2019"), pd.to_datetime("02/29/2016"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array) correct_answers = [1, 5, 6, 5] np.testing.assert_array_equal(answers, correct_answers) def test_year_before_1970(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("06/06/1965"), pd.to_datetime("03/02/2019"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array) correct_answers = [2, 1, 6, 5] np.testing.assert_array_equal(answers, correct_answers) def test_year_after_2038(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("12/31/2040"), pd.to_datetime("01/01/2001"), pd.to_datetime("03/31/2019"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array) correct_answers = [6, 1, 6, 5] np.testing.assert_array_equal(answers, correct_answers) def test_nan_input(self): primitive_func = self.primitive().get_function() array = pd.Series( [ pd.to_datetime("03/01/2019"), np.nan, np.datetime64("NaT"), pd.to_datetime("03/30/2019"), ], ) answers = primitive_func(array) correct_answers = [1, np.nan, np.nan, 5] np.testing.assert_array_equal(answers, correct_answers) def test_with_featuretools(self, es): transform, aggregation = find_applicable_primitives(self.primitive) primitive_instance = self.primitive() transform.append(primitive_instance) valid_dfs(es, aggregation, transform, self.primitive) ================================================ FILE: featuretools/tests/primitive_tests/utils.py ================================================ from inspect import signature import pytest from featuretools import ( FeatureBase, calculate_feature_matrix, dfs, encode_features, list_primitives, load_features, save_features, ) from featuretools.primitives.base import AggregationPrimitive, PrimitiveBase from featuretools.tests.testing_utils import make_ecommerce_entityset PRIMITIVES = list_primitives() def get_number_from_offset(offset): """Extract the numeric element of a potential offset string. Args: offset (int, str): If offset is an integer, that value is returned. If offset is a string, it's assumed to be an offset string of the format nD where n is a single digit integer. Note: This helper utility should only be used with offset strings that only have one numeric character. Only the first character will be returned, so if an offset string 24H is used, it will incorrectly return the integer 2. Additionally, any of the offset timespans (H for hourly, D for daily, etc.) can be used here; however, care should be taken by the user to remember what that timespan is when writing tests, as comparing 7 from 7D to 1 from 1W may not behave as expected. """ if isinstance(offset, str): return int(offset[0]) else: return offset class PrimitiveTestBase: primitive = None @pytest.fixture() def es(self): es = make_ecommerce_entityset() return es def test_name_and_desc(self): assert self.primitive.name is not None assert self.primitive.__doc__ is not None docstring = self.primitive.__doc__ short_description = docstring.splitlines()[0] first_word = short_description.split(" ", 1)[0] valid_verbs = [ "Calculates", "Determines", "Transforms", "Computes", "Shifts", "Extracts", "Applies", ] assert any(s in first_word for s in valid_verbs) assert self.primitive.input_types is not None def test_name_in_primitive_list(self): assert PRIMITIVES.name.eq(self.primitive.name).any() def test_arg_init(self): primitive_ = self.primitive() # determine the optional arguments in the __init__ init_params = signature(self.primitive.__init__) for name, parameter in init_params.parameters.items(): if parameter.default is not parameter.empty: assert hasattr(primitive_, name) def test_serialize(self, es, target_dataframe_name="log"): check_serialize(primitive=self.primitive, es=es, target_dataframe_name="log") def check_serialize(primitive, es, target_dataframe_name="log"): trans_primitives = [] agg_primitives = [] if issubclass(primitive, AggregationPrimitive): agg_primitives = [primitive] else: trans_primitives = [primitive] features = dfs( entityset=es, target_dataframe_name=target_dataframe_name, agg_primitives=agg_primitives, trans_primitives=trans_primitives, max_features=-1, max_depth=3, features_only=True, return_types="all", ) feat_to_serialize = None for feature in features: if feature.primitive.__class__ == primitive: feat_to_serialize = feature break for base_feature in feature.get_dependencies(deep=True): if base_feature.primitive.__class__ == primitive: feat_to_serialize = base_feature break assert feat_to_serialize is not None # Skip calculating feature matrix for long running primitives skip_primitives = ["elmo"] if primitive.name not in skip_primitives: df1 = calculate_feature_matrix([feat_to_serialize], entityset=es) new_feat = load_features(save_features([feat_to_serialize]))[0] assert isinstance(new_feat, FeatureBase) if primitive.name not in skip_primitives: df2 = calculate_feature_matrix([new_feat], entityset=es) assert df1.equals(df2) def find_applicable_primitives(primitive): from featuretools.primitives.utils import ( get_aggregation_primitives, get_transform_primitives, ) all_transform_primitives = list(get_transform_primitives().values()) all_aggregation_primitives = list(get_aggregation_primitives().values()) applicable_transforms = find_stackable_primitives( all_transform_primitives, primitive, ) applicable_aggregations = find_stackable_primitives( all_aggregation_primitives, primitive, ) return applicable_transforms, applicable_aggregations def find_stackable_primitives(all_primitives, primitive): applicable_primitives = [] for x in all_primitives: if x.input_types == [primitive.return_type]: applicable_primitives.append(x) return applicable_primitives def valid_dfs( es, aggregations, transforms, feature_substrings, target_dataframe_name="log", multi_output=False, max_depth=3, max_features=-1, instance_ids=[0, 1, 2, 3], ): if not isinstance(feature_substrings, list): feature_substrings = [feature_substrings] if any([issubclass(x, PrimitiveBase) for x in feature_substrings]): feature_substrings = [x.name.upper() for x in feature_substrings] features = dfs( entityset=es, target_dataframe_name=target_dataframe_name, agg_primitives=aggregations, trans_primitives=transforms, max_features=max_features, max_depth=max_depth, features_only=True, ) applicable_features = [] for feat in features: applicable_features += [ feat for x in feature_substrings if x in feat.get_name() ] if len(applicable_features) == 0: raise ValueError( "No feature names with %s, verify the name attribute \ is defined and/or generate_name() is defined to \ return %s " % (feature_substrings, feature_substrings), ) df = calculate_feature_matrix( entityset=es, features=applicable_features, instance_ids=instance_ids, n_jobs=1, ) encode_features(df, applicable_features) # TODO: check the multi_output shape by checking # feature.number_output_features for each feature # and comparing it with the matrix shape if not multi_output: assert len(applicable_features) == df.shape[1] return ================================================ FILE: featuretools/tests/profiling/__init__.py ================================================ ================================================ FILE: featuretools/tests/profiling/dfs_profile.py ================================================ """ dfs_profile.py Helper module to allow profiling of the dfs operations. At some point we may want to use pstats to output the results to a log, but I'm anticipating that LookingGlass will provide the performance data we want. Notes: - output currently goes to the root directory and is in dfs_profile.stats - *.stats is gitignored - it uses the demo customers dataset for testing - max_depth > 2 is very slow (currently) - stats output can be viewed online with https://nejc.saje.info/pstats-viewer.html """ import cProfile from pathlib import Path import featuretools as ft import featuretools.demo as demo from featuretools.synthesis.dfs import dfs es = demo.load_retail() all_aggs = ft.primitives.get_aggregation_primitives() all_trans = ft.primitives.get_transform_primitives() profiler = cProfile.Profile(builtins=False) profiler.enable() feature_defs = dfs( entityset=es, target_dataframe_name="customers", trans_primitives=all_trans, agg_primitives=all_aggs, max_depth=2, features_only=True, ) profiler.disable() profiler.dump_stats(Path.cwd() / "dfs_profile.stats") ================================================ FILE: featuretools/tests/requirement_files/latest_requirements.txt ================================================ cloudpickle==3.0.0 dask==2024.6.2 dask-expr==1.1.6 distributed==2024.6.2 holidays==0.51 numpy==1.26.4 pandas==2.2.2 psutil==6.0.0 scipy==1.13.1 tqdm==4.66.4 woodwork==0.31.0 ================================================ FILE: featuretools/tests/requirement_files/minimum_core_requirements.txt ================================================ cloudpickle==1.5.0 holidays==0.17 numpy==1.25.0 packaging==20.0 pandas==2.0.0 psutil==5.7.0 scipy==1.10.0 tqdm==4.66.3 woodwork==0.28.0 ================================================ FILE: featuretools/tests/requirement_files/minimum_dask_requirements.txt ================================================ cloudpickle==1.5.0 dask[dataframe]==2023.2.0 distributed==2023.2.0 holidays==0.17 numpy==1.25.0 packaging==20.0 pandas==2.0.0 psutil==5.7.0 scipy==1.10.0 tqdm==4.66.3 woodwork==0.28.0 ================================================ FILE: featuretools/tests/requirement_files/minimum_test_requirements.txt ================================================ boto3==1.34.32 cloudpickle==1.5.0 composeml==0.8.0 graphviz==0.8.4 holidays==0.17 moto[all]==5.0.0 numpy==1.25.0 packaging==20.0 pandas==2.0.0 pip==23.3.0 psutil==5.7.0 pyarrow==14.0.1 pympler==0.8 pytest-cov==3.0.0 pytest-timeout==2.1.0 pytest-xdist==2.5.0 pytest==7.1.2 scipy==1.10.0 smart-open==5.0.0 tqdm==4.66.3 urllib3==1.26.18 woodwork==0.28.0 ================================================ FILE: featuretools/tests/selection/__init__.py ================================================ ================================================ FILE: featuretools/tests/selection/test_selection.py ================================================ import numpy as np import pandas as pd import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Boolean, BooleanNullable, NaturalLanguage from featuretools import EntitySet, Feature, dfs from featuretools.selection import ( remove_highly_correlated_features, remove_highly_null_features, remove_low_information_features, remove_single_value_features, ) from featuretools.tests.testing_utils import make_ecommerce_entityset @pytest.fixture def feature_matrix(): feature_matrix = pd.DataFrame( { "test": [0, 1, 2], "no_null": [np.nan, 0, 0], "some_null": [np.nan, 0, 0], "all_null": [np.nan, np.nan, np.nan], "many_value": [1, 2, 3], "dup_value": [1, 1, 2], "one_value": [1, 1, 1], }, ) return feature_matrix @pytest.fixture def test_es(es, feature_matrix): es.add_dataframe(dataframe_name="test", dataframe=feature_matrix, index="test") return es def test_remove_low_information_feature_names(feature_matrix): feature_matrix = remove_low_information_features(feature_matrix) assert feature_matrix.shape == (3, 5) assert "one_value" not in feature_matrix.columns assert "all_null" not in feature_matrix.columns def test_remove_low_information_features(test_es, feature_matrix): features = [Feature(test_es["test"].ww[col]) for col in test_es["test"].columns] feature_matrix, features = remove_low_information_features(feature_matrix, features) assert feature_matrix.shape == (3, 5) assert len(features) == 5 for f in features: assert f.get_name() in feature_matrix.columns assert "one_value" not in feature_matrix.columns assert "all_null" not in feature_matrix.columns def test_remove_highly_null_features(): nulls_df = pd.DataFrame( { "id": [0, 1, 2, 3], "half_nulls": [None, None, 88, 99], "all_nulls": [None, None, None, None], "quarter": ["a", "b", None, "c"], "vals": [True, True, False, False], }, ) es = EntitySet("data", {"nulls": (nulls_df, "id")}) es["nulls"].ww.set_types( logical_types={"all_nulls": "categorical", "quarter": "categorical"}, ) fm, features = dfs( entityset=es, target_dataframe_name="nulls", trans_primitives=["is_null"], max_depth=1, ) with pytest.raises( ValueError, match="pct_null_threshold must be a float between 0 and 1, inclusive.", ): remove_highly_null_features(fm, pct_null_threshold=1.1) with pytest.raises( ValueError, match="pct_null_threshold must be a float between 0 and 1, inclusive.", ): remove_highly_null_features(fm, pct_null_threshold=-0.1) no_thresh = remove_highly_null_features(fm) no_thresh_cols = set(no_thresh.columns) diff = set(fm.columns) - no_thresh_cols assert len(diff) == 1 assert "all_nulls" not in no_thresh_cols half = remove_highly_null_features(fm, pct_null_threshold=0.5) half_cols = set(half.columns) diff = set(fm.columns) - half_cols assert len(diff) == 2 assert "all_nulls" not in half_cols assert "half_nulls" not in half_cols no_tolerance = remove_highly_null_features(fm, pct_null_threshold=0) no_tolerance_cols = set(no_tolerance.columns) diff = set(fm.columns) - no_tolerance_cols assert len(diff) == 3 assert "all_nulls" not in no_tolerance_cols assert "half_nulls" not in no_tolerance_cols assert "quarter" not in no_tolerance_cols ( with_features_param, with_features_param_features, ) = remove_highly_null_features(fm, features) assert len(with_features_param_features) == len(no_thresh.columns) for i in range(len(with_features_param_features)): assert with_features_param_features[i].get_name() == no_thresh.columns[i] assert with_features_param.columns[i] == no_thresh.columns[i] def test_remove_single_value_features(): same_vals_df = pd.DataFrame( { "id": [0, 1, 2, 3], "all_numeric": [88, 88, 88, 88], "with_nan": [1, 1, None, 1], "all_nulls": [None, None, None, None], "all_categorical": ["a", "a", "a", "a"], "all_bools": [True, True, True, True], "diff_vals": ["hi", "bye", "bye", "hi"], }, ) es = EntitySet("data", {"single_vals": (same_vals_df, "id")}) es["single_vals"].ww.set_types( logical_types={ "all_nulls": "categorical", "all_categorical": "categorical", "diff_vals": "categorical", }, ) fm, features = dfs( entityset=es, target_dataframe_name="single_vals", trans_primitives=["is_null"], max_depth=1, ) no_params, no_params_features = remove_single_value_features(fm, features) no_params_cols = set(no_params.columns) assert len(no_params_features) == 2 assert "IS_NULL(with_nan)" in no_params_cols assert "diff_vals" in no_params_cols nan_as_value, nan_as_value_features = remove_single_value_features( fm, features, count_nan_as_value=True, ) nan_cols = set(nan_as_value.columns) assert len(nan_as_value_features) == 3 assert "IS_NULL(with_nan)" in nan_cols assert "diff_vals" in nan_cols assert "with_nan" in nan_cols without_features_param = remove_single_value_features(fm) assert len(no_params.columns) == len(without_features_param.columns) for i in range(len(no_params.columns)): assert no_params.columns[i] == without_features_param.columns[i] assert no_params_features[i].get_name() == without_features_param.columns[i] def test_remove_highly_correlated_features(): correlated_df = pd.DataFrame( { "id": [0, 1, 2, 3], "diff_ints": [34, 11, 29, 91], "words": ["test", "this is a short sentence", "foo bar", "baz"], "corr_words": [4, 24, 7, 3], "corr_1": [99, 88, 77, 33], "corr_2": [99, 88, 77, 33], }, ) es = EntitySet( "data", {"correlated": (correlated_df, "id", None, {"words": NaturalLanguage})}, ) fm, _ = dfs( entityset=es, target_dataframe_name="correlated", trans_primitives=["num_characters"], max_depth=1, ) with pytest.raises( ValueError, match="pct_corr_threshold must be a float between 0 and 1, inclusive.", ): remove_highly_correlated_features(fm, pct_corr_threshold=1.1) with pytest.raises( ValueError, match="pct_corr_threshold must be a float between 0 and 1, inclusive.", ): remove_highly_correlated_features(fm, pct_corr_threshold=-0.1) with pytest.raises( AssertionError, match="feature named not_a_feature is not in feature matrix", ): remove_highly_correlated_features(fm, features_to_check=["not_a_feature"]) to_check = remove_highly_correlated_features( fm, features_to_check=["corr_words", "NUM_CHARACTERS(words)", "diff_ints"], ) to_check_columns = set(to_check.columns) assert len(to_check_columns) == 4 assert "NUM_CHARACTERS(words)" not in to_check_columns assert "corr_1" in to_check_columns assert "corr_2" in to_check_columns to_keep = remove_highly_correlated_features( fm, features_to_keep=["NUM_CHARACTERS(words)"], ) to_keep_names = set(to_keep.columns) assert len(to_keep_names) == 4 assert "corr_words" in to_keep_names assert "NUM_CHARACTERS(words)" in to_keep_names assert "corr_2" not in to_keep_names new_fm = remove_highly_correlated_features(fm) assert len(new_fm.columns) == 3 assert "corr_2" not in new_fm.columns assert "NUM_CHARACTERS(words)" not in new_fm.columns diff_threshold = remove_highly_correlated_features(fm, pct_corr_threshold=0.8) diff_threshold_cols = diff_threshold.columns assert len(diff_threshold_cols) == 2 assert "corr_words" in diff_threshold_cols assert "diff_ints" in diff_threshold_cols def test_remove_highly_correlated_features_init_woodwork(): correlated_df = pd.DataFrame( { "id": [0, 1, 2, 3], "diff_ints": [34, 11, 29, 91], "words": ["test", "this is a short sentence", "foo bar", "baz"], "corr_words": [4, 24, 7, 3], "corr_1": [99, 88, 77, 33], "corr_2": [99, 88, 77, 33], }, ) es = EntitySet( "data", {"correlated": (correlated_df, "id", None, {"words": NaturalLanguage})}, ) fm, _ = dfs( entityset=es, target_dataframe_name="correlated", trans_primitives=["num_characters"], max_depth=1, ) no_ww_fm = fm.copy() ww_fm = fm.copy() ww_fm.ww.init() new_no_ww_fm = remove_highly_correlated_features(no_ww_fm) new_ww_fm = remove_highly_correlated_features(ww_fm) pd.testing.assert_frame_equal(new_no_ww_fm, new_ww_fm) def test_multi_output_selection(): df1 = pd.DataFrame({"id": [0, 1, 2, 3]}) df2 = pd.DataFrame( { "index": [0, 1, 2, 3], "first_id": [0, 1, 1, 3], "all_nulls": [None, None, None, None], "quarter": ["a", "b", None, "c"], }, ) dataframes = { "first": (df1, "id"), "second": (df2, "index"), } relationships = [("first", "id", "second", "first_id")] es = EntitySet("data", dataframes, relationships=relationships) es["second"].ww.set_types( logical_types={"all_nulls": "categorical", "quarter": "categorical"}, ) fm, features = dfs( entityset=es, target_dataframe_name="first", trans_primitives=[], agg_primitives=["n_most_common"], max_depth=1, ) multi_output, multi_output_features = remove_single_value_features(fm, features) assert multi_output.columns == ["N_MOST_COMMON(second.quarter)[0]"] assert len(multi_output_features) == 1 assert multi_output_features[0].get_name() == multi_output.columns[0] es = make_ecommerce_entityset() fm, features = dfs( entityset=es, target_dataframe_name="régions", trans_primitives=[], agg_primitives=["n_most_common"], max_depth=2, ) matrix_with_slices, unsliced_features = remove_highly_null_features(fm, features) assert len(matrix_with_slices.columns) == 18 assert len(unsliced_features) == 14 matrix_columns = set(matrix_with_slices.columns) for f in unsliced_features: for f_name in f.get_feature_names(): assert f_name in matrix_columns def test_remove_highly_correlated_features_on_boolean_cols(): correlated_df = pd.DataFrame( { "id": [0, 1, 2, 3], "diff_ints": [34, 11, 29, 91], "corr_words": [4, 24, 7, 3], "bools": [True, True, False, True], }, ) es = EntitySet( "data", {"correlated": (correlated_df, "id", None, {"bools": Boolean})}, ) feature_matrix, features = dfs( entityset=es, target_dataframe_name="correlated", trans_primitives=["equal"], agg_primitives=[], max_depth=1, return_types=[ ColumnSchema(logical_type=BooleanNullable), ColumnSchema(logical_type=Boolean), ], ) # Confirm both boolean logical types are included so that we know we're checking the correct types assert { ltype.type_string for ltype in feature_matrix.ww.logical_types.values() } == {Boolean.type_string, BooleanNullable.type_string} to_keep = remove_highly_correlated_features( feature_matrix=feature_matrix, features=features, pct_corr_threshold=0.3, ) assert len(to_keep[0].columns) < len(feature_matrix.columns) ================================================ FILE: featuretools/tests/synthesis/__init__.py ================================================ ================================================ FILE: featuretools/tests/synthesis/test_deep_feature_synthesis.py ================================================ import copy import re import pandas as pd import pytest from woodwork.column_schema import ColumnSchema from woodwork.logical_types import Datetime from featuretools import EntitySet, Feature, GroupByTransformFeature from featuretools.entityset.entityset import LTI_COLUMN_NAME from featuretools.feature_base import ( AggregationFeature, DirectFeature, IdentityFeature, TransformFeature, ) from featuretools.feature_base.utils import is_valid_input from featuretools.primitives import ( Absolute, AddNumeric, Count, CumCount, CumMean, CumMin, CumSum, Day, Diff, Equal, Hour, IsIn, IsNull, Last, Mean, Mode, Month, Negate, NMostCommon, Not, NotEqual, NumCharacters, NumTrue, NumUnique, RollingCount, RollingMax, RollingMean, RollingMin, RollingOutlierCount, RollingSTD, Sum, TimeSincePrevious, TransformPrimitive, Trend, Year, ) from featuretools.synthesis import DeepFeatureSynthesis from featuretools.tests.testing_utils import ( feature_with_name, make_ecommerce_entityset, number_of_features_with_name_like, ) def test_makes_agg_features_from_str(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=["sum"], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "SUM(log.value)") def test_makes_agg_features_from_mixed_str(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Count, "sum"], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "SUM(log.value)") assert feature_with_name(features, "COUNT(log)") def test_makes_agg_features(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "SUM(log.value)") def test_only_makes_supplied_agg_feat(es): kwargs = dict( target_dataframe_name="customers", entityset=es, max_depth=3, ) dfs_obj = DeepFeatureSynthesis(agg_primitives=[Sum], **kwargs) features = dfs_obj.build_features() def find_other_agg_features(features): return [ f for f in features if (isinstance(f, AggregationFeature) and not isinstance(f.primitive, Sum)) or len( [ g for g in f.base_features if isinstance(g, AggregationFeature) and not isinstance(g.primitive, Sum) ], ) > 0 ] other_agg_features = find_other_agg_features(features) assert len(other_agg_features) == 0 def test_error_for_missing_target_dataframe(es): error_text = ( "Provided target dataframe missing_dataframe does not exist in ecommerce" ) with pytest.raises(KeyError, match=error_text): DeepFeatureSynthesis( target_dataframe_name="missing_dataframe", entityset=es, agg_primitives=[Last], trans_primitives=[], ignore_dataframes=["log"], ) es_without_id = EntitySet() error_text = ( "Provided target dataframe missing_dataframe does not exist in entity set" ) with pytest.raises(KeyError, match=error_text): DeepFeatureSynthesis( target_dataframe_name="missing_dataframe", entityset=es_without_id, agg_primitives=[Last], trans_primitives=[], ignore_dataframes=["log"], ) def test_ignores_dataframes(es): error_text = "ignore_dataframes must be a list" with pytest.raises(TypeError, match=error_text): DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], ignore_dataframes="log", ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], ignore_dataframes=["log"], ) features = dfs_obj.build_features() for f in features: deps = f.get_dependencies(deep=True) dataframes = [d.dataframe_name for d in deps] assert "log" not in dataframes def test_ignores_columns(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], ignore_columns={"log": ["value"]}, ) features = dfs_obj.build_features() for f in features: deps = f.get_dependencies(deep=True) identities = [d for d in deps if isinstance(d, IdentityFeature)] columns = [d.column_name for d in identities if d.dataframe_name == "log"] assert "value" not in columns def test_ignore_columns_input_type(es): error_msg = r"ignore_columns should be dict\[str -> list\]" # need to use string literals to avoid regex params wrong_input_type = {"log": "value"} with pytest.raises(TypeError, match=error_msg): DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, ignore_columns=wrong_input_type, ) def test_ignore_columns_with_nonstring_values(es): error_msg = "list in ignore_columns must only have string values" wrong_input_list = {"log": ["a", "b", 3]} with pytest.raises(TypeError, match=error_msg): DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, ignore_columns=wrong_input_list, ) def test_ignore_columns_with_nonstring_keys(es): error_msg = r"ignore_columns should be dict\[str -> list\]" # need to use string literals to avoid regex params wrong_input_keys = {1: ["a", "b", "c"]} with pytest.raises(TypeError, match=error_msg): DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, ignore_columns=wrong_input_keys, ) def test_makes_dfeatures(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "customers.age") def test_makes_trans_feat(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[Hour], ) features = dfs_obj.build_features() assert feature_with_name(features, "HOUR(datetime)") def test_handles_diff_dataframe_groupby(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], groupby_trans_primitives=[Diff], ) features = dfs_obj.build_features() assert feature_with_name(features, "DIFF(value) by session_id") assert feature_with_name(features, "DIFF(value) by product_id") def test_handles_time_since_previous_dataframe_groupby(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], groupby_trans_primitives=[TimeSincePrevious], ) features = dfs_obj.build_features() assert feature_with_name(features, "TIME_SINCE_PREVIOUS(datetime) by session_id") # M TODO # def test_handles_cumsum_dataframe_groupby(es): # dfs_obj = DeepFeatureSynthesis(target_dataframe_name='sessions', # entityset=es, # agg_primitives=[], # trans_primitives=[CumMean]) # features = dfs_obj.build_features() # assert (feature_with_name(features, u'customers.CUM_MEAN(age by région_id)')) def test_only_makes_supplied_trans_feat(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[Hour], ) features = dfs_obj.build_features() other_trans_features = [ f for f in features if (isinstance(f, TransformFeature) and not isinstance(f.primitive, Hour)) or len( [ g for g in f.base_features if isinstance(g, TransformFeature) and not isinstance(g.primitive, Hour) ], ) > 0 ] assert len(other_trans_features) == 0 def test_makes_dfeatures_of_agg_primitives(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=["max"], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "customers.MAX(log.value)") def test_makes_agg_features_of_trans_primitives(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Mean], trans_primitives=[NumCharacters], ) features = dfs_obj.build_features() assert feature_with_name(features, "MEAN(log.NUM_CHARACTERS(comments))") def test_makes_agg_features_with_where(es): es.add_interesting_values() dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Count], where_primitives=[Count], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "COUNT(log WHERE priority_level = 0)") # make sure they are made using direct features too assert feature_with_name(features, "COUNT(log WHERE products.department = food)") def test_make_groupby_features(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[], groupby_trans_primitives=["cum_sum"], ) features = dfs_obj.build_features() assert feature_with_name(features, "CUM_SUM(value) by session_id") def test_make_indirect_groupby_features(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[], groupby_trans_primitives=["cum_sum"], ) features = dfs_obj.build_features() assert feature_with_name(features, "CUM_SUM(products.rating) by session_id") def test_make_groupby_features_with_id(es): # Need to convert customer_id to categorical column in order to build desired feature es["sessions"].ww.set_types( logical_types={"customer_id": "Categorical"}, semantic_tags={"customer_id": "foreign_key"}, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[], trans_primitives=[], groupby_trans_primitives=["cum_count"], ) features = dfs_obj.build_features() assert feature_with_name(features, "CUM_COUNT(customer_id) by customer_id") def test_make_groupby_features_with_diff_id(es): # Need to convert cohort to categorical column in order to build desired feature es["customers"].ww.set_types( logical_types={"cohort": "Categorical"}, semantic_tags={"cohort": "foreign_key"}, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[], trans_primitives=[], groupby_trans_primitives=["cum_count"], ) features = dfs_obj.build_features() groupby_with_diff_id = "CUM_COUNT(cohort) by région_id" assert feature_with_name(features, groupby_with_diff_id) def test_make_groupby_features_with_agg(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="cohorts", entityset=es, agg_primitives=["sum"], trans_primitives=[], groupby_trans_primitives=["cum_sum"], ) features = dfs_obj.build_features() agg_on_groupby_name = "SUM(customers.CUM_SUM(age) by région_id)" assert feature_with_name(features, agg_on_groupby_name) def test_bad_groupby_feature(es): msg = re.escape( "Unknown groupby transform primitive max. " "Call ft.primitives.list_primitives() to get " "a list of available primitives", ) with pytest.raises(ValueError, match=msg): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["sum"], trans_primitives=[], groupby_trans_primitives=["Max"], ) @pytest.mark.parametrize( "rolling_primitive", [ RollingMax, RollingMean, RollingMin, RollingOutlierCount, RollingSTD, ], ) @pytest.mark.parametrize( "window_length, gap", [ (7, 3), ("7d", "3d"), ], ) def test_make_rolling_features(window_length, gap, rolling_primitive, es): rolling_primitive_obj = rolling_primitive( window_length=window_length, gap=gap, min_periods=5, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[rolling_primitive_obj], ) features = dfs_obj.build_features() rolling_transform_name = f"{rolling_primitive.name.upper()}(datetime, value_many_nans, window_length={window_length}, gap={gap}, min_periods=5)" assert feature_with_name(features, rolling_transform_name) @pytest.mark.parametrize( "window_length, gap", [ (7, 3), ("7d", "3d"), ], ) def test_make_rolling_count_off_datetime_feature(window_length, gap, es): rolling_count = RollingCount(window_length=window_length, min_periods=gap) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[rolling_count], ) features = dfs_obj.build_features() rolling_transform_name = ( f"ROLLING_COUNT(datetime, window_length={window_length}, min_periods={gap})" ) assert feature_with_name(features, rolling_transform_name) def test_abides_by_max_depth_param(es): for i in [0, 1, 2, 3]: dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], max_depth=i, ) features = dfs_obj.build_features() for f in features: assert f.get_depth() <= i def test_max_depth_single_table(transform_es): assert len(transform_es.dataframe_dict) == 1 def make_dfs_obj(max_depth): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="first", entityset=transform_es, trans_primitives=[AddNumeric], max_depth=max_depth, ) return dfs_obj for i in [-1, 0, 1, 2]: if i in [-1, 2]: match = ( "Only one dataframe in entityset, changing max_depth to 1 " "since deeper features cannot be created" ) with pytest.warns(UserWarning, match=match): dfs_obj = make_dfs_obj(i) else: dfs_obj = make_dfs_obj(i) features = dfs_obj.build_features() assert len(features) > 0 if i != 0: # at least one depth 1 feature made assert any([f.get_depth() == 1 for f in features]) # no depth 2 or higher even with max_depth=2 assert all([f.get_depth() <= 1 for f in features]) else: # no depth 1 or higher features with max_depth=0 assert all([f.get_depth() == 0 for f in features]) def test_drop_contains(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], max_depth=1, seed_features=[], drop_contains=[], ) features = dfs_obj.build_features() to_drop = features[2] partial_name = to_drop.get_name()[:5] dfs_drop = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], max_depth=1, seed_features=[], drop_contains=[partial_name], ) features = dfs_drop.build_features() assert to_drop.get_name() not in [f.get_name() for f in features] def test_drop_exact(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], max_depth=1, seed_features=[], drop_exact=[], ) features = dfs_obj.build_features() to_drop = features[2] name = to_drop.get_name() dfs_drop = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Sum], trans_primitives=[], max_depth=1, seed_features=[], drop_exact=[name], ) features = dfs_drop.build_features() assert name not in [f.get_name() for f in features] def test_seed_features(es): seed_feature_sessions = ( Feature(es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count) > 2 ) seed_feature_log = Feature(es["log"].ww["comments"], primitive=NumCharacters) session_agg = Feature( seed_feature_log, parent_dataframe_name="sessions", primitive=Mean, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Mean], trans_primitives=[], max_depth=2, seed_features=[seed_feature_sessions, seed_feature_log], ) features = dfs_obj.build_features() assert seed_feature_sessions.get_name() in [f.get_name() for f in features] assert session_agg.get_name() in [f.get_name() for f in features] def test_does_not_make_agg_of_direct_of_target_dataframe(es): count_sessions = Feature( es["sessions"].ww["id"], parent_dataframe_name="customers", primitive=Count, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[Last], trans_primitives=[], max_depth=2, seed_features=[count_sessions], ) features = dfs_obj.build_features() # this feature is meaningless because customers.COUNT(sessions) is already defined on # the customers dataframe assert not feature_with_name(features, "LAST(sessions.customers.COUNT(sessions))") assert not feature_with_name(features, "LAST(sessions.customers.age)") def test_dfs_builds_on_seed_features_more_than_max_depth(es): seed_feature_sessions = Feature( es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count, ) seed_feature_log = Feature(es["log"].ww["datetime"], primitive=Hour) session_agg = Feature( seed_feature_log, parent_dataframe_name="sessions", primitive=Last, ) # Depth of this feat is 2 relative to session_agg, the seed feature, # which is greater than max_depth so it shouldn't be built session_agg_trans = DirectFeature( Feature(session_agg, parent_dataframe_name="customers", primitive=Mode), "sessions", ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Last, Count], trans_primitives=[], max_depth=1, seed_features=[seed_feature_sessions, seed_feature_log], ) features = dfs_obj.build_features() assert seed_feature_sessions.get_name() in [f.get_name() for f in features] assert session_agg.get_name() in [f.get_name() for f in features] assert session_agg_trans.get_name() not in [f.get_name() for f in features] def test_dfs_includes_seed_features_greater_than_max_depth(es): session_agg = Feature( es["log"].ww["value"], parent_dataframe_name="sessions", primitive=Sum, ) customer_agg = Feature( session_agg, parent_dataframe_name="customers", primitive=Mean, ) assert customer_agg.get_depth() == 2 dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[Mean], trans_primitives=[], max_depth=1, seed_features=[customer_agg], ) features = dfs_obj.build_features() assert feature_with_name(features=features, name=customer_agg.get_name()) def test_allowed_paths(es): kwargs = dict( target_dataframe_name="customers", entityset=es, agg_primitives=[Last], trans_primitives=[], max_depth=2, seed_features=[], ) dfs_unconstrained = DeepFeatureSynthesis(**kwargs) features_unconstrained = dfs_unconstrained.build_features() unconstrained_names = [f.get_name() for f in features_unconstrained] customers_session_feat = Feature( es["sessions"].ww["device_type"], parent_dataframe_name="customers", primitive=Last, ) customers_session_log_feat = Feature( es["log"].ww["value"], parent_dataframe_name="customers", primitive=Last, ) assert customers_session_feat.get_name() in unconstrained_names assert customers_session_log_feat.get_name() in unconstrained_names dfs_constrained = DeepFeatureSynthesis( allowed_paths=[["customers", "sessions"]], **kwargs ) features = dfs_constrained.build_features() names = [f.get_name() for f in features] assert customers_session_feat.get_name() in names assert customers_session_log_feat.get_name() not in names def test_max_features(es): kwargs = dict( target_dataframe_name="customers", entityset=es, agg_primitives=[Sum], trans_primitives=[], max_depth=2, seed_features=[], ) dfs_unconstrained = DeepFeatureSynthesis(**kwargs) features_unconstrained = dfs_unconstrained.build_features() dfs_unconstrained_with_arg = DeepFeatureSynthesis(max_features=-1, **kwargs) feats_unconstrained_with_arg = dfs_unconstrained_with_arg.build_features() dfs_constrained = DeepFeatureSynthesis(max_features=1, **kwargs) features = dfs_constrained.build_features() assert len(features_unconstrained) == len(feats_unconstrained_with_arg) assert len(features) == 1 def test_where_primitives(es): es.add_interesting_values(dataframe_name="sessions", values={"device_type": [0]}) kwargs = dict( target_dataframe_name="customers", entityset=es, agg_primitives=[Count, Sum], trans_primitives=[Absolute], max_depth=3, ) dfs_unconstrained = DeepFeatureSynthesis(**kwargs) dfs_constrained = DeepFeatureSynthesis(where_primitives=["sum"], **kwargs) features_unconstrained = dfs_unconstrained.build_features() features = dfs_constrained.build_features() where_feats_unconstrained = [ f for f in features_unconstrained if isinstance(f, AggregationFeature) and f.where is not None ] where_feats = [ f for f in features if isinstance(f, AggregationFeature) and f.where is not None ] assert len(where_feats_unconstrained) >= 1 assert ( len([f for f in where_feats_unconstrained if isinstance(f.primitive, Sum)]) == 0 ) assert ( len([f for f in where_feats_unconstrained if isinstance(f.primitive, Count)]) > 0 ) assert len([f for f in where_feats if isinstance(f.primitive, Sum)]) > 0 assert len([f for f in where_feats if isinstance(f.primitive, Count)]) == 0 assert ( len( [ d for f in where_feats for d in f.get_dependencies(deep=True) if isinstance(d.primitive, Absolute) ], ) > 0 ) def test_stacking_where_primitives(es): es = copy.deepcopy(es) es.add_interesting_values(dataframe_name="sessions", values={"device_type": [0]}) es.add_interesting_values( dataframe_name="log", values={"product_id": ["coke_zero"]}, ) kwargs = dict( target_dataframe_name="customers", entityset=es, agg_primitives=[Count, Last], max_depth=3, ) dfs_where_stack_limit_1 = DeepFeatureSynthesis( where_primitives=["last", Count], **kwargs ) dfs_where_stack_limit_2 = DeepFeatureSynthesis( where_primitives=["last", Count], where_stacking_limit=2, **kwargs ) stack_limit_1_features = dfs_where_stack_limit_1.build_features() stack_limit_2_features = dfs_where_stack_limit_2.build_features() where_stack_1_feats = [ f for f in stack_limit_1_features if isinstance(f, AggregationFeature) and f.where is not None ] where_stack_2_feats = [ f for f in stack_limit_2_features if isinstance(f, AggregationFeature) and f.where is not None ] assert len(where_stack_1_feats) >= 1 assert len(where_stack_2_feats) >= 1 assert len([f for f in where_stack_1_feats if isinstance(f.primitive, Last)]) > 0 assert len([f for f in where_stack_1_feats if isinstance(f.primitive, Count)]) > 0 assert len([f for f in where_stack_2_feats if isinstance(f.primitive, Last)]) > 0 assert len([f for f in where_stack_2_feats if isinstance(f.primitive, Count)]) > 0 stacked_where_limit_1_feats = [] stacked_where_limit_2_feats = [] where_double_where_tuples = [ (where_stack_1_feats, stacked_where_limit_1_feats), (where_stack_2_feats, stacked_where_limit_2_feats), ] for where_list, double_where_list in where_double_where_tuples: for feature in where_list: for base_feat in feature.base_features: if ( isinstance(base_feat, AggregationFeature) and base_feat.where is not None ): double_where_list.append(feature) assert len(stacked_where_limit_1_feats) == 0 assert len(stacked_where_limit_2_feats) > 0 def test_where_different_base_feats(es): es.add_interesting_values(dataframe_name="sessions", values={"device_type": [0]}) kwargs = dict( target_dataframe_name="customers", entityset=es, agg_primitives=[Sum, Count], where_primitives=[Sum, Count], max_depth=3, ) dfs_unconstrained = DeepFeatureSynthesis(**kwargs) features = dfs_unconstrained.build_features() where_feats = [ f.unique_name() for f in features if isinstance(f, AggregationFeature) and f.where is not None ] not_where_feats = [ f.unique_name() for f in features if isinstance(f, AggregationFeature) and f.where is None ] for name in not_where_feats: assert name not in where_feats def test_dfeats_where(es): es.add_interesting_values() dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Count], trans_primitives=[], ) features = dfs_obj.build_features() # test to make sure we build direct features of agg features with where clause assert feature_with_name(features, "customers.COUNT(log WHERE priority_level = 0)") assert feature_with_name( features, "COUNT(log WHERE products.department = electronics)", ) def test_commutative(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[Sum], trans_primitives=[AddNumeric], max_depth=3, ) feats = dfs_obj.build_features() add_feats = [f for f in feats if isinstance(f.primitive, AddNumeric)] # Check that there are no two AddNumeric features with the same base # features. unordered_args = set() for f in add_feats: arg1, arg2 = f.base_features args_set = frozenset({arg1.unique_name(), arg2.unique_name()}) unordered_args.add(args_set) assert len(add_feats) == len(unordered_args) def test_transform_consistency(transform_es): # Generate features transform_es["first"].ww.set_types( logical_types={"b": "BooleanNullable", "b1": "BooleanNullable"}, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="first", entityset=transform_es, trans_primitives=["and", "add_numeric", "or"], max_depth=1, ) feature_defs = dfs_obj.build_features() # Check for correct ordering of features assert feature_with_name(feature_defs, "a") assert feature_with_name(feature_defs, "b") assert feature_with_name(feature_defs, "b1") assert feature_with_name(feature_defs, "b12") assert feature_with_name(feature_defs, "P") assert feature_with_name(feature_defs, "AND(b, b1)") assert not feature_with_name( feature_defs, "AND(b1, b)", ) # make sure it doesn't exist the other way assert feature_with_name(feature_defs, "a + P") assert feature_with_name(feature_defs, "b12 + P") assert feature_with_name(feature_defs, "a + b12") assert feature_with_name(feature_defs, "OR(b, b1)") def test_transform_no_stack_agg(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[NMostCommon], trans_primitives=[NotEqual], max_depth=3, ) feature_defs = dfs_obj.build_features() assert not feature_with_name( feature_defs, "id != N_MOST_COMMON(sessions.device_type)", ) def test_initialized_trans_prim(es): prim = IsIn(list_of_outputs=["coke zero"]) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[prim], ) features = dfs_obj.build_features() assert feature_with_name(features, "product_id.isin(['coke zero'])") def test_initialized_agg_prim(es): ThreeMost = NMostCommon(n=3) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[ThreeMost], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "N_MOST_COMMON(log.subregioncode)") def test_return_types(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Count, NMostCommon], trans_primitives=[Absolute, Hour, IsIn], ) discrete = ColumnSchema(semantic_tags={"category"}) numeric = ColumnSchema(semantic_tags={"numeric"}) datetime = ColumnSchema(logical_type=Datetime) f1 = dfs_obj.build_features(return_types=None) f2 = dfs_obj.build_features(return_types=[discrete]) f3 = dfs_obj.build_features(return_types="all") f4 = dfs_obj.build_features(return_types=[datetime]) f1_types = [f.column_schema for f in f1] f2_types = [f.column_schema for f in f2] f3_types = [f.column_schema for f in f3] f4_types = [f.column_schema for f in f4] assert any([is_valid_input(schema, discrete) for schema in f1_types]) assert any([is_valid_input(schema, numeric) for schema in f1_types]) assert not any([is_valid_input(schema, datetime) for schema in f1_types]) assert any([is_valid_input(schema, discrete) for schema in f2_types]) assert not any([is_valid_input(schema, numeric) for schema in f2_types]) assert not any([is_valid_input(schema, datetime) for schema in f2_types]) assert any([is_valid_input(schema, discrete) for schema in f3_types]) assert any([is_valid_input(schema, numeric) for schema in f3_types]) assert any([is_valid_input(schema, datetime) for schema in f3_types]) assert not any([is_valid_input(schema, discrete) for schema in f4_types]) assert not any([is_valid_input(schema, numeric) for schema in f4_types]) assert any([is_valid_input(schema, datetime) for schema in f4_types]) def test_checks_primitives_correct_type(es): error_text = ( "Primitive in " "agg_primitives is not an aggregation primitive" ) with pytest.raises(ValueError, match=error_text): DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[Hour], trans_primitives=[], ) error_text = ( "Primitive in trans_primitives " "is not a transform primitive" ) with pytest.raises(ValueError, match=error_text): DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[], trans_primitives=[Sum], ) def test_makes_agg_features_along_multiple_paths(diamond_es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="regions", entityset=diamond_es, agg_primitives=["mean"], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "MEAN(customers.transactions.amount)") assert feature_with_name(features, "MEAN(stores.transactions.amount)") def test_makes_direct_features_through_multiple_relationships(games_es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="games", entityset=games_es, agg_primitives=["mean"], trans_primitives=[], ) features = dfs_obj.build_features() teams = ["home", "away"] for forward in teams: for backward in teams: for col in teams: f = "teams[%s_team_id].MEAN(games[%s_team_id].%s_team_score)" % ( forward, backward, col, ) assert feature_with_name(features, f) def test_stacks_multioutput_features(es): class TestTime(TransformPrimitive): name = "test_time" input_types = [ColumnSchema(logical_type=Datetime)] return_type = ColumnSchema(semantic_tags={"numeric"}) number_output_features = 6 def get_function(self): def test_f(x): times = pd.Series(x) units = ["year", "month", "day", "hour", "minute", "second"] return [times.apply(lambda x: getattr(x, unit)) for unit in units] return test_f dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[NumUnique, NMostCommon(n=3)], trans_primitives=[TestTime, Diff], max_depth=4, ) feat = dfs_obj.build_features() for i in range(3): f = "NUM_UNIQUE(sessions.N_MOST_COMMON(log.countrycode)[%d])" % i assert feature_with_name(feat, f) def test_seed_multi_output_feature_stacking(es): threecommon = NMostCommon(3) tc = Feature( es["log"].ww["product_id"], parent_dataframe_name="sessions", primitive=threecommon, ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, seed_features=[tc], agg_primitives=[NumUnique], trans_primitives=[], max_depth=4, ) feat = dfs_obj.build_features() for i in range(3): f = "NUM_UNIQUE(sessions.N_MOST_COMMON(log.product_id)[%d])" % i assert feature_with_name(feat, f) def test_makes_direct_features_along_multiple_paths(diamond_es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="transactions", entityset=diamond_es, max_depth=3, agg_primitives=[], trans_primitives=[], ) features = dfs_obj.build_features() assert feature_with_name(features, "customers.regions.name") assert feature_with_name(features, "stores.regions.name") def test_does_not_make_trans_of_single_direct_feature(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[], trans_primitives=["weekday"], max_depth=2, ) features = dfs_obj.build_features() assert not feature_with_name(features, "WEEKDAY(customers.signup_date)") assert feature_with_name(features, "customers.WEEKDAY(signup_date)") def test_makes_trans_of_multiple_direct_features(diamond_es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="transactions", entityset=diamond_es, agg_primitives=["mean"], trans_primitives=[Equal], max_depth=4, ) features = dfs_obj.build_features() # Make trans of direct and non-direct assert feature_with_name(features, "amount = stores.MEAN(transactions.amount)") # Make trans of direct features on different dataframes assert feature_with_name( features, "customers.MEAN(transactions.amount) = stores.square_ft", ) # Make trans of direct features on same dataframe with different paths. assert feature_with_name(features, "customers.regions.name = stores.regions.name") # Don't make trans of direct features with same path. assert not feature_with_name( features, "stores.square_ft = stores.MEAN(transactions.amount)", ) assert not feature_with_name( features, "stores.MEAN(transactions.amount) = stores.square_ft", ) # The naming of the below is confusing but this is a direct feature of a transform. assert feature_with_name(features, "stores.MEAN(transactions.amount) = square_ft") def test_makes_direct_of_agg_of_trans_on_target(es): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=["mean"], trans_primitives=[Absolute], max_depth=3, ) features = dfs_obj.build_features() assert feature_with_name(features, "sessions.MEAN(log.ABSOLUTE(value))") def test_primitive_options_errors(es): wrong_key_options = {"mean": {"ignore_dataframe": ["sessions"]}} wrong_type_list = {"mean": {"ignore_dataframes": "sessions"}} wrong_type_dict = {"mean": {"ignore_columns": {"sessions": "product_id"}}} conflicting_primitive_options = { ("count", "mean"): {"ignore_dataframes": ["sessions"]}, "mean": {"include_dataframes": ["sessions"]}, } invalid_dataframe = {"mean": {"include_dataframes": ["invalid_dataframe"]}} invalid_column_dataframe = { "mean": {"include_columns": {"invalid_dataframe": ["product_id"]}}, } invalid_column = {"mean": {"include_columns": {"sessions": ["invalid_column"]}}} key_error_text = "Unrecognized primitive option 'ignore_dataframe' for mean" list_error_text = "Incorrect type formatting for 'ignore_dataframes' for mean" dict_error_text = "Incorrect type formatting for 'ignore_columns' for mean" conflicting_error_text = "Multiple options found for primitive mean" invalid_dataframe_warning = "Dataframe 'invalid_dataframe' not in entityset" invalid_column_warning = "Column 'invalid_column' not in dataframe 'sessions'" with pytest.raises(KeyError, match=key_error_text): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=wrong_key_options, ) with pytest.raises(TypeError, match=list_error_text): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=wrong_type_list, ) with pytest.raises(TypeError, match=dict_error_text): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=wrong_type_dict, ) with pytest.raises(KeyError, match=conflicting_error_text): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=conflicting_primitive_options, ) with pytest.warns(UserWarning, match=invalid_dataframe_warning) as record: DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=invalid_dataframe, ) assert len(record) == 1 with pytest.warns(UserWarning, match=invalid_dataframe_warning) as record: DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=invalid_column_dataframe, ) assert len(record) == 1 with pytest.warns(UserWarning, match=invalid_column_warning) as record: DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=invalid_column, ) assert len(record) == 1 def test_primitive_options(es): options = { "sum": {"include_columns": {"customers": ["age"]}}, "mean": {"include_dataframes": ["customers"]}, "mode": {"ignore_dataframes": ["sessions"]}, "num_unique": {"ignore_columns": {"customers": ["engagement_level"]}}, } dfs_obj = DeepFeatureSynthesis( target_dataframe_name="cohorts", entityset=es, primitive_options=options, ) features = dfs_obj.build_features() for f in features: deps = f.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] columns = [d for d in deps if isinstance(d, IdentityFeature)] if isinstance(f.primitive, Sum): for identity_base in columns: if identity_base.dataframe_name == "customers": assert identity_base.get_name() == "age" if isinstance(f.primitive, Mean): assert all([df_name in ["customers"] for df_name in df_names]) if isinstance(f.primitive, Mode): assert "sessions" not in df_names if isinstance(f.primitive, NumUnique): for identity_base in columns: assert not ( identity_base.dataframe_name == "customers" and identity_base.get_name() == "engagement_level" ) options = { "month": {"ignore_columns": {"customers": ["birthday"]}}, "day": {"include_columns": {"customers": ["signup_date", "upgrade_date"]}}, "num_characters": {"ignore_dataframes": ["customers"]}, "year": {"include_dataframes": ["customers"]}, } dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[], ignore_dataframes=["cohort"], primitive_options=options, ) features = dfs_obj.build_features() assert not any([isinstance(f, NumCharacters) for f in features]) for f in features: deps = f.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] columns = [d for d in deps if isinstance(d, IdentityFeature)] if isinstance(f.primitive, Month): for identity_base in columns: assert not ( identity_base.dataframe_name == "customers" and identity_base.get_name() == "birthday" ) if isinstance(f.primitive, Day): for identity_base in columns: if identity_base.dataframe_name == "customers": assert ( identity_base.get_name() == "signup_date" or identity_base.get_name() == "upgrade_date" ) if isinstance(f.primitive, Year): assert all([df_name in ["customers"] for df_name in df_names]) def test_primitive_options_with_globals(es): # non-overlapping ignore_dataframes options = {"mode": {"ignore_dataframes": ["sessions"]}} dfs_obj = DeepFeatureSynthesis( target_dataframe_name="cohorts", entityset=es, ignore_dataframes=["régions"], primitive_options=options, ) features = dfs_obj.build_features() for f in features: deps = f.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] assert "régions" not in df_names if isinstance(f.primitive, Mode): assert "sessions" not in df_names # non-overlapping ignore_columns options = {"num_unique": {"ignore_columns": {"customers": ["engagement_level"]}}} dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, ignore_columns={"customers": ["région_id"]}, primitive_options=options, ) features = dfs_obj.build_features() for f in features: deps = f.get_dependencies(deep=True) columns = [d for d in deps if isinstance(d, IdentityFeature)] for identity_base in columns: assert not ( identity_base.dataframe_name == "customers" and identity_base.get_name() == "région_id" ) if isinstance(f.primitive, NumUnique): for identity_base in columns: assert not ( identity_base.dataframe_name == "customers" and identity_base.get_name() == "engagement_level" ) # Overlapping globals/options with ignore_dataframes options = { "mode": { "include_dataframes": ["sessions", "customers"], "ignore_columns": {"customers": ["région_id"]}, }, "num_unique": { "include_dataframes": ["sessions", "customers"], "include_columns": {"sessions": ["device_type"], "customers": ["age"]}, }, "month": {"ignore_columns": {"cohorts": ["cohort_end"]}}, } dfs_obj = DeepFeatureSynthesis( target_dataframe_name="cohorts", entityset=es, ignore_dataframes=["sessions"], ignore_columns={"customers": ["age"]}, primitive_options=options, ) features = dfs_obj.build_features() for f in features: assert f.primitive.name != "month" # ignoring cohorts means no features are created assert not isinstance(f.primitive, Month) deps = f.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] columns = [d for d in deps if isinstance(d, IdentityFeature)] if isinstance(f.primitive, Mode): assert [all([df_name in ["sessions", "customers"] for df_name in df_names])] for identity_base in columns: assert not ( identity_base.dataframe_name == "customers" and ( identity_base.get_name() == "age" or identity_base.get_name() == "région_id" ) ) elif isinstance(f.primitive, NumUnique): assert [all([df_name in ["sessions", "customers"] for df_name in df_names])] for identity_base in columns: if identity_base.dataframe_name == "sessions": assert identity_base.get_name() == "device_type" # All other primitives ignore 'sessions' and 'age' else: assert "sessions" not in df_names for identity_base in columns: assert not ( identity_base.dataframe_name == "customers" and identity_base.get_name() == "age" ) def test_primitive_options_groupbys(es): options = { "cum_count": {"include_groupby_dataframes": ["log", "customers"]}, "cum_sum": {"ignore_groupby_dataframes": ["sessions"]}, "cum_mean": { "ignore_groupby_columns": { "customers": ["région_id"], "log": ["session_id"], }, }, "cum_min": { "include_groupby_columns": {"sessions": ["customer_id", "device_type"]}, }, } dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[], max_depth=3, groupby_trans_primitives=["cum_sum", "cum_count", "cum_min", "cum_mean"], primitive_options=options, ) features = dfs_obj.build_features() for f in features: if isinstance(f, GroupByTransformFeature): deps = f.groupby.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] + [f.groupby.dataframe_name] columns = [d for d in deps if isinstance(d, IdentityFeature)] columns += [f.groupby] if isinstance(f.groupby, IdentityFeature) else [] if isinstance(f.primitive, CumMean): for identity_groupby in columns: assert not ( identity_groupby.dataframe_name == "customers" and identity_groupby.get_name() == "région_id" ) assert not ( identity_groupby.dataframe_name == "log" and identity_groupby.get_name() == "session_id" ) if isinstance(f.primitive, CumCount): assert all([name in ["log", "customers"] for name in df_names]) if isinstance(f.primitive, CumSum): assert "sessions" not in df_names if isinstance(f.primitive, CumMin): for identity_groupby in columns: if identity_groupby.dataframe_name == "sessions": assert ( identity_groupby.get_name() == "customer_id" or identity_groupby.get_name() == "device_type" ) def test_primitive_options_multiple_inputs(es): too_many_options = { "mode": [{"include_dataframes": ["logs"]}, {"ignore_dataframes": ["sessions"]}], } error_msg = "Number of options does not match number of inputs for primitive mode" with pytest.raises(AssertionError, match=error_msg): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=["mode"], trans_primitives=[], primitive_options=too_many_options, ) unknown_primitive = Trend() unknown_primitive.name = "unknown_primitive" unknown_primitive_option = { "unknown_primitive": [ {"include_dataframes": ["logs"]}, {"ignore_dataframes": ["sessions"]}, ], } error_msg = "Unknown primitive with name 'unknown_primitive'" with pytest.raises(ValueError, match=error_msg): DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, agg_primitives=[unknown_primitive], trans_primitives=[], primitive_options=unknown_primitive_option, ) options1 = { "trend": [ {"include_dataframes": ["log"], "ignore_columns": {"log": ["value"]}}, {"include_dataframes": ["log"], "include_columns": {"log": ["datetime"]}}, ], } dfs_obj1 = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=["trend"], trans_primitives=[], primitive_options=options1, ) features1 = dfs_obj1.build_features() for f in features1: deps = f.get_dependencies() df_names = [d.dataframe_name for d in deps] columns = [d.get_name() for d in deps] if f.primitive.name == "trend": assert all([df_name in ["log"] for df_name in df_names]) assert "datetime" in columns if len(columns) == 2: assert "value" != columns[0] options2 = { Trend: [ {"include_dataframes": ["log"], "ignore_columns": {"log": ["value"]}}, {"include_dataframes": ["log"], "include_columns": {"log": ["datetime"]}}, ], } dfs_obj2 = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=["trend"], trans_primitives=[], primitive_options=options2, ) features2 = dfs_obj2.build_features() assert set(features2) == set(features1) def test_primitive_options_class_names(es): options1 = {"mean": {"include_dataframes": ["customers"]}} options2 = {Mean: {"include_dataframes": ["customers"]}} bad_options = { "mean": {"include_dataframes": ["customers"]}, Mean: {"ignore_dataframes": ["customers"]}, } conflicting_error_text = "Multiple options found for primitive mean" primitives = [["mean"], [Mean]] options = [options1, options2] features = [] for primitive in primitives: with pytest.raises(KeyError, match=conflicting_error_text): DeepFeatureSynthesis( target_dataframe_name="cohorts", entityset=es, agg_primitives=primitive, trans_primitives=[], primitive_options=bad_options, ) for option in options: dfs_obj = DeepFeatureSynthesis( target_dataframe_name="cohorts", entityset=es, agg_primitives=primitive, trans_primitives=[], primitive_options=option, ) features.append(set(dfs_obj.build_features())) for f in features[0]: deps = f.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] if isinstance(f.primitive, Mean): assert all(df_name == "customers" for df_name in df_names) assert features[0] == features[1] == features[2] == features[3] def test_primitive_options_instantiated_primitive(es): warning_msg = ( "Options present for primitive instance and generic " "primitive class \\(mean\\), primitive instance will not use generic " "options" ) skipna_mean = Mean(skipna=False) options = { skipna_mean: {"include_dataframes": ["stores"]}, "mean": {"ignore_dataframes": ["stores"]}, } with pytest.warns(UserWarning, match=warning_msg): dfs_obj = DeepFeatureSynthesis( target_dataframe_name="régions", entityset=es, agg_primitives=["mean", skipna_mean], trans_primitives=[], primitive_options=options, ) features = dfs_obj.build_features() for f in features: deps = f.get_dependencies(deep=True) df_names = [d.dataframe_name for d in deps] if f.primitive == skipna_mean: assert all(df_name == "stores" for df_name in df_names) elif isinstance(f.primitive, Mean): assert "stores" not in df_names def test_primitive_options_commutative(es): class AddThree(TransformPrimitive): name = "add_three" input_types = [ ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"}), ] return_type = ColumnSchema(semantic_tags={"numeric"}) commutative = True def generate_name(self, base_feature_names): return "%s + %s + %s" % ( base_feature_names[0], base_feature_names[1], base_feature_names[2], ) options = { "add_numeric": [ {"include_columns": {"log": ["value_2"]}}, {"include_columns": {"log": ["value"]}}, ], AddThree: [ {"include_columns": {"log": ["value_2"]}}, {"include_columns": {"log": ["value_many_nans"]}}, {"include_columns": {"log": ["value"]}}, ], } dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[AddNumeric, AddThree], primitive_options=options, max_depth=1, ) features = dfs_obj.build_features() add_numeric = [f for f in features if isinstance(f.primitive, AddNumeric)] assert len(add_numeric) == 1 deps = add_numeric[0].get_dependencies(deep=True) assert deps[0].get_name() == "value_2" and deps[1].get_name() == "value" add_three = [f for f in features if isinstance(f.primitive, AddThree)] assert len(add_three) == 1 deps = add_three[0].get_dependencies(deep=True) assert ( deps[0].get_name() == "value_2" and deps[1].get_name() == "value_many_nans" and deps[2].get_name() == "value" ) def test_primitive_options_include_over_exclude(es): options = { "mean": {"ignore_dataframes": ["stores"], "include_dataframes": ["stores"]}, } dfs_obj = DeepFeatureSynthesis( target_dataframe_name="régions", entityset=es, agg_primitives=["mean"], trans_primitives=[], primitive_options=options, ) features = dfs_obj.build_features() at_least_one_mean = False for f in features: deps = f.get_dependencies(deep=True) dataframes = [d.dataframe_name for d in deps] if isinstance(f.primitive, Mean): at_least_one_mean = True assert "stores" in dataframes assert at_least_one_mean def test_primitive_ordering(): # Test that the order of the input primitives impacts neither # which features are created nor their order es = make_ecommerce_entityset() trans_prims = [AddNumeric, Absolute, "divide_numeric", NotEqual, "is_null"] groupby_trans_prim = ["cum_mean", CumMin, CumSum] agg_prims = [NMostCommon(n=3), Sum, Mean, Mean(skipna=False), "min", "max"] where_prims = ["count", Sum] seed_num_chars = Feature( es["customers"].ww["favorite_quote"], primitive=NumCharacters, ) seed_is_null = Feature(es["customers"].ww["age"], primitive=IsNull) seed_features = [seed_num_chars, seed_is_null] dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, trans_primitives=trans_prims, groupby_trans_primitives=groupby_trans_prim, agg_primitives=agg_prims, where_primitives=where_prims, seed_features=seed_features, max_features=-1, max_depth=2, ) features1 = dfs_obj.build_features() trans_prims.reverse() groupby_trans_prim.reverse() agg_prims.reverse() where_prims.reverse() seed_features.reverse() dfs_obj = DeepFeatureSynthesis( target_dataframe_name="customers", entityset=es, trans_primitives=trans_prims, groupby_trans_primitives=groupby_trans_prim, agg_primitives=agg_prims, where_primitives=where_prims, seed_features=seed_features, max_features=-1, max_depth=2, ) features2 = dfs_obj.build_features() assert len(features1) == len(features2) for i in range(len(features2)): assert features1[i].unique_name() == features2[i].unique_name() def test_no_transform_stacking(): df1 = pd.DataFrame({"id": [0, 1, 2, 3], "A": [0, 1, 2, 3]}) df2 = pd.DataFrame( {"index": [0, 1, 2, 3], "first_id": [0, 1, 1, 3], "B": [99, 88, 77, 66]}, ) dataframes = {"first": (df1, "id"), "second": (df2, "index")} relationships = [("first", "id", "second", "first_id")] es = EntitySet("data", dataframes, relationships) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="second", entityset=es, trans_primitives=["negate", "add_numeric"], agg_primitives=["sum"], max_depth=4, ) feature_defs = dfs_obj.build_features() expected = [ "first_id", "B", "-(B)", "first.A", "first.SUM(second.B)", "first.-(A)", "B + first.A", "first.SUM(second.-(B))", "first.A + SUM(second.B)", "first.-(SUM(second.B))", "B + first.SUM(second.B)", "first.A + SUM(second.-(B))", "first.SUM(second.-(B)) + SUM(second.B)", "first.-(SUM(second.-(B)))", "B + first.SUM(second.-(B))", ] assert len(feature_defs) == len(expected) for feature_name in expected: assert feature_with_name(feature_defs, feature_name) def test_builds_seed_features_on_foreign_key_col(es): seed_feature_sessions = Feature(es["sessions"].ww["customer_id"], primitive=Negate) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, agg_primitives=[], trans_primitives=[], max_depth=2, seed_features=[seed_feature_sessions], ) features = dfs_obj.build_features() assert feature_with_name(features, "-(customer_id)") def test_does_not_build_features_on_last_time_index_col(es): es.add_last_time_indexes() dfs_obj = DeepFeatureSynthesis(target_dataframe_name="log", entityset=es) features = dfs_obj.build_features() for feature in features: assert LTI_COLUMN_NAME not in feature.get_name() def test_builds_features_using_all_input_types(es): new_log_df = es["log"] new_log_df.ww["purchased_nullable"] = es["log"]["purchased"] new_log_df.ww.set_types(logical_types={"purchased_nullable": "boolean_nullable"}) es.replace_dataframe("log", new_log_df) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, trans_primitives=[Not], max_depth=1, ) trans_features = dfs_obj.build_features() assert feature_with_name(trans_features, "NOT(purchased)") assert feature_with_name(trans_features, "NOT(purchased_nullable)") dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, groupby_trans_primitives=[Not], max_depth=1, ) groupby_trans_features = dfs_obj.build_features() assert feature_with_name(groupby_trans_features, "NOT(purchased) by session_id") assert feature_with_name( groupby_trans_features, "NOT(purchased_nullable) by session_id", ) dfs_obj = DeepFeatureSynthesis( target_dataframe_name="sessions", entityset=es, trans_primitives=[], agg_primitives=[NumTrue], ) agg_features = dfs_obj.build_features() assert feature_with_name(agg_features, "NUM_TRUE(log.purchased)") assert feature_with_name(agg_features, "NUM_TRUE(log.purchased_nullable)") def test_make_groupby_features_with_depth_none(es): # If max_depth is set to -1, it sets it to None internally, so this # test validates code paths that have a None max_depth dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[], trans_primitives=[], groupby_trans_primitives=["cum_sum"], max_depth=-1, ) features = dfs_obj.build_features() assert feature_with_name(features, "CUM_SUM(value) by session_id") def test_check_stacking_when_building_transform_features(es): class NewMean(Mean): name = "NEW_MEAN" base_of_exclude = [Absolute] dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[NewMean, "mean"], trans_primitives=["absolute"], max_depth=-1, ) features = dfs_obj.build_features() assert number_of_features_with_name_like(features, "ABSOLUTE(MEAN") > 0 assert number_of_features_with_name_like(features, "ABSOLUTE(NEW_MEAN") == 0 def test_check_stacking_when_building_groupby_features(es): class NewMean(Mean): name = "NEW_MEAN" base_of_exclude = [CumSum] dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=[NewMean, "mean"], groupby_trans_primitives=["cum_sum"], max_depth=5, ) features = dfs_obj.build_features() assert number_of_features_with_name_like(features, "CUM_SUM(MEAN") > 0 assert number_of_features_with_name_like(features, "CUM_SUM(NEW_MEAN") == 0 def test_check_stacking_when_building_agg_features(es): class NewAbsolute(Absolute): name = "NEW_ABSOLUTE" base_of_exclude = [Mean] dfs_obj = DeepFeatureSynthesis( target_dataframe_name="log", entityset=es, agg_primitives=["mean"], trans_primitives=[NewAbsolute, "absolute"], max_depth=5, ) features = dfs_obj.build_features() assert number_of_features_with_name_like(features, "MEAN(log.ABSOLUTE") > 0 assert number_of_features_with_name_like(features, "MEAN(log.NEW_ABSOLUTE") == 0 ================================================ FILE: featuretools/tests/synthesis/test_dfs_method.py ================================================ import warnings from unittest.mock import patch import composeml as cp import numpy as np import pandas as pd import pytest from packaging.version import parse from woodwork.column_schema import ColumnSchema from woodwork.logical_types import NaturalLanguage from featuretools.computational_backends.calculate_feature_matrix import ( FEATURE_CALCULATION_PERCENTAGE, ) from featuretools.entityset import EntitySet, Timedelta from featuretools.exceptions import UnusedPrimitiveWarning from featuretools.primitives import GreaterThanScalar, Max, Mean, Min, Sum from featuretools.primitives.base import AggregationPrimitive, TransformPrimitive from featuretools.synthesis import dfs from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis @pytest.fixture def datetime_es(): cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5], "card_id": [1, 1, 5, 1, 5], "transaction_time": pd.to_datetime( [ "2011-2-28 04:00", "2012-2-28 05:00", "2012-2-29 06:00", "2012-3-1 08:00", "2014-4-1 10:00", ], ), "fraud": [True, False, False, False, True], }, ) datetime_es = EntitySet(id="fraud_data") datetime_es = datetime_es.add_dataframe( dataframe_name="transactions", dataframe=transactions_df, index="id", time_index="transaction_time", ) datetime_es = datetime_es.add_dataframe( dataframe_name="cards", dataframe=cards_df, index="id", ) datetime_es = datetime_es.add_relationship("cards", "id", "transactions", "card_id") datetime_es.add_last_time_indexes() return datetime_es def test_dfs_empty_features(): error_text = "No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data." teams = pd.DataFrame({"id": range(3), "name": ["Breakers", "Spirit", "Thorns"]}) games = pd.DataFrame( { "id": range(5), "home_team_id": [2, 2, 1, 0, 1], "away_team_id": [1, 0, 2, 1, 0], "home_team_score": [3, 0, 1, 0, 4], "away_team_score": [2, 1, 2, 0, 0], }, ) dataframes = { "teams": (teams, "id", None, {"name": "natural_language"}), "games": (games, "id"), } relationships = [("teams", "id", "games", "home_team_id")] with patch.object(DeepFeatureSynthesis, "build_features", return_value=[]): features = dfs( dataframes, relationships, target_dataframe_name="teams", features_only=True, ) assert features == [] with ( pytest.raises(AssertionError, match=error_text), patch.object( DeepFeatureSynthesis, "build_features", return_value=[], ), ): dfs( dataframes, relationships, target_dataframe_name="teams", features_only=False, ) def test_passing_strings_to_logical_types_dfs(): teams = pd.DataFrame({"id": range(3), "name": ["Breakers", "Spirit", "Thorns"]}) games = pd.DataFrame( { "id": range(5), "home_team_id": [2, 2, 1, 0, 1], "away_team_id": [1, 0, 2, 1, 0], "home_team_score": [3, 0, 1, 0, 4], "away_team_score": [2, 1, 2, 0, 0], }, ) dataframes = { "teams": (teams, "id", None, {"name": "natural_language"}), "games": (games, "id"), } relationships = [("teams", "id", "games", "home_team_id")] features = dfs( dataframes, relationships, target_dataframe_name="teams", features_only=True, ) name_logical_type = features[0].dataframe["name"].ww.logical_type assert isinstance(name_logical_type, NaturalLanguage) def test_accepts_cutoff_time_df(dataframes, relationships): cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]}) feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times_df, ) feature_matrix = feature_matrix assert len(feature_matrix.index) == 3 assert len(feature_matrix.columns) == len(features) def test_accepts_cutoff_time_compose(dataframes, relationships): def fraud_occured(df): return df["fraud"].any() kwargs = { "time_index": "transaction_time", "labeling_function": fraud_occured, "window_size": 1, } if parse(cp.__version__) >= parse("0.10.0"): kwargs["target_dataframe_index"] = "card_id" else: kwargs["target_dataframe_name"] = "card_id" # pragma: no cover lm = cp.LabelMaker(**kwargs) transactions_df = dataframes["transactions"][0] labels = lm.search(transactions_df, num_examples_per_instance=-1) labels["time"] = pd.to_numeric(labels["time"]) labels.rename({"card_id": "id"}, axis=1, inplace=True) feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="cards", cutoff_time=labels, ) assert len(feature_matrix.index) == 6 assert len(feature_matrix.columns) == len(features) + 1 def test_accepts_single_cutoff_time(dataframes, relationships): feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=20, ) assert len(feature_matrix.index) == 5 assert len(feature_matrix.columns) == len(features) def test_accepts_no_cutoff_time(dataframes, relationships): feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", instance_ids=[1, 2, 3, 5, 6], ) assert len(feature_matrix.index) == 5 assert len(feature_matrix.columns) == len(features) def test_ignores_instance_ids_if_cutoff_df(dataframes, relationships): cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]}) instance_ids = [1, 2, 3, 4, 5] feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times_df, instance_ids=instance_ids, ) assert len(feature_matrix.index) == 3 assert len(feature_matrix.columns) == len(features) def test_approximate_features(dataframes, relationships): cutoff_times_df = pd.DataFrame( {"instance_id": [1, 3, 1, 5, 3, 6], "time": [11, 16, 16, 26, 17, 22]}, ) # force column to BooleanNullable dataframes["transactions"] += ({"fraud": "BooleanNullable"},) feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times_df, approximate=5, cutoff_time_in_index=True, ) direct_agg_feat_name = "cards.PERCENT_TRUE(transactions.fraud)" assert len(feature_matrix.index) == 6 assert len(feature_matrix.columns) == len(features) truth_values = pd.Series(data=[1.0, 0.5, 0.5, 1.0, 0.5, 1.0]) assert (feature_matrix[direct_agg_feat_name] == truth_values.values).all() def test_all_columns(dataframes, relationships): cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]}) feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times_df, agg_primitives=[Max, Mean, Min, Sum], trans_primitives=[], groupby_trans_primitives=["cum_sum"], max_depth=3, allowed_paths=None, ignore_dataframes=None, ignore_columns=None, seed_features=None, ) assert len(feature_matrix.index) == 3 assert len(feature_matrix.columns) == len(features) def test_features_only(dataframes, relationships): if len(dataframes["transactions"]) > 3: dataframes["transactions"][3]["fraud"] = "BooleanNullable" else: dataframes["transactions"] += ({"fraud": "BooleanNullable"},) features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", features_only=True, ) expected_features = 11 assert len(features) == expected_features def test_accepts_relative_training_window(datetime_es): feature_matrix, _ = dfs(entityset=datetime_es, target_dataframe_name="transactions") feature_matrix_2, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-4-1 04:00"), ) feature_matrix_3, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-4-1 04:00"), training_window=Timedelta("3 months"), ) feature_matrix_4, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-4-1 04:00"), training_window="3 months", ) assert (feature_matrix.index == [1, 2, 3, 4, 5]).all() assert (feature_matrix_2.index == [1, 2, 3, 4]).all() assert (feature_matrix_3.index == [2, 3, 4]).all() assert (feature_matrix_4.index == [2, 3, 4]).all() # Test case for leap years feature_matrix_5, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-2-29 04:00"), training_window=Timedelta("1 year"), include_cutoff_time=True, ) assert (feature_matrix_5.index == [2]).all() feature_matrix_5, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-2-29 04:00"), training_window=Timedelta("1 year"), include_cutoff_time=False, ) assert (feature_matrix_5.index == [1, 2]).all() def test_accepts_pd_timedelta_training_window(datetime_es): feature_matrix, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-3-31 04:00"), training_window=pd.Timedelta(61, "D"), ) assert (feature_matrix.index == [2, 3, 4]).all() def test_accepts_pd_dateoffset_training_window(datetime_es): feature_matrix, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-3-31 04:00"), training_window=pd.DateOffset(months=2), ) feature_matrix_2, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.Timestamp("2012-3-31 04:00"), training_window=pd.offsets.BDay(44), ) assert (feature_matrix.index == [2, 3, 4]).all() assert (feature_matrix.index == feature_matrix_2.index).all() def test_accepts_datetime_and_string_offset(datetime_es): feature_matrix, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time=pd.to_datetime("2012-3-31 04:00"), training_window=pd.DateOffset(months=2), ) feature_matrix_2, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time="2012-3-31 04:00", training_window=pd.offsets.BDay(44), ) assert (feature_matrix.index == [2, 3, 4]).all() assert (feature_matrix.index == feature_matrix_2.index).all() def test_handles_pandas_parser_error(datetime_es): with pytest.raises(ValueError): _, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time="2--012-----3-----31 04:00", training_window=pd.DateOffset(months=2), ) def test_handles_pandas_overflow_error(datetime_es): # pandas 1.5.0 raises ValueError, older versions raised OverflowError with pytest.raises((OverflowError, ValueError)): _, _ = dfs( entityset=datetime_es, target_dataframe_name="transactions", cutoff_time="200000000000000000000000000000000000000000000000000000000000000000-3-31 04:00", training_window=pd.DateOffset(months=2), ) def test_warns_with_unused_primitives(es): trans_primitives = ["num_characters", "num_words", "add_numeric"] agg_primitives = [Max, "min"] warning_text = ( "Some specified primitives were not used during DFS:\n" + " trans_primitives: ['add_numeric']\n agg_primitives: ['max', 'min']\n" + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, " + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call " + "contained multiple instances of a primitive in the list above, none of them were used." ) with pytest.warns(UnusedPrimitiveWarning) as record: dfs( entityset=es, target_dataframe_name="customers", trans_primitives=trans_primitives, agg_primitives=agg_primitives, max_depth=1, features_only=True, ) assert record[0].message.args[0] == warning_text # Should not raise a warning with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="customers", trans_primitives=trans_primitives, agg_primitives=agg_primitives, max_depth=2, features_only=True, ) def test_no_warns_with_camel_and_title_case(es): for trans_primitive in ["isNull", "IsNull"]: # Should not raise a UnusedPrimitiveWarning warning with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="customers", trans_primitives=[trans_primitive], max_depth=1, features_only=True, ) for agg_primitive in ["numUnique", "NumUnique"]: # Should not raise a UnusedPrimitiveWarning warning with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="customers", agg_primitives=[agg_primitive], max_depth=2, features_only=True, ) def test_does_not_warn_with_stacking_feature(es): with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="régions", agg_primitives=["percent_true"], trans_primitives=[GreaterThanScalar(5)], primitive_options={ "greater_than_scalar": {"include_dataframes": ["stores"]}, }, features_only=True, ) def test_warns_with_unused_where_primitives(es): warning_text = ( "Some specified primitives were not used during DFS:\n" + " where_primitives: ['count', 'sum']\n" + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, " + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call " + "contained multiple instances of a primitive in the list above, none of them were used." ) with pytest.warns(UnusedPrimitiveWarning) as record: dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["count"], where_primitives=["sum", "count"], max_depth=1, features_only=True, ) assert record[0].message.args[0] == warning_text def test_warns_with_unused_groupby_primitives(es): warning_text = ( "Some specified primitives were not used during DFS:\n" + " groupby_trans_primitives: ['cum_sum']\n" + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, " + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call " + "contained multiple instances of a primitive in the list above, none of them were used." ) with pytest.warns(UnusedPrimitiveWarning) as record: dfs( entityset=es, target_dataframe_name="sessions", groupby_trans_primitives=["cum_sum"], max_depth=1, features_only=True, ) assert record[0].message.args[0] == warning_text # Should not raise a warning with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="customers", groupby_trans_primitives=["cum_sum"], max_depth=1, features_only=True, ) def test_warns_with_unused_custom_primitives(es): class AboveTen(TransformPrimitive): name = "above_ten" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) trans_primitives = [AboveTen] warning_text = ( "Some specified primitives were not used during DFS:\n" + " trans_primitives: ['above_ten']\n" + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, " + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call " + "contained multiple instances of a primitive in the list above, none of them were used." ) with pytest.warns(UnusedPrimitiveWarning) as record: dfs( entityset=es, target_dataframe_name="sessions", trans_primitives=trans_primitives, max_depth=1, features_only=True, ) assert record[0].message.args[0] == warning_text # Should not raise a warning with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="customers", trans_primitives=trans_primitives, max_depth=1, features_only=True, ) class MaxAboveTen(AggregationPrimitive): name = "max_above_ten" input_types = [ColumnSchema(semantic_tags={"numeric"})] return_type = ColumnSchema(semantic_tags={"numeric"}) agg_primitives = [MaxAboveTen] warning_text = ( "Some specified primitives were not used during DFS:\n" + " agg_primitives: ['max_above_ten']\n" + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, " + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call " + "contained multiple instances of a primitive in the list above, none of them were used." ) with pytest.warns(UnusedPrimitiveWarning) as record: dfs( entityset=es, target_dataframe_name="stores", agg_primitives=agg_primitives, max_depth=1, features_only=True, ) assert record[0].message.args[0] == warning_text # Should not raise a warning with warnings.catch_warnings(): warnings.simplefilter("error") dfs( entityset=es, target_dataframe_name="sessions", agg_primitives=agg_primitives, max_depth=1, features_only=True, ) def test_calls_progress_callback(dataframes, relationships): class MockProgressCallback: def __init__(self): self.progress_history = [] self.total_update = 0 self.total_progress_percent = 0 def __call__(self, update, progress_percent, time_elapsed): self.total_update += update self.total_progress_percent = progress_percent self.progress_history.append(progress_percent) mock_progress_callback = MockProgressCallback() dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", progress_callback=mock_progress_callback, ) # second to last entry is the last update from feature calculation assert np.isclose( mock_progress_callback.progress_history[-2], FEATURE_CALCULATION_PERCENTAGE * 100, ) assert np.isclose(mock_progress_callback.total_update, 100.0) assert np.isclose(mock_progress_callback.total_progress_percent, 100.0) def test_calls_progress_callback_cluster(dataframes, relationships, dask_cluster): class MockProgressCallback: def __init__(self): self.progress_history = [] self.total_update = 0 self.total_progress_percent = 0 def __call__(self, update, progress_percent, time_elapsed): self.total_update += update self.total_progress_percent = progress_percent self.progress_history.append(progress_percent) mock_progress_callback = MockProgressCallback() dkwargs = {"cluster": dask_cluster.scheduler.address} dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", progress_callback=mock_progress_callback, dask_kwargs=dkwargs, ) assert np.isclose(mock_progress_callback.total_update, 100.0) assert np.isclose(mock_progress_callback.total_progress_percent, 100.0) def test_dask_kwargs(dataframes, relationships, dask_cluster): cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]}) feature_matrix, features = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times_df, ) dask_kwargs = {"cluster": dask_cluster.scheduler.address} feature_matrix_2, features_2 = dfs( dataframes=dataframes, relationships=relationships, target_dataframe_name="transactions", cutoff_time=cutoff_times_df, dask_kwargs=dask_kwargs, ) assert all( f1.unique_name() == f2.unique_name() for f1, f2 in zip(features, features_2) ) for column in feature_matrix: for x, y in zip(feature_matrix[column], feature_matrix_2[column]): assert (pd.isnull(x) and pd.isnull(y)) or (x == y) ================================================ FILE: featuretools/tests/synthesis/test_encode_features.py ================================================ import pandas as pd import pytest from featuretools import EntitySet, calculate_feature_matrix, dfs from featuretools.feature_base import Feature, IdentityFeature from featuretools.primitives import NMostCommon from featuretools.synthesis import encode_features def test_encodes_features(es): f1 = IdentityFeature(es["log"].ww["product_id"]) f2 = IdentityFeature(es["log"].ww["purchased"]) f3 = IdentityFeature(es["log"].ww["value"]) features = [f1, f2, f3] feature_matrix = calculate_feature_matrix( features, es, instance_ids=[0, 1, 2, 3, 4, 5], ) _, features_encoded = encode_features(feature_matrix, features) assert len(features_encoded) == 6 _, features_encoded = encode_features(feature_matrix, features, top_n=2) assert len(features_encoded) == 5 _, features_encoded = encode_features( feature_matrix, features, include_unknown=False, ) assert len(features_encoded) == 5 def test_inplace_encodes_features(es): f1 = IdentityFeature(es["log"].ww["product_id"]) features = [f1] feature_matrix = calculate_feature_matrix( features, es, instance_ids=[0, 1, 2, 3, 4, 5], ) feature_matrix_shape = feature_matrix.shape feature_matrix_encoded, _ = encode_features(feature_matrix, features) assert feature_matrix_encoded.shape != feature_matrix_shape assert feature_matrix.shape == feature_matrix_shape # inplace they should be the same feature_matrix_encoded, _ = encode_features(feature_matrix, features, inplace=True) assert feature_matrix_encoded.shape == feature_matrix.shape def test_to_encode_features(es): f1 = IdentityFeature(es["log"].ww["product_id"]) f2 = IdentityFeature(es["log"].ww["value"]) f3 = IdentityFeature(es["log"].ww["datetime"]) features = [f1, f2, f3] feature_matrix = calculate_feature_matrix( features, es, instance_ids=[0, 1, 2, 3, 4, 5], ) feature_matrix_encoded, _ = encode_features(feature_matrix, features) feature_matrix_encoded_shape = feature_matrix_encoded.shape # to_encode should keep product_id as a string and datetime as a date, # and not have the same shape as previous encoded matrix due to fewer encoded features to_encode = [] feature_matrix_encoded, _ = encode_features( feature_matrix, features, to_encode=to_encode, ) assert feature_matrix_encoded_shape != feature_matrix_encoded.shape assert feature_matrix_encoded["datetime"].dtype == "datetime64[ns]" assert feature_matrix_encoded["product_id"].dtype == "category" to_encode = ["value"] feature_matrix_encoded, _ = encode_features( feature_matrix, features, to_encode=to_encode, ) assert feature_matrix_encoded_shape != feature_matrix_encoded.shape assert feature_matrix_encoded["datetime"].dtype == "datetime64[ns]" assert feature_matrix_encoded["product_id"].dtype == "category" def test_encode_features_handles_pass_columns(es): f1 = IdentityFeature(es["log"].ww["product_id"]) f2 = IdentityFeature(es["log"].ww["value"]) features = [f1, f2] cutoff_time = pd.DataFrame( { "instance_id": range(6), "time": es["log"]["datetime"][0:6], "label": [i % 2 for i in range(6)], }, columns=["instance_id", "time", "label"], ) feature_matrix = calculate_feature_matrix(features, es, cutoff_time) assert "label" in feature_matrix.columns feature_matrix_encoded, _ = encode_features(feature_matrix, features) feature_matrix_encoded_shape = feature_matrix_encoded.shape # to_encode should keep product_id as a string, and not create 3 additional columns to_encode = [] feature_matrix_encoded, _ = encode_features( feature_matrix, features, to_encode=to_encode, ) assert feature_matrix_encoded_shape != feature_matrix_encoded.shape to_encode = ["value"] feature_matrix_encoded, _ = encode_features( feature_matrix, features, to_encode=to_encode, ) assert feature_matrix_encoded_shape != feature_matrix_encoded.shape assert "label" in feature_matrix_encoded.columns def test_encode_features_catches_features_mismatch(es): f1 = IdentityFeature(es["log"].ww["product_id"]) f2 = IdentityFeature(es["log"].ww["value"]) f3 = IdentityFeature(es["log"].ww["session_id"]) features = [f1, f2] cutoff_time = pd.DataFrame( { "instance_id": range(6), "time": es["log"]["datetime"][0:6], "label": [i % 2 for i in range(6)], }, columns=["instance_id", "time", "label"], ) feature_matrix = calculate_feature_matrix(features, es, cutoff_time) assert "label" in feature_matrix.columns error_text = "Feature session_id not found in feature matrix" with pytest.raises(AssertionError, match=error_text): encode_features(feature_matrix, [f1, f3]) def test_encode_unknown_features(): # Dataframe with categorical column with "unknown" string df = pd.DataFrame({"category": ["unknown", "b", "c", "d", "e"]}).astype( {"category": "category"}, ) es = EntitySet("test") es.add_dataframe( dataframe_name="a", dataframe=df, index="index", make_index=True, ) features, feature_defs = dfs( entityset=es, target_dataframe_name="a", max_depth=1, ) # Specify unknown token for replacement features_enc, _ = encode_features(features, feature_defs, include_unknown=True) assert list(features_enc.columns) == [ "category = unknown", "category = e", "category = d", "category = c", "category = b", "category is unknown", ] def test_encode_features_topn(es): topn = Feature( Feature(es["log"].ww["product_id"]), parent_dataframe_name="customers", primitive=NMostCommon(n=3), ) features, feature_defs = dfs( entityset=es, instance_ids=[0, 1, 2], target_dataframe_name="customers", agg_primitives=[NMostCommon(n=3)], ) features_enc, feature_defs_enc = encode_features( features, feature_defs, include_unknown=True, ) assert topn.unique_name() in [feat.unique_name() for feat in feature_defs_enc] for name in topn.get_feature_names(): assert name in features_enc.columns assert features_enc.columns.tolist().count(name) == 1 def test_encode_features_drop_first(): df = pd.DataFrame({"category": ["ao", "b", "c", "d", "e"]}).astype( {"category": "category"}, ) es = EntitySet("test") es.add_dataframe( dataframe_name="a", dataframe=df, index="index", make_index=True, ) features, feature_defs = dfs( entityset=es, target_dataframe_name="a", max_depth=1, ) features_enc, _ = encode_features( features, feature_defs, drop_first=True, include_unknown=False, ) assert len(features_enc.columns) == 4 features_enc, feature_defs = encode_features( features, feature_defs, top_n=3, drop_first=True, include_unknown=False, ) assert len(features_enc.columns) == 2 def test_encode_features_handles_dictionary_input(es): f1 = IdentityFeature(es["log"].ww["product_id"]) f2 = IdentityFeature(es["log"].ww["purchased"]) f3 = IdentityFeature(es["log"].ww["session_id"]) features = [f1, f2, f3] feature_matrix = calculate_feature_matrix(features, es, instance_ids=range(16)) feature_matrix_encoded, features_encoded = encode_features(feature_matrix, features) true_values = [ "product_id = coke zero", "product_id = toothpaste", "product_id = car", "product_id = brown bag", "product_id = taco clock", "product_id = Haribo sugar-free gummy bears", "product_id is unknown", "purchased", "session_id = 0", "session_id = 1", "session_id = 4", "session_id = 3", "session_id = 5", "session_id = 2", "session_id is unknown", ] assert len(features_encoded) == 15 for col in true_values: assert col in list(feature_matrix_encoded.columns) top_n_dict = {} feature_matrix_encoded, features_encoded = encode_features( feature_matrix, features, top_n=top_n_dict, ) assert len(features_encoded) == 15 for col in true_values: assert col in list(feature_matrix_encoded.columns) top_n_dict = {f1.get_name(): 4, f3.get_name(): 3} feature_matrix_encoded, features_encoded = encode_features( feature_matrix, features, top_n=top_n_dict, ) assert len(features_encoded) == 10 true_values = [ "product_id = coke zero", "product_id = toothpaste", "product_id = car", "product_id = brown bag", "product_id is unknown", "purchased", "session_id = 0", "session_id = 1", "session_id = 4", "session_id is unknown", ] for col in true_values: assert col in list(feature_matrix_encoded.columns) feature_matrix_encoded, features_encoded = encode_features( feature_matrix, features, top_n=top_n_dict, include_unknown=False, ) true_values = [ "product_id = coke zero", "product_id = toothpaste", "product_id = car", "product_id = brown bag", "purchased", "session_id = 0", "session_id = 1", "session_id = 4", ] assert len(features_encoded) == 8 for col in true_values: assert col in list(feature_matrix_encoded.columns) def test_encode_features_matches_calculate_feature_matrix(): df = pd.DataFrame({"category": ["b", "c", "d", "e"]}).astype( {"category": "category"}, ) es = EntitySet("test") es.add_dataframe( dataframe_name="a", dataframe=df, index="index", make_index=True, ) features, feature_defs = dfs( entityset=es, target_dataframe_name="a", max_depth=1, ) features_enc, feature_defs_enc = encode_features( features, feature_defs, to_encode=["category"], ) features_calc = calculate_feature_matrix(feature_defs_enc, entityset=es) pd.testing.assert_frame_equal(features_enc, features_calc) assert features_calc.ww._schema == features_enc.ww._schema ================================================ FILE: featuretools/tests/synthesis/test_get_valid_primitives.py ================================================ import pytest from woodwork.column_schema import ColumnSchema from featuretools.primitives import ( AggregationPrimitive, Count, Hour, IsIn, Not, TimeSincePrevious, TransformPrimitive, ) from featuretools.synthesis.get_valid_primitives import get_valid_primitives def test_get_valid_primitives_selected_primitives(es): agg_prims, trans_prims = get_valid_primitives( es, "log", selected_primitives=[Hour, Count], ) assert set(agg_prims) == set([Count]) assert set(trans_prims) == set([Hour]) agg_prims, trans_prims = get_valid_primitives( es, "products", selected_primitives=[Hour], max_depth=1, ) assert set(agg_prims) == set() assert set(trans_prims) == set() def test_get_valid_primitives_selected_primitives_strings(es): agg_prims, trans_prims = get_valid_primitives( es, "log", selected_primitives=["hour", "count"], ) assert set(agg_prims) == set([Count]) assert set(trans_prims) == set([Hour]) agg_prims, trans_prims = get_valid_primitives( es, "products", selected_primitives=["hour"], max_depth=1, ) assert set(agg_prims) == set() assert set(trans_prims) == set() def test_invalid_primitive(es): with pytest.raises(ValueError, match="'foobar' is not a recognized primitive name"): get_valid_primitives( es, target_dataframe_name="log", selected_primitives=["foobar"], ) msg = ( "Selected primitive " "is not an AggregationPrimitive, TransformPrimitive, or str" ) with pytest.raises(ValueError, match=msg): get_valid_primitives( es, target_dataframe_name="log", selected_primitives=[ColumnSchema], ) def test_primitive_compatibility(es): _, trans_prims = get_valid_primitives( es, "customers", selected_primitives=[TimeSincePrevious], ) assert len(trans_prims) == 1 def test_get_valid_primitives_custom_primitives(es): class ThreeMostCommonCat(AggregationPrimitive): name = "n_most_common_categorical" input_types = [ColumnSchema(semantic_tags={"category"})] return_type = ColumnSchema(semantic_tags={"category"}) number_output_features = 3 class AddThree(TransformPrimitive): name = "add_three" input_types = [ ColumnSchema(semantic_tags="numeric"), ColumnSchema(semantic_tags="numeric"), ColumnSchema(semantic_tags="numeric"), ] return_type = ColumnSchema(semantic_tags="numeric") commutative = True agg_prims, trans_prims = get_valid_primitives(es, "log") assert ThreeMostCommonCat not in agg_prims assert AddThree not in trans_prims with pytest.raises( ValueError, match="'add_three' is not a recognized primitive name", ): agg_prims, trans_prims = get_valid_primitives( es, "log", 2, [ThreeMostCommonCat, "add_three"], ) def test_get_valid_primitives_all_primitives(es): agg_prims, trans_prims = get_valid_primitives(es, "customers") assert Count in agg_prims assert Hour in trans_prims def test_get_valid_primitives_single_table(transform_es): msg = "Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created" with pytest.warns(UserWarning, match=msg): agg_prims, trans_prims = get_valid_primitives(transform_es, "first") assert set(agg_prims) == set() assert IsIn in trans_prims def test_get_valid_primitives_with_dfs_kwargs(es): agg_prims, trans_prims = get_valid_primitives( es, "customers", selected_primitives=[Hour, Count, Not], ) assert set(agg_prims) == set([Count]) assert set(trans_prims) == set([Hour, Not]) # Can use other dfs parameters and they get applied agg_prims, trans_prims = get_valid_primitives( es, "customers", selected_primitives=[Hour, Count, Not], ignore_columns={"customers": ["loves_ice_cream"]}, ) assert set(agg_prims) == set([Count]) assert set(trans_prims) == set([Hour]) agg_prims, trans_prims = get_valid_primitives( es, "products", selected_primitives=[Hour, Count], ignore_dataframes=["log"], ) assert set(agg_prims) == set() assert set(trans_prims) == set() ================================================ FILE: featuretools/tests/test_version.py ================================================ from featuretools import __version__ def test_version(): assert __version__ == "1.31.0" ================================================ FILE: featuretools/tests/testing_utils/__init__.py ================================================ # flake8: noqa from featuretools.tests.testing_utils.cluster import ( MockClient, mock_cluster, get_mock_client_cluster, ) from featuretools.tests.testing_utils.es_utils import get_df_tags from featuretools.tests.testing_utils.features import ( feature_with_name, number_of_features_with_name_like, backward_path, forward_path, check_rename, check_names, ) from featuretools.tests.testing_utils.mock_ds import make_ecommerce_entityset ================================================ FILE: featuretools/tests/testing_utils/cluster.py ================================================ from psutil import virtual_memory def mock_cluster( n_workers=1, threads_per_worker=1, diagnostics_port=8787, memory_limit=None, **dask_kwarg, ): return (n_workers, threads_per_worker, diagnostics_port, memory_limit) class MockClient: def __init__(self, cluster): self.cluster = cluster def scheduler_info(self): return {"workers": {"worker 1": {"memory_limit": virtual_memory().total}}} def get_mock_client_cluster(): return MockClient, mock_cluster ================================================ FILE: featuretools/tests/testing_utils/es_utils.py ================================================ def get_df_tags(df): """Gets a DataFrame's semantic tags without index or time index tags for Woodwork init""" semantic_tags = {} for col_name in df.columns: semantic_tags[col_name] = df.ww.semantic_tags[col_name] - { "time_index", "index", } return semantic_tags ================================================ FILE: featuretools/tests/testing_utils/features.py ================================================ import re from featuretools.entityset.relationship import RelationshipPath def feature_with_name(features, name): for f in features: if f.get_name() == name: return True return False def number_of_features_with_name_like(features, pattern): """Returns number of features with names that match the provided regex pattern""" pattern = re.compile(re.escape(pattern)) names = [f.get_name() for f in features] return len([name for name in names if pattern.search(name)]) def backward_path(es, dataframe_ids): """ Create a backward RelationshipPath through the given dataframes. Assumes only one such path is possible. """ def _get_relationship(child, parent): return next( r for r in es.get_forward_relationships(child) if r._parent_dataframe_name == parent ) relationships = [ _get_relationship(child, parent) for parent, child in zip(dataframe_ids[:-1], dataframe_ids[1:]) ] return RelationshipPath([(False, r) for r in relationships]) def forward_path(es, dataframe_ids): """ Create a forward RelationshipPath through the given dataframes. Assumes only one such path is possible. """ def _get_relationship(child, parent): return next( r for r in es.get_forward_relationships(child) if r._parent_dataframe_name == parent ) relationships = [ _get_relationship(child, parent) for child, parent in zip(dataframe_ids[:-1], dataframe_ids[1:]) ] return RelationshipPath([(True, r) for r in relationships]) def check_rename(feat, new_name, new_names): copy_feat = feat.rename(new_name) assert feat.unique_name() != copy_feat.unique_name() assert feat.get_name() != copy_feat.get_name() assert ( feat.base_features[0].generate_name() == copy_feat.base_features[0].generate_name() ) assert feat.dataframe_name == copy_feat.dataframe_name assert feat.get_feature_names() != copy_feat.get_feature_names() check_names(copy_feat, new_name, new_names) def check_names(feat, new_name, new_names): assert feat.get_name() == new_name assert feat.get_feature_names() == new_names ================================================ FILE: featuretools/tests/testing_utils/generate_fake_dataframe.py ================================================ import random from datetime import datetime as dt import pandas as pd import woodwork.type_sys.type_system as ww_type_system from woodwork import logical_types from featuretools.feature_discovery.utils import flatten_list logical_type_mapping = { logical_types.Boolean.__name__: [True, False], logical_types.BooleanNullable.__name__: [True, False, pd.NA], logical_types.Categorical.__name__: ["A", "B", "C"], logical_types.Datetime.__name__: [ dt(2020, 1, 1, 12, 0, 0), dt(2020, 6, 1, 12, 0, 0), ], logical_types.Double.__name__: [1.2, 2.3, 3.4], logical_types.Integer.__name__: [1, 2, 3], logical_types.IntegerNullable.__name__: [1, 2, 3, pd.NA], logical_types.EmailAddress.__name__: [ "john.smith@example.com", "sally.jones@example.com", ], logical_types.LatLong.__name__: [(1, 2), (3, 4)], logical_types.NaturalLanguage.__name__: [ "This is sentence 1", "This is sentence 2", ], logical_types.Ordinal.__name__: [1, 2, 3], logical_types.URL.__name__: ["https://www.example.com", "https://www.example2.com"], logical_types.PostalCode.__name__: ["60018", "60018-0123"], } def generate_fake_dataframe( col_defs=[("f_1", "Numeric"), ("f_2", "Datetime", "time_index")], n_rows=10, df_name="df", ): def randomize(values_): random.seed(10) values = values_.copy() random.shuffle(values) return values def gen_series(values): values = [values] * n_rows if isinstance(values, list): values = flatten_list(values) return randomize(values)[:n_rows] def get_tags(lt, tags=set()): inferred_tags = ww_type_system.str_to_logical_type(lt).standard_tags assert isinstance(inferred_tags, set) return inferred_tags.union(tags) - {"index", "time_index"} other_kwargs = {} df = pd.DataFrame() lt_dict = {} tags_dict = {} for name, lt_name, *rest in col_defs: if lt_name in logical_type_mapping: values = logical_type_mapping[lt_name] if lt_name == logical_types.Ordinal.__name__: lt = logical_types.Ordinal(order=values) else: lt = lt_name values = gen_series(values) else: raise Exception(f"Unknown logical type {lt_name}") lt_dict[name] = lt if len(rest): tags = rest[0] if "index" in tags: other_kwargs["index"] = name values = range(n_rows) if "time_index" in tags: other_kwargs["time_index"] = name values = pd.date_range("2000-01-01", periods=n_rows) tags_dict[name] = get_tags(lt_name, tags) else: tags_dict[name] = get_tags(lt_name) s = pd.Series(values, name=name) df = pd.concat([df, s], axis=1) df.ww.init( name=df_name, logical_types=lt_dict, semantic_tags=tags_dict, **other_kwargs, ) return df ================================================ FILE: featuretools/tests/testing_utils/mock_ds.py ================================================ from datetime import datetime import numpy as np import pandas as pd from woodwork.logical_types import ( URL, Boolean, Categorical, CountryCode, Datetime, Double, EmailAddress, Filepath, Integer, IPAddress, LatLong, NaturalLanguage, Ordinal, PersonFullName, PhoneNumber, PostalCode, SubRegionCode, ) from featuretools.entityset import EntitySet def make_ecommerce_entityset(with_integer_time_index=False): """Makes a entityset with the following shape: R Régions / \\ . S C Stores, Customers | . S P Sessions, Products \\ / . L Log """ dataframes = make_ecommerce_dataframes( with_integer_time_index=with_integer_time_index, ) dataframe_names = dataframes.keys() es_id = "ecommerce" if with_integer_time_index: es_id += "_int_time_index" logical_types = make_logical_types(with_integer_time_index=with_integer_time_index) semantic_tags = make_semantic_tags() time_indexes = make_time_indexes(with_integer_time_index=with_integer_time_index) es = EntitySet(id=es_id) for df_name in dataframe_names: time_index = time_indexes.get(df_name, None) ti_name = None secondary = None if time_index is not None: ti_name = time_index["name"] secondary = time_index["secondary"] df = dataframes[df_name] es.add_dataframe( df, dataframe_name=df_name, index="id", logical_types=logical_types[df_name], semantic_tags=semantic_tags[df_name], time_index=ti_name, secondary_time_index=secondary, ) es.normalize_dataframe( "customers", "cohorts", "cohort", additional_columns=["cohort_name"], make_time_index=True, new_dataframe_time_index="cohort_end", ) es.add_relationships( [ ("régions", "id", "customers", "région_id"), ("régions", "id", "stores", "région_id"), ("customers", "id", "sessions", "customer_id"), ("sessions", "id", "log", "session_id"), ("products", "id", "log", "product_id"), ], ) return es def make_ecommerce_dataframes(with_integer_time_index=False): region_df = pd.DataFrame( {"id": ["United States", "Mexico"], "language": ["en", "sp"]}, ) store_df = pd.DataFrame( { "id": range(6), "région_id": ["United States"] * 3 + ["Mexico"] * 2 + [np.nan], "num_square_feet": list(range(30000, 60000, 6000)) + [np.nan], }, ) product_df = pd.DataFrame( { "id": [ "Haribo sugar-free gummy bears", "car", "toothpaste", "brown bag", "coke zero", "taco clock", ], "department": [ "food", "electronics", "health", "food", "food", "electronics", ], "rating": [3.5, 4.0, 4.5, 1.5, 5.0, 5.0], "url": [ "google.com", "https://www.featuretools.com/", "amazon.com", "www.featuretools.com", "bit.ly", "featuretools.com/demos/", ], }, ) customer_times = { "signup_date": [ datetime(2011, 4, 8), datetime(2011, 4, 9), datetime(2011, 4, 6), ], # some point after signup date "upgrade_date": [ datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7), ], "cancel_date": [ datetime(2011, 6, 8), datetime(2011, 10, 9), datetime(2012, 1, 6), ], "birthday": [datetime(1993, 3, 8), datetime(1926, 8, 2), datetime(1993, 4, 20)], } if with_integer_time_index: customer_times["signup_date"] = [6, 7, 4] customer_times["upgrade_date"] = [18, 26, 5] customer_times["cancel_date"] = [27, 28, 29] customer_times["birthday"] = [2, 1, 3] customer_df = pd.DataFrame( { "id": pd.Categorical([0, 1, 2]), "age": [33, 25, 56], "région_id": ["United States"] * 3, "cohort": [0, 1, 0], "cohort_name": ["Early Adopters", "Late Adopters", "Early Adopters"], "loves_ice_cream": [True, False, True], "favorite_quote": [ "The proletariat have nothing to lose but their chains", "Capitalism deprives us all of self-determination", "All members of the working classes must seize the " "means of production.", ], "signup_date": customer_times["signup_date"], # some point after signup date "upgrade_date": customer_times["upgrade_date"], "cancel_date": customer_times["cancel_date"], "cancel_reason": ["reason_1", "reason_2", "reason_1"], "engagement_level": [1, 3, 2], "full_name": ["Mr. John Doe", "Doe, Mrs. Jane", "James Brown"], "email": ["john.smith@example.com", np.nan, "team@featuretools.com"], "phone_number": ["555-555-5555", "555-555-5555", "1-(555)-555-5555"], "birthday": customer_times["birthday"], }, ) ips = [ "192.168.0.1", "2001:4860:4860::8888", "0.0.0.0", "192.168.1.1:2869", np.nan, np.nan, ] filepaths = [ "/home/user/docs/Letter.txt", "./inthisdir", "C:\\user\\docs\\Letter.txt", "~/.rcinfo", "../../greatgrandparent", "data.json", ] session_df = pd.DataFrame( { "id": [0, 1, 2, 3, 4, 5], "customer_id": pd.Categorical([0, 0, 0, 1, 1, 2]), "device_type": [0, 1, 1, 0, 0, 1], "device_name": ["PC", "Mobile", "Mobile", "PC", "PC", "Mobile"], "ip": ips, "filepath": filepaths, }, ) times = list( [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)] + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)] + [datetime(2011, 4, 9, 10, 40, 0)] + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)] + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)] + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)], ) if with_integer_time_index: times = list(range(8, 18)) + list(range(19, 26)) values = list( [i * 5 for i in range(5)] + [i * 1 for i in range(4)] + [0] + [i * 5 for i in range(2)] + [i * 7 for i in range(3)] + [np.nan] * 2, ) values_2 = list( [i * 2 for i in range(5)] + [i * 1 for i in range(4)] + [0] + [i * 2 for i in range(2)] + [i * 3 for i in range(3)] + [np.nan] * 2, ) values_many_nans = list( [np.nan] * 5 + [i * 1 for i in range(4)] + [0] + [np.nan] * 2 + [i * 3 for i in range(3)] + [np.nan] * 2, ) latlong = list([(values[i], values_2[i]) for i, _ in enumerate(values)]) latlong2 = list([(values_2[i], -values[i]) for i, _ in enumerate(values)]) zipcodes = list( ["02116"] * 5 + ["02116-3899"] * 4 + ["0"] + ["1234567890"] * 2 + ["12345-6789"] * 2 + [np.nan] * 3, ) countrycodes = list(["US"] * 5 + ["AL"] * 4 + [np.nan] * 5 + ["ALB"] * 2 + ["USA"]) subregioncodes = list( ["US-AZ"] * 5 + ["US-MT"] * 4 + [np.nan] * 3 + ["UG-219"] * 2 + ["ZM-06"] * 3, ) log_df = pd.DataFrame( { "id": range(17), "session_id": [0] * 5 + [1] * 4 + [2] * 1 + [3] * 2 + [4] * 3 + [5] * 2, "product_id": ["coke zero"] * 3 + ["car"] * 2 + ["toothpaste"] * 3 + ["brown bag"] * 2 + ["Haribo sugar-free gummy bears"] + ["coke zero"] * 4 + ["taco clock"] * 2, "datetime": times, "value": values, "value_2": values_2, "latlong": latlong, "latlong2": latlong2, "zipcode": zipcodes, "countrycode": countrycodes, "subregioncode": subregioncodes, "value_many_nans": values_many_nans, "priority_level": [0] * 2 + [1] * 5 + [0] * 6 + [2] * 2 + [1] * 2, "purchased": [True] * 11 + [False] * 4 + [True, False], "url": ["https://www.featuretools.com/"] * 2 + ["amazon.com"] * 2 + [ "www.featuretools.com", "bit.ly", "featuretools.com/demos/", "www.google.co.in/" "http://lplay.google.co.in", " ", "invalid_url", "an", "microsoft.com/search/", ] + [np.nan] * 5, "email_address": ["john.smith@example.com", np.nan, "team@featuretools.com"] * 5 + [" prefix@space.com", "suffix@space.com "], "comments": [coke_zero_review()] + ["I loved it"] * 2 + car_reviews() + toothpaste_reviews() + brown_bag_reviews() + [gummy_review()] + ["I loved it"] * 4 + taco_clock_reviews(), }, ) return { "régions": region_df, "stores": store_df, "products": product_df, "customers": customer_df, "sessions": session_df, "log": log_df, } def make_semantic_tags(): store_semantic_tags = {"région_id": "foreign_key"} customer_semantic_tags = {"région_id": "foreign_key", "birthday": "date_of_birth"} session_semantic_tags = {"customer_id": "foreign_key"} log_semantic_tags = {"session_id": "foreign_key"} return { "customers": customer_semantic_tags, "sessions": session_semantic_tags, "log": log_semantic_tags, "products": {}, "stores": store_semantic_tags, "régions": {}, } def make_logical_types(with_integer_time_index=False): region_logical_types = {"id": Categorical, "language": Categorical} store_logical_types = { "id": Integer, "région_id": Categorical, "num_square_feet": Double, } product_logical_types = { "id": Categorical, "rating": Double, "department": Categorical, "url": URL, } customer_logical_types = { "id": Integer, "age": Integer, "région_id": Categorical, "loves_ice_cream": Boolean, "favorite_quote": NaturalLanguage, "signup_date": Datetime(datetime_format="%Y-%m-%d"), "upgrade_date": Datetime(datetime_format="%Y-%m-%d"), "cancel_date": Datetime(datetime_format="%Y-%m-%d"), "cancel_reason": Categorical, "engagement_level": Ordinal(order=[1, 2, 3]), "full_name": PersonFullName, "email": EmailAddress, "phone_number": PhoneNumber, "birthday": Datetime(datetime_format="%Y-%m-%d"), "cohort_name": Categorical, "cohort": Integer, } session_logical_types = { "id": Integer, "customer_id": Integer, "device_type": Categorical, "device_name": Categorical, "ip": IPAddress, "filepath": Filepath, } log_logical_types = { "id": Integer, "session_id": Integer, "product_id": Categorical, "datetime": Datetime(datetime_format="%Y-%m-%d"), "value": Double, "value_2": Double, "latlong": LatLong, "latlong2": LatLong, "zipcode": PostalCode, "countrycode": CountryCode, "subregioncode": SubRegionCode, "value_many_nans": Double, "priority_level": Ordinal(order=[0, 1, 2]), "purchased": Boolean, "url": URL, "email_address": EmailAddress, "comments": NaturalLanguage, } if with_integer_time_index: log_logical_types["datetime"] = Integer customer_logical_types["signup_date"] = Integer customer_logical_types["upgrade_date"] = Integer customer_logical_types["cancel_date"] = Integer customer_logical_types["birthday"] = Integer return { "customers": customer_logical_types, "sessions": session_logical_types, "log": log_logical_types, "products": product_logical_types, "stores": store_logical_types, "régions": region_logical_types, } def make_time_indexes(with_integer_time_index=False): return { "customers": { "name": "signup_date", "secondary": {"cancel_date": ["cancel_reason"]}, }, "log": {"name": "datetime", "secondary": None}, } def coke_zero_review(): return """ When it comes to Coca-Cola products, people tend to be die-hard fans. Many of us know someone who can't go a day without a Diet Coke (or two or three). And while Diet Coke has been a leading sugar-free soft drink since it was first released in 1982, it came to light that young adult males shied away from this beverage — identifying diet cola as a woman's drink. The company's answer to that predicament came in 2005 - in the form of a shiny black can - with the release of Coca-Cola Zero. While Diet Coke was created with its own flavor profile and not as a sugar-free version of the original, Coca-Cola Zero aims to taste just like the "real Coke flavor." Despite their polar opposite advertising campaigns, the contents and nutritional information of the two sugar-free colas is nearly identical. With that information in hand we at HuffPost Taste needed to know: Which of these two artificially-sweetened Coca-Cola beverages actually tastes better? And can you even tell the difference between them? Before we get to the results of our taste test, here are the facts: Diet Coke Motto: Always Great Tast Nutritional Information: Many say that a can of Diet Coke actually contains somewhere between 1-4 calories, but if a serving size contains fewer than 5 calories a company is not obligated to note it in its nutritional information. Diet Coke's nutritional information reads 0 Calories, 0g Fat, 40mg Sodium, 0g Total Carbs, 0g Protein. Ingredients: Carbonated water, caramel color, aspartame, phosphoric acid, potassium benzonate, natural flavors, citric acid, caffeine. Artificial sweetener: Aspartame Coca-Cola Zero Motto: Real Coca-Cola Taste AND Zero Calories Nutritional Information: While the label clearly advertises this beverage as a zero calorie cola, we are not entirely certain that its minimal calorie content is simply not required to be noted in the nutritional information. Coca-Cola Zero's nutritional information reads 0 Calories, 0g Fat, 40mg Sodium, 0g Total Carbs, 0g Protein. Artificial sweetener: Aspartame and acesulfame potassium Ingredients: Carbonated water, caramel color, phosphoric acid, aspartame, potassium benzonate, natural flavors, potassium citrate, acesulfame potassium, caffeine. The Verdict: Twenty-four editors blind-tasted the two cokes, side by side, and... 54 percent of our tasters were able to distinguish Diet Coke from Coca-Cola Zero 50 percent of our tasters preferred Diet Coke to Coca-Cola Zero, and vice versa Here’s what our tasters thought of the two sugar-free soft drinks: Diet Coke: "Tastes fake right away." "Much fresher brighter, crisper." "Has the wonderful flavors of Diet Coke’s artificial sweeteners." Coca-Cola Zero: "Has more of a sharply sweet aftertaste I associate with diet sodas." "Tastes more like regular coke, less like fake sweetener." "Has an odd taste." "Tastes more like regular." "Very sweet." Overall comments: "That was a lot more difficult than I though it would be." "Both equally palatable." A few people said Diet Coke tasted much better ... unbeknownst to them, they were actually referring to Coca-Cola Zero. IN SUMMARY: It is a real toss up. There is not one artificially-sweetened Coca-Cola beverage that outshines the other. So how do people choose between one or the other? It is either a matter of personal taste, or maybe the marketing campaigns will influence their choice. """ def gummy_review(): return """ The place: BMO Harris Bradley Center The event: Bucks VS Spurs The snack: Satan's Diarrhea Hate Bears made by Haribo I recently took my 4 year old son to his first NBA game. He was very excited to go to the game, and I was excited because we had fantastic seats. Row C center court to be exact. I've never sat that close before. I've never had to go DOWN stairs to get to my seats. 24 stairs to get to my seats to be exact. His favorite candy is Skittles. Mine are anything gummy. I snuck in a bag of skittles for my son, and grabbed a handful of gummy bears for myself, to be later known as Satan's Diarrhea Hate Bears, that I received for Christmas in bulk from my parents, and put them in a zip lock bag. After the excitement of the 1st quarter has ended I take my son out to get him a bottled water and myself a beer. We return to our seats to enjoy our candy and drinks. ..............fast forward until 1 minute before half time........... I have begun to sweat a sweat that is only meant for a man on mile 19 of a marathon. I have kicked out my legs out so straight that I am violently pushing the gentleman wearing a suit seat in front of me forward. He is not happy, I do not care. My hands are on the side of my seat not unlike that of a gymnast on a pommel horse, lifting me off my chair. My son is oblivious to what is happening next to him, after all, there is a mascot running around somewhere and he is eating candy. I realize that at some point in the very near to immediate future I am going to have to allow this lava from Satan to forcefully expel itself from my innards. I also realize that I have to walk up 24 stairs just to get to level ground in hopes to make it to the bathroom. I’ll just have to sit here stiff as a board for a few moments waiting for the pain to subside. About 30 seconds later there is a slight calm in the storm of the violent hurricane that is going on in my lower intestine. I muster the courage to gently relax every muscle in my lower half and stand up. My son stands up next to me and we start to ascend up the stairs. I take a very careful and calculated step up the first stair. Then a very loud horn sounds. Halftime. Great. It’s going to be crowded. The horn also seems to have awaken the Satan's Diarrhea Hate Bears that are having a mosh pit in my stomach. It literally felt like an avalanche went down my stomach and I again have to tighten every muscle and stand straight up and focus all my energy on my poor sphincter to tighten up and perform like it has never performed before. Taking another step would be the worst idea possible, the flood gates would open. Don’t worry, Daddy has a plan. I some how mumble the question, “want to play a game?” to my son, he of course says “yes”. My idea is to hop on both feet allllll the way up the stairs, using the center railing to propel me up each stair. My son is always up for a good hopping game, so he complies and joins in on the “fun”. Some old lady 4 steps up thinks its cute that we are doing this, obviously she wasn’t looking at the panic on my face. 3 rows behind her a man about the same age as me, who must have had similar situations, notices the fear/panic/desperation on my face understands the danger that I along with my pants and anyone within a 5 yard radius spray zone are in. He just mouths the words “good luck man” to me and I press on. Half way up and there is no leakage, but my legs are getting tired and my sphincter has never endured this amount of pressure for this long of time. 16 steps/hops later…….4 steps to go…….My son trips and falls on the stairs, I have two options: keep going knowing he will catch up or bend down to pick him up relieving my sphincter of all the pressure and commotion while ruining the day of roughly the 50 people that are now watching a grown man hop up stairs while sweating profusely next to a 4 year old boy. Luckily he gets right back up and we make it to the top of the stairs. Good, the hard part was over. Or so I thought. I managed to waddle like a penguin, or someone who is about to poop their pants in 2.5 seconds, to the men's room only to find that every stall is being used. EVERY STALL. It's halftime, of course everyone has to poop at that moment. I don't know if I can wait any longer, do I go ahead and fulfil the dream of every high school boy and poop in the urinal? What kind of an example would that set for my son? On the other hand, what kind of an example would it be for his father to fill his pants with a substance that probably will be unrecognizable to man. Suddenly a stall door opens, and I think I manage to actually levitate over to the stall. I my son follows me in, luckily it was the handicap stall so there was room for him to be out of the way. I get my pants off and start to sit. I know what taking a giant poo feels like. I also know what vomiting feels like. I can now successfully say that I know what it is like to vomit out my butt. I wasn't pooping, those Satan's Diarrhea Hate Bears did something to my insides that made my sphincter vomit our the madness. I am now conscious of my surroundings. Other than the war that the bottom half of my body is currently having with this porcelain chair, it is quiet as a pin drop in the bathroom. The other men in there can sense that something isn't right, no one has heard anyone ever poop vomit before. I can sense that the worst part is over. But its not stopping, nor can I physically stop it at this point, I am leaking..it's horrible. I call out "does anyone have a diaper?" hoping that some gentleman was changing a baby. Nothing. No one said a word. I know people are in there, I can see the toes of shoes pointed in my direction under the stall.. "DOES ANYONE HAVE A DIAPER!?!" I am screaming, my son is now crying, he thinks he is witnessing the death of his father. I can't even assure him that I will make it. Not a word was said, but a diaper was thrown over the stall. I catch it, line my underwear with it, put my pants back on, and walk out of that bathroom like a champ. We go straight to our seats, grab out coats and go home. As we are walking out, the gentleman that wished me good luck earlier simply put his fist out, and I happily bumped it. My son asks me, "Daddy, why are we leaving early?" "Well son, I need to change my diaper" """ def taco_clock_reviews(): return [ """ This timer does what it is supposed to do. Setup is elementary. Replacing the old one (after 12 years) was relatively easy. It has performed flawlessly since. I'm delighted I could find an esoteric product like this at Amazon. Their service, and the customer reviews, are just excellent. """, """ Funny, cute clock. A little spendy for how light the clock is, but its hard to find a taco clock. """, ] def brown_bag_reviews(): return [ """ These bags looked exactly like I'd hoped, however, the handles broke off of almost every single bag as soon as items were placed in them! I used these as gift bags for out-of-town guests at my wedding, so imagine my embarassment as the handles broke off as I was handing them out. I would not recommend purchaing these bags unless you plan to fill them with nothing but paper! Anything heavier will cause the handles to snap right off. """, """ I purchased these in August 2014 from Big Blue Supplies. I have no problem with the seller, these arrived new condition, fine shape. I do have a slight problem with the bags. In case someone might want to know, the handles on these bags are set inside against the top. Then a piece of Kraft type packing tape is placed over the handles to hold them in place. On some of the bags, the tape is already starting to peel off. I would be really hesitant about using these bags unless I reinforced the current tape with a different adhesive. I will keep the bags, and make a tape of a holiday or decorative theme and place over in order to make certain the handles stay in place. Also in case anybody is wondering, the label on the plastic packaging bag states these are from ORIENTAL TRADING COMPANY. On the bottom of each bag it is stamped MADE IN CHINA. Again, I will be placing a sticker over that. Even the dollar store bags I normally purchase do not have that stamped on the bottom in such prominent lettering. I purchased these because they were plain and I wanted to decorate them. I do not think I would purchase again for all the reasons stated above. Another thing for those still wanting to purchase, the ones I received were: 12 3/4 inches high not including handle, 10 1/4 inches wide and a 5 1/4 inch depth. """, ] def car_reviews(): return [ """ The full-size pickup truck and the V-8 engine were supposed to be inseparable, like the internet and cat videos. You can’t have one without the other—or so we thought. In America’s most popular vehicle, the Ford F-150, two turbocharged six-cylinder engines marketed under the EcoBoost name have dethroned the naturally aspirated V-8. Ford’s new 2.7-liter twin-turbo V-6 is the popular choice, while the 3.5-liter twin-turbo V-6 is the top performer. The larger six allows for greater hauling capacity, accelerates the truck more quickly, and swills less gas in EPA testing than the V-8 alternative. It’s enough to make even old-school truck buyers acknowledge that there actually is a replacement for displacement. And yet a V-8 in a big pickup truck still feels so natural, so right. In the F-150, the Coyote 5.0-liter V-8 is tuned for torque more so than power, yet it still revs with an enthusiastic giddy-up that reminds us that this engine’s other job is powering the Mustang. The response follows the throttle pedal faithfully while the six-speed automatic clicks through gears smoothly and easily. Together they pull this 5220-pound F-150 to 60 mph in 6.3 seconds, which is 0.4 second quicker than the 5.3-liter Chevrolet Silverado with the six-speed automatic and 0.9 second quicker than the 5.3 Silverado with the new eight-speed auto. The 3.5-liter EcoBoost, though, can do the deed another half-second quicker, but its synthetic soundtrack doesn’t have the rich, multilayered tone of the V-8. It wasn’t until we saddled our test truck with a 6400-pound trailer (well under its 9000-pound rating) that we fully understood the case for upgrading to the 3.5-liter EcoBoost. The twin-turbo engine offers an extra 2500 pounds of towing capability and handles lighter tasks with considerably less strain. The 5.0-liter truck needs more revs and a wider throttle opening to accelerate its load, so we were often coaxed into pressing the throttle to the floor for even modest acceleration. The torquier EcoBoost engine offers a heartier response at part throttle. In real-world, non-towing situations, the twin-turbo 3.5-liter doesn’t deliver on its promise of increased fuel economy, with both the 5.0-liter V-8 and that V-6 returning 16 mpg in our hands. But given the 3.5-liter’s virtues, we can forgive it that trespass. Trucks Are the New Luxury Pickups once were working-class transportation. Today, they’re proxy luxury vehicles—or at least that’s how they’re priced. If you think our test truck’s $57,240 window sticker is steep, consider that our model, the Lariat, is merely a mid-spec trim. There are three additional grades—King Ranch, Platinum, and Limited—positioned and priced above it, plus the 3.5-liter EcoBoost that costs an extra $400 as well as a plethora of options to inflate the price past 60 grand. Squint and you can almost see the six-figure trucks of the future on the horizon. For the most part, though, the equipment in this particular Lariat lives up to the price tag. The driver and passenger seats are heated and cooled, with 10-way power adjustability and supple leather. The technology includes blind-spot monitoring, navigation, and a 110-volt AC outlet. Nods to utility include spotlights built into the side mirrors and Ford’s Pro Trailer Backup Assist, which makes reversing with a trailer as easy as turning a tiny knob on the dashboard. Middle-Child Syndrome In the F-150, Ford has a trifecta of engines (the fourth, a naturally aspirated 3.5-liter V-6, is best left to the fleet operators). The 2.7-liter twin-turbo V-6 delivers remarkable performance at an affordable price. The 3.5-liter twin-turbo V-6 is the workhorse, with power, torque, and hauling capability to spare. Compared with those two logical options, the middle-child 5.0-liter V-8 is the right-brain choice. Its strongest selling points may be its silky power delivery and the familiar V-8 rumble. That’s a flimsy argument when it comes to rationalizing a $50,000-plus purchase, though, so perhaps it’s no surprise that today’s boosted six-cylinders are now the engines of choice in the F-150. """, """ THE GOOD The Tesla Model S 90D's electric drivetrain is substantially more efficient than any internal combustion engine, and gives the car smooth and quick acceleration. All-wheel drive comes courtesy of a smart dual motor system. The new Autopilot feature eases the stress of stop-and-go traffic and long road trips. THE BAD Even at Tesla's Supercharger stations, recharging the battery takes significantly longer than refilling an internal combustion engine car's gas tank, limiting where you can drive. Tesla hasn't improved its infotainment system much from the Model S' launch. THE BOTTOM LINE Among the different flavors of Tesla Model S, the 90D is the one to get, exhibiting the best range and all-wheel drive, while offering an uncomplicated, next-generation driving experience that shows very well against equally priced competitors. REVIEW SPECIFICATIONS PHOTOS Roadshow Automobiles Tesla 2016 Tesla Model S Having tested driver assistance systems in many cars, and even ridden in fully self-driving cars, I should have been ready for Tesla's new Autopilot feature. But engaging it while cruising the freeway in the Model S 90D, I kept my foot hovering over the brake. My trepidation didn't come so much from the adaptive cruise control, which kept the Model S following traffic ahead at a set distance, but from the self-steering, this part of Autopilot managing to keep the Model S well-centered in its lane with no help from me. Over many miles, I built up more trust in the system, letting the car do the steering in situations from bumper-to-bumper traffic and a winding road through the hills. 2016 Tesla Model S 90DEnlarge Image Although the middle of the Model S range, the 90D offers the best range and a wealth of useful tech, such as Autopilot self-driving. Wayne Cunningham/Roadshow Tesla added Autopilot to its Model S line as an option last year, along with all-wheel-drive. More recently, the high-tech automaker improved its batteries, upgrading its cars from their former 65 and 85 kilowatt-hour capacity to 70 and 90 kilowatt-hour. The example I drove, the 90D, represents all these advances. More importantly, the 90D is the current range-leader among the Model S line, boasting 288 miles on a full battery charge. The Model S' improvements fall outside of typical automotive industry product cycles, fulfilling Tesla's promise of acting more like a technology company, constantly building and deploying new features. Tesla accomplishes that goal partially through over-the-air software updates, improving existing cars, but the 90D presents significant hardware updates over the original Model S launched four years ago. Sit and go Of course, this Model S exhibited the ease of use of the original. Walking up to the car with the key fob in my pocket, it automatically unlocked. When I got in the car, it powered up without me having to push a start button, so I only needed to put it in drive to get on the road. Likewise, the design hasn't changed, its sleek, hatchback four-door body offering excellent cargo room, both front and back, and seating space. The cabin feels less cramped than most cars due to the lack of a transmission tunnel and a dashboard bare of buttons or dials. 2016 Tesla Model S 90DEnlarge Image The flat floor in the Model S' cabin makes for enhanced passenger room. Wayne Cunningham/Roadshow The big, 17-inch touchscreen in the center of the dashboard shows navigation, stereo, phone, energy consumption and car settings. I easily went from full-screen to a split-screen view, the windows showing each appearing instantly. A built-in 4G/LTE data connection powers Google maps and Internet-based audio. The LCD instrument panel in front of me showed my speed, energy usage, remaining range, and intelligently swapped audio information for turn-by-turn directions when started navigation. The instrument panel actually made the experience of driving under Autopilot more comfortable, reassuring me with graphics that showed when the Model S' sensors were detecting the lane lines and the traffic around me. Impressively, the sensors could differentiate, as shown on the screen's graphics, a passenger car from a big truck. At speed on the freeway, Autopilot smoothly maintained the car's position in its lane, and when I took my hands off the wheel for too long, it flashed a warning on the instrument panel. In stop-and-go traffic approaching a toll booth, the car did an even better job of self-driving, recognizing traffic around it and maintaining appropriate distances. Handling surprise Taking over the driving myself, the ride quality proved as comfortable as any sport-luxury car, as this Model S had its optional air suspension. The electric power steering is well-tuned, turning the wheels with a quiet, natural feel and good heft. Audi S7 vs Tesla Model S Shootout: Audi S7 vs. Tesla Model S Wayne Cunningham/Roadshow The biggest surprise came when I spent the day doing laps at the Thunderhill Raceway, negotiating a series of tight, technical turns in competition with an Audi S7. I expected the Model S to get out-of-shape in the turns, but instead it proved steady and solid. The Model S' 4,647-pound curb weight made it less than ideal for a track test, but much of that weight is in the battery pack, mounted low in the chassis. That low center of gravity helped limit body roll, ensuring good grip from all four tires. In the turns, the Model S felt nicely balanced, although not entirely nimble. Helping its grip was its native all-wheel drive, gained from having motors driving each set of wheels. The combined output of the motors comes to 417 horsepower and 485 pound-feet of torque, those numbers expressed in 0-to-60 mph times of well under 5 seconds. That thrust made for fast runs down the race track's straightaways, or simply giving me the ability to take advantage of gaps in traffic on public roads. 288 miles is more than enough for most people's daily driving needs, and if you plug in every night, you will wake up to a fully charged car every morning. The Model S makes for a far different experience than driving an internal combustion car, where you need to go to a gas station to refuel. However, longer trips in the Model S require some planning, such as scheduling stops at Tesla's free Supercharger stations. Charging times are much lengthier than refilling a tank with gasoline. From a Level 2, 240-volt station, you get 29 miles added every hour. Tesla's Supercharger, a Level 3 charger, takes 75 minutes to fully recharge the Model S 90D's battery. 2016 Tesla Model S 90DEnlarge Image Despite its high initial price, the Model S 90D costs less to run on a daily basis than a combustion engine car. Wayne Cunningham/Roadshow Low maintenance The 2016 Tesla Model S 90D adds features to keep it competitive against the internal combustion cars in its sport luxury set. More importantly, it remains very easy to live with. In fact, the electric drivetrain should mean greatly decreased maintenance, as there are fewer moving parts. The EPA estimates that annual electricity costs for the Model S 90D should run $650, much less than buying gasoline for an equivalent internal combustion car. Lengthy charging times mean longer trips are either out of the question or require more planning than with an internal combustion car. And while the infotainment system responds quickly to touch inputs and offers useful screens, it hasn't changed much in four years. Most notably, Tesla hasn't added any music apps beyond the ones it launched with. Along with new, useful apps, it would be nice to have some themes or other aesthetic changes to the infotainment interface. The Model S 90D's base price of $88,000 puts it out of reach of the average buyer, and the model I drove was optioned up to around $95,000. Against its Audi, BMW and Mercedes-Benz competition, however, it makes a compelling argument, especially for its uncomplicated nature. """, ] def toothpaste_reviews(): return [ """ Toothpaste can do more harm than good The next time a patient innocently asks me, “What’s the best toothpaste to use?” I’m going to unleash a whole Chunky Soup can of “You Want The Truth? You CAN’T HANDLE THE TRUTH!!!” Gosh, that’s such an overused movie quote. Sorry about that, but still. If you’re a dental professional, isn’t this the most annoying question you get, day after day? Do you even care which toothpaste your patients use? No. You don’t. Asking a dentist what toothpaste to use is like asking your physician which bar of soap or body scrub you should use to clean your skin. Your dentist and dental hygienist have never seen a tube of toothpaste that singlehandedly improves the health of all patients in their practice, and the reason is simple: Toothpaste is a cosmetic. We brush our teeth so that out mouths no longer taste like… mouth. Mouth tastes gross, right? It tastes like putrefied skin. It tastes like tongue cheese. It tastes like Cream of Barf. On the other hand, toothpaste has been exquisitely designed to bring you a brisk rush of York Peppermint Patty, or Triple Cinnamon Heaven, or whatever flavor that drives those tubes off of the shelves in the confusing dental aisle of your local supermarket or drugstore. Toothpaste definitely tastes better than Cream of Barf. And that’s why you use it. Not because it’s good for you. You use toothpaste because it tastes good, and because it makes you accept your mouth as part of your face again. From a marketing perspective, all of the other things that are in your toothpaste are in there to give it additional perceived value. So let’s deconstruct these ingredients, shall we? 1. Fluoride. This was probably the first additive to toothpaste that brought it under the jurisdiction of the Food & Drug Administration and made toothpaste part drug, part cosmetic. Over time, a fluoride toothpaste can improve the strength of teeth, but the fluoride itself does nothing to make teeth cleaner. Some people are scared of fluoride so they don’t use it. Their choice. Professionally speaking, I know that the benefits of a fluoride additive far outweigh the risks. 2. Foam. Sodium Lauryl Sulfate is soap. Soap has a creamy, thick texture that American tongues especially like and equate to the feeling of cleanliness. There’s not enough surfactant, though, in toothpaste foam to break up the goo that grows on your teeth. If these bubbles scrubbed, you’d better believe that they would also scrub your delicate gum tissues into a bloody pulp. 3. Abrasive particles. Most toothpastes use hydrated silica as the grit that polishes teeth. You’re probably most familiar with it as the clear beady stuff in the “Do Not Eat” packets. Depending on the size and shape of the particles, silica is the whitening ingredient in most whitening toothpastes. But whitening toothpaste cannot get your teeth any whiter than a professional dental cleaning, because it only cleans the surface. Two weeks to a whiter smile? How about 30 minutes with your hygienist? It’s much more efficient and less harsh. 4. Desensitizers. Teeth that are sensitive to hot, cold, sweets, or a combination can benefit from the addition of potassium nitrate or stannous fluoride to a toothpaste. This is more of a palliative treatment, when the pain is the problem. Good old Time will usually make teeth feel better, too, unless the pain is coming from a cavity. Yeah, I’m talking to you, the person who is trying to heal the hole in their tooth with Sensodyne. 5. Tartar control. It burns! It burns! If your toothpaste has a particular biting flavor, it might contain tetrasodium pyrophosphate, an ingredient that is supposed to keep calcium phosphate salts (tartar, or calculus) from fossilizing on the back of your lower front teeth. A little tartar on your teeth doesn’t harm you unless it gets really thick and you can no longer keep it clean. One problem with tartar control toothpastes is that in order for the active ingredient to work, it has to be dissolved in a stronger detergent than usual, which can affect people that are sensitive to a high pH. 6. Triclosan. This antimicrobial is supposed to reduce infections between the gum and tooth. However, if you just keep the germs off of your teeth in the first place it’s pretty much a waste of an extra ingredient. Its safety has been questioned but, like fluoride, the bulk of the scientific research easily demonstrates that the addition of triclosan in toothpaste does much more good than harm. Why toothpaste can be bad for you. Let’s just say it’s not the toothpaste’s fault. It’s yours. The toothpaste is just the co-dependent enabler. You’re the one with the problem. Remember, toothpaste is a cosmetic, first and foremost. It doesn’t clean your teeth by itself. Just in case you think I’m making this up I’ve included clinical studies in the references at the end of this article that show how ineffective toothpaste really is. peasized • You’re using too much. Don’t be so suggestible! Toothpaste ads show you how to use up the tube more quickly. Just use 1/3 as much, the size of a pea. It will still taste good, I promise! And too much foam can make you lose track of where your teeth actually are located. • You’re not taking enough time. At least two minutes. Any less and you’re missing spots. Just ’cause it tastes better doesn’t mean you did a good job. • You’re not paying attention. I’ve seen people brush the same four spots for two minutes and miss the other 60% of their mouth.brushguide The toothbrush needs to touch every crevice of every tooth, not just where it lands when you go into autopilot and start thinking about what you’re going to wear that day. It’s the toothbrush friction that cleans your teeth, not the cleaning product. Plaque is a growth, like the pink or grey mildew that grows around the edges of your shower. You’ve gotta rub it off to get it off. No tooth cleaning liquid, paste, creme, gel, or powder is going to make as much of a difference as your attention to detail will. The solution. Use what you like. It’s that simple. If it tastes good and feels clean to you, you’ll use it more often, brush longer, feel better, be healthier. You can use baking soda, or coconut oil, or your favorite toothpaste, or even just plain water. The key is to have a good technique and to brush often. A music video makes this demonstration a little more fun than your usual lecture at the dental office, although, in my opinion you really still need to feel what it is like to MASH THE BRISTLES OF A SOFT TOOTHBRUSH INTO YOUR GUMS: A little more serious video from my pal Dr. Mark Burhenne where he demonstrates how to be careful with your toothbrush bristles: Final word. ♬ It’s all about that Bass, ’bout that Bass, no bubbles. ♬ Heh, dentistry in-joke there. Seriously, though, the bottom line is that your paste will mask brushing technique issues, so don’t put so much faith in the power of toothpaste. Also you may have heard that some toothpastes contain decorative plastic that can get swallowed. Yeah, that was a DentalBuzz report I wrote that went viral earlier this year. And while I can’t claim total victory on that front, at least the company in question has promised that the plastic will no longer be added to their toothpaste lines very soon due to the overwhelming amount of letters, emails, and phone calls that they received as a result of people reading that article and making a difference. But now I’m tired of talking about toothpaste. Next topic? I’m bringing pyorrhea back. """, """ I’ve been a user of Colgate Total Whitening Toothpaste for many years because I’ve always tried to maintain a healthy smile (I’m a receptionist so I need a white smile). But because I drink coffee at least twice a day (sometimes more!) and a lot of herbal teas, I’ve found that using just this toothpaste alone doesn’t really get my teeth white... The best way to get white teeth is to really try some professional products specifically for tooth whitening. I’ve tried a few products, like Crest White Strips and found that the strips are really not as good as the trays. Although the Crest White Strips are easy to use, they really DO NOT cover your teeth perfectly like some other professional dental whitening kits. This Product did cover my teeth well however because of their custom heat trays, and whitening my teeth A LOT. I would say if you really want white teeth, use the Colgate Toothpaste and least 2 times a day, along side a professional Gel product like Shine Whitening. """, """ The first feature is the price, and it is right. Next, I consider whether it will be neat to use. It is. Sometimes when I buy those new hard plastic containers, they actually get messy. Also I cannot get all the toothpaste out. It is easy to get the paste out of Colgate Total Whitening Paste without spraying it all over the cabinet. If it does not taste good, I won't use it. Some toothpaste burns my mouth so bad that brushing my teeth is a painful experience. This one doesn't burn. It tastes simply the way toothpaste is supposed to taste. Whitening is important. This one is supposed ot whiten. After spending money to whiten my teeth, I need a product to help ward off the bad effects of coffee and tea. Avoiding all kinds of oral pathology is a major consideration. This toothpaste claims that it can help fight cavities, gingivitis, plaque, tartar, and bad breath. I hope this product stays on the market a long time and does not change. """, ] ================================================ FILE: featuretools/tests/utils_tests/__init__.py ================================================ ================================================ FILE: featuretools/tests/utils_tests/test_config.py ================================================ import logging import os from featuretools.config_init import initialize_logging logging_env_vars = { "FEATURETOOLS_LOG_LEVEL": "debug", "FEATURETOOLS_ES_LOG_LEVEL": "critical", "FEATURETOOLS_BACKEND_LOG_LEVEL": "error", } def test_logging_defaults(): old_env_vars = {} for env_var in logging_env_vars: old_env_vars[env_var] = os.environ.get(env_var, None) if old_env_vars[env_var] is not None: del os.environ[env_var] initialize_logging() main_logger = logging.getLogger("featuretools") assert main_logger.getEffectiveLevel() == logging.INFO es_logger = logging.getLogger("featuretools.entityset") assert es_logger.getEffectiveLevel() == logging.INFO backend_logger = logging.getLogger("featuretools.computation_backend") assert backend_logger.getEffectiveLevel() == logging.INFO for env_var, value in old_env_vars.items(): if value is not None: os.environ[env_var] = value def test_logging_set_via_env(): old_env_vars = {} for env_var, value in logging_env_vars.items(): old_env_vars[env_var] = os.environ.get(env_var, None) os.environ[env_var] = value initialize_logging() main_logger = logging.getLogger("featuretools") assert main_logger.getEffectiveLevel() == logging.DEBUG es_logger = logging.getLogger("featuretools.entityset") assert es_logger.getEffectiveLevel() == logging.CRITICAL backend_logger = logging.getLogger("featuretools.computation_backend") assert backend_logger.getEffectiveLevel() == logging.ERROR for env_var, value in old_env_vars.items(): if value is not None: os.environ[env_var] = value ================================================ FILE: featuretools/tests/utils_tests/test_description_utils.py ================================================ from featuretools.utils.description_utils import convert_to_nth def test_first(): assert convert_to_nth(1) == "1st" assert convert_to_nth(21) == "21st" assert convert_to_nth(131) == "131st" def test_second(): assert convert_to_nth(2) == "2nd" assert convert_to_nth(22) == "22nd" assert convert_to_nth(232) == "232nd" def test_third(): assert convert_to_nth(3) == "3rd" assert convert_to_nth(23) == "23rd" assert convert_to_nth(133) == "133rd" def test_nth(): assert convert_to_nth(4) == "4th" assert convert_to_nth(11) == "11th" assert convert_to_nth(12) == "12th" assert convert_to_nth(13) == "13th" assert convert_to_nth(111) == "111th" assert convert_to_nth(112) == "112th" assert convert_to_nth(113) == "113th" ================================================ FILE: featuretools/tests/utils_tests/test_entry_point.py ================================================ import pandas as pd import pytest from featuretools import dfs @pytest.fixture def entry_points_dfs(): cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "card_id": [1, 2, 1, 3, 4, 5], "transaction_time": [10, 12, 13, 20, 21, 20], "fraud": [True, False, True, False, True, True], }, ) return cards_df, transactions_df class MockEntryPoint(object): def on_call(self, kwargs): self.kwargs = kwargs def on_error(self, error, runtime): self.error = error def on_return(self, return_value, runtime): self.return_value = return_value def load(self): return self def __call__(self): return self class MockPkgResources(object): def __init__(self, entry_point): self.entry_point = entry_point def iter_entry_points(self, name): return [self.entry_point] def test_entry_point(es, monkeypatch): entry_point = MockEntryPoint() # overrides a module used in the entry_point decorator for dfs # so the decorator will use this mock entry point monkeypatch.setitem( dfs.__globals__["entry_point"].__globals__, "pkg_resources", MockPkgResources(entry_point), ) fm, fl = dfs(entityset=es, target_dataframe_name="customers") assert "entityset" in entry_point.kwargs.keys() assert "target_dataframe_name" in entry_point.kwargs.keys() assert (fm, fl) == entry_point.return_value def test_entry_point_error(es, monkeypatch): entry_point = MockEntryPoint() monkeypatch.setitem( dfs.__globals__["entry_point"].__globals__, "pkg_resources", MockPkgResources(entry_point), ) with pytest.raises(KeyError): dfs(entityset=es, target_dataframe_name="missing_dataframe") assert isinstance(entry_point.error, KeyError) def test_entry_point_detect_arg(monkeypatch, entry_points_dfs): cards_df = entry_points_dfs[0] transactions_df = entry_points_dfs[1] cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]}) transactions_df = pd.DataFrame( { "id": [1, 2, 3, 4, 5, 6], "card_id": [1, 2, 1, 3, 4, 5], "transaction_time": [10, 12, 13, 20, 21, 20], "fraud": [True, False, True, False, True, True], }, ) dataframes = { "cards": (cards_df, "id"), "transactions": (transactions_df, "id", "transaction_time"), } relationships = [("cards", "id", "transactions", "card_id")] entry_point = MockEntryPoint() monkeypatch.setitem( dfs.__globals__["entry_point"].__globals__, "pkg_resources", MockPkgResources(entry_point), ) fm, fl = dfs(dataframes, relationships, target_dataframe_name="cards") assert "dataframes" in entry_point.kwargs.keys() assert "relationships" in entry_point.kwargs.keys() assert "target_dataframe_name" in entry_point.kwargs.keys() ================================================ FILE: featuretools/tests/utils_tests/test_gen_utils.py ================================================ import pandas as pd import pytest from woodwork import list_logical_types as ww_list_logical_types from woodwork import list_semantic_tags as ww_list_semantic_tags from featuretools import list_logical_types, list_semantic_tags from featuretools.utils.gen_utils import ( camel_and_title_to_snake, import_or_none, import_or_raise, ) def test_import_or_raise_errors(): with pytest.raises(ImportError, match="error message"): import_or_raise("_featuretools", "error message") def test_import_or_raise_imports(): math = import_or_raise("math", "error message") assert math.ceil(0.1) == 1 def test_import_or_none(): math = import_or_none("math") assert math.ceil(0.1) == 1 bad_lib = import_or_none("_featuretools") assert bad_lib is None @pytest.fixture def df(): return pd.DataFrame({"id": range(5)}) def test_list_logical_types(): ft_ltypes = list_logical_types() ww_ltypes = ww_list_logical_types() assert ft_ltypes.equals(ww_ltypes) def test_list_semantic_tags(): ft_semantic_tags = list_semantic_tags() ww_semantic_tags = ww_list_semantic_tags() assert ft_semantic_tags.equals(ww_semantic_tags) def test_camel_and_title_to_snake(): assert camel_and_title_to_snake("Top3Words") == "top_3_words" assert camel_and_title_to_snake("top3Words") == "top_3_words" assert camel_and_title_to_snake("Top100Words") == "top_100_words" assert camel_and_title_to_snake("top100Words") == "top_100_words" assert camel_and_title_to_snake("Top41") == "top_41" assert camel_and_title_to_snake("top41") == "top_41" assert camel_and_title_to_snake("41TopWords") == "41_top_words" assert camel_and_title_to_snake("TopThreeWords") == "top_three_words" assert camel_and_title_to_snake("topThreeWords") == "top_three_words" assert camel_and_title_to_snake("top_three_words") == "top_three_words" assert camel_and_title_to_snake("over_65") == "over_65" assert camel_and_title_to_snake("65_and_over") == "65_and_over" assert camel_and_title_to_snake("USDValue") == "usd_value" ================================================ FILE: featuretools/tests/utils_tests/test_recommend_primitives.py ================================================ import logging import pandas as pd import pytest from woodwork.logical_types import NaturalLanguage from woodwork.table_schema import ColumnSchema from featuretools import EntitySet from featuretools.primitives import Day, TransformPrimitive from featuretools.utils.recommend_primitives import ( DEFAULT_EXCLUDED_PRIMITIVES, TIME_SERIES_PRIMITIVES, _recommend_non_numeric_primitives, _recommend_skew_numeric_primitives, get_recommended_primitives, ) @pytest.fixture def moderate_right_skewed_df(): return pd.DataFrame( {"moderately right skewed": [2, 3, 4, 4, 4, 5, 5, 7, 9, 11, 12, 13, 15]}, ) @pytest.fixture def heavy_right_skewed_df(): return pd.DataFrame( {"heavy right skewed": [1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 9, 11, 13]}, ) @pytest.fixture def left_skewed_df(): return pd.DataFrame( {"left skewed": [2, 3, 4, 5, 7, 9, 11, 11, 11, 12, 12, 12, 13, 15]}, ) @pytest.fixture def skewed_df_zeros(): return pd.DataFrame({"zeros": [-1, 0, 0, 1, 2, 2, 3, 4, 5, 7, 9]}) @pytest.fixture def normal_df(): return pd.DataFrame({"normal": [2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11]}) @pytest.fixture def right_skew_moderate_and_heavy_df(moderate_right_skewed_df, heavy_right_skewed_df): return pd.concat([moderate_right_skewed_df, heavy_right_skewed_df], axis=1) @pytest.fixture def es_with_skewed_dfs( moderate_right_skewed_df, heavy_right_skewed_df, left_skewed_df, skewed_df_zeros, normal_df, right_skew_moderate_and_heavy_df, ): es = EntitySet() es.add_dataframe(moderate_right_skewed_df, "moderate_right_skewed_df", "id") es.add_dataframe(heavy_right_skewed_df, "heavy_right_skewed_df", "id") es.add_dataframe(left_skewed_df, "left_skewed_df", "id") es.add_dataframe(skewed_df_zeros, "skewed_df_zeros", "id") es.add_dataframe(normal_df, "normal_df", "id") es.add_dataframe( right_skew_moderate_and_heavy_df, "right_skew_moderate_and_heavy_df", "id", ) return es def test_recommend_skew_numeric_primitives(es_with_skewed_dfs): valid_skew_primtives = set(["square_root", "natural_logarithm"]) valid_prims = [ "cosine", "square_root", "natural_logarithm", "sine", ] assert _recommend_skew_numeric_primitives( es_with_skewed_dfs, "moderate_right_skewed_df", valid_prims, ) == set(["square_root"]) assert _recommend_skew_numeric_primitives( es_with_skewed_dfs, "heavy_right_skewed_df", valid_skew_primtives, ) == set(["natural_logarithm"]) assert ( _recommend_skew_numeric_primitives( es_with_skewed_dfs, "left_skewed_df", valid_skew_primtives, ) == set() ) assert ( _recommend_skew_numeric_primitives( es_with_skewed_dfs, "skewed_df_zeros", valid_skew_primtives, ) == set() ) assert ( _recommend_skew_numeric_primitives( es_with_skewed_dfs, "normal_df", valid_skew_primtives, ) == set() ) assert ( _recommend_skew_numeric_primitives( es_with_skewed_dfs, "right_skew_moderate_and_heavy_df", valid_skew_primtives, ) == valid_skew_primtives ) def test_recommend_non_numeric_primitives(make_es): ecom_es_customers = EntitySet() ecom_es_customers.add_dataframe(make_es["customers"]) valid_primitives = [ "day", "num_characters", "natural_logarithm", "sine", ] actual_recommendations = _recommend_non_numeric_primitives( ecom_es_customers, "customers", valid_primitives, ) expected_recommendations = set( [ "day", "num_characters", ], ) assert expected_recommendations == actual_recommendations def test_recommend_skew_numeric_primitives_exception(make_es, caplog): class MockExceptionPrimitive(TransformPrimitive): """Count the number of times the string value occurs.""" name = "mock_primitive_with_exception" input_types = [ColumnSchema(logical_type=NaturalLanguage)] return_type = ColumnSchema(semantic_tags={"numeric"}) def get_function(self): def make_exception(column): raise Exception("this primitive has an exception") return make_exception ecom_es_customers = EntitySet() ecom_es_customers.add_dataframe(make_es["customers"]) valid_primitives = [MockExceptionPrimitive(), Day()] logger = logging.getLogger("featuretools") logger.propagate = True actual_recommendations = _recommend_non_numeric_primitives( ecom_es_customers, "customers", valid_primitives, ) logger.propagate = False expected_recommendations = set(["day"]) assert expected_recommendations == actual_recommendations assert ( "Exception with feature MOCK_PRIMITIVE_WITH_EXCEPTION(favorite_quote) with primitive mock_primitive_with_exception: this primitive has an exception" in caplog.text ) def test_get_recommended_primitives_time_series(make_es): ecom_es_log = EntitySet() ecom_es_log.add_dataframe(make_es["log"]) ecom_es_log["log"].ww.set_time_index("datetime") actual_recommendations_ts = get_recommended_primitives( ecom_es_log, True, ) for ts_prim in TIME_SERIES_PRIMITIVES: assert ts_prim in actual_recommendations_ts def test_get_recommended_primitives(make_es): ecom_es_customers = EntitySet() ecom_es_customers.add_dataframe(make_es["customers"]) actual_recommendations = get_recommended_primitives( ecom_es_customers, False, ) expected_recommendations = [ "day", "num_characters", "natural_logarithm", "punctuation_count", "mean_characters_per_word", "is_weekend", "whitespace_count", "median_word_length", "month", "total_word_length", "weekday", "day_of_year", "week", "quarter", "email_address_to_domain", "number_of_common_words", "num_words", "num_unique_separators", "age", "year", "is_leap_year", "days_in_month", "is_free_email_domain", "number_of_unique_words", ] for prim in expected_recommendations: assert prim in actual_recommendations for ts_prim in TIME_SERIES_PRIMITIVES: assert ts_prim not in actual_recommendations def test_get_recommended_primitives_exclude(make_es): ecom_es_customers = EntitySet() ecom_es_customers.add_dataframe(make_es["customers"]) extra_exclude = ["num_characters", "natural_logarithm"] prims_to_exclude = DEFAULT_EXCLUDED_PRIMITIVES + extra_exclude actual_recommendations = get_recommended_primitives( ecom_es_customers, False, prims_to_exclude, ) for ex_prim in extra_exclude: assert ex_prim not in actual_recommendations def test_get_recommended_primitives_empty_es_error(): error_msg = "No DataFrame in EntitySet found. Please add a DataFrame." empty_es = EntitySet() with pytest.raises(IndexError, match=error_msg): get_recommended_primitives( empty_es, False, ) def test_get_recommended_primitives_multi_table_es_error(make_es): error_msg = "Multi-table EntitySets are currently not supported. Please only use a single table EntitySet." with pytest.raises(IndexError, match=error_msg): get_recommended_primitives( make_es, False, ) ================================================ FILE: featuretools/tests/utils_tests/test_time_utils.py ================================================ from datetime import datetime, timedelta from itertools import chain import numpy as np import pandas as pd import pytest from featuretools.utils import convert_time_units, make_temporal_cutoffs from featuretools.utils.time_utils import ( calculate_trend, convert_datetime_to_floats, convert_timedelta_to_floats, ) def test_make_temporal_cutoffs(): instance_ids = pd.Series(range(10)) cutoffs = pd.date_range(start="1/2/2015", periods=10, freq="1d") temporal_cutoffs_by_nwindows = make_temporal_cutoffs( instance_ids, cutoffs, window_size="1h", num_windows=2, ) assert temporal_cutoffs_by_nwindows.shape[0] == 20 actual_instances = chain.from_iterable([[i, i] for i in range(10)]) actual_times = [ "1/1/2015 23:00:00", "1/2/2015 00:00:00", "1/2/2015 23:00:00", "1/3/2015 00:00:00", "1/3/2015 23:00:00", "1/4/2015 00:00:00", "1/4/2015 23:00:00", "1/5/2015 00:00:00", "1/5/2015 23:00:00", "1/6/2015 00:00:00", "1/6/2015 23:00:00", "1/7/2015 00:00:00", "1/7/2015 23:00:00", "1/8/2015 00:00:00", "1/8/2015 23:00:00", "1/9/2015 00:00:00", "1/9/2015 23:00:00", "1/10/2015 00:00:00", "1/10/2015 23:00:00", "1/11/2015 00:00:00", "1/11/2015 23:00:00", ] actual_times = [pd.Timestamp(c) for c in actual_times] for computed, actual in zip( temporal_cutoffs_by_nwindows["instance_id"], actual_instances, ): assert computed == actual for computed, actual in zip(temporal_cutoffs_by_nwindows["time"], actual_times): assert computed == actual cutoffs = [pd.Timestamp("1/2/2015")] * 9 + [pd.Timestamp("1/3/2015")] starts = [pd.Timestamp("1/1/2015")] * 9 + [pd.Timestamp("1/2/2015")] actual_times = ["1/1/2015 00:00:00", "1/2/2015 00:00:00"] * 9 actual_times += ["1/2/2015 00:00:00", "1/3/2015 00:00:00"] actual_times = [pd.Timestamp(c) for c in actual_times] temporal_cutoffs_by_wsz_start = make_temporal_cutoffs( instance_ids, cutoffs, window_size="1d", start=starts, ) for computed, actual in zip( temporal_cutoffs_by_wsz_start["instance_id"], actual_instances, ): assert computed == actual for computed, actual in zip(temporal_cutoffs_by_wsz_start["time"], actual_times): assert computed == actual cutoffs = [pd.Timestamp("1/2/2015")] * 9 + [pd.Timestamp("1/3/2015")] starts = [pd.Timestamp("1/1/2015")] * 10 actual_times = ["1/1/2015 00:00:00", "1/2/2015 00:00:00"] * 9 actual_times += ["1/1/2015 00:00:00", "1/3/2015 00:00:00"] actual_times = [pd.Timestamp(c) for c in actual_times] temporal_cutoffs_by_nw_start = make_temporal_cutoffs( instance_ids, cutoffs, num_windows=2, start=starts, ) for computed, actual in zip( temporal_cutoffs_by_nw_start["instance_id"], actual_instances, ): assert computed == actual for computed, actual in zip(temporal_cutoffs_by_nw_start["time"], actual_times): assert computed == actual def test_convert_time_units(): units = { "years": 31540000, "months": 2628000, "days": 86400, "hours": 3600, "minutes": 60, "seconds": 1, "milliseconds": 0.001, "nanoseconds": 0.000000001, } for each in units: assert convert_time_units(units[each] * 2, each) == 2 assert np.isclose(convert_time_units(float(units[each] * 2), each), 2) error_text = "Invalid unit given, make sure it is plural" with pytest.raises(ValueError, match=error_text): convert_time_units("jnkwjgn", 10) @pytest.mark.parametrize( "dt, expected_floats", [ ( pd.Series( [ datetime(2010, 1, 1, 11, 45, 0), datetime(2010, 1, 1, 12, 55, 15), datetime(2010, 1, 1, 11, 57, 30), datetime(2010, 1, 1, 11, 12), datetime(2010, 1, 1, 11, 12, 15), ], ), pd.Series([21039105.0, 21039175.25, 21039117.5, 21039072.0, 21039072.25]), ), ( pd.Series( list(pd.date_range(start="2017-01-01", freq="1d", periods=3)) + list(pd.date_range(start="2017-01-10", freq="2d", periods=4)) + list(pd.date_range(start="2017-01-22", freq="1d", periods=7)), ), pd.Series( [ 17167.0, 17168.0, 17169.0, 17176.0, 17178.0, 17180.0, 17182.0, 17188.0, 17189.0, 17190.0, 17191.0, 17192.0, 17193.0, 17194.0, ], ), ), ], ) def test_convert_datetime_floats(dt, expected_floats): actual_floats = convert_datetime_to_floats(dt) pd.testing.assert_series_equal(pd.Series(actual_floats), expected_floats) @pytest.mark.parametrize( "td, expected_floats", [ ( pd.Series( [ pd.Timedelta(2, "day"), pd.Timedelta(120000000), pd.Timedelta(48, "sec"), pd.Timedelta(30, "min"), pd.Timedelta(12, "hour"), ], ), pd.Series( [ 2.0, 1.388888888888889e-06, 0.0005555555555555556, 0.020833333333333332, 0.5, ], ), ), ( pd.Series( [ timedelta(days=4), timedelta(milliseconds=4000000), timedelta(hours=2, seconds=49), ], ), pd.Series([4.0, 0.0462962962962963, 0.08390046296296297]), ), ], ) def test_convert_timedelta_to_floats(td, expected_floats): actual_floats = convert_timedelta_to_floats(td) pd.testing.assert_series_equal(pd.Series(actual_floats), expected_floats) @pytest.mark.parametrize( "series,expected_trends", [ ( # using datetimes pd.Series( data=[0, 5, 10], index=[ datetime(2019, 1, 1), datetime(2019, 1, 2), datetime(2019, 1, 3), ], ), 5.0, ), ( # using pd.Timestamp pd.Series( data=[0, -5, 3], index=pd.date_range(start="2019-01-01", freq="1D", periods=3), ), 1.4999999999999998, ), ( pd.Series( data=[1, 2, 4, 8, 16], index=pd.date_range(start="2019-01-01", freq="1D", periods=5), ), 3.6000000000000005, ), ( # using pd.Timedelta with no change in time pd.Series( data=[1, 2, 3], index=[ pd.Timedelta(120000000), pd.Timedelta(120000000), pd.Timedelta(120000000), ], ), 0, ), ], ) def test_calculate_trend(series, expected_trends): actual_trends = calculate_trend(series) assert np.isclose(actual_trends, expected_trends) ================================================ FILE: featuretools/tests/utils_tests/test_trie.py ================================================ from featuretools.utils import Trie def test_get_node(): t = Trie(default=lambda: "default") t.get_node([1, 2, 3]).value = "123" t.get_node([1, 2, 4]).value = "124" sub = t.get_node([1, 2]) assert sub.get_node([3]).value == "123" assert sub.get_node([4]).value == "124" sub.get_node([4, 5]).value = "1245" assert t.get_node([1, 2, 4, 5]).value == "1245" def test_setting_and_getting(): t = Trie(default=lambda: "default") assert t.get_node([1, 2, 3]).value == "default" t.get_node([1, 2, 3]).value = "123" t.get_node([1, 2, 4]).value = "124" assert t.get_node([1, 2, 3]).value == "123" assert t.get_node([1, 2, 4]).value == "124" assert t.get_node([1]).value == "default" t.get_node([1]).value = "1" assert t.get_node([1]).value == "1" t.get_node([1, 2, 3]).value = "updated" assert t.get_node([1, 2, 3]).value == "updated" def test_iteration(): t = Trie(default=lambda: "default", path_constructor=tuple) t.get_node((1, 2, 3)).value = "123" t.get_node((1, 2, 4)).value = "124" expected = [ ((), "default"), ((1,), "default"), ((1, 2), "default"), ((1, 2, 3), "123"), ((1, 2, 4), "124"), ] for i, value in enumerate(t): assert value == expected[i] ================================================ FILE: featuretools/tests/utils_tests/test_utils_info.py ================================================ import os import pytest from featuretools import __version__ from featuretools.utils import ( get_featuretools_root, get_installed_packages, get_sys_info, show_info, ) @pytest.fixture def this_dir(): return os.path.dirname(os.path.abspath(__file__)) def test_show_info(capsys): show_info() captured = capsys.readouterr() assert "Featuretools version" in captured.out assert "Featuretools installation directory:" in captured.out assert __version__ in captured.out assert "SYSTEM INFO" in captured.out def test_sys_info(): sys_info = get_sys_info() info_keys = [ "python", "python-bits", "OS", "OS-release", "machine", "processor", "byteorder", "LC_ALL", "LANG", "LOCALE", ] found_keys = [k for k, _ in sys_info] assert set(info_keys).issubset(found_keys) def test_installed_packages(): installed_packages = get_installed_packages() # Per PEP 426, package names are case insensitive # Underscore and hyphen are equivalent installed_set = { name.lower().replace("-", "_") for name in installed_packages.keys() } requirements = [ "pandas", "numpy", "tqdm", "cloudpickle", "psutil", ] assert set(requirements).issubset(installed_set) def test_get_featuretools_root(this_dir): root = os.path.abspath(os.path.join(this_dir, "..", "..")) assert get_featuretools_root() == root ================================================ FILE: featuretools/utils/__init__.py ================================================ # flake8: noqa from featuretools.utils.api import * ================================================ FILE: featuretools/utils/api.py ================================================ # flake8: noqa from featuretools.utils.entry_point import entry_point from featuretools.utils.gen_utils import make_tqdm_iterator from featuretools.utils.time_utils import ( calculate_trend, convert_time_units, make_temporal_cutoffs, ) from featuretools.utils.trie import Trie from featuretools.utils.utils_info import ( get_featuretools_root, get_installed_packages, get_sys_info, show_info, ) ================================================ FILE: featuretools/utils/common_tld_utils.py ================================================ # put longer TLDs first to avoid catching a small part of a longer TLD and escape periods COMMON_TLDS = [ "management", "technology", "solutions", "delivery", "services", "software", "digital", "finance", "monster", "network", "support", "systems", "website", "agency", "design", "events", "global", "health", "online", "stream", "studio", "travel", "apple", "click", "cloud", "email", "games", "group", "media", "ninja", "press", "rocks", "space", "store", "today", "tools", "video", "works", "world", "aero", "arpa", "asia", "bank", "best", "blog", "buzz", "care", "casa", "chat", "club", "coop", "cyou", "desi", "farm", "goog", "guru", "host", "info", "jobs", "life", "link", "live", "mobi", "name", "news", "page", "plus", "shop", "site", "team", "tech", "work", "zone", "app", "aws", "bid", "biz", "box", "cam", "cat", "com", "dev", "edu", "eus", "fun", "gov", "icu", "int", "ltd", "mil", "net", "nyc", "one", "onl", "org", "ovh", "pro", "pub", "run", "sap", "top", "vip", "win", "xxx", "xyz", "ac", "ad", "ae", "ag", "ai", "al", "am", "ar", "at", "au", "az", "ba", "bd", "be", "bg", "br", "by", "bz", "ca", "cc", "cf", "ch", "cl", "cm", "cn", "co", "cr", "cu", "cx", "cy", "cz", "de", "dk", "do", "ec", "ee", "eg", "es", "eu", "fi", "fm", "fr", "ga", "ge", "gg", "gl", "gq", "gr", "gs", "gt", "hk", "hn", "hr", "hu", "id", "ie", "il", "im", "in", "io", "ir", "is", "it", "jo", "jp", "ke", "kh", "ki", "kr", "kw", "kz", "la", "lb", "li", "lk", "lt", "lu", "lv", "ly", "ma", "md", "me", "mk", "ml", "mm", "mn", "ms", "mu", "mx", "my", "nf", "ng", "nl", "no", "np", "nu", "nz", "om", "pa", "pe", "ph", "pk", "pl", "pr", "ps", "pt", "pw", "py", "qa", "re", "ro", "rs", "ru", "sa", "sc", "se", "sg", "sh", "si", "sk", "so", "st", "su", "sv", "sx", "th", "tj", "tk", "tn", "to", "tr", "tt", "tv", "tw", "ua", "ug", "uk", "us", "uy", "vc", "ve", "vn", "ws", "za", ] ================================================ FILE: featuretools/utils/description_utils.py ================================================ def convert_to_nth(integer): string_nth = str(integer) end_int = integer % 10 if end_int == 1 and integer % 100 != 11: return str(integer) + "st" elif end_int == 2 and integer % 100 != 12: return str(string_nth) + "nd" elif end_int == 3 and integer % 100 != 13: return str(string_nth) + "rd" else: return str(string_nth) + "th" ================================================ FILE: featuretools/utils/entry_point.py ================================================ import time from functools import wraps from inspect import signature import pkg_resources def entry_point(name): def inner_function(func): @wraps(func) def function_wrapper(*args, **kwargs): """function_wrapper of greeting""" # add positional args as named kwargs on_call_kwargs = kwargs.copy() sig = signature(func) for arg, parameter in zip(args, sig.parameters): on_call_kwargs[parameter] = arg # collect and initialize all registered entry points entry_points = [] for entry_point in pkg_resources.iter_entry_points(name): entry_point = entry_point.load() entry_points.append(entry_point()) # send arguments before function is called for ep in entry_points: ep.on_call(on_call_kwargs) try: # call function start = time.time() return_value = func(*args, **kwargs) runtime = time.time() - start except Exception as e: runtime = time.time() - start # send error for ep in entry_points: ep.on_error(error=e, runtime=runtime) raise e # send return value for ep in entry_points: ep.on_return(return_value=return_value, runtime=runtime) return return_value return function_wrapper return inner_function ================================================ FILE: featuretools/utils/gen_utils.py ================================================ import importlib import logging import re import sys from tqdm import tqdm logger = logging.getLogger("featuretools.utils") def make_tqdm_iterator(**kwargs): options = {"file": sys.stdout, "leave": True} options.update(kwargs) return tqdm(**options) def get_relationship_column_id(path): _, r = path[0] child_link_name = r._child_column_name for _, r in path[1:]: parent_link_name = child_link_name child_link_name = "%s.%s" % (r.parent_name, parent_link_name) return child_link_name def find_descendents(cls): """ A generator which yields all descendent classes of the given class (including the given class) Args: cls (Class): the class to find descendents of """ yield cls for sub in cls.__subclasses__(): for c in find_descendents(sub): yield c def import_or_raise(library, error_msg): """ Attempts to import the requested library. If the import fails, raises an ImportErorr with the supplied Args: library (str): the name of the library error_msg (str): error message to return if the import fails """ try: return importlib.import_module(library) except ImportError: raise ImportError(error_msg) def import_or_none(library): """ Attemps to import the requested library. Args: library (str): the name of the library Returns: the library if it is installed, else None """ try: return importlib.import_module(library) except ImportError: return None def camel_and_title_to_snake(name): name = re.sub(r"([^_\d]+)(\d+)", r"\1_\2", name) name = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name) return re.sub("([a-z0-9])([A-Z])", r"\1_\2", name).lower() ================================================ FILE: featuretools/utils/plot_utils.py ================================================ from featuretools.utils.gen_utils import import_or_raise def check_graphviz(): GRAPHVIZ_ERR_MSG = ( "Please install graphviz to plot." + " (See https://featuretools.alteryx.com/en/stable/install.html#installing-graphviz for" + " details)" ) graphviz = import_or_raise("graphviz", GRAPHVIZ_ERR_MSG) # Try rendering a dummy graph to see if a working backend is installed try: graphviz.Digraph().pipe(format="svg") except graphviz.backend.ExecutableNotFound: raise RuntimeError( "To plot entity sets, a graphviz backend is required.\n" + "Install the backend using one of the following commands:\n" + " Mac OS: brew install graphviz\n" + " Linux (Ubuntu): $ sudo apt install graphviz\n" + " Windows (conda): conda install -c conda-forge python-graphviz\n" + " Windows (pip): pip install graphviz\n" + " Windows (EXE required if graphviz was installed via pip): https://graphviz.org/download/#windows" + " For more details visit: https://featuretools.alteryx.com/en/stable/install.html#installing-graphviz", ) return graphviz def get_graphviz_format(graphviz, to_file): if to_file: # Explicitly cast to str in case a Path object was passed in to_file = str(to_file) split_path = to_file.split(".") if len(split_path) < 2: raise ValueError( "Please use a file extension like '.pdf'" + " so that the format can be inferred", ) format_ = split_path[-1] valid_formats = graphviz.FORMATS if format_ not in valid_formats: raise ValueError( "Unknown format. Make sure your format is" + " amongst the following: %s" % valid_formats, ) else: format_ = None return format_ def save_graph(graph, to_file, format_): # Graphviz always appends the format to the file name, so we need to # remove it manually to avoid file names like 'file_name.pdf.pdf' offset = len(format_) + 1 # Add 1 for the dot output_path = to_file[:-offset] graph.render(output_path, cleanup=True) ================================================ FILE: featuretools/utils/recommend_primitives.py ================================================ import logging from typing import List from featuretools.computational_backends import calculate_feature_matrix from featuretools.entityset import EntitySet from featuretools.primitives.utils import get_transform_primitives from featuretools.synthesis import dfs, get_valid_primitives ORDERED_PRIMITIVES = [ # non-numeric primitives that require specific ordering or a time index to be set "cum_count", "cumulative_time_since_last_false", "cumulative_time_since_last_true", "diff", "diff_datetime", "is_first_occurrence", "is_last_occurrence", "time_since_previous", ] DEPRECATED_PRIMITIVES = [ "multiply_boolean", # functionality duplicated by 'and' primitive "numeric_lag", # deprecated and replaced with `lag` ] REQUIRED_INPUT_PRIMITIVES = [ # non-numeric primitives that require input "count_string", "distance_to_holiday", "is_in_geobox", "not_equal_scalar", "equal_scalar", "time_since", "isin", ] OTHER_PRIMITIVES_TO_EXCLUDE = [ # Excluding some primitives that can produce too many features or aren't useful in extracting information "not", "and", "or", "equal", "not_equal", ] DEFAULT_EXCLUDED_PRIMITIVES = ( REQUIRED_INPUT_PRIMITIVES + DEPRECATED_PRIMITIVES + ORDERED_PRIMITIVES + OTHER_PRIMITIVES_TO_EXCLUDE ) # TODO: Make this list more dynamic TIME_SERIES_PRIMITIVES = [ "expanding_count", "expanding_max", "expanding_mean", "expanding_min", "expanding_std", "expanding_trend", "lag", "rolling_count", "rolling_outlier_count", "rolling_max", "rolling_mean", "rolling_min", "rolling_std", "rolling_trend", ] # TODO: Support multi-table def get_recommended_primitives( entityset: EntitySet, include_time_series_primitives: bool = False, excluded_primitives: List[str] = DEFAULT_EXCLUDED_PRIMITIVES, ) -> List[str]: """Get a list of recommended primitives given an entity set. Description: This function works by first getting a list of valid primitives withholding any primitives specified in `excluded_primitives` that could be applied to a single-table EntitySet. Secondly, engineered features are created for non-numeric fields and are checked for non-uniqueness. If the feature is non-unique, it is added to the recommendation list. Then, numeric fields are checked for skewness. Depending on how skew a column is `square_root` or `natural_logarithm` will be recommended. Lastly if `include_time_series_primitives` is specified as `True`, `Lag` will always be recommended, as well as all Rolling and Expanding primitives if numeric columns are present. Args: entityset (EntitySet): EntitySet that only contains one dataframe. include_time_series_primitives (bool): Whether or not time-series primitives should be considered. Defaults to False. excluded_primitives (List[str]): List of transform primitives to exclude from recommendations. Defaults to DEFAULT_EXCLUDED_PRIMITIVES. Note: The main objective of this function is to recommend primitives that could potentially provide important features to the modeling process. Non-numeric primitives do a great job in mainly serving as a way to extract information from origin features that may essentially be meaningless by themselves (e.g., NaturalLanguage, Datetime, LatLong). That is why they are the main focus of this function. Numeric transform primitives are very case-by-case dependent and therefore it is hard to mathematically quantify which should be recommended. Therefore, only transform primitives that address skewed numeric columns are included, as this is a standard and quantifiable transformation step. The only exception to this rule being for time series problems. Because there are so few primitives that are only applicable for time series, all of them are included in the recommended primitives list. Note: This function currently only works for single table and will only recommend transform primitives. """ es_dataframe_list = entityset.dataframes if len(es_dataframe_list) == 0: raise IndexError("No DataFrame in EntitySet found. Please add a DataFrame.") if len(es_dataframe_list) > 1: raise IndexError( "Multi-table EntitySets are currently not supported. Please only use a single table EntitySet.", ) target_dataframe_name = es_dataframe_list[0].ww.name recommended_primitives = set() if not include_time_series_primitives: excluded_primitives += TIME_SERIES_PRIMITIVES all_trans_primitives = get_transform_primitives() selected_trans_primitives = [ p for name, p in all_trans_primitives.items() if name not in excluded_primitives ] valid_primitive_names = [ prim.name for prim in get_valid_primitives( entityset, target_dataframe_name, 1, selected_trans_primitives, )[1] ] recommended_primitives.update( _recommend_non_numeric_primitives( entityset, target_dataframe_name, valid_primitive_names, ), ) recommended_primitives.update( _recommend_skew_numeric_primitives( entityset, target_dataframe_name, valid_primitive_names, ), ) recommended_primitives.update( set(TIME_SERIES_PRIMITIVES).intersection( valid_primitive_names, ), ) return list(recommended_primitives) def _recommend_non_numeric_primitives( entityset: EntitySet, target_dataframe_name: str, valid_primitives: List[str], ) -> set: """Get a set of non-numeric primitives for a given dataset and a list of primitives. Description: Given a single table entity set with a `target_dataframe_name` and an applicable list of `valid_primitives`, get a set of primitives which produce non-unique features. Args: entityset (EntitySet): EntitySet that only contains one dataframe. target_dataframe_name (str): Name of target dataframe to access in `entityset`. valid_primitives (List[str]): List of primitives to calculate and check output features. """ recommended_non_numeric_primitives: set[str] = set() # Only want to run feature generation on non numeric primitives numeric_columns_to_ignore = list( entityset[target_dataframe_name] .ww.select(include="numeric", return_schema=True) .columns, ) features = dfs( entityset=entityset, target_dataframe_name=target_dataframe_name, trans_primitives=valid_primitives, max_depth=1, features_only=True, ignore_columns={target_dataframe_name: numeric_columns_to_ignore}, ) for f in features: if ( f.primitive.name is not None and f.primitive.name not in recommended_non_numeric_primitives ): try: matrix = calculate_feature_matrix([f], entityset) for f_name in f.get_feature_names(): if len(matrix[f_name].unique()) > 1: recommended_non_numeric_primitives.add(f.primitive.name) except ( Exception ) as e: # If error in calculating feature matrix pass on the recommendation logger = logging.getLogger("featuretools") logger.error( f"Exception with feature {f.get_name()} with primitive {f.primitive.name}: {str(e)}", ) return recommended_non_numeric_primitives def _recommend_skew_numeric_primitives( entityset: EntitySet, target_dataframe_name: str, valid_primitives: List[str], ) -> set: """Get a set of recommended skew numeric primitives given an entity set. Description: Given woodwork initialized dataframe of origin features with only `numeric` semantic tags and an applicable list of `valid_skew_primitives`, get a set of primitives which could be applied to address right skewness. Args: entityset (EntitySet): EntitySet that only contains one dataframe. target_dataframe_name (str): Name of target dataframe to access in `entityset`. valid_primitives (List[str]): List of primitives to compare. Note: We currently only have primitives to address right skewness. """ recommended_skew_primitives: set[str] = set() skew_numeric_primitives = set(["square_root", "natural_logarithm"]) valid_skew_primitives = skew_numeric_primitives.intersection(valid_primitives) if valid_skew_primitives: numerics_only_df = entityset[target_dataframe_name].ww.select("numeric") recommended_skew_primitives: set[str] = set() for col in numerics_only_df: # Shouldn't recommend log, sqrt if nans, zeros and negative numbers are present contains_nan = numerics_only_df[col].isnull().any() all_above_zero = (numerics_only_df[col] > 0).all() if all_above_zero and not contains_nan: skew = numerics_only_df[col].skew() # We currently don't have anything in featuretools to automatically handle left skewed data as well as skewed data with negative values if skew > 0.5 and skew < 1 and "square_root" in valid_skew_primitives: recommended_skew_primitives.add("square_root") # TODO: Add Box Cox here when available if skew > 1 and "natural_logarithm" in valid_skew_primitives: recommended_skew_primitives.add("natural_logarithm") # TODO: Add log base 10 transform primitive when available return recommended_skew_primitives ================================================ FILE: featuretools/utils/s3_utils.py ================================================ import json import shutil from featuretools.utils.gen_utils import import_or_raise def use_smartopen_es(file_path, path, transport_params=None, read=True): open = import_or_raise("smart_open", SMART_OPEN_ERR_MSG).open if read: with open(path, "rb", transport_params=transport_params) as fin: with open(file_path, "wb") as fout: shutil.copyfileobj(fin, fout) else: with open(file_path, "rb") as fin: with open(path, "wb", transport_params=transport_params) as fout: shutil.copyfileobj(fin, fout) def use_smartopen_features(path, features_dict=None, transport_params=None, read=True): open = import_or_raise("smart_open", SMART_OPEN_ERR_MSG).open if read: with open(path, "r", encoding="utf-8", transport_params=transport_params) as f: features_dict = json.load(f) return features_dict else: with open(path, "w", transport_params=transport_params) as f: json.dump(features_dict, f) def get_transport_params(profile_name): boto3 = import_or_raise("boto3", BOTO3_ERR_MSG) UNSIGNED = import_or_raise("botocore", BOTOCORE_ERR_MSG).UNSIGNED Config = import_or_raise("botocore.config", BOTOCORE_ERR_MSG).Config if isinstance(profile_name, str): session = boto3.Session(profile_name=profile_name) transport_params = {"client": session.client("s3")} elif profile_name is False or boto3.Session().get_credentials() is None: session = boto3.Session() client = session.client("s3", config=Config(signature_version=UNSIGNED)) transport_params = {"client": client} else: transport_params = None return transport_params BOTO3_ERR_MSG = ( "The boto3 library is required to read and write from URLs and S3.\n" "Install via pip:\n" " pip install boto3\n" "Install via conda:\n" " conda install -c conda-forge boto3" ) BOTOCORE_ERR_MSG = ( "The botocore library is required to read and write from URLs and S3.\n" "Install via pip:\n" " pip install botocore\n" "Install via conda:\n" " conda install -c conda-forge botocore" ) SMART_OPEN_ERR_MSG = ( "The smart_open library is required to read and write from URLs and S3.\n" "Install via pip:\n" " pip install 'smart-open>=5.0.0'\n" "Install via conda:\n" " conda install -c conda-forge 'smart_open>=5.0.0'" ) ================================================ FILE: featuretools/utils/schema_utils.py ================================================ import logging import warnings from packaging.version import parse from featuretools.version import ENTITYSET_SCHEMA_VERSION, FEATURES_SCHEMA_VERSION logger = logging.getLogger("featuretools.utils") def check_schema_version(cls, cls_type): """ If the saved schema version is newer than the current featuretools schema version, this function will output a warning saying so. If the saved schema version is a major release or more behind the current featuretools schema version, this function will log a message saying so. """ if isinstance(cls_type, str): current = None saved = None if cls_type == "entityset": current = ENTITYSET_SCHEMA_VERSION saved = cls.get("schema_version") elif cls_type == "features": current = FEATURES_SCHEMA_VERSION saved = cls.features_dict["schema_version"] if parse(current) < parse(saved): warning_text_upgrade = ( "The schema version of the saved %s" "(%s) is greater than the latest supported (%s). " "You may need to upgrade featuretools. Attempting to load %s ..." % (cls_type, saved, current, cls_type) ) warnings.warn(warning_text_upgrade) if parse(current).major > parse(saved).major: warning_text_outdated = ( "The schema version of the saved %s" "(%s) is no longer supported by this version " "of featuretools. Attempting to load %s ..." % (cls_type, saved, cls_type) ) logger.warning(warning_text_outdated) ================================================ FILE: featuretools/utils/time_utils.py ================================================ from datetime import datetime, timedelta import numpy as np import pandas as pd def make_temporal_cutoffs( instance_ids, cutoffs, window_size=None, num_windows=None, start=None, ): """Makes a set of equally spaced cutoff times prior to a set of input cutoffs and instance ids. If window_size and num_windows are provided, then num_windows of size window_size will be created prior to each cutoff time If window_size and a start list is provided, then a variable number of windows will be created prior to each cutoff time, with the corresponding start time as the first cutoff. If num_windows and a start list is provided, then num_windows of variable size will be created prior to each cutoff time, with the corresponding start time as the first cutoff Args: instance_ids (list, np.ndarray, or pd.Series): list of instance ids. This function will make a new datetime series of multiple cutoff times for each value in this array. cutoffs (list, np.ndarray, or pd.Series): list of datetime objects associated with each instance id. Each one of these will be the last time in the new datetime series for each instance id window_size (pd.Timedelta, optional): amount of time between each datetime in each new cutoff series num_windows (int, optional): number of windows in each new cutoff series start (list, optional): list of start times for each instance id """ if window_size is not None and num_windows is not None and start is not None: raise ValueError( "Only supply 2 of the 3 optional args, window_size, num_windows and start", ) out = [] for i, id_time in enumerate(zip(instance_ids, cutoffs)): _id, time = id_time _window_size = window_size _start = None if start is not None: if window_size is None: _window_size = (time - start[i]) / (num_windows - 1) else: _start = start[i] to_add = pd.DataFrame() to_add["time"] = pd.date_range( end=time, periods=num_windows, freq=_window_size, start=_start, ) to_add["instance_id"] = [_id] * len(to_add["time"]) out.append(to_add) return pd.concat(out).reset_index(drop=True) def convert_time_units(secs, unit): """ Converts a time specified in seconds to a time in the given units Args: secs (integer): number of seconds. This function will convert the units of this number. unit(str): units to be converted to. acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds """ unit_divs = { "years": 31540000, "months": 2628000, "days": 86400, "hours": 3600, "minutes": 60, "seconds": 1, "milliseconds": 0.001, "nanoseconds": 0.000000001, } if unit not in unit_divs: raise ValueError("Invalid unit given, make sure it is plural") return secs / (unit_divs[unit]) def convert_datetime_to_floats(x): first = int(x.iloc[0].value * 1e-9) x = pd.to_numeric(x).astype(np.float64).values dividend = find_dividend_by_unit(first) x *= 1e-9 / dividend return x def convert_timedelta_to_floats(x): first = int(x.iloc[0].total_seconds()) dividend = find_dividend_by_unit(first) x = pd.TimedeltaIndex(x).total_seconds().astype(np.float64) / dividend return x def find_dividend_by_unit(time): """Finds whether time best corresponds to a value in days, hours, minutes, or seconds. """ for dividend in [86400, 3600, 60]: div = time / dividend if round(div) == div: return dividend return 1 def calculate_trend(series): # numpy can't handle `Int64` values, so cast to float if series.dtype == "Int64": series = series.astype("float64") df = pd.DataFrame({"x": series.index, "y": series.values}).dropna() if df.shape[0] <= 2: return np.nan if isinstance(df["x"].iloc[0], (datetime, pd.Timestamp)): x = convert_datetime_to_floats(df["x"]) else: x = df["x"].values if isinstance(df["y"].iloc[0], (datetime, pd.Timestamp)): y = convert_datetime_to_floats(df["y"]) elif isinstance(df["y"].iloc[0], (timedelta, pd.Timedelta)): y = convert_timedelta_to_floats(df["y"]) else: y = df["y"].values x = x - x.mean() y = y - y.mean() # prevent divide by zero error if len(np.unique(x)) == 1: return 0 # consider scipy.stats.linregress for large n cases coefficients = np.polyfit(x, y, 1) return coefficients[0] ================================================ FILE: featuretools/utils/trie.py ================================================ class Trie(object): """ A trie (prefix tree) where the keys are sequences of hashable objects. It behaves similarly to a dictionary, except that the keys can be lists or other sequences. Examples: >>> from featuretools.utils import Trie >>> trie = Trie(default=str) >>> # Set a value >>> trie.get_node([1, 2, 3]).value = '123' >>> # Get a value >>> trie.get_node([1, 2, 3]).value '123' >>> # Overwrite a value >>> trie.get_node([1, 2, 3]).value = 'updated' >>> trie.get_node([1, 2, 3]).value 'updated' >>> # Getting a key that has not been set returns the default value. >>> trie.get_node([1, 2]).value '' """ def __init__(self, default=lambda: None, path_constructor=list): """ default: A function returning the value to use for new nodes. path_constructor: A function which constructs a path from a list. The path type must support addition (concatenation). """ self.value = default() self._children = {} self._default = default self._path_constructor = path_constructor def children(self): """ A list of pairs of the edges from this node and the nodes they point to. Examples: >>> from featuretools.utils import Trie >>> trie = Trie(default=str) >>> trie.get_node([1, 2]).value = '12' >>> trie.get_node([3]).value = '3' >>> children = trie.children() >>> first_edge, first_child = children[0] >>> first_edge 1 >>> first_child.value '' >>> second_edge, second_child = children[1] >>> second_edge 3 >>> second_child.value '3' """ return list(self._children.items()) def get_node(self, path): """ Get the sub-trie at the given path. If it does not yet exist initialize it with the default value. Examples: >>> from featuretools.utils import Trie >>> t = Trie() >>> t.get_node([1, 2, 3]).value = '123' >>> t.get_node([1, 2, 4]).value = '124' >>> sub = t.get_node([1, 2]) >>> sub.get_node([3]).value '123' >>> sub.get_node([4]).value '124' """ if path: first = path[0] rest = path[1:] if first in self._children: sub_trie = self._children[first] else: sub_trie = Trie( default=self._default, path_constructor=self._path_constructor, ) self._children[first] = sub_trie return sub_trie.get_node(rest) else: return self def __iter__(self): """ Iterate over all values in the trie. Yields tuples of (path, value). Implemented using depth first search. """ yield self._path_constructor([]), self.value for key, sub_trie in self.children(): path_to_children = self._path_constructor([key]) for sub_path, value in sub_trie: path = path_to_children + sub_path yield path, value ================================================ FILE: featuretools/utils/utils_info.py ================================================ import locale import os import platform import struct import sys import pkg_resources import featuretools deps = [ "numpy", "pandas", "tqdm", "cloudpickle", "dask", "distributed", "psutil", "pip", "setuptools", ] def show_info(): print("Featuretools version: %s" % featuretools.__version__) print("Featuretools installation directory: %s" % get_featuretools_root()) print_sys_info() print_deps(deps) def print_sys_info(): print("\nSYSTEM INFO") print("-----------") sys_info = get_sys_info() for k, stat in sys_info: print("{k}: {stat}".format(k=k, stat=stat)) def print_deps(dependencies): print("\nINSTALLED VERSIONS") print("------------------") installed_packages = get_installed_packages() package_dep = [] for x in dependencies: # prevents uninstalled deps from being printed if x in installed_packages: package_dep.append((x, installed_packages[x])) for k, stat in package_dep: print("{k}: {stat}".format(k=k, stat=stat)) # Modified from here # https://github.com/pandas-dev/pandas/blob/d9a037ec4ad0aab0f5bf2ad18a30554c38299e57/pandas/util/_print_versions.py#L11 def get_sys_info(): "Returns system information as a dict" blob = [] try: (sysname, nodename, release, version, machine, processor) = platform.uname() blob.extend( [ ("python", ".".join(map(str, sys.version_info))), ("python-bits", struct.calcsize("P") * 8), ("OS", "{sysname}".format(sysname=sysname)), ("OS-release", "{release}".format(release=release)), ("machine", "{machine}".format(machine=machine)), ("processor", "{processor}".format(processor=processor)), ("byteorder", "{byteorder}".format(byteorder=sys.byteorder)), ("LC_ALL", "{lc}".format(lc=os.environ.get("LC_ALL", "None"))), ("LANG", "{lang}".format(lang=os.environ.get("LANG", "None"))), ("LOCALE", ".".join(map(str, locale.getlocale()))), ], ) except (KeyError, ValueError): pass return blob def get_installed_packages(): installed_packages = {} for d in pkg_resources.working_set: installed_packages[d.project_name] = d.version return installed_packages def get_featuretools_root(): return os.path.dirname(featuretools.__file__) ================================================ FILE: featuretools/utils/wrangle.py ================================================ import re import tarfile from datetime import datetime import numpy as np import pandas as pd from woodwork.logical_types import Datetime, Ordinal from featuretools.entityset.timedelta import Timedelta def _check_timedelta(td): """ Convert strings to Timedelta objects Allows for both shortform and longform units, as well as any form of capitalization '2 Minutes' '2 minutes' '2 m' '1 Minute' '1 minute' '1 m' '1 units' '1 Units' '1 u' Shortform is fine if space is dropped '2m' '1u" If a pd.Timedelta object is passed, units will be converted to seconds due to the underlying representation of pd.Timedelta. If a pd.DateOffset object is passed, it will be converted to a Featuretools Timedelta if it has one temporal parameter. Otherwise, it will remain a pd.DateOffset. """ if td is None: return td if isinstance(td, Timedelta): return td elif not isinstance(td, (int, float, str, pd.DateOffset, pd.Timedelta)): raise ValueError("Unable to parse timedelta: {}".format(td)) if isinstance(td, pd.Timedelta): unit = "s" value = td.total_seconds() times = {unit: value} return Timedelta(times, delta_obj=td) elif isinstance(td, pd.DateOffset): # DateOffsets if td.__class__.__name__ != "DateOffset": if hasattr(td, "__dict__"): # Special offsets (such as BDay) - prior to pandas 1.0.0 value = td.__dict__["n"] else: # Special offsets (such as BDay) - after pandas 1.0.0 value = td.n unit = td.__class__.__name__ times = dict([(unit, value)]) else: times = dict() for td_unit, td_value in td.kwds.items(): times[td_unit] = td_value return Timedelta(times, delta_obj=td) else: pattern = "([0-9]+) *([a-zA-Z]+)$" match = re.match(pattern, td) value, unit = match.groups() try: value = int(value) except Exception: try: value = float(value) except Exception: raise ValueError( "Unable to parse value {} from ".format(value) + "timedelta string: {}".format(td), ) times = {unit: value} return Timedelta(times) def _check_time_against_column(time, time_column): """ Check to make sure that time is compatible with time_column, where time could be a timestamp, or a Timedelta, number, or None, and time_column is a Woodwork initialized column. Compatibility means that arithmetic can be performed between time and elements of time_column If time is None, then we don't care if arithmetic can be performed (presumably it won't ever be performed) """ if time is None: return True elif isinstance(time, (int, float)): return time_column.ww.schema.is_numeric elif isinstance(time, (pd.Timestamp, datetime, pd.DateOffset)): return time_column.ww.schema.is_datetime elif isinstance(time, Timedelta): if time_column.ww.schema.is_datetime: return True elif time.unit not in Timedelta._time_units: if ( isinstance(time_column.ww.logical_type, Ordinal) or "numeric" in time_column.ww.semantic_tags or "time_index" in time_column.ww.semantic_tags ): return True return False def _check_time_type(time): """ Checks if `time` is an instance of common int, float, or datetime types. Returns "numeric" or Datetime based on results """ time_type = None if isinstance(time, (datetime, np.datetime64)): time_type = Datetime elif ( isinstance(time, (int, float)) or np.issubdtype(time, np.integer) or np.issubdtype(time, np.floating) ): time_type = "numeric" return time_type def _is_s3(string): """ Checks if the given string is a s3 path. Returns a boolean. """ return string.startswith("s3://") def _is_url(string): """ Checks if the given string is an url path. Returns a boolean. """ return string.startswith("http") def _is_local_tar(string): """ Checks if the given string is a local tarfile path. Returns a boolean. """ return string.endswith(".tar") and tarfile.is_tarfile(string) ================================================ FILE: featuretools/version.py ================================================ __version__ = "1.31.0" ENTITYSET_SCHEMA_VERSION = "9.0.0" FEATURES_SCHEMA_VERSION = "10.0.0" ================================================ FILE: pyproject.toml ================================================ [project] name = "featuretools" readme = "README.md" description = "a framework for automated feature engineering" dynamic = ["version"] classifiers = [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Science/Research", "Intended Audience :: Developers", "Topic :: Software Development", "Topic :: Scientific/Engineering", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX", "Operating System :: Unix", "Operating System :: MacOS", ] authors = [ {name="Alteryx, Inc.", email="open_source_support@alteryx.com"} ] maintainers = [ {name="Alteryx, Inc.", email="open_source_support@alteryx.com"} ] keywords = ["feature engineering", "data science", "machine learning"] license = {text = "BSD 3-clause"} requires-python = ">=3.9,<4" dependencies = [ "cloudpickle >= 1.5.0", "holidays >= 0.17", "numpy >= 1.25.0, < 2.0.0", "packaging >= 20.0", "pandas >= 2.0.0", "psutil >= 5.7.0", "scipy >= 1.10.0", "tqdm >= 4.66.3", "woodwork >= 0.28.0", ] [project.urls] "Documentation" = "https://featuretools.alteryx.com" "Source Code"= "https://github.com/alteryx/featuretools/" "Changes" = "https://featuretools.alteryx.com/en/latest/release_notes.html" "Issue Tracker" = "https://github.com/alteryx/featuretools/issues" "Twitter" = "https://twitter.com/alteryxoss" "Chat" = "https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA" [project.optional-dependencies] test = [ "boto3 >= 1.34.32", "composeml >= 0.8.0", "graphviz >= 0.8.4", "moto[all] >= 5.0.0", "pip >= 23.3.0", "pyarrow >= 14.0.1", "pympler >= 0.8", "pytest >= 7.1.2", "pytest-cov >= 3.0.0", "pytest-xdist >= 2.5.0", "smart-open >= 5.0.0", "urllib3 >= 1.26.18", "pytest-timeout >= 2.1.0", ] dask = [ "dask[dataframe] >= 2023.2.0", "distributed >= 2023.2.0", ] tsfresh = [ "featuretools-tsfresh-primitives >= 1.0.0", ] autonormalize = [ "autonormalize >= 2.0.1", ] sql = [ "featuretools_sql >= 0.0.1", "psycopg2-binary >= 2.9.3", ] sklearn = [ "featuretools-sklearn-transformer >= 1.0.0", ] premium = [ "premium-primitives >= 0.0.3", ] nlp = [ "nlp-primitives >= 2.12.0", ] docs = [ "ipython == 8.4.0", "jupyter == 1.0.0", "jupyter-client >= 8.0.2", "matplotlib == 3.7.2", "Sphinx == 5.1.1", "nbsphinx == 0.8.9", "nbconvert == 6.5.0", "pydata-sphinx-theme == 0.9.0", "sphinx-inline-tabs == 2022.1.2b11", "sphinx-copybutton == 0.5.0", "myst-parser == 0.18.0", "autonormalize >= 2.0.1", "click >= 7.0.0", "featuretools[dask,test]", ] dev = [ "ruff >= 0.1.6", "black[jupyter] >= 23.1.0", "pre-commit >= 2.20.0", "featuretools[docs,dask,test]", ] complete = [ "featuretools[premium,nlp,dask]", ] [tool.setuptools] include-package-data = true license-files = [ "LICENSE", "featuretools/primitives/data/free_email_provider_domains_license" ] [tool.setuptools.packages.find] namespaces = true [tool.setuptools.package-data] "*" = [ "*.txt", "README.md", ] "featuretools" = [ "primitives/data/*.csv", "primitives/data/*.txt", ] [tool.setuptools.exclude-package-data] "*" = [ "* __pycache__", "*.py[co]", "docs/*" ] [tool.setuptools.dynamic] version = {attr = "featuretools.version.__version__"} [tool.pytest.ini_options] addopts = "--doctest-modules --ignore=featuretools/tests/entry_point_tests/add-ons" testpaths = [ "featuretools/tests/*" ] filterwarnings = [ "ignore::DeprecationWarning", "ignore::PendingDeprecationWarning" ] [tool.ruff] line-length = 88 target-version = "py311" lint.ignore = ["E501"] lint.select = [ # Pyflakes "F", # Pycodestyle "E", "W", # isort "I001" ] src = ["featuretools"] [tool.ruff.lint.per-file-ignores] "__init__.py" = ["E402", "F401", "I001", "E501"] [tool.ruff.lint.isort] known-first-party = ["featuretools"] [tool.coverage.run] source = ["featuretools"] omit = [ "*/add-ons/**/*" ] [tool.coverage.report] exclude_lines =[ "pragma: no cover", "def __repr__", "raise AssertionError", "raise NotImplementedError", "if __name__ == .__main__.:", "if self._verbose:", "if verbose:", "if profile:", "pytest.skip" ] [build-system] requires = [ "setuptools >= 61.0.0", "wheel" ] build-backend = "setuptools.build_meta" ================================================ FILE: release.md ================================================ # Release Process ## 0. Pre-Release Checklist Before starting the release process, verify the following: - All work required for this release has been completed and the team is ready to release. - [All Github Actions Tests are green on main](https://github.com/alteryx/featuretools/actions?query=branch%3Amain). - EvalML Tests are green with Featuretools main - [![Unit Tests - EvalML with Featuretools main branch](https://github.com/alteryx/evalml/actions/workflows/unit_tests_with_featuretools_main_branch.yaml/badge.svg?branch=main)](https://github.com/alteryx/evalml/actions/workflows/unit_tests_with_featuretools_main_branch.yaml) - Looking Glass performance tests runs should not show any significant performance regressions when comparing the last commit on `main` with the previous release of Featuretools. See Step 1 below for instructions on manually launching the performance tests runs. - The [ReadtheDocs build](https://readthedocs.com/projects/feature-labs-inc-featuretools/) for "latest" is marked as passed. To avoid mysterious errors, best practice is to empty your browser cache when reading new versions of the docs! - The [public documentation for the "latest" branch](https://featuretools.alteryx.com/en/latest/) looks correct, and the [release notes](https://featuretools.alteryx.com/en/latest/release_notes.html) includes the last change which was made on `main`. - Get agreement on the version number to use for the release. #### Version Numbering Featuretools uses [semantic versioning](https://semver.org/). Every release has a major, minor and patch version number, and are displayed like so: `..`. In certain instances, it may be necessary to create a backport release. This is when commits from a newer version of a library are ported to an older version of the software and then released. This occurs when anything but the latest commit on main is used as the target for release, but can go so far as to add a further patch release, such as 0.11.2, to be released after a 0.12.0 version had already been released. If a backport release is being performed, please see the [Backport Release Guide](docs/backport_release.md) for instructions on how to proceed, as some steps from this guide should be performed differently. If you'd like to create a development release, which won't be deployed to pypi and conda and marked as a generally-available production release, please add a "dev" prefix to the patch version, i.e. `X.X.devX`. Note this claims the patch number--if the previous release was `0.12.0`, a subsequent dev release would be `0.12.dev1`, and the following release would be `0.12.2`, _not_ `0.12.1`. Development releases deploy to [test.pypi.org](https://test.pypi.org/project/featuretools/) instead of to [pypi.org](https://pypi.org/project/featuretools). ## 1. Evaluate Performance Test Results Before releasing Featuretools, the person performing the release should launch a performance test run and evaluate the results to make sure no significant performance regressions will be introduced by the release. This can be done by launching a Looking Glass performance test run, which will then post results to Slack. To manually launch a Looking Glass performance test run, follow these steps: 1. Navigate to the [Looking Glass performance tests](https://github.com/alteryx/featuretools/actions/workflows/looking_glass_performance_tests.yaml) GitHub action 2. Click on the Run workflow dropdown to set up the run 3. Make sure that the "use workflow from" dropdown is set to `main` to use the workflow version in Featuretools `main` 4. Enter the hash of the most recent commit to `main` in the "new commit to evaluate" field. For example: `cee9607` 5. Enter the version tag of the last release of Featuretools in the "previous commit to evaluate" field. For example, if the last release of Featuretools was version 1.20.0, you would enter `v1.20.0` here. 6. Click the "Run workflow" button to launch the jobs Once the job has been completed, the results summaries will be posted to Slack automatically. Review the results and make sure the performance has not degraded. If any significant performance issues are noted, discuss with the development team before proceeding. Note: The procedure above can also be used to launch performance tests runs at any time, even outside of the release process. When launching a test run, the commit fields can take any commit hash, GitHub branch or tag as input to specify the new and previous commits to compare. ## 2. Create Featuretools release on Github #### Create Release Branch 1. Branch off of featuretools main. For the branch name, please use "release_vX.Y.Z" as the naming scheme (e.g. "release_v0.13.3"). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry. #### Bump Version Number 1. Bump `__version__` in `featuretools/version.py`, and `featuretools/tests/test_version.py`. #### Update Release Notes 1. Replace "Future Release" in `docs/source/release_notes.rst` with the current date ``` v0.13.3 Sep 28, 2020 ==================== ``` 2. Remove any unused Release Notes sections for this release (e.g. Fixes, Testing Changes) 3. Add yourself to the list of contributors to this release and **put the contributors in alphabetical order** 4. The release PR does not need to be mentioned in the list of changes 5. Add a commented out "Future Release" section with all of the Release Notes sections above the current section ``` .. Future Release ============== * Enhancements * Fixes * Changes * Documentation Changes * Testing Changes .. Thanks to the following people for contributing to this release: ``` #### Create Release PR A [release pr](https://github.com/alteryx/featuretools/pull/856) should have **the version number as the title** and the release notes for that release as the PR body text. The contributors list is not necessary. The special sphinx docs syntax (:pr:\`547\`) needs to be changed to github link syntax (#547). Checklist before merging: - The title of the PR is the version number. - All tests are currently green on checkin and on `main`. - The ReadtheDocs build for the release PR branch has passed, and the resulting docs contain the expected release notes. - PR has been reviewed and approved. - Confirm with the team that `main` will be frozen until step 3 (Github Release) is complete. After merging, verify again that ReadtheDocs "latest" is correct. ## 3. Create Github Release After the release pull request has been merged into the `main` branch, it is time draft the github release. [Example release](https://github.com/alteryx/featuretools/releases/tag/v0.13.3) - The target should be the `main` branch - The tag should be the version number with a v prefix (e.g. v0.13.3) - Release title is the same as the tag - Release description should be the full Release Notes updates for the release, including the line thanking contributors. Contributors should also have their links changed from the docs syntax (:user:\`gsheni\`) to github syntax (@gsheni) - This is not a pre-release - Publishing the release will automatically upload the package to PyPI ## 4. Release on conda-forge In order to release on conda-forge, you can either wait for a bot to create a pull request, or use a GitHub Actions workflow ### Option a: Use a GitHub Action workflow 1. After the package has been uploaded on PyPI, the **Create Feedstock Pull Request** workflow should automatically kickoff a job. * If it does not, go [here](https://github.com/alteryx/featuretools/actions/workflows/create_feedstock_pr.yaml) * Click **Run workflow** and input the letter `v` followed by the release version (e.g. `v0.13.3`) * Kickoff the GitHub Action, and monitor the Job Summary. 2. Once the job has been completed, you will see summary output, with a URL. * Visit that URL and create a pull request. * Alternatively, create the pull request by clicking the branch name (e.g. - `v0.13.3`): - https://github.com/alteryx/featuretools-feedstock/branches 3. Verify that the PR has the following: * The `build['number']` is 0 (in __recipe/meta.yml__). * The `requirements['run']` (in __recipe/meta.yml__) matches the `[project]['dependencies']` in __featuretools/pyproject.toml__. * The `test['requires']` (in __recipe/meta.yml__) matches the `[project.optional-dependencies]['test']` in __featuretools/pyproject.toml__ > There will be 2 entries for graphviz: `graphviz` and `python-graphviz`. > Make sure `python-graphviz` (in __recipe/meta.yml__) matches `graphviz` in `[project.optional-dependencies]['test']` in __featuretools/pyproject.toml__. 4. Satisfy the conditions in pull request description and **merge it if the CI passes**. ### Option b: Waiting for bot to create new PR 1. A bot should automatically create a new PR in [conda-forge/featuretools-feedstock](https://github.com/conda-forge/featuretools-feedstock/pulls) - note, the PR may take up to a few hours to be created 2. Update requirements changes in `recipe/meta.yaml` (bot should have handled version and source links on its own) 3. After tests pass, a maintainer will merge the PR in # Miscellaneous ## Add new maintainers to featuretools-feedstock Per the instructions [here](https://conda-forge.org/docs/maintainer/updating_pkgs.html#updating-the-maintainer-list): 1. Ask an existing maintainer to create an issue on the [repo](https://github.com/conda-forge/featuretools-feedstock). a. Select *Bot commands* and put the following title (change `username`): ```text @conda-forge-admin, please add user @username ``` 2. A PR will be auto-created on the repo, and will need to be merged by an existing maintainer. 3. The new user will need to **check their email for an invite link to click**, which should be https://github.com/conda-forge